Связь между Spark и удаленной HDFS - PullRequest
0 голосов
/ 12 сентября 2018

У меня есть кластер Docker Swarm. В этом кластере у нас есть контейнеры spark (1 master и 1 worker) и hadoop (1 namenode и 1 datanode). Я создал контейнеры, используя следующий файл docker-compose:

version: "3"

services:
  master:
    image: singularities/spark
    command: start-spark master
    hostname: master
    networks:
      - overlay
    ports:
      - "6066:6066"
      - "7070:7070"
      - "8080:8080"
      - "50070:50070"
      - "7077:7077"
    deploy:
      placement:
        constraints:
          - node.role == manager
  worker:
    image: singularities/spark
    command: start-spark worker master
    networks:
      - overlay
    environment:
      SPARK_WORKER_CORES: 1
      SPARK_WORKER_MEMORY: 4g
    links:
      - master

  namenode:
    image: sfedyakov/hadoop-271-cluster 
    command: "/etc/bootstrap.sh -d -namenode"
    networks:
      - overlay
    hostname: namenode
    ports:
      - "8088:8088"
      - "50090:50090"
      - "19888:19888"
    deploy:
      placement:
        constraints:
          - node.role == manager
  datanode:
    image: sfedyakov/hadoop-271-cluster
    command: "/etc/bootstrap.sh -d -datanode"
    networks:
      - overlay
    links:
      - namenode
networks:
  overlay:

После создания контейнера, если я запустил docker inspect <namenode container id> для определения IP-адреса namenode, он выдает следующее:

"Networks": {
                "ingress": {
                    "IPAMConfig": {
                        "IPv4Address": "10.255.0.20"
                    },
                    "Links": null,
                    "Aliases": [
                        "b4ec63d0330c"
                    ],
                    "NetworkID": "etkd22i440xtnyedekv0769dw",
                    "EndpointID": "5a76f2cb028d40e55ebe7e01688f13ec8f2176c4d134a7e6a2397ad1986eb9f2",
                    "Gateway": "",
                    "IPAddress": "10.255.0.20",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:0a:ff:00:14",
                    "DriverOpts": null
                },
                "spark_overlay": {
                    "IPAMConfig": {
                        "IPv4Address": "10.0.4.8"
                    },
                    "Links": null,
                    "Aliases": [
                        "b4ec63d0330c"
                    ],
                    "NetworkID": "07r7yh470ipyxy1vzc6b0j4g2",
                    "EndpointID": "14996683ea1e30a8ed9f2ff75fbd1776786bbac01323176ad1dac6669cb150b9",
                    "Gateway": "",
                    "IPAddress": "10.0.4.8",
                    "IPPrefixLen": 24,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:0a:00:04:08",
                    "DriverOpts": null
                }
            }

Я пишу простой пример WordCount с помощью spark

 val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
 val data = spark.sparkContext.textFile("hdfs://10.0.4.8:9000/Sample.txt")

 val counts = data.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
 counts.foreach(println)

Однако выдает следующую ошибку:

Caused by: java.net.URISyntaxException: Illegal character in hostname at index 12: hdfs://spark_namenode.1.ywlf9yx9hcm4duhxnywn91i35.spark_overlay:9000
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.parseHostname(URI.java:3387)
    at java.net.URI$Parser.parseServer(URI.java:3236)
    at java.net.URI$Parser.parseAuthority(URI.java:3155)
    at java.net.URI$Parser.parseHierarchical(URI.java:3097)
    at java.net.URI$Parser.parse(URI.java:3053)
    at java.net.URI.<init>(URI.java:673)
    at org.apache.hadoop.net.NetUtils.getCanonicalUri(NetUtils.java:270)
...