Question

Я пытаюсь вывести схему Dynami c json из kafka topi c. Нашел этот фрагмент кода в блоге, который выводит схему с помощью PYSPARK.

  def read_kafka_topic(topic):
    
    df_json = (spark.read
               .format("kafka")
               .option("kafka.bootstrap.servers", kafka_broker)
               .option("subscribe", topic)
               .option("startingOffsets", "earliest")
               .option("endingOffsets", "latest")
               .option("failOnDataLoss", "false")
               .load()
               .withColumn("value", expr("string(value)"))
               .filter(col("value").isNotNull())
               .select("key", expr("struct(offset, value) r"))
               .groupBy("key").agg(expr("max(r) r")) 
               .select("r.value"))

    df_read = spark.read.json(
    df_json.rdd.map(lambda x: x.value), multiLine=True)**

Пробовал с SCALA:

**val df_read = spark.read.json(df_json.rdd.map(x=>x))**

Но я получаю ошибку ниже.

не может быть применено к (org. apache .spark.rdd.RDD [org. apache .spark. sql .Row]) val df_read = spark.read. json (df_ json .rdd.map (x => x))

Исправление? Пожалуйста, помогите.

thebluephantom · Answer 1 · 04 августа 2020

RDD не поддерживается в структурированной потоковой передаче.
Структурированная потоковая передача не допускает вывода схемы.
Схема необходимо определить.

например, для источника файла

val dataSchema = "Recorded_At timestamp, Device string, Index long, Model string, User string, _corrupt_record String, gt string, x double, y double, z double"
val dataPath = "dbfs:/mnt/training/definitive-guide/data/activity-data-stream.json"

val initialDF = spark
  .readStream                             // Returns DataStreamReader
  .option("maxFilesPerTrigger", 1)        // Force processing of only 1 file per trigger 
  .schema(dataSchema)                     // Required for all streaming DataFrames
  .json(dataPath)                         // The stream's source directory and file type

например, ситуация Kafka, поскольку Databricks научит вас

spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

val kafkaServer = "server1.databricks.training:9092"  // US (Oregon)
// kafkaServer = "server2.databricks.training:9092"   // Singapore

val editsDF = spark.readStream                        // Get the DataStreamReader
  .format("kafka")                                    // Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafkaServer)     // Configure the Kafka server name and port
  .option("subscribe", "en")                          // Subscribe to the "en" Kafka topic 
  .option("startingOffsets", "earliest")              // Rewind stream to beginning when we restart notebook
  .option("maxOffsetsPerTrigger", 1000)               // Throttle Kafka's processing of the streams
  .load()                                             // Load the DataFrame
  .select($"value".cast("STRING"))                    // Cast the "value" column to STRING

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, BooleanType, TimestampType}

lazy val schema = StructType(List(
  StructField("channel", StringType, true),
  StructField("comment", StringType, true),
  StructField("delta", IntegerType, true),
  StructField("flag", StringType, true),
  StructField("geocoding", StructType(List(            //  (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType, true),
    StructField("country", StringType, true),
    StructField("countryCode2", StringType, true),
    StructField("countryCode3", StringType, true),
    StructField("stateProvince", StringType, true),
    StructField("latitude", DoubleType, true),
    StructField("longitude", DoubleType, true)
  )), true),
  StructField("isAnonymous", BooleanType, true),
  StructField("isNewPage", BooleanType, true),
  StructField("isRobot", BooleanType, true),
  StructField("isUnpatrolled", BooleanType, true),
  StructField("namespace", StringType, true),           //   (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType, true),                //   (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType, true),             //   (STRING): URL of the page that was edited
  StructField("timestamp", TimestampType, true),        //   (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType, true),
  StructField("user", StringType, true),                //   (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType, true),
  StructField("wikipediaURL", StringType, true),
  StructField("wikipedia", StringType, true)            //   (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
))

import org.apache.spark.sql.functions.from_json

val jsonEdits = editsDF.select(
  from_json($"value", schema).as("json")) 
...
...

Структурированная потоковая передача Spark: вывод схемы в Scala

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Структурированная потоковая передача Spark: вывод схемы в Scala

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы