Как перенести данные в pyspark для нескольких разных столбцов - PullRequest
0 голосов
/ 10 июля 2020

Я пытаюсь перенести данные в pyspark. Я смог транспонировать с помощью одного столбца. Однако с несколькими столбцами я не уверен, как передавать параметры для функции разнесения.

Формат ввода:

enter image description here

Output Format :

введите описание изображения здесь

Кто-нибудь может намекнуть мне, пожалуйста, любым примером или ссылкой? Заранее спасибо.

Ответы [ 2 ]

2 голосов
/ 10 июля 2020

используйте stack, чтобы транспонировать, как показано ниже (spark>=2.4) -

Загрузить тестовые данные

val data =
      """
        |PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
        | 1 | xyz | MS | abc | Phd | pqr | BS
        |  2 | POR | MS | ABC | Phd | null | null
      """.stripMargin
    val stringDS1 = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
      .toSeq.toDS()
    val df1 = spark.read
      .option("sep", "|")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS1)
    df1.show(false)
    df1.printSchema()

    /**
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      * |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      * |1       |xyz                  |MS              |abc                  |Phd             |pqr                  |BS              |
      * |2       |POR                  |MS              |ABC                  |Phd             |null                 |null            |
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      *
      * root
      * |-- PersonId: integer (nullable = true)
      * |-- Education1CollegeName: string (nullable = true)
      * |-- Education1Degree: string (nullable = true)
      * |-- Education2CollegeName: string (nullable = true)
      * |-- Education2Degree: string (nullable = true)
      * |-- Education3CollegeName: string (nullable = true)
      * |-- Education3Degree: string (nullable = true)
      */

Un- развернуть таблицу с помощью стека


    df1.selectExpr("PersonId",
      "stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
        "Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
      .where("CollegeName is not null and EducationDegree is not null")
      .show(false)

    /**
      * +--------+-----------+---------------+
      * |PersonId|CollegeName|EducationDegree|
      * +--------+-----------+---------------+
      * |1       |xyz        |MS             |
      * |1       |abc        |Phd            |
      * |1       |pqr        |BS             |
      * |2       |POR        |MS             |
      * |2       |ABC        |Phd            |
      * +--------+-----------+---------------+
      */
0 голосов
/ 10 июля 2020

Очищенная версия PySpark этого

from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[ 

"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])

+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|

    +---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
    |  1|                  xyz|              MS|                  abc|             Phd|                  pqr|              BS|
    |  2|                  POR|              MS|                  ABC|             Phd|                     |                |
    +---+---------------------+----------------+---------------------+----------------+---------------------+----------------+

Код -

df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")

+---+---+---+
| id|  B|  C|
+---+---+---+
|  1|xyz| MS|
|  1|abc|Phd|
|  1|pqr| BS|
|  2|POR| MS|
|  2|ABC|Phd|
|  2|   |   |
+---+---+---+
...