Создайте новый столбец из одного значения, доступного в других столбцах, как массив пары Key Value - PullRequest
0 голосов
/ 04 октября 2019

Я извлек некоторые данные из улья в датафрейм в указанном ниже формате.

+--------------------+-----------------+--------------------+---------------+
| NUM_ID|            SIG1|           SIG2|             SIG3|            SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|

Если мы возьмем только один сигнал, он будет таким, как показано ниже.

|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
    [{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
    [{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
    [{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]

Для каждого NUM_ID каждый столбец SIG будет иметь массив пар E и V.

Схема для приведенных выше данных:

fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: double (nullable = true)
|-- SIG2: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: double (nullable = true)
|-- SIG3: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: double (nullable = true)
|-- SIG4: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: double (nullable = true)

Мое требование - получить все значения E извсе столбцы для определенного NUM_ID и создать в качестве нового столбца с соответствующими значениями сигналов в других столбцах, как показано ниже.

+-------+-------------+-------+-------+-------+-------+
| NUM_ID|            E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825|   null|96.3354|
|XXXXX01|1569560505000|   null|   null|35.5501|   null|
|XXXXX01|1569560531001|73.7825|   null|   null|   null|
|XXXXX02|1569560505000|34.7825|   null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825|   null|   null|   null|
|XXXXX02|1569560502000|   null|   null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825|   null|   null|   null|
|XXXXX03|1569560505000|34.7825|   null|35.5501|96.3354|
|XXXXX03|1569560509000|   null|34.7825|35.5501|96.3354|

Значения E из всех четырех столбцов сигналов для конкретного NUM_ID следует принимать какодин столбец без дубликатов и значения V для соответствующего E должны быть заполнены в разных столбцах. Предположим, что у сигнала нет какой-либо пары EV для конкретного E, тогда этот столбец должен быть нулевым. как показано выше.

Заранее спасибо. Любое преимущество приветствуется.

Для лучшего понимания ниже приведен пример структуры входных и ожидаемых выходных данных.

ВХОД:

+-------------------------+-----------------+-----------------+------------------+
| NUM_ID|             SIG1|           SIG2|             SIG3|            SIG4|
+-------------------------+-----------------+-----------------+------------------+
|XXXXX01|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
|XXXXX02|[{E7,V1},{E8,V2}]|[{E1,V3},{E3,V4}]|[{E1,V5},{E5,V6}]|[{E9,V7},{E8,V8}]|
|XXXXX03|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |

ОЖИДАЕМЫЙ ВЫХОД:



+-------+---+--------+-------+-------+-------+
| NUM_ID|  E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+---+-------+-------+-------+-------+
|XXXXX01| E1|     V1|     V3|   null|   null|
|XXXXX01| E2|     V2|   null|   null|     V8|
|XXXXX01| E3|   null|     V4|   null|   null|
|XXXXX01| E4|   null|   null|     V5|   null|
|XXXXX01| E5|   null|   null|     V6|     V7|

|XXXXX02| E1|   null|     V3|     V5|   null|
|XXXXX02| E3|   null|     V4|   null|   null|
|XXXXX02| E5|   null|   null|     V6|   null|
|XXXXX02[ E7|     V1|   null|   null|   null|
|XXXXX02| E8|     V2|   null|   null|     V7|
|XXXXX02| E9|   null|34.7825|   null|     V8|

1 Ответ

0 голосов
/ 04 октября 2019

Входной файл CSV имеет следующий вид:

NUM_ID | SIG1 | SIG2 | SIG3 | SIG4 XXXXX01 | [{1569560531000,3.7825}, {1569560475000,3.7812}, {1569560483000,3.7812}, {1569560491000,34.7875}] | [{1569560537000,3.7825}, {1569560531000,34.7825}, {1569560489000,34.7825}, {1569560497000,34.7825}] | [{1569560505000,34.7825}, {1569560513000,34.7825}, {1569560521000, 34.7825}, {1569560531000,34.7825}] | [{1569560535000,34.7825}, {1569560531000,34.7825}, {1569560483000,34.7825}

    import org.apache.spark.sql.Row
    import org.apache.spark.sql.expressions.UserDefinedFunction

    val df = spark.read.format("csv").option("header","true").option("delimiter", "|").load("path .csv")
    df.show(false)
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
    |NUM_ID |SIG1                                                                                          |SIG2                                                                                            |SIG3                                                                                             |SIG4                                                                     |
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+


    //UDF to generate column E

    def UDF_E:UserDefinedFunction=udf((r: Row)=>{
    val SigColumn = "SIG1,SIG2,SIG3,SIG4"
    val colList = SigColumn.split(",").toList
    val rr = "[\\}],[\\{]".r
    var out = ""
    colList.foreach{ x =>
    val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","")
    val b = a.split("\\|").map(x => x.split(",")(0)).toSet
    out = out + "," + b.mkString(",")
    }
    val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
    out1
    })


    //UDF to generate column value with Signal

    def UDF_V:UserDefinedFunction=udf((E: String, SIG:String)=>{
    val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "")
    val SigMap = "(\\w+),([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
    var out = ""
    if(SigMap.keys.toList.contains(E)){
    out = SigMap(E).toString
    }
    out})

    //new DataFrame with Column "E"
    val df1 = df.withColumn("E", UDF_E(struct(df.columns map col: _*))).withColumn("E", explode(split(col("E"), ",")))

     df1.show(false)
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
    |NUM_ID |SIG1                                                                                          |SIG2                                                                                            |SIG3                                                                                             |SIG4                                                                     |E            |
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+

    //Final DataFrame
    val df2 =  df1.withColumn("SIG1_V", UDF_V(col("E"),col("SIG1"))).withColumn("SIG2_V", UDF_V(col("E"),col("SIG2"))).withColumn("SIG3_V", UDF_V(col("E"),col("SIG3"))).withColumn("SIG4_V", UDF_V(col("E"),col("SIG4"))).drop("SIG1","SIG2","SIG3","SIG4")

    df2.show()
    +-------+-------------+-------+-------+-------+-------+
    | NUM_ID|            E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
    +-------+-------------+-------+-------+-------+-------+
    |XXXXX01|1569560475000| 3.7812|       |       |       |
    |XXXXX01|1569560483000| 3.7812|       |       |34.7825|
    |XXXXX01|1569560489000|       |34.7825|       |       |
    |XXXXX01|1569560491000|34.7875|       |       |       |
    |XXXXX01|1569560497000|       |34.7825|       |       |
    |XXXXX01|1569560505000|       |       |34.7825|       |
    |XXXXX01|1569560513000|       |       |34.7825|       |
    |XXXXX01|1569560521000|       |       |34.7825|       |
    |XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
    |XXXXX01|1569560535000|       |       |       |34.7825|
    |XXXXX01|1569560537000|       | 3.7825|       |       |
    +-------+-------------+-------+-------+-------+-------+
...