Ошибка при запуске UDF для столбцов данных - PullRequest
0 голосов
/ 11 октября 2019

Я извлек некоторые данные из улья в датафрейм в указанном ниже формате.

| NUM_ID|            SIG1|           SIG2|             SIG3|            SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|

Если мы примем только один сигнал, он будет таким, как показано ниже.

|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
    [{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
    [{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
    [{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]

Схема для приведенных выше данных:

 fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: string (nullable = true)
|-- SIG2: string (nullable = true)
|-- SIG3: string (nullable = true)
|-- SIG4: string (nullable = true)

Мое требование - получить все значения E из всех столбцов для определенного NUM_ID и создать в качестве нового cloumn с соответствующими значениями сигнала. в других столбцах, как показано ниже.

+-------+-------------+-------+-------+-------+-------+
| NUM_ID|            E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825|   null|96.3354|
|XXXXX01|1569560505000|   null|   null|35.5501|   null|
|XXXXX01|1569560531001|73.7825|   null|   null|   null|
|XXXXX02|1569560505000|34.7825|   null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825|   null|   null|   null|
|XXXXX02|1569560502000|   null|   null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825|   null|   null|   null|
|XXXXX03|1569560505000|34.7825|   null|35.5501|96.3354|
|XXXXX03|1569560509000|   null|34.7825|35.5501|96.3354|

Я пытался написать UDF для достижения этой цели, как показано ниже.

UDF # 1:

def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4,SIG5,SIG6"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","").replaceAll(""""E":""","")
val st = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + st.mkString(",")
}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})

UDF # 2:

def UDF_V:UserDefinedFunction=udf((E: String,SIG:String)=>{
val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "").replaceAll(""""E":""","").replaceAll(""","V":""","=")
val SigMap = "(\\w+)=([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
var out = ""
if(SigMap.keys.toList.contains(E)){
out = SigMap(E).toString
}
out})

вывод UDF # 1 показан ниже:

+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
    |NUM_ID |SIG1                                                                                          |SIG2                                                                                            |SIG3                                                                                             |SIG4                                                                     |E            |
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
    |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
    +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+

Вывод UDF # 2:

+-------+-------------+-------+-------+-------+-------+
    | NUM_ID|            E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
    +-------+-------------+-------+-------+-------+-------+
    |XXXXX01|1569560475000| 3.7812|       |       |       |
    |XXXXX01|1569560483000| 3.7812|       |       |34.7825|
    |XXXXX01|1569560489000|       |34.7825|       |       |
    |XXXXX01|1569560491000|34.7875|       |       |       |
    |XXXXX01|1569560497000|       |34.7825|       |       |
    |XXXXX01|1569560505000|       |       |34.7825|       |
    |XXXXX01|1569560513000|       |       |34.7825|       |
    |XXXXX01|1569560521000|       |       |34.7825|       |
    |XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
    |XXXXX01|1569560535000|       |       |       |34.7825|
    |XXXXX01|1569560537000|       | 3.7825|       |       |
    +-------+-------------+-------+-------+-------+-------+

Когда я запускаю df.show() iбудет выводиться, как указано выше, но когда я получу df.count(), я получу следующую ошибку:

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$UDF_E$1: (struct<NUM_ID:string,SIG1:string,SIG2:string,SIG3:string,SIG4:string>) => string)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)


Caused by: java.lang.NullPointerException
        at $line74.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$UDF_E$1$$anonfun$apply$1.apply(<console>:57

В чем может быть причина этой ошибки? какие-либо приводит к решению этой проблемы?

1 Ответ

0 голосов
/ 12 октября 2019

Решил эту проблему, добавив нулевую проверку, как показано ниже.

def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4,SIG5,SIG6"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
if(r.getAs(x) != null){
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","").replaceAll(""""E":""","")
val st = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + st.mkString(",")
}}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})
...