преобразование вложенного значения строки json (json) в новый фрейм данных
val rd1= spark.read.option("multiLine", "true").option("mode", "PERMISSIVE").json("data.json")
import org.apache.spark.sql.functions._
val ds1= rd1.select("alpha._id", "alpha.Description", "alpha.Sub-Tower","alpha.Tower","alpha.input_data") //
ds1.show()// it gives only single row with array in each column values instead need table of 4 rows
мой подход 1
val ds2=ds1
.withColumn("Description", explode(col("Description")))
.withColumn("Tower",data explode(col("Tower")))
.withColumn("input_data", explode(col("input_data")))
.withColumn("Sub-Tower", explode(col("Sub-Tower")))
.withColumn("_id", explode(col("_id")))
println(ds2.count()) /// the json array lenngth is 4 it is giving 1025 incorrect output
input
{
"name": "raxvsdbsd",
"stack": "raw",
"threshold": "50",
"alpha": [
{
"_id": "27",
"input_data": "alpha beta gamma",
"Tower": "A B C",
"Description": "a b,c",
"Sub-Tower": "crt"
},
{
"_id": "91",
"input_data": "alpha beta gamma",
"Tower": "A B C",
"Description": "a b,c",
"Sub-Tower": "crt"
},
{
"_id": "21",
"input_data": "alpha beta gamma",
"Tower": "A B C",
"Description": "a b,c",
"Sub-Tower": "crt"
},
{
"_id": "29",
"input_data": "alpha beta gamma",
"Tower": "A B C",
"Description": "a b,c",
"Sub-Tower": "crt"
}
]
}
ожидаемый вывод:
таблица для альфы, как показано ниже:
+-----------+---------+-----+---+----------------+
|Description|Sub-Tower|Tower|_id| input_data|
+-----------+---------+-----+---+----------------+
| a b,c| crt|A B C| 27|alpha beta gamma|
| a b,c| crt|A B C| 91|alpha beta gamma|
| a b,c| crt|A B C| 21|alpha beta gamma|
| a b,c| crt|A B C| 29|alpha beta gamma|
+-----------+---------+-----+---+----------------+