Я новичок в Apache Spark и мне нужна помощь.Может кто-нибудь сказать, как правильно объединить следующие 2 кадра данных?!
Первый кадр данных:
| DATE_TIME | PHONE_NUMBER |
|---------------------|--------------|
| 2019-01-01 00:00:00 | 7056589658 |
| 2019-02-02 00:00:00 | 7778965896 |
Второй кадр данных:
| DATE_TIME | IP |
|---------------------|---------------|
| 2019-01-01 01:00:00 | 194.67.45.126 |
| 2019-02-02 00:00:00 | 102.85.62.100 |
| 2019-03-03 03:00:00 | 102.85.62.100 |
Конечный кадр данных, который я хочу:
| DATE_TIME | PHONE_NUMBER | IP |
|---------------------|--------------|---------------|
| 2019-01-01 00:00:00 | 7056589658 | |
| 2019-01-01 01:00:00 | | 194.67.45.126 |
| 2019-02-02 00:00:00 | 7778965896 | 102.85.62.100 |
| 2019-03-03 03:00:00 | | 102.85.62.100 |
Вот код, который я попробовал:
import org.apache.spark.sql.Dataset
import spark.implicits._
val df1 = Seq(
("2019-01-01 00:00:00", "7056589658"),
("2019-02-02 00:00:00", "7778965896")
).toDF("DATE_TIME", "PHONE_NUMBER")
df1.show()
val df2 = Seq(
("2019-01-01 01:00:00", "194.67.45.126"),
("2019-02-02 00:00:00", "102.85.62.100"),
("2019-03-03 03:00:00", "102.85.62.100")
).toDF("DATE_TIME", "IP")
df2.show()
val total = df1.join(df2, Seq("DATE_TIME"), "left_outer")
total.show()
К сожалению, это вызывает ошибку:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
...