Вот ответ Scala, поскольку проблема не имеет ничего общего с pyspark. Вы можете конвертировать.
Ваш окончательный вывод я не смог получить, но альтернативы должно хватить.
// Assuming we could also optimize this further but not doing so.
// Assuming distinct values to compare against. If not, then some further logic required.
// A bug found in Ranking it looks like - or me??? Worked around this and dropped the logic
// Pivoting does not help here. Grouping by in SQL to vet specific columns names not
elegant, used the more Scala approach.
import org.apache.spark.sql.functions._
import spark.implicits._
import java.time._
import org.apache.spark.sql.functions.{rank}
import org.apache.spark.sql.expressions.Window
def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
val toEpochDayUdf = udf(toEpochDay(_: String))
// Our input.
val df0 = Seq(
("1","2018-09-05"), ("1","2018-09-14"),
("2","2018-12-23"), ("5","2015-12-20"),
("6","2018-12-23")
).toDF("id", "dia_dt")
val df1 = Seq(
("1","2018-09-06", 5), ("1","2018-09-07", 6), ("6","2023-09-07", 7),
("2","2018-12-23", 4), ("2","2018-12-24", 5), ("2","2018-10-23", 5),
("1","2017-09-06", 5), ("1","2017-09-07", 6),
("5","2015-12-20", 5), ("5","2015-12-21", 6), ("5","2015-12-19", 5), ("5","2015-12-18", 7), ("5","2015-12-22", 5),
("5","2015-12-23", 6), ("5","2015-12-17", 6), ("5","2015-12-26", 60)
).toDF("id", "obs_dt", "obs_val")
val myExpression = "abs(dia_epoch - obs_epoch)"
// Hard to know how to restrict further at this point.
val df2 = df1.withColumn("obs_epoch", toEpochDayUdf($"obs_dt"))
val df3 = df2.join(df0, Seq("id"), "inner").withColumn("dia_epoch", toEpochDayUdf($"dia_dt")).withColumn("abs_diff", expr(myExpression))
@transient val w1 = org.apache.spark.sql.expressions.Window.partitionBy("id", "dia_epoch" ).orderBy(asc("abs_diff"))
val df4 = df3.select($"*", rank.over(w1).alias("rank")) // This is required
// Final results as collect_list. Distinct column names not so easy due to not being able to use pivot - may be a limitation on knowledge on my side.
df4.orderBy("id", "dia_dt")
.filter($"rank" <= 3)
.groupBy($"id", $"dia_dt")
.agg(collect_list(struct($"obs_dt", $"obs_val")).as("observations"))
.show(false)
возвращается:
+---+----------+---------------------------------------------------+
|id |dia_dt |observations |
+---+----------+---------------------------------------------------+
|1 |2018-09-05|[[2017-09-07, 6], [2018-09-06, 5], [2018-09-07, 6]]|
|1 |2018-09-14|[[2017-09-07, 6], [2018-09-06, 5], [2018-09-07, 6]]|
|2 |2018-12-23|[[2018-10-23, 5], [2018-12-23, 4], [2018-12-24, 5]]|
|5 |2015-12-20|[[2015-12-19, 5], [2015-12-20, 5], [2015-12-21, 6]]|
|6 |2018-12-23|[[2023-09-07, 7]] |
+---+----------+---------------------------------------------------+
Продолжай, тяжелая работа сделана.