Ниже мой код pyspark для этого:
values = [
(1,"2019-10-11, 2019-10-12, 2019-10-13, 2019-10-14, 2019-10-15"),
(2,"2019-11-11, 2019-11-12, 2019-11-17, 2019-11-18")
]
rdd = sc.parallelize(values)
schema = StructType([
StructField("id", IntegerType(), True),StructField("dates", StringType(), True)
])
data = spark.createDataFrame(rdd, schema)
data.createOrReplaceTempView("data")
spark.sql("""select id,
dates,
size(split(dates, ",")) as date_count
from data""").show(20,False)
Результат:
+---+----------------------------------------------------------+----------+
|id |dates |date_count|
+---+----------------------------------------------------------+----------+
|1 |2019-10-11, 2019-10-12, 2019-10-13, 2019-10-14, 2019-10-15|5 |
|2 |2019-11-11, 2019-11-12, 2019-11-17, 2019-11-18 |4 |
+---+----------------------------------------------------------+----------+