Разбить столбец на несколько строк
ref: взорваться в PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([(132, "economics,engineering"),(201, "engineering"),(123, "sociology,philosophy"),(222, "philosophy")], ["id", "classes"])
+---+--------------------+
| id| classes|
+---+--------------------+
|132|economics,enginee...|
|201| engineering|
|123|sociology,philosophy|
|222| philosophy|
+---+--------------------+
explodeCol = df.select(col("id"), F.explode(F.split(col("classes"), ",")).alias("branch"))
+---+-----------+
| id| branch|
+---+-----------+
|132| economics|
|132|engineering|
|201|engineering|
|123| sociology|
|123| philosophy|
|222| philosophy|
+---+-----------+
explodeCol.groupBy("id").pivot("branch").agg(F.sum(lit(1))).na.fill(0).show()
+---+---------+-----------+----------+---------+
| id|economics|engineering|philosophy|sociology|
+---+---------+-----------+----------+---------+
|222| 0| 0| 1| 0|
|201| 0| 1| 0| 0|
|132| 1| 1| 0| 0|
|123| 0| 0| 1| 1|
+---+---------+-----------+----------+---------+
Для более подробной документации Spark см. http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html