Предполагая, что вам нужно сохранить другие идентификаторы, вы можете использовать groupBy
в своем столбце DF
, а затем collect_list
- reason
. Вот так:
val df = Seq(("RM", "10", "001","1","NOT PRESENT"),
("RM","10","001","1","NOT VALID DATA"))
.toDF("cust_name","cust_id","cust_code","flag","reason")
import org.apache.spark.sql.functions.collect_list
df.groupBy("cust_name", "cust_id", "cust_code", "flag") // group by your keys
.agg(collect_list('reason) alias "reason_list")
.show(truncate = false)
Вывод:
+---------+-------+---------+----+-----------------------------+
|cust_name|cust_id|cust_code|flag|reason_list |
+---------+-------+---------+----+-----------------------------+
|RM |10 |001 |1 |[NOT PRESENT, NOT VALID DATA]|
+---------+-------+---------+----+-----------------------------+