Используйте collect_list
при группировке по, а затем используйте функцию concat_ws
для создания строки из списка.
df.show(false)
+--------------------------------------+------+---------------+---------------+----------------+-------+
|Errors |userid|associationtype|associationrank|associationvalue|sparkId|
+--------------------------------------+------+---------------+---------------+----------------+-------+
|Primary Key Constraint Violated |3 |Brand5 |error |Lee |4 |
|Incorrect datatype in associationrank|3 |Brand5 |error |Lee |4 |
+--------------------------------------+------+---------------+---------------+----------------+-------+
df.groupBy("userid", "associationtype", "associationrank", "associationvalue", "sparkId")
.agg(collect_list("Errors").as("Errors"))
.withColumn("Errors", concat_ws(", ", col("Errors")))
.show(false)
+------+---------------+---------------+----------------+-------+-----------------------------------------------------------------------+
|userid|associationtype|associationrank|associationvalue|sparkId|Errors |
+------+---------------+---------------+----------------+-------+-----------------------------------------------------------------------+
|3 |Brand5 |error |Lee |4 |Primary Key Constraint Violated, Incorrect datatype in associationrank|
+------+---------------+---------------+----------------+-------+-----------------------------------------------------------------------+