С UDF и вдохновлено Regex для сопоставления только запятых, не указанных в скобках? :
val df = List(
("item (foo bar) is available, soaps", true),
("item (bar) is available", false),
("soaps, shampoo", false),
("item (foo bar, bar) is available", true),
("item (foo bar, bar) is available, (soap, shampoo)", true)
).
toDF("itemNames", "coupons")
df.show(false)
val regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS)
val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)
Вывод:
+--------------------------------+-------+
|itemNames |coupons|
+--------------------------------+-------+
|item (foo bar) is available |true |
| soaps |true |
|item (bar) is available |false |
|soaps |false |
| shampoo |false |
|item (foo bar, bar) is available|true |
|item (foo bar, bar) is available|true |
| (soap, shampoo) |true |
+--------------------------------+-------+
Если "trim" имеет значениетребуется, может быть легко добавлено в "customSplit".