Вы ищете array_except (col1, col2), который возвращает элементы, которые присутствуют в col1, но отсутствуют в col2 без дубликатов .
from pyspark.sql import functions as F
df = spark.createDataFrame([(["10","20","30"],["10","20","30"]), (["10","20","30"],["10","20"]), (["10","20","30"],["10","20","50"])],["age","id"])
df=df.withColumn('col3', F.array_except('age', 'id'))
#In case you want to get the difference between both columns
#df=df.withColumn('col3', F.array_union(F.array_except('age', 'id'), F.array_except('id', 'age')))
df.show()
Вам нужно использовать udf, если вы работаете со свечой <2.4: </p>
from pyspark.sql import types as T
from pyspark.sql import functions as F
df = spark.createDataFrame([(["10","20","30"],["10","20","30"]), (["10","20","30"],["10","20"]), (["10","20","30"],["10","20","50"])],["age","id"])
def arrayDiff(col1, col2):
#set will remove duplicates
diff = set(col1) - set(col2)
return list(diff)
diff=F.udf(arrayDiff, T.ArrayType(T.StringType()))
df=df.withColumn('col3', diff('age', 'id'))
df.show()
Вывод:
+------------+------------+----+
| age| id|col3|
+------------+------------+----+
|[10, 20, 30]|[10, 20, 30]| []|
|[10, 20, 30]| [10, 20]|[30]|
|[10, 20, 30]|[10, 20, 50]|[30]|
+------------+------------+----+