Вы можете explode the arr
ay , тогда, выполнив group by count, Window
, мы можем получить наиболее часто встречающийся элемент.
Example:
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr_2")
df_new = df.withColumn('arr_concat',concat(col('arr_1'),col('arr_2')))
from pyspark.sql.functions import *
from pyspark.sql import *
df1=df_new.withColumn("mid",monotonically_increasing_id())
df2=df1.selectExpr("explode(arr_concat) as arr","mid").groupBy("mid","arr").agg(count(lit("1")).alias("cnt"))
w=Window.partitionBy("mid").orderBy(desc("cnt"))
df3=df2.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop(*["rn","cnt"])
df3.join(df1,['mid'],'inner').drop(*['mid','arr_concat']).withColumn("arr",array(col("arr"))).show()
#+---+---------+------+
#|arr| arr_1| arr_2|
#+---+---------+------+
#|[d]|[c, d, d]| []|
#|[e]| [e]|[e, f]|
#|[a]|[a, a, b]| [a]|
#| []| []| []|
#+---+---------+------+