Мы можем использовать функцию Spark SQL translate()
, чтобы создать столбец группировки для ваших строк.
С PySpark:
Пример кадра данных для тестирования
from pyspark.sql.types import StringType
df = spark.createDataFrame(["adaz", "LssA ss", "Leds ST", "Pear QA","Lear QA"], StringType())
Фактическая трансформация
from pyspark.sql.functions import translate, collect_list, col
import string
lowercases = string.ascii_lowercase
uppercases = string.ascii_uppercase
length_alphabet = len(uppercases)
ones = "1" * length_alphabet
zeroes = "0" * length_alphabet
old = uppercases + lowercases
new = ones + zeroes
df.withColumn("group", translate(df.value, old, new)) \
.groupBy(col("group")).agg(collect_list(df.value).alias("strings")) \
.show(truncate = False)
Результат:
+-------+---------------------------+
|group |strings |
+-------+---------------------------+
|1000 11|[Leds ST, Pear QA, Lear QA]|
|0000 |[adaz] |
|1001 00|[LssA ss] |
+-------+---------------------------+
С Scala Spark:
import org.apache.spark.sql.functions.{translate, col, collect_list}
val lower = 'a' to 'z'
val upper = 'A' to 'Z'
val length_alphabet = upper.size
val lowercases = lower.mkString("")
val uppercases = upper.mkString("")
val ones = "1" * length_alphabet
val zeroes = "0" * length_alphabet
val old = uppercases + lowercases
val news = ones + zeroes
df.withColumn("group", translate($"value", old, news))
.groupBy(col("group")).agg(collect_list($"value").alias("strings"))
.show(truncate = false)