Pyspark также имеет функцию contains()
(или) like
, которую мы можем использовать в .filter()
Example:
#sample data
df.show()
#+-----------+------------------+
#|issue_month| loan_status|
#+-----------+------------------+
#| 10| Fully Paid|
#| 10| Default|
#| 10|Late (31-120 days)|
#+-----------+------------------+
#in filter query convert loan_status to lower case and look for substring late.
df.groupBy("issue_month","loan_status").\
count().\
filter(lower(col("loan_status")).contains("late")).\
show()
#by using like function
df.groupBy("issue_month","loan_status").\
count().\
filter(lower(col("loan_status")).like("late%")).\
show()
#i would suggest filtering rows before groupby will significantly increases the performance in bigdata!!
df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
show()
#+-----------+------------------+-----+
#|issue_month| loan_status|count|
#+-----------+------------------+-----+
#| 10|Late (31-120 days)| 1|
#+-----------+------------------+-----+
Мы можем использовать .agg(sum("count"))
, чтобы получить сумму подсчета независимо от номера_месяца.
Example:
from pyspark.sql.functions import sum as _sum
df.show()
#+-----------+------------------+
#|issue_month| loan_status|
#+-----------+------------------+
#| 10| Fully Paid|
#| 10| Default|
#| 11|Late (31-120 days)|
#| 11|Late (31-120 days)|
#| 10| Late (16-30 days)|
#+-----------+------------------+
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("issue_month","loan_status").\
count().\
agg(_sum("count").alias("sum")).\
show()
#+---+
#|sum|
#+---+
#| 3|
#+---+
df.filter(lower(col("loan_status")).like("late%")).\
groupBy("issue_month","loan_status").\
count().\
groupBy("loan_status").\
agg(_sum("count").alias("sum_count")).\
show()
#same result will get by using one group too
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("loan_status").\
agg(count("*").alias("sum_count")).\
show()
#+------------------+---------+
#| loan_status|sum_count|
#+------------------+---------+
#|Late (31-120 days)| 2|
#| Late (16-30 days)| 1|
#+------------------+---------+
ОБНОВЛЕНИЕ:
df.filter(lower(col("loan_status")).contains("late")).\
groupBy("issue_month").\
agg(count("*").alias("sum_count")).\
show()
#+-----------+---------+
#|issue_month|sum_count|
#+-----------+---------+
#| 10| 1|
#| 11| 2|
#+-----------+---------+