Создание примера DataFrame:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
data= [[200,'Marketing','Jane','Smith'],
[140,'Marketing','Jerry','Soreky'],
[120,'Marketing','Justin','Sauren'],
[170,'Sales','Joe','Statham'],
[190,'Sales','Jeremy','Sage'],
[220,'Sales','Jay','Sawyer']]
columns= ['SALARY','Dept_name','First_name','Last_name']
df= spark.createDataFrame(data,columns)
df.show()
+------+---------+----------+---------+
|SALARY|Dept_name|First_name|Last_name|
+------+---------+----------+---------+
| 200|Marketing| Jane| Smith|
| 140|Marketing| Jerry| Soreky|
| 120|Marketing| Justin| Sauren|
| 170| Sales| Joe| Statham|
| 190| Sales| Jeremy| Sage|
| 220| Sales| Jay| Sawyer|
+------+---------+----------+---------+
Создание запроса для получения людей с более высокой зарплатой, чем в среднем по их отделу:
w=Window().partitionBy("Dept_name")
df.withColumn("Average_Salary", F.avg("SALARY").over(w))\
.filter(F.col("SALARY")>F.col("Average_Salary"))\
.select("SALARY","Dept_name","First_name","Last_name")\
.show()
+------+---------+----------+---------+
|SALARY|Dept_name|First_name|Last_name|
+------+---------+----------+---------+
| 220| Sales| Jay| Sawyer|
| 200|Marketing| Jane| Smith|
+------+---------+----------+---------+