Добавление до @ user10954945, вот планы выполнения для обоих:
import pyspark
sc = pyspark.SparkContext.getOrCreate()
spark = pyspark.sql.SparkSession(sc)
df = spark.createDataFrame(((1,), (2,)), ['timeDiff'])
filtered_1 = df[df["timeDiff"] <= 30]
filtered_2 = df.filter(df["timeDiff"] <= 30)
filtered_1.explain()
== Physical Plan ==
*(1) Filter (isnotnull(timeDiff#6L) && (timeDiff#6L <= 30))
+- Scan ExistingRDD[timeDiff#6L]
filtered_2.explain()
== Physical Plan ==
*(1) Filter (isnotnull(timeDiff#6L) && (timeDiff#6L <= 30))
+- Scan ExistingRDD[timeDiff#6L]
Фактически, вы получаете тот же результат, используя SQL API:
df.createOrReplaceTempView('df')
filtered_3 = spark.sql("SELECT * FROM df WHERE timeDiff <= 30")
filtered_3.explain()
== Physical Plan ==
*(1) Filter (isnotnull(timeDiff#6L) && (timeDiff#6L <= 30))
+- Scan ExistingRDD[timeDiff#6L]