Вы можете использовать функцию Bucketizer из pyspark.ml.feature
from pyspark.ml.feature import Bucketizer
df = sqlContext.createDataFrame([("ABC", 20, 35, 12),
("ABC", 36, 47, 25),
("CDE", 20, 27, 8 ),
("CDE", 28, 33, 13),
("CDE", 34, 42, 20),
("CDE", 43, 47, 22)],
["UserID","Start_KM","End_KM","Time_Taken(secs)"])
df = df.withColumn("Time_Taken(secs)",f.col("Time_Taken(secs)").cast("double"))
bucketizer = Bucketizer(splits=[-float("inf"), 5., 10., 15., 20., 25., 30., float("inf")],inputCol="Time_Taken(secs)", outputCol="Time_Taken(buckets)")
bucketed = bucketizer.transform(df)
bucketed.show()
+------+--------+------+----------------+-------------------+
|UserID|Start_KM|End_KM|Time_Taken(secs)|Time_Taken(buckets)|
+------+--------+------+----------------+-------------------+
| ABC| 20| 35| 12.0| 2.0|
| ABC| 36| 47| 25.0| 5.0|
| CDE| 20| 27| 8.0| 1.0|
| CDE| 28| 33| 13.0| 2.0|
| CDE| 34| 42| 20.0| 4.0|
| CDE| 43| 47| 22.0| 4.0|
+------+--------+------+----------------+-------------------+
Затем вы можете использовать полученный столбец для расчета того, что вы хотите.