С помощью функции Window можно найти следующие / предыдущие даты, а затем отфильтрованные строки, в которых разница между датами превышает 24 часа.
Подготовка данных
val df = Seq(("C1", "08-NOV-18 11.29.43"),
("C2", "09-NOV-18 13.29.43"),
("C2", "09-NOV-18 18.29.43"),
("C3", "11-NOV-18 19.29.43"),
("C1", "12-NOV-18 10.29.43"),
("C2", "13-NOV-18 09.29.43"),
("C4", "14-NOV-18 20.29.43"),
("C1", "15-NOV-18 11.29.43"),
("C5", "16-NOV-18 15.29.43"),
("C10", "17-NOV-18 19.29.43"),
("C1", "18-NOV-18 12.29.43"),
("C2", "18-NOV-18 10.29.43"),
("C2", "19-NOV-18 09.29.43"),
("C6", "20-NOV-18 13.29.43"),
("C6", "21-NOV-18 14.29.43"),
("C1", "21-NOV-18 18.29.43"),
("C1", "22-NOV-18 11.29.43"))
.toDF("client", "dt")
.withColumn("dt", to_timestamp($"dt", "dd-MMM-yy HH.mm.ss"))
Действующий код
// get next/prev dates
val dateWindow = Window.partitionBy("client").orderBy("dt")
val withNextPrevDates = df
.withColumn("previousDate", lag($"dt", 1).over(dateWindow))
.withColumn("nextDate", lead($"dt", 1).over(dateWindow))
// function for filter
val secondsInDay = TimeUnit.DAYS.toSeconds(1)
val dateDiffLessThanDay = (startTimeStamp: Column, endTimeStamp: Column) =>
endTimeStamp.cast(LongType) - startTimeStamp.cast(LongType) < secondsInDay && datediff(endTimeStamp, startTimeStamp) === 1
// filter
val result = withNextPrevDates
.where(dateDiffLessThanDay($"previousDate", $"dt") || dateDiffLessThanDay($"dt", $"nextDate"))
.drop("previousDate", "nextDate")
Результат
+------+-------------------+
|client|dt |
+------+-------------------+
|C1 |2018-11-21 18:29:43|
|C1 |2018-11-22 11:29:43|
|C2 |2018-11-18 10:29:43|
|C2 |2018-11-19 09:29:43|
+------+-------------------+