Вы можете использовать dropDuplicates ()
Пример данных в DataFrame
>>> cols = ['ID', 'Date']
>>> vals = [
('213412', '2008-10-26T06:04:00.000Z'),
('213412', '2008-10-26T06:04:00.000Z'),
('393859 ', '2018-10-26T09:17:00.000Z'),
]
# Create DataFrame
>>> df = spark.createDataFrame(vals, cols)
>>> df.show(3, False)
+--------+------------------------+
|ID |Date |
+--------+------------------------+
|213412 |2008-10-26T06:04:00.000Z|
|213412 |2008-10-26T06:04:00.000Z|
|393859 |2018-10-26T09:17:00.000Z|
+--------+------------------------+
Использовать dropDuplicates ()
# You can simply use df.dropDuplicates(), but by specifying the Column (ID) you are telling Spark to drop based on that column.
df_dist = df.dropDuplicates(["ID"])
df_dist.show(2, False)
+--------+------------------------+
|ID |Date |
+--------+------------------------+
|213412 |2008-10-26T06:04:00.000Z|
|393859 |2018-10-26T09:17:00.000Z|
+--------+------------------------+
Для получения дополнительной информации см.