Вот полное решение для обеих точек: -
Первая проблема - анализ даты -
date_format
принимает столбец даты и форматирует его в любую комбинацию. Но здесь Last Updated
является строковым столбцом. Для преобразования string
в date
требуется to_date
. Проверьте ниже, я проанализировал string
до date
.
data = sqlContext.createDataFrame([
["Photo Editor & Ca...", " January 7, 2018"],
[" Coloring book moana", " January 15, 2018"],
["U Launcher Lite –...", " August 1, 2018"],
["ketch - Draw & P...", " June 8, 2018"],
["Pixel Draw - Numb...", " June 20, 2018"],
["Paper flowers ins...", " March 26, 2017"],
["moke Effect Phot...", " April 26, 2018"],
[" Infinite Painter", " June 14, 2018"],
["Garden Coloring Book", "September 20, 2017"],
["Kids Paint Free -...", " July 3, 2018"],
["Text on Photo - F...", " October 27, 2017"],
["Name Art Photo Ed...", " July 31, 2018"],
["Tattoo Name On My...", " April 2, 2018"],
["Mandala Coloring ...", " June 26, 2018"],
["3D Color Pixel by...", " August 3, 2018"],
["Learn To Draw Kaw...", " June 6, 2018"]
], ["app", "Last Updated"])
from pyspark.sql import functions as F
parsed_date_data = data.withColumn(
"date",
F.to_date(
F.trim(F.col("Last Updated")),
"MMMM dd, yyyy"
)
)
parsed_date_data.show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
| Coloring book moana| January 15, 2018|2018-01-15|
|U Launcher Lite â...| August 1, 2018|2018-08-01|
| ketch - Draw & P...| June 8, 2018|2018-06-08|
|Pixel Draw - Numb...| June 20, 2018|2018-06-20|
|Paper flowers ins...| March 26, 2017|2017-03-26|
| moke Effect Phot...| April 26, 2018|2018-04-26|
| Infinite Painter| June 14, 2018|2018-06-14|
|Garden Coloring Book|September 20, 2017|2017-09-20|
|Kids Paint Free -...| July 3, 2018|2018-07-03|
|Text on Photo - F...| October 27, 2017|2017-10-27|
|Name Art Photo Ed...| July 31, 2018|2018-07-31|
|Tattoo Name On My...| April 2, 2018|2018-04-02|
|Mandala Coloring ...| June 26, 2018|2018-06-26|
|3D Color Pixel by...| August 3, 2018|2018-08-03|
|Learn To Draw Kaw...| June 6, 2018|2018-06-06|
+--------------------+------------------+----------+
Второй Как мы можем применить фильтр к фрейму данных -
parsed_date_data.where("date = '2018-01-07'").show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.filter("date = '2018-01-07'").show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.where(F.col("date") == '2018-01-07').show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.filter(F.col("date") == '2018-01-07').show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.filter(parsed_date_data.date == '2018-01-07').show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.where(parsed_date_data.date == '2018-01-07').show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.where(parsed_date_data.date.isin('2018-01-07')).show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
parsed_date_data.filter(parsed_date_data.date.isin('2018-01-07')).show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|Photo Editor & Ca...| January 7, 2018|2018-01-07|
+--------------------+------------------+----------+
Даже если вы можете применять дополнительные фильтры -
parsed_date_data.filter(F.month(parsed_date_data.date) == '08').show()
+--------------------+------------------+----------+
| app| Last Updated| date|
+--------------------+------------------+----------+
|U Launcher Lite â...| August 1, 2018|2018-08-01|
|3D Color Pixel by...| August 3, 2018|2018-08-03|
+--------------------+------------------+----------+
Вот полный API для понимания функций pyspark.