Как работать с множественным форматом даты? Искра - Scala - PullRequest
2 голосов
/ 20 июня 2020

У меня есть данные в формате Json, например,

....
{"Title":"51 Birch Street","US_Gross":84689,"Worldwide_Gross":84689,"US_DVD_Sales":null,"Production_Budget":350000,"Release_Date":"18-Oct-06","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Truly Indie","Source":null,"Major_Genre":null,"Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":97,"IMDB_Rating":7.4,"IMDB_Votes":439}
{"Title":"55 Days at Peking","US_Gross":10000000,"Worldwide_Gross":10000000,"US_DVD_Sales":null,"Production_Budget":17000000,"Release_Date":"1963-01-01","MPAA_Rating":null,"Running_Time_min":null,"Distributor":null,"Source":"Original Screenplay","Major_Genre":"Drama","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":57,"IMDB_Rating":6.8,"IMDB_Votes":2104}
{"Title":"Nine 1/2 Weeks","US_Gross":6734844,"Worldwide_Gross":6734844,"US_DVD_Sales":null,"Production_Budget":18000000,"Release_Date":"21-Feb-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Drama","Creative_Type":"Contemporary Fiction","Director":"Adrian Lyne","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.4,"IMDB_Votes":12731}
{"Title":"AstÈrix aux Jeux Olympiques","US_Gross":999811,"Worldwide_Gross":132999811,"US_DVD_Sales":null,"Production_Budget":113500000,"Release_Date":"4-Jul-08","MPAA_Rating":"Not Rated","Running_Time_min":null,"Distributor":"Alliance","Source":"Based on Comic/Graphic Novel","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":4.9,"IMDB_Votes":5620}
{"Title":"The Abyss","US_Gross":54243125,"Worldwide_Gross":54243125,"US_DVD_Sales":null,"Production_Budget":70000000,"Release_Date":"9-Aug-89","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"20th Century Fox","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Science Fiction","Director":"James Cameron","Rotten_Tomatoes_Rating":88,"IMDB_Rating":7.6,"IMDB_Votes":51018}
{"Title":"Action Jackson","US_Gross":20257000,"Worldwide_Gross":20257000,"US_DVD_Sales":null,"Production_Budget":7000000,"Release_Date":"12-Feb-88","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Lorimar Motion Pictures","Source":"Original Screenplay","Major_Genre":"Action","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":10,"IMDB_Rating":4.6,"IMDB_Votes":3856}
{"Title":"Ace Ventura: Pet Detective","US_Gross":72217396,"Worldwide_Gross":107217396,"US_DVD_Sales":null,"Production_Budget":12000000,"Release_Date":"4-Feb-94","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Tom Shadyac","Rotten_Tomatoes_Rating":49,"IMDB_Rating":6.6,"IMDB_Votes":63543}
{"Title":"Ace Ventura: When Nature Calls","US_Gross":108360063,"Worldwide_Gross":212400000,"US_DVD_Sales":null,"Production_Budget":30000000,"Release_Date":"10-Nov-95","MPAA_Rating":"PG-13","Running_Time_min":null,"Distributor":"Warner Bros.","Source":"Original Screenplay","Major_Genre":"Comedy","Creative_Type":"Contemporary Fiction","Director":"Steve Oedekerk","Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.6,"IMDB_Votes":51275}
{"Title":"April Fool's Day","US_Gross":12947763,"Worldwide_Gross":12947763,"US_DVD_Sales":null,"Production_Budget":5000000,"Release_Date":"27-Mar-86","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"Paramount Pictures","Source":"Original Screenplay","Major_Genre":"Horror","Creative_Type":"Contemporary Fiction","Director":null,"Rotten_Tomatoes_Rating":31,"IMDB_Rating":null,"IMDB_Votes":null}
{"Title":"Among Giants","US_Gross":64359,"Worldwide_Gross":64359,"US_DVD_Sales":null,"Production_Budget":4000000,"Release_Date":"26-Mar-99","MPAA_Rating":"R","Running_Time_min":null,"Distributor":"Fox Searchlight","Source":"Original Screenplay","Major_Genre":"Romantic Comedy","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":5.7,"IMDB_Votes":546}
{"Title":"Annie Get Your Gun","US_Gross":8000000,"Worldwide_Gross":8000000,"US_DVD_Sales":null,"Production_Budget":3768785,"Release_Date":"17-May-50","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"MGM","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":100,"IMDB_Rating":7.1,"IMDB_Votes":1326}
{"Title":"Alice in Wonderland","US_Gross":0,"Worldwide_Gross":0,"US_DVD_Sales":null,"Production_Budget":3000000,"Release_Date":"28-Jul-51","MPAA_Rating":null,"Running_Time_min":null,"Distributor":"RKO Radio Pictures","Source":"Based on Book/Short Story","Major_Genre":"Musical","Creative_Type":null,"Director":null,"Rotten_Tomatoes_Rating":20,"IMDB_Rating":6.7,"IMDB_Votes":63458}
{"Title":"The Princess and the Cobbler","US_Gross":669276,"Worldwide_Gross":669276,"US_DVD_Sales":null,"Production_Budget":24000000,"Release_Date":"25-Aug-95","MPAA_Rating":"G","Running_Time_min":null,"Distributor":"Miramax","Source":"Original Screenplay","Major_Genre":"Adventure","Creative_Type":"Fantasy","Director":null,"Rotten_Tomatoes_Rating":null,"IMDB_Rating":7.3,"IMDB_Votes":893}
....

, где у меня есть несколько форматов даты в поле "Release_Date" например 26-Mar-99 или 1963-01-01 или 4-Jul-08

У меня есть работающий код

      val moviesDF = spark.read
        .option("inferSchema", "true")
        .json(s"${path}/movies.json")

       moviesDF.show(truncate = false)

      val moviesWithReleaseDates = moviesDF
        .select(col("Title"), to_date(col("Release_Date"), "dd-MMM-yy").as("Actual_Release")) // conversion
      moviesWithReleaseDates.show(truncate = false)

, но вывод

|Four Rooms                                |1995-12-25    |
|The Four Seasons                          |1981-05-22    |
|Four Weddings and a Funeral               |1994-03-09    |
|51 Birch Street                           |2006-10-18    |
|55 Days at Peking                         |null          |
|Nine 1/2 Weeks                            |1986-02-21    |
|AstÈrix aux Jeux Olympiques               |2008-07-04    |
|The Abyss                                 |1989-08-09    |
|Action Jackson                            |1988-02-12    |
|Ace Ventura: Pet Detective                |1994-02-04    |

, когда формат даты похож на "18-Oct-06", он работает нормально, но когда формат даты отличается, он показывает нули.

Чтобы показать все даты без нулей, как я могу сделать это простым и элегантным способом?

Заранее спасибо.

Ответы [ 3 ]

2 голосов
/ 21 июня 2020

В любом случае вам нужно иметь конечный список формата даты, который есть в файле для Release_Date, или вы хотите поддерживать его при обработке.

вы можете написать udf для анализа date string используя метод ниже -

val formatStrings = Seq("dd-MMM-yy", "yyyy-MM-dd","other-formats")
    import java.text.SimpleDateFormat
    def tryParse(dateString: String): java.util.Date = {
      val parser: String => java.util.Date = dateStr => new SimpleDateFormat(dateStr).parse(dateString)
      formatStrings.map(parser).filter(_ != null).head
    }

или используйте coalesce

coalesce(
to_date(col("Release_Date"), "dd-MMM-yy"),
to_date(col("Release_Date"), "yyyy-MM-dd"),
to_date(col("Release_Date"), "other-date-format")
).as("Actual_Release")

или

val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")

val newDF =  df.withColumn("Actual_Release", coalesce(dt_formats.map(fmt => to_date($"Release_Date", fmt)):_*))
2 голосов
/ 20 июня 2020

Это потому, что to_date(col("Release_Date"), "dd-MMM-yy"). Здесь вы указываете формат даты ввода, и он правильно читается, если формат даты json соответствует этому. В противном случае это будет null

Теперь вам нужно прочитать текст даты из json со всеми возможными форматами даты.

Напишите udf. Передайте текст даты в качестве входных данных. В udf проверьте возможный формат даты и, если он совпадает, верните правильный объект даты. UDF, безусловно, полезен здесь

1 голос
/ 24 июня 2020

Вы можете попробовать что-то вроде этого, я не знаю, элегантно ли это, но просто:

val mWRD = moviesDF.selectExpr("""Title""",
"""IF(LENGTH(Release_Date) <= 9,to_date(Release_Date,'dd-MMM-yy'),
to_date(Release_Date,'yyyy-MM-dd')) AS Actual_Release""")
mWRD.show(truncate = false)
...