Есть ли способ в Scala spark преобразовать этот фрейм данных в этот? - PullRequest
0 голосов
/ 23 января 2020

Преобразование кадра данных

Col A || date1     || Value1 || Value2   || Date2

  11  ||2002-08-14 || 44.234 || 485.5975 ||2002-05-30

  11  ||2003-02-14 || 52.699 || 485.5975 ||2002-05-30

  11  ||2003-05-15 || 32.484 || 485.5975 ||2002-05-30

  11  ||2003-08-14 || 39.797 || 817.2205 ||2003-05-30

  11  ||2004-02-14 || 36.114 || 817.2205 ||2003-05-30

  11  ||2004-05-15 || 41.137 || 817.2205 ||2003-05-30

и преобразование в следующее:

Col A || date1     || Value1 || Value2   || Date2     || Required

  11  ||2002-08-14 || 44.234 || 485.5975 ||2002-05-30 || 44.234+485.5975

  11  ||2003-02-14 || 52.699 || 485.5975 ||2002-05-30 || 52.699+44.234+485.5975

  11  ||2003-05-15 || 32.484 || 485.5975 ||2002-05-30 || 32.484+52.699+44.234+485.5975

  11  ||2003-08-14 || 39.797 || 817.2205 ||2003-05-30 || 39.797+817.2205

  11  ||2004-02-14 || 36.114 || 817.2205 ||2003-05-30 || 36.114+39.797+817.2205

  11  ||2004-05-15 || 41.137 || 817.2205 ||2003-05-30 || 41.137+36.114+39.797+817.2205

Ответы [ 2 ]

2 голосов
/ 23 января 2020

Ваша логика группировки c не очень понятна, но вы можете настроить следующие логики c для группировки, если это необходимо. Я предположил, что Value2 является кандидатом на группировку для этого образца набора данных.

Вот пример кода для получения результата, если вы хотите суммировать значение, то можете соответствующим образом изменить агрегацию.

import org.apache.spark.sql.expressions.Window
val inDF = spark.read.option("header","true").csv("one.csv")

val w = Window.partitionBy(col("Value2")).orderBy(desc("idx"))

val rowNumWindow = Window.partitionBy(col("Value2")).orderBy("Date2")
val outDF =  inDF.withColumn("idx", row_number() over rowNumWindow).
   withColumn("item_list", collect_list(col("Value1")) over w).
   withColumn("Required",concat_ws("+",col("item_list"),col("Value2"))) 



outDF.select("ColA","date1","Value1","Value2","Date2","Required").show(10,false)


root
 |-- ColA: string (nullable = true)
 |-- date1: string (nullable = true)
 |-- Value1: string (nullable = true)
 |-- Value2: string (nullable = true)
 |-- Date2: string (nullable = true)

|ColA|date1     |Value1|Value2  |Date2     |
+----+----------+------+--------+----------+
|11  |2002-08-14|44.234|485.5975|2002-05-30|
|11  |2003-02-14|52.699|485.5975|2002-05-30|
|11  |2003-05-15|32.484|485.5975|2002-05-30|
|11  |2003-08-14|39.797|817.2205|2003-05-30|
|11  |2004-02-14|36.114|817.2205|2003-05-30|
|11  |2004-05-15|41.137|817.2205|2003-05-30|
+----+----------+------+--------+----------+

Вывод, чтобы показать так же, как в вопросе:

+----+----------+------+--------+----------+-----------------------------+
|ColA|date1     |Value1|Value2  |Date2     |Required                     |
+----+----------+------+--------+----------+-----------------------------+
|11  |2004-05-15|41.137|817.2205|2003-05-30|41.137+817.2205              |
|11  |2004-02-14|36.114|817.2205|2003-05-30|41.137+36.114+817.2205       |
|11  |2003-08-14|39.797|817.2205|2003-05-30|41.137+36.114+39.797+817.2205|
|11  |2003-05-15|32.484|485.5975|2002-05-30|32.484+485.5975              |
|11  |2003-02-14|52.699|485.5975|2002-05-30|32.484+52.699+485.5975       |
|11  |2002-08-14|44.234|485.5975|2002-05-30|32.484+52.699+44.234+485.5975|
+----+----------+------+--------+----------+-----------------------------+

Если вы хотите сделать сумму, то вы можете сделать это:

val outDF =  inDF.withColumn("idx", row_number() over rowNumWindow).
 withColumn("item_list", sum(col("Value1")) over w).
 withColumn("Required",(col("item_list")+col("Value2")))
outDF.select("ColA","date1","Value1","Value2","Date2","Required").show(10,false)

Вывод:

+----+----------+------+--------+----------+-----------------+
|ColA|date1     |Value1|Value2  |Date2     |Required         |
+----+----------+------+--------+----------+-----------------+
|11  |2004-05-15|41.137|817.2205|2003-05-30|858.3575000000001|
|11  |2004-02-14|36.114|817.2205|2003-05-30|894.4715         |
|11  |2003-08-14|39.797|817.2205|2003-05-30|934.2685         |
|11  |2003-05-15|32.484|485.5975|2002-05-30|518.0815         |
|11  |2003-02-14|52.699|485.5975|2002-05-30|570.7805000000001|
|11  |2002-08-14|44.234|485.5975|2002-05-30|615.0145         |
+----+----------+------+--------+----------+-----------------+
1 голос
/ 23 января 2020
scala> import org.apache.spark.sql.expressions.Window

scala> df.show(false)
+-----+----------+------+--------+----------+
|Col A|date1     |Value1|Value2  |Date2     |
+-----+----------+------+--------+----------+
|11   |2002-08-14|44.234|485.5975|2002-05-30|
|11   |2003-02-14|52.699|485.5975|2002-05-30|
|11   |2003-05-15|32.484|485.5975|2002-05-30|
|11   |2003-08-14|39.797|817.2205|2003-05-30|
|11   |2004-02-14|36.114|817.2205|2003-05-30|
|11   |2004-05-15|41.137|817.2205|2003-05-30|
+-----+----------+------+--------+----------+

scala> val W  = Window.partitionBy("Col A", "Date2").rowsBetween(Window.unboundedPreceding, Window.currentRow)

scala> val df1 = df.withColumn("sumListValue1", sum(col("value1")).over(W).cast("Decimal(14,2)"))

scala> df1.show()
+-----+----------+------+--------+----------+-------------+
|Col A|     date1|Value1|  Value2|     Date2|sumListValue1|
+-----+----------+------+--------+----------+-------------+
|   11|2003-08-14|39.797|817.2205|2003-05-30|        39.80|
|   11|2004-02-14|36.114|817.2205|2003-05-30|        75.91|
|   11|2004-05-15|41.137|817.2205|2003-05-30|       117.05|
|   11|2002-08-14|44.234|485.5975|2002-05-30|        44.23|
|   11|2003-02-14|52.699|485.5975|2002-05-30|        96.93|
|   11|2003-05-15|32.484|485.5975|2002-05-30|       129.42|
+-----+----------+------+--------+----------+-------------+


scala> df1.withColumn("Required", col("sumListValue1") + col("Value2")).drop("sumListValue1").show()
+-----+----------+------+--------+----------+--------+
|Col A|     date1|Value1|  Value2|     Date2|Required|
+-----+----------+------+--------+----------+--------+
|   11|2003-08-14|39.797|817.2205|2003-05-30|857.0205|
|   11|2004-02-14|36.114|817.2205|2003-05-30|893.1305|
|   11|2004-05-15|41.137|817.2205|2003-05-30|934.2705|
|   11|2002-08-14|44.234|485.5975|2002-05-30|529.8275|
|   11|2003-02-14|52.699|485.5975|2002-05-30|582.5275|
|   11|2003-05-15|32.484|485.5975|2002-05-30|615.0175|
+-----+----------+------+--------+----------+--------+
...