Имена агрегированных столбцов Spark DataFrame - PullRequest
1 голос
/ 04 августа 2020

У меня есть DataFrame со следующей структурой:

root
 |-- very_hot: string (nullable = true)
 |-- hot: string (nullable = true)
 |-- cold: string (nullable = true)
 |-- little_snow: string (nullable = true)
 |-- medium_snow: string (nullable = true)
 |-- very_cold: string (nullable = true)
 |-- deep_snow: string (nullable = true)
 |-- freezing: string (nullable = true)
 |-- windy: string (nullable = true)

Каждый из этих столбцов содержит True или False. Я хочу создать новый столбец с массивами имен столбцов, которые составляют True. Как я могу это сделать?

EDIT : Вот таблица, которая у меня есть:

+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
|very_hot|  hot| cold|little_snow|medium_snow|very_cold|deep_snow|freezing|windy|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
|    True|False|False|      False|      False|    False|    False|   False| True|
|   False|False| True|       True|      False|    False|    False|   False|False|
|   False|False| True|      False|       True|    False|    False|   False|False|
|   False|False|False|      False|      False|     True|     True|   False|False|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+

Столбец, который мне нужен, должен выглядеть так:

+--------------------+
|            features|
+--------------------+
|     very_hot, windy|
|   cold, little_snow|
|   cold, medium_snow|
|very_cold, deep_snow|
+--------------------+

Ответы [ 4 ]

0 голосов
/ 05 августа 2020

Другой вариант -

 df2.show(false)
    df2.printSchema()
    /**
      * +--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
      * |very_hot|hot  |cold |little_snow|medium_snow|very_cold|deep_snow|freezing|windy|
      * +--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
      * |True    |False|False|False      |False      |False    |False    |False   |True |
      * |False   |False|True |True       |False      |False    |False    |False   |False|
      * |False   |False|True |False      |True       |False    |False    |False   |False|
      * |False   |False|False|False      |False      |True     |True     |False   |False|
      * +--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
      *
      * root
      * |-- very_hot: string (nullable = true)
      * |-- hot: string (nullable = true)
      * |-- cold: string (nullable = true)
      * |-- little_snow: string (nullable = true)
      * |-- medium_snow: string (nullable = true)
      * |-- very_cold: string (nullable = true)
      * |-- deep_snow: string (nullable = true)
      * |-- freezing: string (nullable = true)
      * |-- windy: string (nullable = true)
      */

    val columns = df2.columns.map(c => s"named_struct('name', '$c', 'value', `$c`)").mkString(", ")
    df2.selectExpr(s"TRANSFORM(FILTER(array($columns), x -> x.value='True'), x -> x.name) as features")
      .show(false)
    /**
      * +----------------------+
      * |features              |
      * +----------------------+
      * |[very_hot, windy]     |
      * |[cold, little_snow]   |
      * |[cold, medium_snow]   |
      * |[very_cold, deep_snow]|
      * +----------------------+
      */
0 голосов
/ 05 августа 2020

этот код может быть вам полезен,

import org.apache.spark.sql.functions._
val df=Seq(("True","False","False","False","False","False","False","False","True"),("False","False","True","True","False","False","False","False","False"),("False","False","True","False","True","False","False","False","False"),("False","False","False","False","False","True","True","False","False")).toDF("very_hot","hot","cold","little_snow","medium_snow","very_cold","deep_snow","freezing","windy")

df.show()

/*
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
|very_hot|  hot| cold|little_snow|medium_snow|very_cold|deep_snow|freezing|windy|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
|    True|False|False|      False|      False|    False|    False|   False| True|
|   False|False| True|       True|      False|    False|    False|   False|False|
|   False|False| True|      False|       True|    False|    False|   False|False|
|   False|False|False|      False|      False|     True|     True|   False|False|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+
*/

val df1=df.withColumn("features", concat_ws(",", 
when(col("very_hot").contains("True"), "very_hot"), 
when(col("hot").contains("True"), "hot"),
when(col("cold").contains("True"), "cold"),
when(col("little_snow").contains("True"), "little_snow"),
when(col("medium_snow").contains("True"), "medium_snow"),
when(col("very_cold").contains("True"), "very_cold"),
when(col("deep_snow").contains("True"), "deep_snow"),
when(col("freezing").contains("True"), "freezing"),
when(col("windy").contains("True"), "windy")
)).drop("very_hot").drop("hot").drop("cold").drop("little_snow").drop("medium_snow").drop("very_cold").drop("deep_snow").drop("freezing").drop("windy")

df1.show()
/*
+-------------------+
|           features|
+-------------------+
|     very_hot,windy|
|   cold,little_snow|
|   cold,medium_snow|
|very_cold,deep_snow|
+-------------------+
*/

0 голосов
/ 05 августа 2020

Попробуйте это.

val df2 = df.withColumn("feature", concat_ws(", ", df.columns.map(c => when(col(c)===lit("True"), c)): _*))
df2.show(false)

+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+--------------------+
|very_hot|hot  |cold |little_snow|medium_snow|very_cold|deep_snow|freezing|windy|feature             |
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+--------------------+
|true    |false|false|false      |false      |false    |false    |false   |true |very_hot, windy     |
|false   |false|true |true       |false      |false    |false    |false   |false|cold, little_snow   |
|false   |false|true |false      |true       |false    |false    |false   |false|cold, medium_snow   |
|false   |false|false|false      |false      |true     |true     |false   |false|very_cold, deep_snow|
+--------+-----+-----+-----------+-----------+---------+---------+--------+-----+--------------------+


df2.drop(df.columns: _*).show(false)

+--------------------+
|feature             |
+--------------------+
|very_hot, windy     |
|cold, little_snow   |
|cold, medium_snow   |
|very_cold, deep_snow|
+--------------------+
0 голосов
/ 04 августа 2020

Этот scala код

val data = Seq((true, true, false), (true, false, true), (false, true, true))
val df = data.toDF("first", "second", "third")
val names = df.schema.map(_.name).zipWithIndex
df.rdd
  .map(r => names
    .filter(n => r.getBoolean(n._2))
    .map(_._1)
    .mkString(",")
  ).toDF("feature").show

приведет к

+------------+
|     feature|
+------------+
|first,second|
| first,third|
|second,third|
+------------+
...