Да, это так, например:
import org.apache.spark.sql.Column
val df = List(
("1001", "[physics, chemistry]", "pass"),
("1001", "[biology, math]", "fail"),
("3002", "[economics]", "pass"),
("2002", "[physics, chemistry]", "fail")
).toDF("student_id", "subjects", "result")
df.filter(col("student_id").startsWith("3")).show
возвращает:
+----------+-----------+------+
|student_id| subjects|result|
+----------+-----------+------+
| 3002|[economics]| pass|
+----------+-----------+------+
для входных данных, производных от JSON - albiet не очень важен, пример с использованием DF, а не DS (работает также дляDS), только незначительная разница для поля в структуре:
import org.apache.spark.sql.Column
val df = spark.read.json("/FileStore/tables/json_nested_4.txt")
import org.apache.spark.sql.functions._
val flattened = df.select($"name", explode($"schools").as("schools_flat"))
flattened.filter(col("name").startsWith("J")).show
flattened.filter(col("schools_flat.sname").startsWith("u")).show
базовый ввод и структура:
+-------+----------------+
| name| schools_flat|
+-------+----------------+
|Michael|[stanford, 2010]|
|Michael|[berkeley, 2012]|
| Andy| [ucsb, 2011]|
| Justin|[berkeley, 2014]|
+-------+----------------+
flattened: org.apache.spark.sql.DataFrame = [name: string, schools_flat: struct<sname: string, year: bigint>]
возвращает:
+------+----------------+
| name| schools_flat|
+------+----------------+
|Justin|[berkeley, 2014]|
+------+----------------+
+----+------------+
|name|schools_flat|
+----+------------+
|Andy|[ucsb, 2011]|
+----+------------+