Для меня он работает хорошо -
По умолчанию пустая строка ("") считается пустой при чтении csv
Тест-1. Набор данных с нулевым значением и схема с нулевым значением = false
String data = "id Col_1 Col_2 Col_3 Col_4 Col_5\n" +
"1 A B C D E\n" +
"2 X Y Z P \"\"";
List<String> list = Arrays.stream(data.split(System.lineSeparator()))
.map(s -> s.replaceAll("\\s+", ","))
.collect(Collectors.toList());
List<StructField> fields = Arrays.stream("id Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
.map(s -> new StructField(s, DataTypes.StringType, false, Metadata.empty()))
.collect(Collectors.toList());
Dataset<Row> df1 = spark.read()
.schema(new StructType(fields.toArray(new StructField[fields.size()])))
.option("header", true)
.option("sep", ",")
.csv(spark.createDataset(list, Encoders.STRING()));
df1.show();
df1.printSchema();
Вывод -
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow_0_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3384)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
Заключение - Ожидаемое поведение PASS
2. Набор данных с нулевым значением и схема с нулевым значением = true
String data = "id Col_1 Col_2 Col_3 Col_4 Col_5\n" +
"1 A B C D E\n" +
"2 X Y Z P \"\"";
List<String> list = Arrays.stream(data.split(System.lineSeparator()))
.map(s -> s.replaceAll("\\s+", ","))
.collect(Collectors.toList());
List<StructField> fields = Arrays.stream("id Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
.map(s -> new StructField(s, DataTypes.StringType, true, Metadata.empty()))
.collect(Collectors.toList());
Dataset<Row> df1 = spark.read()
.schema(new StructType(fields.toArray(new StructField[fields.size()])))
.option("header", true)
.option("sep", ",")
.csv(spark.createDataset(list, Encoders.STRING()));
df1.show();
df1.printSchema();
Вывод -
+---+-----+-----+-----+-----+-----+
| id|Col_1|Col_2|Col_3|Col_4|Col_5|
+---+-----+-----+-----+-----+-----+
| 1| A| B| C| D| E|
| 2| X| Y| Z| P| null|
+---+-----+-----+-----+-----+-----+
root
|-- id: string (nullable = true)
|-- Col_1: string (nullable = true)
|-- Col_2: string (nullable = true)
|-- Col_3: string (nullable = true)
|-- Col_4: string (nullable = true)
|-- Col_5: string (nullable = true)
Заключение - Ожидаемое поведение PASS
3. Набор данных без null & Schema nullable = true
String data1 = "id Col_1 Col_2 Col_3 Col_4 Col_5\n" +
"1 A B C D E\n" +
"2 X Y Z P F";
List<String> list1 = Arrays.stream(data1.split(System.lineSeparator()))
.map(s -> s.replaceAll("\\s+", ","))
.collect(Collectors.toList());
List<StructField> fields1 = Arrays.stream("id Col_1 Col_2 Col_3 Col_4 Col_5".split("\\s+"))
.map(s -> new StructField(s, DataTypes.StringType, true, Metadata.empty()))
.collect(Collectors.toList());
Dataset<Row> df2 = spark.read()
.schema(new StructType(fields1.toArray(new StructField[fields.size()])))
.option("header", true)
.option("sep", ",")
.csv(spark.createDataset(list1, Encoders.STRING()));
df2.show();
df2.printSchema();
Output-
+---+-----+-----+-----+-----+-----+
| id|Col_1|Col_2|Col_3|Col_4|Col_5|
+---+-----+-----+-----+-----+-----+
| 1| A| B| C| D| E|
| 2| X| Y| Z| P| F|
+---+-----+-----+-----+-----+-----+
root
|-- id: string (nullable = true)
|-- Col_1: string (nullable = true)
|-- Col_2: string (nullable = true)
|-- Col_3: string (nullable = true)
|-- Col_4: string (nullable = true)
|-- Col_5: string (nullable = true)
Заключение - Ожидаемое поведение PASS