Недопустимое исключение нулевого указателя ...
с искрой 2,20 и ниже зависимости xml
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.4.1</version>
</dependency>
Я сохранил те же данные и успешно восстановил их обратно ..как показано ниже
package com.examples
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.log4j.Level
import org.apache.spark.sql.{SQLContext, SparkSession}
/**
* Created by Ram Ghadiyaram
*/
object SparkXmlTest {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName(this.getClass.getName)
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val sqlContext = new SQLContext(sc)
val str =
"""
|<?xml version="1.0"?>
|<Company>
| <Employee id="1">
| <Email>tp@xyz.com</Email>
| <UserData id="id32" type="AttributesInContext">
| <UserValue value="7in" title="Height"></UserValue>
| <UserValue value="23lb" title="Weight"></UserValue>
|</UserData>
| </Employee>
| <Measures id="1">
| <Email>newdata@rty.com</Email>
| <UserData id="id32" type="SitesInContext">
|</UserData>
| </Measures>
| <Employee id="2">
| <Email>tp@xyz.com</Email>
| <UserData id="id33" type="AttributesInContext">
| <UserValue value="7in" title="Height"></UserValue>
| <UserValue value="34lb" title="Weight"></UserValue>
|</UserData>
| </Employee>
| <Measures id="2">
| <Email>nextrow@rty.com</Email>
| <UserData id="id35" type="SitesInContext">
|</UserData>
| </Measures>
| <Employee id="3">
| <Email>tp@xyz.com</Email>
| <UserData id="id34" type="AttributesInContext">
| <UserValue value="7in" title="Height"></UserValue>
| <UserValue value="" title="Weight"></UserValue>
|</UserData>
| </Employee>
|</Company>
""".stripMargin
println("save to file ")
val f = new File("xmltest.xml")
FileUtils.writeStringToFile(f, str)
val xmldf = spark.read.format("com.databricks.spark.xml")
.option("rootTag", "Company")
.option("rowTag", "Employee")
.option("nullValue", "")
.load(f.getAbsolutePath)
val columnNames = xmldf.columns.toSeq
val sdf = xmldf.select(columnNames.map(c => xmldf.col(c)): _*)
sdf.write.format("com.databricks.spark.xml")
.option("rootTag", "Company")
.option("rowTag", "Employee")
.mode("overwrite")
.save("/src/main/resources/Rel1")
println("read back from saved file ....")
val readbackdf = spark.read.format("com.databricks.spark.xml")
.option("rootTag", "Company")
.option("rowTag", "Employee")
.option("nullValue", "")
.load("/src/main/resources/Rel1")
readbackdf.show(false)
}
}
Результат:
save to file
read back from saved file ....
+----------+------------------------------------------------------------------------------+---+
|Email |UserData |_id|
+----------+------------------------------------------------------------------------------+---+
|tp@xyz.com|[WrappedArray([null,Height,7in], [null,Weight,23lb]),id32,AttributesInContext]|1 |
|tp@xyz.com|[WrappedArray([null,Height,7in], [null,Weight,34lb]),id33,AttributesInContext]|2 |
|tp@xyz.com|[WrappedArray([null,Height,7in], [null,Weight,null]),id34,AttributesInContext]|3 |
+----------+------------------------------------------------------------------------------+---+
Обновление: с последним обновлением XML OP я попытался получить исключение и исправить с помощью приведенного ниже кода...
.option("attributePrefix", "_Att")
.option("valueTag", "_VALUE")
Полный набор опций и описание из блоков данных:
This package allows reading XML files in local or distributed filesystem as Spark DataFrames. When reading files the API accepts several options:
path: Location of files. Similar to Spark can accept standard Hadoop globbing expressions.
rowTag: The row tag of your xml files to treat as a row. For example, in this xml ..., the appropriate value would be book. Default is ROW. At the moment, rows containing self closing xml tags are not supported.
samplingRatio: Sampling ratio for inferring schema (0.0 ~ 1). Default is 1. Possible types are StructType, ArrayType, StringType, LongType, DoubleType, BooleanType, TimestampType and NullType, unless user provides a schema for this.
excludeAttribute : Whether you want to exclude attributes in elements or not. Default is false.
treatEmptyValuesAsNulls : (DEPRECATED: use nullValue set to "") Whether you want to treat whitespaces as a null value. Default is false
mode: The mode for dealing with corrupt records during parsing. Default is PERMISSIVE.
PERMISSIVE :
When it encounters a corrupted record, it sets all fields to null and puts the malformed string into a new field configured by columnNameOfCorruptRecord.
When it encounters a field of the wrong datatype, it sets the offending field to null.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
inferSchema: if true, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If false, all resulting columns are of string type. Default is true.
columnNameOfCorruptRecord: The name of new field where malformed strings are stored. Default is _corrupt_record.
attributePrefix: The prefix for attributes so that we can differentiate attributes and elements. This will be the prefix for field names. Default is _.
valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE.
charset: Defaults to 'UTF-8' but can be set to other valid charset names
ignoreSurroundingSpaces: Defines whether or not surrounding whitespaces from values being read should be skipped. Default is false.
When writing files the API accepts several options:
path: Location to write files.
rowTag: The row tag of your xml files to treat as a row. For example, in this xml ..., the appropriate value would be book. Default is ROW.
rootTag: The root tag of your xml files to treat as the root. For example, in this xml ..., the appropriate value would be books. Default is ROWS.
nullValue: The value to write null value. Default is string null. When this is null, it does not write attributes and elements for fields.
attributePrefix: The prefix for attributes so that we can differentiating attributes and elements. This will be the prefix for field names. Default is _.
valueTag: The tag used for the value when there are attributes in the element having no child. Default is _VALUE.
compression: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.
Currently it supports the shortened name usage. You can use just xml instead of com.databricks.spark.xml.
Полный пример здесь:
package com.examples
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{SQLContext, SparkSession}
/**
* Created by Ram Ghadiyaram
*/
object SparkXmlTest {
// org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName(this.getClass.getName)
.getOrCreate()
// spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val sqlContext = new SQLContext(sc)
// val str =
// """
// |<?xml version="1.0"?>
// |<Company>
// | <Employee id="1">
// | <Email>tp@xyz.com</Email>
// | <UserData id="id32" type="AttributesInContext">
// | <UserValue value="7in" title="Height"></UserValue>
// | <UserValue value="23lb" title="Weight"></UserValue>
// |</UserData>
// | </Employee>
// | <Measures id="1">
// | <Email>newdata@rty.com</Email>
// | <UserData id="id32" type="SitesInContext">
// |</UserData>
// | </Measures>
// | <Employee id="2">
// | <Email>tp@xyz.com</Email>
// | <UserData id="id33" type="AttributesInContext">
// | <UserValue value="7in" title="Height"></UserValue>
// | <UserValue value="34lb" title="Weight"></UserValue>
// |</UserData>
// | </Employee>
// | <Measures id="2">
// | <Email>nextrow@rty.com</Email>
// | <UserData id="id35" type="SitesInContext">
// |</UserData>
// | </Measures>
// | <Employee id="3">
// | <Email>tp@xyz.com</Email>
// | <UserData id="id34" type="AttributesInContext">
// | <UserValue value="7in" title="Height"></UserValue>
// | <UserValue value="" title="Weight"></UserValue>
// |</UserData>
// | </Employee>
// |</Company>
// """.stripMargin
val str =
"""
|<Company>
| <Employee id="id47" masterRef="#id53" revision="" nomenclature="">
|<ApplicationRef version="J.0" application="Teamcenter"></ApplicationRef>
|<UserData id="id52">
|<UserValue valueRef="#id4" value="" title="_CONFIG_CONTEXT"></UserValue></UserData></Employee>
|<Employee id="id47" masterRef="#id53" revision="" nomenclature="">
|<ApplicationRef version="B.0" application="Teamcenter"></ApplicationRef>
|<UserData id="id63">
|<UserValue valueRef="#id5" value="" title="_CONFIG_CONTEXT"></UserValue></UserData></Employee>
|</Company>
""".stripMargin
println("save to file ")
val f = new File("xmltest.xml")
FileUtils.writeStringToFile(f, str)
val xmldf = spark.read.format("com.databricks.spark.xml")
.option("rootTag", "Company")
.option("rowTag", "Employee")
.option("nullValue", "")
.load(f.getAbsolutePath)
val columnNames = xmldf.columns.toSeq
val sdf = xmldf.select(columnNames.map(c => xmldf.col(c)): _*)
sdf.write.format("com.databricks.spark.xml")
.option("rootTag", "Company")
.option("rowTag", "Employee")
.option("attributePrefix", "_Att")
.option("valueTag", "_VALUE")
.mode("overwrite")
.save("./src/main/resources/Rel1")
println("read back from saved file ....")
val readbackdf = spark.read.format("com.databricks.spark.xml")
.option("rootTag", "Company")
.option("rowTag", "Employee")
.option("nullValue", "")
.load("./src/main/resources/Rel1")
readbackdf.show(false)
}
}
Результат:
save to file
read back from saved file ....
+-----------------+-------------------------------+----+----------+
|ApplicationRef |UserData |_id |_masterRef|
+-----------------+-------------------------------+----+----------+
|[Teamcenter, J.0]|[[_CONFIG_CONTEXT, #id4], id52]|id47|#id53 |
|[Teamcenter, B.0]|[[_CONFIG_CONTEXT, #id5], id63]|id47|#id53 |
+-----------------+-------------------------------+----+----------+