У меня есть CSV-файл с данными в формате ниже
02/04/2018,MZE-RM00007(Kg.),29530,14.5,428185
02/04/2018,MZE-RM00007(Kg.),29160,14.5,422820
02/04/2018,MZE-RM00007(Kg.),22500,14.501,326272.5
02/04/2018,MZE-RM00007(Kg.),29490,14.5,427605
02/04/2018,MZE-RM00007(Kg.),19750,14.5,286375
02/04/2018,MZE-RM00007(Kg.),30140,14.5,437030
02/04/2018,MZE-RM00007(Kg.),24730,14.25,352402.5
02/04/2018,MZE-RM00007(Kg.),29520,14.5,428040
03/04/2018,CHOLINE CHLORIDE-MD00027(Kg.),3000,93,279000
Я пытаюсь прочитать его в pyspark, как показано ниже
spark = SparkSession.builder.\
appName("Weather_Data_Extraction_To_Delhi_Only_2017").\
master("local").\
config("spark.driver.memory", "4g").\
config("spark.executor.memory", "2g").\
getOrCreate()
MySchema = StructType([
StructField("sDate", DateType(), True),
StructField("Items", StringType(), True),
StructField("purchasedQTY", DoubleType(), True),
StructField("rate", DoubleType(), True),
StructField("purchasedVolume", DoubleType(), True),
])
linesDataFrame = spark.read.format("csv").schema(MySchema).load("/home/rajnish.kumar/eclipse-workspace/ShivShakti/Data/RMPurchaseData.csv")
print linesDataFrame.printSchema()
и моя схема печати
root
|-- sDate: date (nullable = true)
|-- Items: string (nullable = true)
|-- purchasedQTY: double (nullable = true)
|-- rate: double (nullable = true)
|-- purchasedVolume: double (nullable = true)
None
Теперь, когда я запрашиваю:
linesDataFrame.select("sDate","Items","purchasedQTY","rate","purchasedVolume").show()
Я получаю результаты ниже
+-----+-----+------------+----+---------------+
|sDate|Items|purchasedQTY|rate|purchasedVolume|
+-----+-----+------------+----+---------------+
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
+-----+-----+------------+----+---------------+
only showing top 20 rows
Но когда я запрашиваю
linesDataFrame.select("Items","purchasedQTY","rate","purchasedVolume").show()
ниже мой результат
+--------------------+------------+------+---------------+
| Items|purchasedQTY| rate|purchasedVolume|
+--------------------+------------+------+---------------+
| MZE-RM00007(Kg.)| 29530.0| 14.5| 428185.0|
| MZE-RM00007(Kg.)| 29160.0| 14.5| 422820.0|
| MZE-RM00007(Kg.)| 22500.0|14.501| 326272.5|
| MZE-RM00007(Kg.)| 29490.0| 14.5| 427605.0|
| MZE-RM00007(Kg.)| 19750.0| 14.5| 286375.0|
| MZE-RM00007(Kg.)| 30140.0| 14.5| 437030.0|
| MZE-RM00007(Kg.)| 24730.0| 14.25| 352402.5|
| MZE-RM00007(Kg.)| 29520.0| 14.5| 428040.0|
|CHOLINE CHLORIDE-...| 3000.0| 93.0| 279000.0|
| MZE-RM00007(Kg.)| 19790.0| 14.0| 277060.0|
| MZE-RM00007(Kg.)| 28020.0| 14.5| 406290.0|
| MZE-RM00007(Kg.)| 26330.0| 14.0| 368620.0|
| MZE-RM00007(Kg.)| 26430.0| 14.0| 370020.0|
|MOP DRY-MD00183(Kg.)| 300.0| 158.0| 47400.0|
| mop-MD00094(Kg.)| 500.0| 147.0| 73500.0|
| MZE-RM00007(Kg.)| 23380.0| 14.0| 327320.0|
| MZE-RM00007(Kg.)| 31840.0| 14.0| 445760.0|
| MZE-RM00007(Kg.)| 14370.0| 14.5| 208365.0|
| MZE-RM00007(Kg.)| 20660.0| 14.5| 299570.0|
| MZE-RM00007(Kg.)| 20220.0| 13.9| 281058.0|
+--------------------+------------+------+---------------+
only showing top 20 rows
Почему вызов запроса с помощью "sDate" дает мне значение NULL и как устранить вышеуказанную проблему?