Я хочу прочитать CSV-файл, в котором одно поле содержит многострочную запись, поэтому всякий раз, когда я разделяю его с помощью лямбда-функции, как показано ниже в коде, и пытаюсь использовать разные столбцы, код выдает мне ошибку, такую как индекс списка вне границ ,
Мой код
from pyspark import SparkContext, SparkConf
import spark
print("What you want to analyze:")
print("1. No. of patients from same hospital")
print("2. No. of patients from a particular city")
choice=int(input())
conf = SparkConf().setMaster("local").setAppName("Word_count")
conf.set("spark.driver.memory", "3G")
rdd=spark.read.option("wholeFile", true).option("multiline",true).option("header", true).option("inferSchema", "true").csv("/home/karan/HospInfo3.csv")
if(choice==1):
print("No. of patients from same hospital is:-")
diss=rdd.map(lambda x:x.split(",")).\
map(lambda x:(x[1],1)).\
reduceByKey(lambda x,y:x+y).\
map(lambda x:(x[1],x[0])).\
sortByKey(ascending=False).\
map(lambda x:(x[1],x[0]))
rs=diss.collect()
elif(choice==2):
print("No. of patients from a particular city is:-")
diss=rdd.map(lambda x:x.split(",")).\
map(lambda x:(x[3],1)).\
reduceByKey(lambda x,y:x+y).\
map(lambda x:(x[1],x[0])).\
sortByKey(ascending=False).\
map(lambda x:(x[1],x[0]))
rs=diss.collect()
if os.path.exists("/home/karan/output2"):
shutil.rmtree("/home/karan/output2")
for x in rs:
print(str(x[0])+"\t\t"+str(x[1]))
sc.parallelize(rs).saveAsTextFile("output2")
Также у меня появляется следующая ошибка: -
Traceback (most recent call last):
File "hospitalsDemo2.py", line 11, in <module>
rdd=spark.read.option("wholeFile", true).option("multiline",true).option("header", true).option("inferSchema", "true").csv("/home/karan/HospInfo3.csv")
AttributeError: 'module' object has no attribute 'read'
И метод SparkCOntext, считывающий его как файл CSV, дает следующую ошибку (где все это началось): -
Traceback (most recent call last):
File "hospitalsDemo2.py", line 17, in <module>
reduceByKey(lambda x,y:x+y).\
File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 1640, in reduceByKey
return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)
File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 1868, in combineByKey
numPartitions = self._defaultReducePartitions()
File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 2278, in _defaultReducePartitions
return self.getNumPartitions()
File "/home/karan/.local/lib/python2.7/site-packages/pyspark/rdd.py", line 2474, in getNumPartitions
return self._prev_jrdd.partitions().size()
File "/home/karan/.local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/karan/.local/lib/python2.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: Illegal file pattern: \k is not followed by '<' for named capturing group near index 7
\home\karan\hospInfo\.csv
^
at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71)
at org.apache.hadoop.fs.GlobFilter.<init>(GlobFilter.java:50)
at org.apache.hadoop.fs.Globber.glob(Globber.java:192)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.regex.PatternSyntaxException: \k is not followed by '<' for named capturing group near index 7
\home\karan\hospInfo\.csv
^
at java.util.regex.Pattern.error(Pattern.java:1957)
at java.util.regex.Pattern.escape(Pattern.java:2413)
at java.util.regex.Pattern.atom(Pattern.java:2200)
at java.util.regex.Pattern.sequence(Pattern.java:2132)
at java.util.regex.Pattern.expr(Pattern.java:1998)
at java.util.regex.Pattern.compile(Pattern.java:1698)
at java.util.regex.Pattern.<init>(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1028)
at org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156)
at org.apache.hadoop.fs.GlobPattern.<init>(GlobPattern.java:42)
at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67)
... 29 more
CSV образец
CSV открывается как текстовый файл
Образец прост в использовании
Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Hospital Type,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location
10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,Acute Care Hospitals,Government - Hospital District or Authority,,Same as the national average,,Above the national average,,Below the national average,,"2505 US HIGHWAY 431 NORTH
BOAZ, AL
"
10012,DEKALB REGIONAL MEDICAL CENTER,200 MED CENTER DRIVE,FORT PAYNE,AL,35968,DE KALB,2568453150,Acute Care Hospitals,Proprietary,,Above the national average,,Same as the national average,,"200 MED CENTER DRIVE
FORT PAYNE, AL
"
10032,WEDOWEE HOSPITAL,209 NORTH MAIN STREET,WEDOWEE,AL,36278,RANDOLPH,2563572111,Acute Care Hospitals,Government - Hospital District or Authority,,Not Available,Results are not available for this reporting period,"209 NORTH MAIN STREET
WEDOWEE, AL
"
10131,CRESTWOOD MEDICAL CENTER,ONE HOSPITAL DR SE,HUNTSVILLE,AL,35801,MADISON,2568823100,Acute Care Hospitals,Proprietary,,Same as the national average,,Same as the national average,,"ONE HOSPITAL DR SE
HUNTSVILLE, AL
"