Ошибка сериализации в PySpark при попытке прочитать записи WARC - PullRequest
0 голосов
/ 26 августа 2018

Я пытаюсь прочитать WARC-записи в PySpark, используя пользовательский формат ввода.Тот же метод отлично работает в Scala.Это мой код:

r = sc.newAPIHadoopFile(
'/Users/akshanshgupta/Workspace/00.warc',
'org.warcbase.mapreduce.WacWarcInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.warcbase.io.WarcRecordWritable')

Вот код Scala, который отлично работает:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.spark.rdd.RDD
import org.apache.spark.{SerializableWritable, SparkConf, SparkContext}
import org.warcbase.io.WarcRecordWritable
import org.warcbase.mapreduce.WacWarcInputFormat
import org.warcbase.spark.archive.io.{ArchiveRecord, WarcRecord}

val r = sc.newAPIHadoopFile("/Users/akshanshgupta/Workspace/00.warc",
  classOf[WacWarcInputFormat], classOf[LongWritable], classOf[WarcRecordWritable])
  .filter(r => r._2.getRecord.getHeader.getHeaderValue("WARC-Type").equals("response"))
  .map(r => new WarcRecord(new SerializableWritable(r._2))).asInstanceOf[RDD[ArchiveRecord]]

Это ошибка, которую я получаю в PySpark:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.warcbase.io.WarcRecordWritable
Serialization stack:
    - object not serializable (class: org.warcbase.io.WarcRecordWritable, value: org.warcbase.io.WarcRecordWritable@aee8520)
    - field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
    - object (class scala.Tuple2, (0,org.warcbase.io.WarcRecordWritable@aee8520))
    - element of array (index: 0)
    - array (class [Lscala.Tuple2;, size 1)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
    at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:239)
    at org.apache.spark.api.python.PythonRDD$.newAPIHadoopFile(PythonRDD.scala:265)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopFile(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Что я делаю не так?

...