Ошибка при чтении файла avro в искровом коде в Scala IDE - PullRequest
0 голосов
/ 28 мая 2020

Я создаю фрейм данных, читая файл avro, но получаю сообщение об ошибке при чтении файла в приложении Spark в scala IDE.

package dataFrameBasics

import org.apache.spark.sql.SparkSession

object WorkingWithAvroFile {
  def main(ar : Array[String]): Unit={
    val ss= SparkSession.builder().master("local")
    .appName("Working with Avro File")
    .getOrCreate()

    val avroDF= ss.read
    .format("com.databricks.spark.avro")
    .load("C:/Spark_Files/userdata1.avro")

    avroDF.printSchema()
    avroDF.show(10)
    println("Count:"+avroDF.count())

  }

}

На консоли с ошибкой ниже: Исключение в потоке "main "java .lang.ClassNotFoundException: не удалось найти источник данных: org. apache .spark. sql .avro.AvroFileFormat. Пожалуйста, найдите пакеты по адресу http://spark.apache.org/third-party-projects.html

На вкладке «Проблемы» отображается ошибка:

spark-avro_2.11-3.2.0.jar пути сборки SparkCourseAsMavenProject пересекается -компилирован с несовместимой версией Scala (2.11.0). Если этот отчет ошибочен, эту проверку можно отключить на странице настроек компилятора.

В pom. xml, добавлена ​​зависимость:

<dependency>
        <groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>

Пробовал разные версии этой библиотеки, но все равно дает ту же ошибку.

Ответы [ 4 ]

0 голосов
/ 28 мая 2020
@QuickSilver--- Continuing stack trace(3)

***************
Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
        at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
        at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
        at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
        at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
        at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
        at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
        at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
        at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    20/05/28 19:53:52 INFO SparkContext: Invoking stop() from shutdown hook
    20/05/28 19:53:52 INFO SparkUI: Stopped Spark web UI at http://Lenovo-PC:4040
    20/05/28 19:53:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    20/05/28 19:53:53 INFO MemoryStore: MemoryStore cleared
    20/05/28 19:53:53 INFO BlockManager: BlockManager stopped
    20/05/28 19:53:53 INFO BlockManagerMaster: BlockManagerMaster stopped
    20/05/28 19:53:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    20/05/28 19:53:53 INFO SparkContext: Successfully stopped SparkContext
    20/05/28 19:53:53 INFO ShutdownHookManager: Shutdown hook called
    20/05/28 19:53:53 INFO ShutdownHookManager: Deleting directory C:\Users\santosh\AppData\Local\Temp\spark-5a9bc571-30e8-4a94-8f95-f62c6fc9023c
0 голосов
/ 28 мая 2020
@QuickSilver -- Please find below full stacktrace

******************
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/05/28 19:53:10 INFO SparkContext: Running Spark version 2.4.5
20/05/28 19:53:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/05/28 19:53:12 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
    at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
    at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
    at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
    at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
    at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
    at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
    at dataFrameBasics.WorkingWithAvroFile$.main(WorkingWithAvroFile.scala:9)
    at dataFrameBasics.WorkingWithAvroFile.main(WorkingWithAvroFile.scala)
20/05/28 19:53:12 INFO SparkContext: Submitted application: Working with Avro File
20/05/28 19:53:12 INFO SecurityManager: Changing view acls to: santosh
20/05/28 19:53:12 INFO SecurityManager: Changing modify acls to: santosh
20/05/28 19:53:12 INFO SecurityManager: Changing view acls groups to: 
20/05/28 19:53:12 INFO SecurityManager: Changing modify acls groups to: 
20/05/28 19:53:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(santosh); groups with view permissions: Set(); users  with modify permissions: Set(santosh); groups with modify permissions: Set()
20/05/28 19:53:16 INFO Utils: Successfully started service 'sparkDriver' on port 50965.
20/05/28 19:53:16 INFO SparkEnv: Registering MapOutputTracker
20/05/28 19:53:16 INFO SparkEnv: Registering BlockManagerMaster
20/05/28 19:53:16 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/05/28 19:53:16 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/05/28 19:53:16 INFO DiskBlockManager: Created local directory at C:\Users\santosh\AppData\Local\Temp\blockmgr-0efb8730-052b-474d-ad33-669ad1b5d5ed
20/05/28 19:53:16 INFO MemoryStore: MemoryStore started with capacity 351.3 MB
20/05/28 19:53:16 INFO SparkEnv: Registering OutputCommitCoordinator
20/05/28 19:53:17 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/05/28 19:53:17 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://Lenovo-PC:4040
20/05/28 19:53:18 INFO Executor: Starting executor ID driver on host localhost
20/05/28 19:53:18 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50976.
20/05/28 19:53:18 INFO NettyBlockTransferService: Server created on Lenovo-PC:50976
20/05/28 19:53:18 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/05/28 19:53:18 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:18 INFO BlockManagerMasterEndpoint: Registering block manager Lenovo-PC:50976 with 351.3 MB RAM, BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:18 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:18 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, Lenovo-PC, 50976, None)
20/05/28 19:53:19 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/Users/santosh/Scala/Workspace/SparkCourseAsMavenProject/spark-warehouse').
20/05/28 19:53:19 INFO SharedState: Warehouse path is 'file:/C:/Users/santosh/Scala/Workspace/SparkCourseAsMavenProject/spark-warehouse'.
20/05/28 19:53:23 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/05/28 19:53:24 INFO InMemoryFileIndex: It took 232 ms to list leaf files for 1 paths.
**root
 |-- registration_dttm: string (nullable = true)
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- cc: long (nullable = true)
 |-- country: string (nullable = true)
 |-- birthdate: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- title: string (nullable = true)
 |-- comments: string (nullable = true)**

20/05/28 19:53:31 INFO FileSourceStrategy: Pruning directories with: 
20/05/28 19:53:31 INFO FileSourceStrategy: Post-Scan Filters: 
20/05/28 19:53:31 INFO FileSourceStrategy: Output Data Schema: struct<registration_dttm: string, id: bigint, first_name: string, last_name: string, email: string ... 11 more fields>
20/05/28 19:53:31 INFO FileSourceScanExec: Pushed Filters: 
20/05/28 19:53:32 INFO CodeGenerator: Code generated in 790.65082 ms
20/05/28 19:53:35 INFO CodeGenerator: Code generated in 188.480854 ms
20/05/28 19:53:36 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.3 KB, free 351.1 MB)
20/05/28 19:53:42 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.6 KB, free 351.1 MB)
20/05/28 19:53:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on Lenovo-PC:50976 (size: 20.6 KB, free: 351.3 MB)
20/05/28 19:53:42 INFO SparkContext: Created broadcast 0 from show at WorkingWithAvroFile.scala:24
20/05/28 19:53:42 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4287865 bytes, open cost is considered as scanning 4194304 bytes.
20/05/28 19:53:42 INFO SparkContext: Starting job: show at WorkingWithAvroFile.scala:24
20/05/28 19:53:42 INFO DAGScheduler: Got job 0 (show at WorkingWithAvroFile.scala:24) with 1 output partitions
20/05/28 19:53:42 INFO DAGScheduler: Final stage: ResultStage 0 (show at WorkingWithAvroFile.scala:24)
20/05/28 19:53:42 INFO DAGScheduler: Parents of final stage: List()
20/05/28 19:53:42 INFO DAGScheduler: Missing parents: List()
20/05/28 19:53:42 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at show at WorkingWithAvroFile.scala:24), which has no missing parents
20/05/28 19:53:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 14.1 KB, free 351.1 MB)
20/05/28 19:53:43 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 6.0 KB, free 351.0 MB)
20/05/28 19:53:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on Lenovo-PC:50976 (size: 6.0 KB, free: 351.3 MB)
20/05/28 19:53:43 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:43 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at show at WorkingWithAvroFile.scala:24) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:43 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/05/28 19:53:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7732 bytes)
20/05/28 19:53:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
20/05/28 19:53:44 INFO FileScanRDD: Reading File path: file:///C:/Spark_Files/userdata1.avro, range: 0-93561, partition values: [empty row]
20/05/28 19:53:44 INFO CodeGenerator: Code generated in 85.990927 ms
20/05/28 19:53:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3182 bytes result sent to driver
20/05/28 19:53:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2830 ms on localhost (executor driver) (1/1)
20/05/28 19:53:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
20/05/28 19:53:46 INFO DAGScheduler: ResultStage 0 (show at WorkingWithAvroFile.scala:24) finished in 3.362 s
20/05/28 19:53:46 INFO DAGScheduler: Job 0 finished: show at WorkingWithAvroFile.scala:24, took 4.340480 s
**+--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
|   registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
+--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
|2016-02-03T07:55:29Z|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
|2016-02-03T17:04:03Z|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|            null|              Canada| 1/16/1968|150280.17|       Accountant IV|        |
|2016-02-03T01:09:31Z|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
|2016-02-03T12:36:21Z|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
|2016-02-03T05:05:31Z|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
|2016-02-03T07:22:34Z|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
|2016-02-03T08:33:08Z|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
|2016-02-03T06:47:06Z|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|            null|Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
|2016-02-03T03:52:53Z|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|            null|         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
|2016-02-03T18:29:47Z| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
+--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
only showing top 10 rows**

20/05/28 19:53:47 INFO FileSourceStrategy: Pruning directories with: 
20/05/28 19:53:47 INFO FileSourceStrategy: Post-Scan Filters: 
20/05/28 19:53:47 INFO FileSourceStrategy: Output Data Schema: struct<>
20/05/28 19:53:47 INFO FileSourceScanExec: Pushed Filters: 
20/05/28 19:53:48 INFO CodeGenerator: Code generated in 50.675225 ms
20/05/28 19:53:48 INFO CodeGenerator: Code generated in 45.623914 ms
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 220.3 KB, free 350.8 MB)
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 20.6 KB, free 350.8 MB)
20/05/28 19:53:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on Lenovo-PC:50976 (size: 20.6 KB, free: 351.3 MB)
20/05/28 19:53:48 INFO SparkContext: Created broadcast 2 from count at WorkingWithAvroFile.scala:25
20/05/28 19:53:48 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4287865 bytes, open cost is considered as scanning 4194304 bytes.
20/05/28 19:53:48 INFO SparkContext: Starting job: count at WorkingWithAvroFile.scala:25
20/05/28 19:53:48 INFO DAGScheduler: Registering RDD 6 (count at WorkingWithAvroFile.scala:25) as input to shuffle 0
20/05/28 19:53:48 INFO DAGScheduler: Got job 1 (count at WorkingWithAvroFile.scala:25) with 1 output partitions
20/05/28 19:53:48 INFO DAGScheduler: Final stage: ResultStage 2 (count at WorkingWithAvroFile.scala:25)
20/05/28 19:53:48 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
20/05/28 19:53:48 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
20/05/28 19:53:48 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[6] at count at WorkingWithAvroFile.scala:25), which has no missing parents
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 10.9 KB, free 350.8 MB)
20/05/28 19:53:48 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.5 KB, free 350.8 MB)
20/05/28 19:53:48 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on Lenovo-PC:50976 (size: 5.5 KB, free: 351.2 MB)
20/05/28 19:53:48 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:48 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[6] at count at WorkingWithAvroFile.scala:25) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:48 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
20/05/28 19:53:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 7721 bytes)
20/05/28 19:53:48 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 27
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 17
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 14
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 20
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 22
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 28
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 12
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 13
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 30
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 21
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 9
20/05/28 19:53:49 INFO FileScanRDD: Reading File path: file:///C:/Spark_Files/userdata1.avro, range: 0-93561, partition values: [empty row]
20/05/28 19:53:49 INFO CodeGenerator: Code generated in 33.699831 ms
20/05/28 19:53:49 INFO BlockManagerInfo: Removed broadcast_1_piece0 on Lenovo-PC:50976 in memory (size: 6.0 KB, free: 351.3 MB)
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 19
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 7
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 23
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 16
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 11
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 15
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 26
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 8
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 29
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 10
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 25
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 18
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 5
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 6
20/05/28 19:53:49 INFO ContextCleaner: Cleaned accumulator 24
20/05/28 19:53:49 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1638 bytes result sent to driver
20/05/28 19:53:49 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 618 ms on localhost (executor driver) (1/1)
20/05/28 19:53:49 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
20/05/28 19:53:49 INFO DAGScheduler: ShuffleMapStage 1 (count at WorkingWithAvroFile.scala:25) finished in 0.778 s
20/05/28 19:53:49 INFO DAGScheduler: looking for newly runnable stages
20/05/28 19:53:49 INFO DAGScheduler: running: Set()
20/05/28 19:53:49 INFO DAGScheduler: waiting: Set(ResultStage 2)
20/05/28 19:53:49 INFO DAGScheduler: failed: Set()
20/05/28 19:53:49 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[9] at count at WorkingWithAvroFile.scala:25), which has no missing parents
20/05/28 19:53:49 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 8.6 KB, free 350.8 MB)
20/05/28 19:53:49 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.4 KB, free 350.8 MB)
20/05/28 19:53:49 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on Lenovo-PC:50976 (size: 4.4 KB, free: 351.3 MB)
20/05/28 19:53:49 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:49 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[9] at count at WorkingWithAvroFile.scala:25) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:49 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
20/05/28 19:53:49 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, executor driver, partition 0, ANY, 7246 bytes)
20/05/28 19:53:49 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
20/05/28 19:53:49 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks including 1 local blocks and 0 remote blocks
20/05/28 19:53:49 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 36 ms
20/05/28 19:53:50 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1749 bytes result sent to driver
20/05/28 19:53:50 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 397 ms on localhost (executor driver) (1/1)
20/05/28 19:53:50 INFO DAGScheduler: ResultStage 2 (count at WorkingWithAvroFile.scala:25) finished in 0.477 s
20/05/28 19:53:50 INFO DAGScheduler: Job 1 finished: count at WorkingWithAvroFile.scala:25, took 1.384763 s
20/05/28 19:53:50 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
**Count:1000**
0 голосов
/ 28 мая 2020
@QuickSilver-- continuing stack trace

**********
20/05/28 19:53:50 INFO FileSourceStrategy: Pruning directories with: 
20/05/28 19:53:50 INFO FileSourceStrategy: Post-Scan Filters: 
20/05/28 19:53:50 INFO FileSourceStrategy: Output Data Schema: struct<registration_dttm: string, id: bigint, first_name: string, last_name: string, email: string ... 11 more fields>
20/05/28 19:53:50 INFO FileSourceScanExec: Pushed Filters: 
20/05/28 19:53:50 INFO deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
20/05/28 19:53:50 INFO AvroFileFormat: Compressing Avro output using the snappy codec
20/05/28 19:53:50 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
20/05/28 19:53:50 INFO CodeGenerator: Code generated in 22.138244 ms
20/05/28 19:53:50 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 220.3 KB, free 350.6 MB)
20/05/28 19:53:51 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 20.6 KB, free 350.6 MB)
20/05/28 19:53:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on Lenovo-PC:50976 (size: 20.6 KB, free: 351.2 MB)
20/05/28 19:53:51 INFO SparkContext: Created broadcast 5 from save at WorkingWithAvroFile.scala:27
20/05/28 19:53:51 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4287865 bytes, open cost is considered as scanning 4194304 bytes.
20/05/28 19:53:51 INFO SparkContext: Starting job: save at WorkingWithAvroFile.scala:27
20/05/28 19:53:51 INFO DAGScheduler: Got job 2 (save at WorkingWithAvroFile.scala:27) with 1 output partitions
20/05/28 19:53:51 INFO DAGScheduler: Final stage: ResultStage 3 (save at WorkingWithAvroFile.scala:27)
20/05/28 19:53:51 INFO DAGScheduler: Parents of final stage: List()
20/05/28 19:53:51 INFO DAGScheduler: Missing parents: List()
20/05/28 19:53:51 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[11] at save at WorkingWithAvroFile.scala:27), which has no missing parents
20/05/28 19:53:51 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 135.4 KB, free 350.4 MB)
20/05/28 19:53:51 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 48.6 KB, free 350.4 MB)
20/05/28 19:53:51 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on Lenovo-PC:50976 (size: 48.6 KB, free: 351.2 MB)
20/05/28 19:53:51 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1163
20/05/28 19:53:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[11] at save at WorkingWithAvroFile.scala:27) (first 15 tasks are for partitions Vector(0))
20/05/28 19:53:51 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
20/05/28 19:53:51 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, executor driver, partition 0, PROCESS_LOCAL, 7732 bytes)
20/05/28 19:53:51 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
20/05/28 19:53:51 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
**20/05/28 19:53:51 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro**
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
    at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
    at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/05/28 19:53:52 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
    at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
    at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

20/05/28 19:53:52 ERROR TaskSetManager: Task 0 in stage 3.0 failed 1 times; aborting job
20/05/28 19:53:52 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
20/05/28 19:53:52 INFO TaskSchedulerImpl: Cancelling stage 3
20/05/28 19:53:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 3: Stage cancelled
20/05/28 19:53:52 INFO DAGScheduler: ResultStage 3 (save at WorkingWithAvroFile.scala:27) failed in 0.880 s due to Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver)**: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro**
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
    at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
    at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
20/05/28 19:53:52 INFO DAGScheduler: Job 2 failed: save at WorkingWithAvroFile.scala:27, took 0.916810 s
20/05/28 19:53:52 ERROR FileFormatWriter: Aborting job 39b6cdd2-d53c-4cb0-afa4-8d21a34d4638.
**org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro**
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
    at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
    at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1891)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1879)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1878)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:927)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:80)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at dataFrameBasics.WorkingWithAvroFile$.main(WorkingWithAvroFile.scala:27)
    at dataFrameBasics.WorkingWithAvroFile.main(WorkingWithAvroFile.scala)
Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
    at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
    at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/05/28 19:53:52 WARN FileUtil: Failed to delete file or dir [C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\.part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro.crc]: it still exists.
20/05/28 19:53:52 WARN FileUtil: Failed to delete file or dir [C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro]: it still exists.
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:80)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at dataFrameBasics.WorkingWithAvroFile$.main(WorkingWithAvroFile.scala:27)
    at dataFrameBasics.WorkingWithAvroFile.main(WorkingWithAvroFile.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\santosh\Scala\Workspace\SparkCourseAsMavenProject\output_destination\avro_file\_temporary\0\_temporary\attempt_20200528195351_0003_m_000000_3\part-00000-370dab6e-2c60-4b7c-82d6-e6c3645b538c-c000.avro
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:775)
    at org.apache.spark.sql.avro.AvroOutputWriter$$anon$1.getAvroFileOutputStream(AvroOutputWriter.scala:58)
    at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
    at org.apache.spark.sql.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:61)
    at org.apache.spark.sql.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:123)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:236)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1891)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1879)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1878)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:927)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
    ... 21 more
0 голосов
/ 28 мая 2020

Во-первых, вы не можете импортировать jar-файл Avro com.databricks и ожидать, что в нем будет находиться формат файла Avro org. apache .spark. sql. Лучше выбрать что-то вроде ниже. Примечание: используйте правую scala и искровую версию

Удалите зависимость avro от ваших блоков данных и замените ее приведенной ниже:

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.12</artifactId>
            <version>2.4.0</version>
        </dependency>

Используйте приведенный ниже код для чтения и записи файлов avro

df.read.format("avro").load("person.avro")
df.write.format("avro").save("person.avro")

Вы сможете найти информацию о зависимости maven по указанному ниже URL: Apache Spark Avro

...