Из-за риска получить отрицательный голос, я должен попросить помощи по этой ошибке.Я запускаю несколько простых объединений и вычислений на довольно большом фрейме данных.Все работает нормально, пока я не решу записать данные и с помощью следующего оператора:
FinalTable.repartition(1).write.mode(SaveMode.Overwrite).parquet(OutputFilePath + "/day=" + Day)
, а затем я получаю следующую ошибку:
Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(3) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#218L])
+- *(3) Project
+- *(3) BroadcastHashJoin [cast(sid#30 as bigint), reid#26L, substring(cast(ra#27L as string), -13, (length(cast(ra#27L as string)) - 7))], [sid#75L, reid#73L, substring(cast(ra#74L as string), -13, (length(cast(ra#74L as string)) - 7))], LeftOuter, BuildRight
:- *(3) Project [reid#26L, ra#27L, sid#30]
: +- *(3) BroadcastHashJoin [cast(sid#30 as bigint), reid#26L, substring(cast(ra#27L as string), -13, (length(cast(ra#27L as string)) - 7))], [sid#101L, reid#99L, substring(cast(ra#100L as string), -13, (length(cast(ra#100L as string)) - 7))], LeftOuter, BuildRight
: :- *(3) Project [reid#26L, ra#27L, sid#30]
: : +- *(3) Filter isnotnull(ppbc#4)
: : +- *(3) FileScan parquet [ppbc#4,reid#26L,ra#27L,sid#30] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://bucket/folder/parquet/day=2018-10-22], PartitionFilters: [], PushedFilters: [IsNotNull(pbbc)], ReadSchema: struct<ppbc:string,reid:bigint,ra:bigint,sid:string>
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[2, bigint, true], input[0, bigint, true], substring(cast(input[1, bigint, true] as string), -13, (length(cast(input[1, bigint, true] as string)) - 7))))
: +- *(1) Project [reid#99L, ra#100L, sid#101L]
: +- *(1) FileScan json [reid#99L,ra#100L,sid#101L,day#103] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://bucket/folder/pathtodata], PartitionCount: 69, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<reid:bigint,ra:bigint,sid:bigint>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[2, bigint, true], input[0, bigint, true], substring(cast(input[1, bigint, true] as string), -13, (length(cast(input[1, bigint, true] as string)) - 7))))
+- *(2) Project [reid#73L, ra#74L, sid#75L]
+- *(2) FileScan json [reid#73L,ra#74L,sid#75L,day#77] Batched: false, Format: JSON, Location: InMemoryFileIndex[s3://bucket/folder/pathtodata2], PartitionCount: 11, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<reid:bigint,ra:bigint,sid:bigint>