Hadoop Shuffle терпит неудачу. Ошибка контрольной суммы - PullRequest
0 голосов
/ 06 июня 2019

Мы видим очень медленную фазу копирования:

reduce > copy task(attempt_1559832449421_0209_m_000006_0 succeeded at 0.03 MB/s) Aggregated copy rate(21 of 22 at 0.54 MB/s)

В журнале редуктора содержится

2019-04-01 19:14:46,919 WARN [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to shuffle for fetcher#10
org.apache.hadoop.fs.ChecksumException: Checksum Error
    at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:212)
    at org.apache.hadoop.mapred.IFileInputStream.readWithChecksum(IFileInputStream.java:189)
    at org.apache.hadoop.mapreduce.task.reduce.OnDiskMapOutput.shuffle(OnDiskMapOutput.java:103)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:562)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:348)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
2019-04-01 19:14:46,920 WARN [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to shuffle output of attempt_1559832449421_0209_m_000010_0 from phdp100.g:13562
java.io.IOException: org.apache.hadoop.fs.ChecksumException: Checksum Error
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:566)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:348)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:198)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
    at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:212)
    at org.apache.hadoop.mapred.IFileInputStream.readWithChecksum(IFileInputStream.java:189)
    at org.apache.hadoop.mapreduce.task.reduce.OnDiskMapOutput.shuffle(OnDiskMapOutput.java:103)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:562)
    ... 2 more
2019-04-01 19:14:46,920 WARN [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.Fetcher: copyMapOutput failed for tasks [attempt_1559832449421_0209_m_000010_0]
2019-04-01 19:14:46,921 INFO [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Reporting fetch failure for attempt_1559832449421_0209_m_000010_0 to MRAppMaster.
2019-04-01 19:14:46,921 INFO [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: phdp100.g:13562 freed by fetcher#10 in 118716ms
2019-04-01 19:16:36,779 INFO [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: attempt_1559832449421_0209_m_000010_0: Shuffling to disk since 367993615 is greater than maxSingleShuffleLimit (58274612)
2019-04-01 19:16:36,830 INFO [fetcher#10] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#10 about to shuffle output of map attempt_1559832449421_0209_m_000010_0 decomp: 367993615 len: 27373924 to DISK

У нас есть cdh 5.16. В чем может быть проблема?

...