Ошибка чтения паркета pyspark при чтении файлов паркета, хранящихся в hdfs: исключение блока отсутствует - PullRequest
2 голосов
/ 17 июня 2019

У меня есть данные, сохраненные в формате паркета в формате hdf, которые я хочу обработать с помощью spark

Платформа:
Ubuntu 16.04
Искра 2.1.3
Hadoop 2.6.5

Вот список содержимого каталога, в котором хранятся данные:

hdfs dfs -ls /databases/crawl_data_stats/scraped_metadata
Found 6 items
drwxr-xr-x   - root supergroup          0 2019-06-13 12:30 /databases/crawl_data_stats/scraped_metadata/.metadata
drwxr-xr-x   - root supergroup          0 2019-06-13 12:32 /databases/crawl_data_stats/scraped_metadata/.signals
-rw-r--r--   1 root supergroup   87819081 2019-06-13 12:32 /databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet
-rw-r--r--   1 root supergroup   92307005 2019-06-13 12:31 /databases/crawl_data_stats/scraped_metadata/4fc7732b-2a7b-4a56-a034-16bc0393c0b9.parquet
-rw-r--r--   1 root supergroup   69329182 2019-06-13 12:31 /databases/crawl_data_stats/scraped_metadata/a69db553-1ac7-469d-b55c-ff4133f4b8dc.parquet
-rw-r--r--   1 root supergroup   90382508 2019-06-13 12:32 /databases/crawl_data_stats/scraped_metadata/d7ca247f-7832-4b0b-88b6-f940dcfe9df4.parquet

Я пытался читать файлы паркета как:

temp = spark.read.parquet("hdfs://localhost:9000/databases/crawl_data_stats/scraped_metadata")

Вот часть ошибки, которую я получаю

19/06/17 11:00:14 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 1300.3553468732534 msec.  
19/06/17 11:00:16 WARN DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 4782.536317936689 msec.  
19/06/17 11:00:21 WARN DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 6014.259159548854 msec.  
19/06/17 11:00:27 WARN DFSClient: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException  
19/06/17 11:00:27 WARN DFSClient: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException  
19/06/17 11:00:27 WARN DFSClient: DFS Read  
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet  

...

19/06/17 11:00:27 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job  
Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>  
  File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 274, in parquet  
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))  
  File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__  
  File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/pyspark/sql/utils.py", line 63, in deco  
    return f(*a, **kw)  
  File "/usr/local/spark/spark-2.1.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o27.parquet.  
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-317098980-127.0.0.1-1560408421762:blk_1073741878_1054 file=/databases/crawl_data_stats/scraped_metadata/3695c4ed-e140-4a01-aa27-bd29b5fb7be5.parquet  

...

Спасибо за помощь

...