У меня проблемы с PyArrow 0.15.1 при попытке вывести содержимое каталога из HDFS.
PyArrow устанавливается внутри образа Ubuntu 18.04 docker.
Использование Hadoop 3.2.1
и openjdk-8-jdk
>>> import pyarrow as pa
>>> pa.__version__
'0.15.1'
>>> fs = pa.hdfs.connect(<ip>, <port>)
>>> fs.ls('/')
hdfsListDirectory(/): FileSystem#listStatus error:
ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetListingRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetListingRequestProto cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy9.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:674)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy10.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1647)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1631)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1048)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1112)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1109)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1119)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/varun.patil/anaconda3/lib/python3.7/site-packages/pyarrow/hdfs.py", line 103, in ls
return super(HadoopFileSystem, self).ls(path, detail)
File "pyarrow/io-hdfs.pxi", line 272, in pyarrow.lib.HadoopFileSystem.ls
File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS list directory failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
Я установил JAVA_HOME
и HADOOP_HOME
правильно.
Тот же код работает правильно с PyArrow 0.11.1, но мне нужно использовать 0.15.1.