英文:
No FileSystem for scheme: abfss - running pyspark standalone
问题
尝试使用独立的 Spark 读取 Azure Datalake Gen2 中存储的 CSV 文件时遇到 java.io.IOException: No FileSystem for scheme: abfss
错误。
使用以下命令安装了 PySpark:pip install pyspark==3.0.3
,并运行以下命令启动,包含所需的依赖项:
pyspark --packages "org.apache.hadoop:hadoop-azure:3.0.3,org.apache.hadoop:hadoop-azure-datalake:3.0.3"
我在这里找到另一个答案建议使用 Spark 3.2+
与 org.apache.spark:hadoop-cloud_2.12
,但也不起作用,仍然遇到相同的异常,完整的堆栈跟踪如下:
>>> spark.read.csv("abfss://raw@teststorageaccount.dfs.core.windows.net/members.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 737, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: No FileSystem for scheme: abfss
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
...
英文:
Trying to read csv file stored in Azure Datalake Gen2 using standalone spark but getting java.io.IOException: No FileSystem for scheme: abfss
Installed pyspark using: pip install pyspark==3.0.3
and running it using following command, containing required deps:
pyspark --packages "org.apache.hadoop:hadoop-azure:3.0.3,org.apache.hadoop:hadoop-azure-datalake:3.0.3"
I found another answer here suggesting using Spark 3.2+
with org.apache.spark:hadoop-cloud_2.12
but it didn't work either, still getting the same exception, complete stack trace is pasted below:
>>> spark.read.csv("abfss://raw@teststorageaccount.dfs.core.windows.net/members.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 737, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
^^^^^^^^^^^
File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: No FileSystem for scheme: abfss
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
答案1
得分: 2
-
你需要使用过去五年内构建的 hadoop-* JARs 版本。ABFS 支持在 Hadoop 3.2.0 中引入,具体请参考 HADOOP-15407. 支持在 Hadoop 中使用 Windows Azure 存储 - Blob 文件系统,并且没有被回溯移植的原因是"如果人们不愿意升级,我们为什么要费心回溯移植呢?"。确保所有版本的 hadoop 库都相同,除非你想看到不同的堆栈跟踪。在与 Spark 集成时,你可以将这些版本打包到你的应用程序中,并查找一个提到"用户类路径优先"的配置。
-
你需要定义一个名为
SPARK_CONF_DIR
的环境变量,指向一个文件夹,并在该路径下添加一个core-site.xml
文件,该文件定义了 Hadoop Azure 的具体信息,如 ABFS 帐户详细信息。
请注意 fs.azure.always.use.https
配置选项,以便使用 abfss。
另外,根据你的错误信息,你正在使用 Spark 3.1.2 版本,所以你的包版本需要与之匹配。
英文:
- you need to use a version of hadoop-* JARs built in the last five years. ABFS support came in Hadoop 3.2.0 with HADOOP-15407. Support Windows Azure Storage - Blob file system in Hadoop and has not been backported for the reason "if people can't be bothered to upgrade, why should we bother to backport?". Do make sure all versions of the hadoop libraries are the same unless you want to see different stack traces. When integrating with Spark, you can package these versions in your app, and look for a config mentioning "user classpath first".
- You need to define an environment variable
SPARK_CONF_DIR
to a folder and add acore-site.xml
file in that path, which defines Hadoop Azure specifics like the ABFS account details
https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html
Make note of fs.azure.always.use.https
config option for using abfss.
Also, your error says you're using Spark 3.1.2, so your package versions need to match that
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论