No FileSystem for scheme: abfss – running pyspark standalone.

huangapple go评论75阅读模式
英文:

No FileSystem for scheme: abfss - running pyspark standalone

问题

尝试使用独立的 Spark 读取 Azure Datalake Gen2 中存储的 CSV 文件时遇到 java.io.IOException: No FileSystem for scheme: abfss 错误。

使用以下命令安装了 PySpark:pip install pyspark==3.0.3,并运行以下命令启动,包含所需的依赖项:

pyspark --packages "org.apache.hadoop:hadoop-azure:3.0.3,org.apache.hadoop:hadoop-azure-datalake:3.0.3"

我在这里找到另一个答案建议使用 Spark 3.2+org.apache.spark:hadoop-cloud_2.12,但也不起作用,仍然遇到相同的异常,完整的堆栈跟踪如下:

>>> spark.read.csv("abfss://raw@teststorageaccount.dfs.core.windows.net/members.csv") 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 737, in csv 
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__ 
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 111, in deco 
    return f(*a, **kw) 
           ^^^^^^^^^^^ 
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv. 
: java.io.IOException: No FileSystem for scheme: abfss 
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) 
	...
英文:

Trying to read csv file stored in Azure Datalake Gen2 using standalone spark but getting java.io.IOException: No FileSystem for scheme: abfss

Installed pyspark using: pip install pyspark==3.0.3 and running it using following command, containing required deps:

pyspark --packages "org.apache.hadoop:hadoop-azure:3.0.3,org.apache.hadoop:hadoop-azure-datalake:3.0.3"

I found another answer here suggesting using Spark 3.2+ with org.apache.spark:hadoop-cloud_2.12 but it didn't work either, still getting the same exception, complete stack trace is pasted below:

>>> spark.read.csv("abfss://raw@teststorageaccount.dfs.core.windows.net/members.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 737, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: No FileSystem for scheme: abfss
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

答案1

得分: 2

  1. 你需要使用过去五年内构建的 hadoop-* JARs 版本。ABFS 支持在 Hadoop 3.2.0 中引入,具体请参考 HADOOP-15407. 支持在 Hadoop 中使用 Windows Azure 存储 - Blob 文件系统,并且没有被回溯移植的原因是"如果人们不愿意升级,我们为什么要费心回溯移植呢?"。确保所有版本的 hadoop 库都相同,除非你想看到不同的堆栈跟踪。在与 Spark 集成时,你可以将这些版本打包到你的应用程序中,并查找一个提到"用户类路径优先"的配置。

  2. 你需要定义一个名为 SPARK_CONF_DIR 的环境变量,指向一个文件夹,并在该路径下添加一个 core-site.xml 文件,该文件定义了 Hadoop Azure 的具体信息,如 ABFS 帐户详细信息。

请注意 fs.azure.always.use.https 配置选项,以便使用 abfss。

另外,根据你的错误信息,你正在使用 Spark 3.1.2 版本,所以你的包版本需要与之匹配。

英文:
  1. you need to use a version of hadoop-* JARs built in the last five years. ABFS support came in Hadoop 3.2.0 with HADOOP-15407. Support Windows Azure Storage - Blob file system in Hadoop and has not been backported for the reason "if people can't be bothered to upgrade, why should we bother to backport?". Do make sure all versions of the hadoop libraries are the same unless you want to see different stack traces. When integrating with Spark, you can package these versions in your app, and look for a config mentioning "user classpath first".
  2. You need to define an environment variable SPARK_CONF_DIR to a folder and add a core-site.xml file in that path, which defines Hadoop Azure specifics like the ABFS account details

https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html

Make note of fs.azure.always.use.https config option for using abfss.

Also, your error says you're using Spark 3.1.2, so your package versions need to match that

huangapple
  • 本文由 发表于 2023年6月29日 15:31:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76578907.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定