2023年6月29日 15:31:50go评论92阅读模式

英文:

No FileSystem for scheme: abfss - running pyspark standalone

问题

尝试使用独立的 Spark 读取 Azure Datalake Gen2 中存储的 CSV 文件时遇到 java.io.IOException: No FileSystem for scheme: abfss 错误。

使用以下命令安装了 PySpark：pip install pyspark==3.0.3，并运行以下命令启动，包含所需的依赖项：

pyspark --packages "org.apache.hadoop:hadoop-azure:3.0.3,org.apache.hadoop:hadoop-azure-datalake:3.0.3"

我在这里找到另一个答案建议使用 Spark 3.2+ 与 org.apache.spark:hadoop-cloud_2.12，但也不起作用，仍然遇到相同的异常，完整的堆栈跟踪如下：

&gt;&gt;&gt; spark.read.csv(&quot;abfss://raw@teststorageaccount.dfs.core.windows.net/members.csv&quot;) 
Traceback (most recent call last): 
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt; 
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/readwriter.py&quot;, line 737, in csv 
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py&quot;, line 1304, in __call__ 
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py&quot;, line 111, in deco 
    return f(*a, **kw) 
           ^^^^^^^^^^^ 
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py&quot;, line 326, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv. 
: java.io.IOException: No FileSystem for scheme: abfss 
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) 
	...

英文:

Trying to read csv file stored in Azure Datalake Gen2 using standalone spark but getting java.io.IOException: No FileSystem for scheme: abfss

Installed pyspark using: pip install pyspark==3.0.3 and running it using following command, containing required deps:

pyspark --packages "org.apache.hadoop:hadoop-azure:3.0.3,org.apache.hadoop:hadoop-azure-datalake:3.0.3"

I found another answer here suggesting using Spark 3.2+ with org.apache.spark:hadoop-cloud_2.12 but it didn't work either, still getting the same exception, complete stack trace is pasted below:

&gt;&gt;&gt; spark.read.csv(&quot;abfss://raw@teststorageaccount.dfs.core.windows.net/members.csv&quot;)
Traceback (most recent call last):
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/readwriter.py&quot;, line 737, in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py&quot;, line 1304, in __call__
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/pyspark/sql/utils.py&quot;, line 111, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File &quot;/Users/dev/binaries/spark-3.1.2-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py&quot;, line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: No FileSystem for scheme: abfss
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
	at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:795)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

答案1

得分: 2

你需要使用过去五年内构建的 hadoop-* JARs 版本。ABFS 支持在 Hadoop 3.2.0 中引入，具体请参考 HADOOP-15407. 支持在 Hadoop 中使用 Windows Azure 存储 - Blob 文件系统，并且没有被回溯移植的原因是"如果人们不愿意升级，我们为什么要费心回溯移植呢？"。确保所有版本的 hadoop 库都相同，除非你想看到不同的堆栈跟踪。在与 Spark 集成时，你可以将这些版本打包到你的应用程序中，并查找一个提到"用户类路径优先"的配置。
你需要定义一个名为 SPARK_CONF_DIR 的环境变量，指向一个文件夹，并在该路径下添加一个 core-site.xml 文件，该文件定义了 Hadoop Azure 的具体信息，如 ABFS 帐户详细信息。

请注意 fs.azure.always.use.https 配置选项，以便使用 abfss。

另外，根据你的错误信息，你正在使用 Spark 3.1.2 版本，所以你的包版本需要与之匹配。

英文:

you need to use a version of hadoop-* JARs built in the last five years. ABFS support came in Hadoop 3.2.0 with HADOOP-15407. Support Windows Azure Storage - Blob file system in Hadoop and has not been backported for the reason "if people can't be bothered to upgrade, why should we bother to backport?". Do make sure all versions of the hadoop libraries are the same unless you want to see different stack traces. When integrating with Spark, you can package these versions in your app, and look for a config mentioning "user classpath first".
You need to define an environment variable SPARK_CONF_DIR to a folder and add a core-site.xml file in that path, which defines Hadoop Azure specifics like the ABFS account details

https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html

Make note of fs.azure.always.use.https config option for using abfss.

Also, your error says you're using Spark 3.1.2, so your package versions need to match that

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

No FileSystem for scheme: abfss – running pyspark standalone.

问题

答案1

如何在Python中实现类似SQL ORDER BY的功能？

CSS文件在我使用Django的生产模式时未加载。

理解 Python 字符串

mysql python 插入用户提供的数据

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。