“az” 方案的文件系统错误,尝试使用 PySpark 从 ADLS Gen2 读取 CSV 时发生。

huangapple go评论82阅读模式
英文:

No FileSystem for scheme "az" error when trying to read csv from ADLS Gen2 using PySpark

问题

import pandas as pd
import pyspark.pandas as ps

我正在尝试使用pyspark pandas API来比较两个类似的脚本的性能(一个使用pandas,另一个通过pandas接口使用pyspark)。然而,我在从我们的ADLS Gen 2存储中导入数据到pyspark时遇到了问题。

当我运行以下代码时,它按预期工作:

df_pandas = pd.read_csv(f"az://container/path/to/file.csv", sep=';', dtype=str)

然而,当我使用pyspark pandas API运行相同的代码时:

df_spark = ps.read_csv(f"az://container/path/to/file.csv", sep=';', dtype=str)

但是,当我运行这个代码时,会抛出以下错误:

Py4JJavaError: An error occurred while calling o1840.load.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "az"

我在网上查找并找到其他人遇到了类似问题,他们在使用AWS时遇到了问题,但我不确定如何在Azure上解决这个问题。我尝试将“az”替换为“abfs”,但然后我会收到以下错误:

An error occurred while calling o1852.load.
: abfs://container/path/to/file.csv has invalid authority.

顺便说一下,我是从Azure Synapse笔记本上运行这些代码的。

英文:
import pandas as pd
import pyspark.pandas as ps

I am trying to use the pyspark pandas api to compare performance between two similar scripts (one using pandas and one using pyspark through the pandas interface). However, I have trouble importing my data in pyspark from our ADLS Gen 2 storage.

When I run the following code it works as expected:

df_pandas = pd.read_csv(f"az://container/path/to/file.csv",sep=';', dtype=str)

However when I run the same using the pyspark pandas api:

df_spark = ps.read_csv(f"az://container/path/to/file.csv",sep=';', dtype=str)

However, when I run this the following error gets thrown:

Py4JJavaError: An error occurred while calling o1840.load.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "az"

I have looked online and found others with similar problems using AWS but I'm not sure how to solve it for Azure. I tried replacing az with abfs but I then get the error:

An error occurred while calling o1852.load.
: abfs://container/path/to/file.csv has invalid authority.

I'm running these from Azure Synapse notebooks btw.

答案1

得分: 1

> 从ADLS Gen2读取csv文件。

代码:

import pandas 
df = pandas.read_csv('abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<file_path>', storage_options = {'account_key' : 'account_key_value'})

输出:

“az” 方案的文件系统错误,尝试使用 PySpark 从 ADLS Gen2 读取 CSV 时发生。

更多信息请参考链接1链接2

英文:

I reproduce same in environment.I got this output.

> Reading csv files from ADLS Gen2.

Code:

import pandas 
df = pandas.read_csv(&#39;abfss://&lt;container_name&gt;@&lt;storage_account_name&gt;.dfs.core.windows.net/&lt;file_path&gt;&#39;, storage_options = {&#39;account_key&#39; : &#39;account_key_value&#39;})

Output:

“az” 方案的文件系统错误,尝试使用 PySpark 从 ADLS Gen2 读取 CSV 时发生。

For more information refer this link1 and link2.

huangapple
  • 本文由 发表于 2023年1月9日 19:34:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056695.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定