2023年6月22日 19:19:21go评论84阅读模式

英文:

error while loading data from S3 using Spark

问题

我在使用Spark从S3加载数据时遇到了一个错误。首先，这是我的代码：

# Chargement des packages et options de configuration

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3,databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 pyspark-shell --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true",spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com'
# Creating SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
    .appName('Image_P8') \
    .config('spark.driver.extraJavaOptions', '-Dio.netty.tryReflectionSetAccessible=true') \
    .config('spark.hadoop.fs.s3a.endpoint', 's3.eu-west-1.amazonaws.com') \
    .getOrCreate()
path = "s3a://ocr-fruits/Test/*"
image_df = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load(path)

image_df.show(2)

然而，当我尝试使用spark.read.format("binaryFile")方法加载图像文件时，会抛出一个NumberFormatException异常，错误消息为"对于输入字符串：'64M'"。以下是与此错误对应的错误信息：

NumberFormatException                     Traceback (most recent call last)
Cell In[65], line 4
      1 image_df = spark.read.format("binaryFile") \
      2   .option("pathGlobFilter", "*.jpg") \
      3   .option("recursiveFileLookup", "true") \
----> 4   .load(path)
      6 image_df.show(2)

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\sql\readwriter.py:300, in DataFrameReader.load(self, path, format, schema, **options)
    298 self.options(**options)
    299 if isinstance(path, str):
--> 300     return self._df(self._jreader.load(path))
    301 elif path is not None:
    302     if type(path) != list:

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
-> 1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw)
    171 converted = convert_exception(e.java_exception)
    172 if not isinstance(converted, UnknownException):
    173     # Hide where the exception came from that shows a non-Pythonic
--> 175     raise converted from None
    176 else:
    177     raise

NumberFormatException: For input string: "64M"

如果有人能帮助我理解这个异常的原因并找到使用Spark从S3加载图像文件的解决方案，我将不胜感激。提前感谢您的帮助！

英文:

I'm facing an error while loading data from S3 using Spark. First, here is my code :

# Chargement des packages et options de configuration

import os
os.environ[&#39;PYSPARK_SUBMIT_ARGS&#39;] = &#39;--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3,databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 pyspark-shell --conf spark.driver.extraJavaOptions=&quot;-Dio.netty.tryReflectionSetAccessible=true&quot;,spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com&#39;
# Creating SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession.builder \
    .appName(&#39;Image_P8&#39;) \
    .config(&#39;spark.driver.extraJavaOptions&#39;, &#39;-Dio.netty.tryReflectionSetAccessible=true&#39;) \
    .config(&#39;spark.hadoop.fs.s3a.endpoint&#39;, &#39;s3.eu-west-1.amazonaws.com&#39;) \
    .getOrCreate()
path = &quot;s3a://ocr-fruits/Test/*&quot;
image_df = spark.read.format(&quot;binaryFile&quot;) \
  .option(&quot;pathGlobFilter&quot;, &quot;*.jpg&quot;) \
  .option(&quot;recursiveFileLookup&quot;, &quot;true&quot;) \
  .load(path)

image_df.show(2)

However, when I try to load image files using the spark.read.format("binaryFile") method, a NumberFormatException exception is thrown with the message "For input string: '64M'".
Here is the error corresponding :

'''

NumberFormatException                     Traceback (most recent call last)
Cell In[65], line 4
      1 image_df = spark.read.format(&quot;binaryFile&quot;) \
      2   .option(&quot;pathGlobFilter&quot;, &quot;*.jpg&quot;) \
      3   .option(&quot;recursiveFileLookup&quot;, &quot;true&quot;) \
----&gt; 4   .load(path)
      6 image_df.show(2)

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\sql\readwriter.py:300, in DataFrameReader.load(self, path, format, schema, **options)
    298 self.options(**options)
    299 if isinstance(path, str):
--&gt; 300     return self._df(self._jreader.load(path))
    301 elif path is not None:
    302     if type(path) != list:

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-&gt; 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, &quot;_detach&quot;):

File C:\apps\opt\spark-3.4.0-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:175, in capture_sql_exception.&lt;locals&gt;.deco(*a, **kw)
    171 converted = convert_exception(e.java_exception)
    172 if not isinstance(converted, UnknownException):
    173     # Hide where the exception came from that shows a non-Pythonic
    174     # JVM exception message.
--&gt; 175     raise converted from None
    176 else:
    177     raise

NumberFormatException: For input string: &quot;64M&quot;

'''

I would greatly appreciate if someone could assist me in understanding the cause of this exception and finding a solution to load image files from S3 using Spark.

Thank you in advance for your help!

答案1

得分: 1

这与多部分块大小有关。

将你的 hadoop-* jars 升级到最新版本，例如 3.3.0，并确保你使用的 aws-sdk 与 hadoop 兼容。

你正在使用的版本可以追溯到 2015 年 11 月 13 日。

尝试在构建 Spark 会话时操作此配置：

"spark.hadoop.fs.s3a.multipart.size"

英文:

This is related to the multipart block size.

upgrade your hadoop-* jars to a recent version, 3.3.0 for exemple and be sure that the aws-sdk you're using is compatible with hadoop.

The one you're using dates back to Nov 13, 2015.

Try also to manipulate this configuration when building the spark session:

&quot;spark.hadoop.fs.s3a.multipart.size&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从S3使用Spark加载数据时出错。

问题

答案1

签名不匹配 403 错误是在使用 aws-sdk-go 签署 URL 时出现的。

使用亚马逊链接下载mp3文件

将PDF托管在AWS S3。

Laravel Spatie备份保存到S3文件夹而不创建子目录

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论