2020年7月24日 22:08:28go评论131阅读模式

英文:

How to read and write from/to S3 using Spark 3.0.0?

问题

我正在尝试启动一个Spark应用程序，该应用程序应能够使用Kubernetes上的Spark Operator和pySpark版本3.0.0读取和写入S3。Spark Operator工作得很好，但我很快意识到启动的应用程序无法正确地从S3中读取文件。

这个命令：

spark.read.json("s3a://bucket/path/to/data.json")

会抛出这个异常：

py4j.protocol.Py4JJavaError: 调用 o58.json 时出错。
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
找不到类 org.apache.hadoop.fs.s3a.S3AFileSystem

我首先尝试使用 gcr.io/spark-operator/spark-py:v3.0.0 作为Spark镜像，然后尝试添加一些 .jars 文件，但没有成功：

ADD https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar $SPARK_HOME/jars

这是我的Spark配置：

"spark.hadoop.fs.s3a.endpoint": "S3A_ENDPOINT",
"spark.hadoop.fs.s3a.access.key": "ACCESS_KEY",
"spark.hadoop.fs.s3a.secret.key": "SECRET_KEY",
"spark.hadoop.fs.s3a.connection.ssl.enabled": "false",
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.path.style.access": "true",
"spark.driver.extraClassPath": "/opt/spark/jars/*",
"spark.executor.extraClassPath": "/opt/spark/jars/*"

我的 $SPARK_HOME 是 /opt/spark。

是否有人能够在Spark 3.0.0中从S3读取/写入？这是否仅限于pyspark？我该如何“修复”这个问题？提前感谢您的帮助！

英文:

I'm trying to launch a Spark application which should be able to read and write to S3, using Spark Operator on Kubernetes and pySpark version 3.0.0. The Spark Operator is wworking nicely, but I soon realized that the application launched can't read files from S3 properly.

This command:

spark.read.json("s3a://bucket/path/to/data.json")

is throwing this exception:

py4j.protocol.Py4JJavaError: An error occurred while calling o58.json.
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I tried this first using gcr.io/spark-operator/spark-py:v3.0.0 as Spark image, and then tried adding some .jars to it with no success:

ADD https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar $SPARK_HOME/jars

Here's my spark conf:

    &quot;spark.hadoop.fs.s3a.endpoint&quot;: &quot;S3A_ENDPOINT&quot;
    &quot;spark.hadoop.fs.s3a.access.key&quot;: &quot;ACCESS_KEY&quot;
    &quot;spark.hadoop.fs.s3a.secret.key&quot;: &quot;SECRET_KEY&quot;
    &quot;spark.hadoop.fs.s3a.connection.ssl.enabled&quot;: &quot;false&quot;
    &quot;spark.hadoop.fs.s3a.impl&quot;: &quot;org.apache.hadoop.fs.s3a.S3AFileSystem&quot;
    &quot;spark.hadoop.fs.s3a.path.style.access&quot;: &quot;true&quot;
    &quot;spark.driver.extraClassPath&quot;: &quot;/opt/spark/jars/*&quot;
    &quot;spark.executor.extraClassPath&quot;: &quot;/opt/spark/jars/*&quot;

And my $SPARK_HOME is /opt/spark.

Is anyone able to read/write from S3 using Spark 3.0.0? Is this an issue with pyspark, exclusively? How can I "fix" this?
Thanks in advance!

答案1

得分: 3

我找到了如何做到这一点：
这是一个分支，其中包含我对基本Docker镜像所做更改的分支（只有少量更改）：

https://github.com/Coqueiro/spark/tree/branch-3.0-s3

我创建了一个Makefile来帮助创建分发，但基本上我只是按照官方文档进行操作：

http://spark.apache.org/docs/latest/building-spark.html

此外，这是已经构建并上传到Docker Hub的镜像：
https://hub.docker.com/repository/docker/coqueirotree/spark-py

它包含Spark 3.0.0，Hadoop 3.2.0，S3A和Kubernetes支持。

英文:

I figured out how to do it:
Here’s a fork with the changes I made to the base docker image (just a few changes):

https://github.com/Coqueiro/spark/tree/branch-3.0-s3
I created a Makefile to aid distribution creation, but I basically just followed the official doc:

http://spark.apache.org/docs/latest/building-spark.html

Also, here’s the image, already built and pushed to Docker Hub:
https://hub.docker.com/repository/docker/coqueirotree/spark-py

It has Spark 3.0.0, Hadoop 3.2.0, S3A and Kubernetes support.

答案2

得分: 0

你尝试过使用预先构建的 Hadoop 库的 Spark Jars 吗？你可以在 https://spark.apache.org/downloads.html 找到它们，你还可以将 Hadoop 依赖项添加到你的类路径中。

英文:

Have you tried using spark jars that have pre-built Hadoop libraries https://spark.apache.org/downloads.html you can also add Hadoop dependencies to your classpath.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Spark 3.0.0从/向S3读取和写入数据？

问题

答案1

答案2

URISyntaxException 在 JDK19 和 JDK20 中的错误消息不同。

How can I get a Kubernetes clientset in GO using a JSON service account key?

安卓 – MQTT 和 Mosquitto

你没有在Java中传递身份验证令牌。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论