2023年6月16日 15:11:07go评论134阅读模式

英文:

Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image

问题

我正在尝试在GCP Dataproc Serverless上使用自定义容器启动Pyspark作业，但当我尝试访问我的自定义镜像中的主类时，我发现了以下异常：

Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error '/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)'. Please specify one with --class.

为了重现此异常，我只创建了一个"Hello World"示例和一个基本的镜像。我的镜像托管在Google Container Registry，并且以下是其内容：

# 基础镜像
FROM centos:7

# 复制Python源代码
COPY helloword.py helloword.py

# 有用的工具
RUN yum install -y curl wget procps

# 版本
ENV TINI_VERSION=v0.19.0

# 安装tini
RUN curl -fL "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini" -o /usr/bin/tini \
    && chmod +x /usr/bin/tini

# 创建 'spark' 组/用户。
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark

以下是与作业关联的命令行：

gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
  --project name_project \
  --region europe-west9 \
  --version 1.1.19 \
  --container-image "eu.gcr.io/name_project/image-test" \
  --subnet default \
  --service-account service_account

您想要知道如何访问我的 helloword.py 吗？

提前感谢您的帮助。

英文:

I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception:

Exception in thread &quot;main&quot; org.apache.spark.SparkException: Failed to get main class in JAR with error &#39;/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)&#39;. Please specify one with --class.

To replicate this exception, I just made a hello word and a basic image. My image is hosted on Google Container Registry and here is its contents:

# Base image
FROM centos:7

# Copy the Python source code
COPY helloword.py helloword.py

# Usefull tools
RUN yum install -y curl wget procps

# Versions
ENV TINI_VERSION=v0.19.0

# Install tini
RUN curl -fL &quot;https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini&quot; -o /usr/bin/tini \
&amp;&amp; chmod +x /usr/bin/tini

# Create the &#39;spark&#39; group/user.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark

and this is the command line associated to the job:

gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
  --project name_project \
  --region europe-west9 \
  --version 1.1.19 \
  --container-image &quot;eu.gcr.io/name_project/image-test&quot; \
  --subnet default \
  --service-account service_account

You know how I can access to my helloword.py?

Thanks in advance.

答案1

得分: 0

由于file://helloword.py路径相对于Spark工作目录，但您已将此文件复制到容器中的Docker工作目录（默认为/），所以您看到了这个错误。

要解决此问题，您需要使用绝对路径引用此文件：file:///helloword.py。

英文:

You are seeing this error because file://helloword.py path is relative to the Spark working directory, but you have copied this file to the Docker working directory (/ by default) in your container.

To fix this issue you need to reference this file using an absolute path: file:///helloword.py

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pyspark与GCP Dataproc Serverless上的自定义容器：访问自定义容器图像中的类

问题

答案1

GCP Functions gen2 “You must assign the Invoker role”

TypeError: 添加列到结构时，’Column’ 对象不可调用

编程取消一个pyspark dataproc批处理作业

Pyspark 从一列中提取完全连续的4个数字，并将其返回到新列中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论