英文:
Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image
问题
我正在尝试在GCP Dataproc Serverless上使用自定义容器启动Pyspark作业,但当我尝试访问我的自定义镜像中的主类时,我发现了以下异常:
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error '/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)'. Please specify one with --class.
为了重现此异常,我只创建了一个"Hello World"示例和一个基本的镜像。我的镜像托管在Google Container Registry,并且以下是其内容:
# 基础镜像
FROM centos:7
# 复制Python源代码
COPY helloword.py helloword.py
# 有用的工具
RUN yum install -y curl wget procps
# 版本
ENV TINI_VERSION=v0.19.0
# 安装tini
RUN curl -fL "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini" -o /usr/bin/tini \
&& chmod +x /usr/bin/tini
# 创建 'spark' 组/用户。
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
以下是与作业关联的命令行:
gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
--project name_project \
--region europe-west9 \
--version 1.1.19 \
--container-image "eu.gcr.io/name_project/image-test" \
--subnet default \
--service-account service_account
您想要知道如何访问我的 helloword.py
吗?
提前感谢您的帮助。
英文:
I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception:
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error '/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)'. Please specify one with --class.
To replicate this exception, I just made a hello word and a basic image. My image is hosted on Google Container Registry and here is its contents:
# Base image
FROM centos:7
# Copy the Python source code
COPY helloword.py helloword.py
# Usefull tools
RUN yum install -y curl wget procps
# Versions
ENV TINI_VERSION=v0.19.0
# Install tini
RUN curl -fL "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini" -o /usr/bin/tini \
&& chmod +x /usr/bin/tini
# Create the 'spark' group/user.
RUN groupadd -g 1099 spark
RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
USER spark
and this is the command line associated to the job:
gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
--project name_project \
--region europe-west9 \
--version 1.1.19 \
--container-image "eu.gcr.io/name_project/image-test" \
--subnet default \
--service-account service_account
You know how I can access to my helloword.py
?
Thanks in advance.
答案1
得分: 0
由于file://helloword.py
路径相对于Spark工作目录,但您已将此文件复制到容器中的Docker工作目录(默认为/
),所以您看到了这个错误。
要解决此问题,您需要使用绝对路径引用此文件:file:///helloword.py
。
英文:
You are seeing this error because file://helloword.py
path is relative to the Spark working directory, but you have copied this file to the Docker working directory (/
by default) in your container.
To fix this issue you need to reference this file using an absolute path: file:///helloword.py
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论