Pyspark与GCP Dataproc Serverless上的自定义容器:访问自定义容器图像中的类

huangapple go评论85阅读模式
英文:

Pyspark with custom container on GCP Dataproc Serverless : access to class in custom container image

问题

我正在尝试在GCP Dataproc Serverless上使用自定义容器启动Pyspark作业,但当我尝试访问我的自定义镜像中的主类时,我发现了以下异常:

  1. Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error '/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)'. Please specify one with --class.

为了重现此异常,我只创建了一个"Hello World"示例和一个基本的镜像。我的镜像托管在Google Container Registry,并且以下是其内容:

  1. # 基础镜像
  2. FROM centos:7
  3. # 复制Python源代码
  4. COPY helloword.py helloword.py
  5. # 有用的工具
  6. RUN yum install -y curl wget procps
  7. # 版本
  8. ENV TINI_VERSION=v0.19.0
  9. # 安装tini
  10. RUN curl -fL "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini" -o /usr/bin/tini \
  11. && chmod +x /usr/bin/tini
  12. # 创建 'spark' 组/用户。
  13. RUN groupadd -g 1099 spark
  14. RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
  15. USER spark

以下是与作业关联的命令行:

  1. gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
  2. --project name_project \
  3. --region europe-west9 \
  4. --version 1.1.19 \
  5. --container-image "eu.gcr.io/name_project/image-test" \
  6. --subnet default \
  7. --service-account service_account

您想要知道如何访问我的 helloword.py 吗?

提前感谢您的帮助。

英文:

I’m trying to start a job Pyspark on GCP Dataproc Serverless with custom container, but when I tried to access to my main class in my custom image, I found this exception:

  1. Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error '/var/dataproc/tmp/srvls-batch-10bc1778-798f-4477-b0ea-e8440770784f (Is a directory)'. Please specify one with --class.

To replicate this exception, I just made a hello word and a basic image. My image is hosted on Google Container Registry  and here is its contents:

  1. # Base image
  2. FROM centos:7
  3. # Copy the Python source code
  4. COPY helloword.py helloword.py
  5. # Usefull tools
  6. RUN yum install -y curl wget procps
  7. # Versions
  8. ENV TINI_VERSION=v0.19.0
  9. # Install tini
  10. RUN curl -fL "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini" -o /usr/bin/tini \
  11. && chmod +x /usr/bin/tini
  12. # Create the 'spark' group/user.
  13. RUN groupadd -g 1099 spark
  14. RUN useradd -u 1099 -g 1099 -d /home/spark -m spark
  15. USER spark

and this is the command line associated to the job:

  1. gcloud dataproc batches submit pyspark --batch name_batch file://helloword.py \
  2. --project name_project \
  3. --region europe-west9 \
  4. --version 1.1.19 \
  5. --container-image "eu.gcr.io/name_project/image-test" \
  6. --subnet default \
  7. --service-account service_account

You know how I can access to my helloword.py?

Thanks in advance.

答案1

得分: 0

由于file://helloword.py路径相对于Spark工作目录,但您已将此文件复制到容器中的Docker工作目录(默认为/),所以您看到了这个错误。

要解决此问题,您需要使用绝对路径引用此文件:file:///helloword.py

英文:

You are seeing this error because file://helloword.py path is relative to the Spark working directory, but you have copied this file to the Docker working directory (/ by default) in your container.

To fix this issue you need to reference this file using an absolute path: file:///helloword.py

huangapple
  • 本文由 发表于 2023年6月16日 15:11:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76487745.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定