强制数据流工作者使用Python 3。

huangapple go评论71阅读模式
英文:

Force Dataflow workers to use Python 3?

问题

I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner works fine, but with DataflowRunner it fails to install 1 dependency from requirements.txt. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.

The pipeline is done and working fine locally (DirectRunner) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0 in my requirements.txt).

One of the modules required by my pipeline is:
from lbcapi3 import api

So my requirements.txt sent to GCP has a line with:
lbcapi3==1.0.0

That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:

ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n'

That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt.

Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?

I tried by adding this as the first line of the my file api-etl.py, but didn't work:

#!/usr/bin/env python3

Thanks!

英文:

I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner works fine, but with DataflowRunner it fails to install 1 dependency from requirements.txt. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.

The pipeline is done and working fine locally (DirectRunner) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0 in my requirements.txt).

One of the modules required by my pipeline is:
from lbcapi3 import api

So my requirements.txt sent to GCP has a line with:
lbcapi3==1.0.0

That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:

ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n'

That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt.

Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?

I tried by adding this as the first line of the my file api-etl.py, but didn't work:

#!/usr/bin/env python3

Thanks!

答案1

得分: 1

请按照快速入门指南中的说明来设置和运行您的流水线。在安装 Apache Beam SDK 时,请确保安装版本 2.16(因为这是正式支持 Python 3 的第一个版本)。请检查您的版本。

如果您想迁移到 Python 2.x 环境以外的版本,您可以将 Apache Beam SDK 与 Python 版本 3.53.63.7 配合使用。

有关更多信息,请参阅此文档。此外,请查看预安装的依赖项

提供额外信息后编辑:

我在 Dataflow 上重现了问题。我看到了两种解决方案。

  1. 您可以使用 --extra_package 选项,这将允许以可访问的方式进行本地包的暂存。不要在 requirements.txt 中列出本地包,而是使用 --extra_package 选项创建本地包的 tarball(例如 my_package.tar.gz)并进行暂存。

从 Github 克隆存储库:

$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/

使用以下命令构建tarball

$ python setup.py sdist

最后几行将如下所示:

Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)

然后,使用以下命令行选项运行您的流水线:

--extra_package /path/to/package/package-name

在我的情况下:

--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

确保在命令中提供了所有所需的选项(job_nameprojectrunnerstaging_locationtemp_location):

python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

您遇到的错误将会消失。

  1. 第二种解决方案 - 在 setup.py 文件中发布应用程序正在使用的附加库,请参考文档

为您的项目创建一个 setup.py 文件:

import setuptools

setuptools.setup(
   name='PACKAGE-NAME',
   version='PACKAGE-VERSION',
   install_requires=[],
   packages=setuptools.find_packages(),
)

您可以摆脱 requirements.txt 文件,而是将 requirements.txt 中包含的所有包添加到设置调用的 install_requires 字段中。

英文:

Follow the instructions in the quickstart to get up and running with your pipeline. When installing the Apache Beam SDK, make sure to install version 2.16 (since this is the first version that officially supports Python 3). Please, check your version.

You can use the Apache Beam SDK with Python versions 3.5, 3.6, or 3.7 if you are keen to migrate from Python 2.x environments.

For more information, refer to this documentation. Also, take a look for preinstalled dependencies.

Edited, after providing additional information:

I have reproduced problem on Dataflow. I see two solutions.

  1. You can use --extra_package option, which would allow staging local packages in an accessible way. Instead of listing local package in the requirements.txt, create a tarball of the local package (e.g. my_package.tar.gz) and use --extra_package option to stage them.

Clone the repository from Github:

$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/

Build the tarball with the following command:

$ python setup.py sdist

The last few lines will look like this:

Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)

Then, run your pipeline with the following command-line option:

 --extra_package /path/to/package/package-name

In my case:

--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

Make sure, that all of required options are provided in the command (job_name, project, runner, staging_location, temp_location):

python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

The error you faced, would disappear.

  1. Second solution - posting the additional libraries that your app is using in setup.py file, refer to the documentation.

Create a setup.py file for your project:

 import setuptools

 setuptools.setup(
    name='PACKAGE-NAME',
    version='PACKAGE-VERSION',
    install_requires=[],
    packages=setuptools.find_packages(),
 )

You can get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call.

答案2

得分: 0

简单的答案是,在部署管道时,您需要确保您的本地环境是在python 3.5、3.6或3.7上运行的。如果是这样,那么一旦启动您的作业,Dataflow工作程序将具有相同的版本。

英文:

The simple answer is when deploying your pipeline, you need to make sure that your local environment is on python 3.5 3.6 or 3.7. If it is, then the Dataflow worker will have the same version once your job is launched.

huangapple
  • 本文由 发表于 2020年1月4日 11:57:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/59587759.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定