英文:
Force Dataflow workers to use Python 3?
问题
I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner
works fine, but with DataflowRunner
it fails to install 1 dependency from requirements.txt
. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.
The pipeline is done and working fine locally (DirectRunner
) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0
in my requirements.txt
).
One of the modules required by my pipeline is:
from lbcapi3 import api
So my requirements.txt
sent to GCP has a line with:
lbcapi3==1.0.0
That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:
ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n'
That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt
.
Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?
I tried by adding this as the first line of the my file api-etl.py
, but didn't work:
#!/usr/bin/env python3
Thanks!
英文:
I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner
works fine, but with DataflowRunner
it fails to install 1 dependency from requirements.txt
. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.
The pipeline is done and working fine locally (DirectRunner
) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0
in my requirements.txt
).
One of the modules required by my pipeline is:
from lbcapi3 import api
So my requirements.txt
sent to GCP has a line with:
lbcapi3==1.0.0
That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:
ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n'
That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt
.
Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?
I tried by adding this as the first line of the my file api-etl.py
, but didn't work:
#!/usr/bin/env python3
Thanks!
答案1
得分: 1
请按照快速入门指南中的说明来设置和运行您的流水线。在安装 Apache Beam SDK 时,请确保安装版本 2.16
(因为这是正式支持 Python 3 的第一个版本)。请检查您的版本。
如果您想迁移到 Python 2.x
环境以外的版本,您可以将 Apache Beam SDK 与 Python 版本 3.5
、3.6
或 3.7
配合使用。
提供额外信息后编辑:
我在 Dataflow 上重现了问题。我看到了两种解决方案。
- 您可以使用
--extra_package
选项,这将允许以可访问的方式进行本地包的暂存。不要在requirements.txt
中列出本地包,而是使用--extra_package
选项创建本地包的 tarball(例如my_package.tar.gz
)并进行暂存。
从 Github 克隆存储库:
$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/
使用以下命令构建tarball:
$ python setup.py sdist
最后几行将如下所示:
Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)
然后,使用以下命令行选项运行您的流水线:
--extra_package /path/to/package/package-name
在我的情况下:
--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
确保在命令中提供了所有所需的选项(job_name
、project
、runner
、staging_location
、temp_location
):
python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
您遇到的错误将会消失。
- 第二种解决方案 - 在
setup.py
文件中发布应用程序正在使用的附加库,请参考文档。
为您的项目创建一个 setup.py
文件:
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
version='PACKAGE-VERSION',
install_requires=[],
packages=setuptools.find_packages(),
)
您可以摆脱 requirements.txt
文件,而是将 requirements.txt
中包含的所有包添加到设置调用的 install_requires
字段中。
英文:
Follow the instructions in the quickstart to get up and running with your pipeline. When installing the Apache Beam SDK, make sure to install version 2.16
(since this is the first version that officially supports Python 3). Please, check your version.
You can use the Apache Beam SDK with Python versions 3.5
, 3.6
, or 3.7
if you are keen to migrate from Python 2.x
environments.
For more information, refer to this documentation. Also, take a look for preinstalled dependencies.
Edited, after providing additional information:
I have reproduced problem on Dataflow. I see two solutions.
- You can use
--extra_package
option, which would allow staging local packages in an accessible way. Instead of listing local package in therequirements.txt
, create a tarball of the local package (e.g.my_package.tar.gz
) and use--extra_package
option to stage them.
Clone the repository from Github:
$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/
Build the tarball with the following command:
$ python setup.py sdist
The last few lines will look like this:
Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)
Then, run your pipeline with the following command-line option:
--extra_package /path/to/package/package-name
In my case:
--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
Make sure, that all of required options are provided in the command (job_name
, project
, runner
, staging_location
, temp_location
):
python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
The error you faced, would disappear.
- Second solution - posting the additional libraries that your app is using in
setup.py
file, refer to the documentation.
Create a setup.py
file for your project:
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
version='PACKAGE-VERSION',
install_requires=[],
packages=setuptools.find_packages(),
)
You can get rid of the requirements.txt
file and instead, add all packages contained in requirements.txt
to the install_requires
field of the setup call.
答案2
得分: 0
简单的答案是,在部署管道时,您需要确保您的本地环境是在python 3.5、3.6或3.7上运行的。如果是这样,那么一旦启动您的作业,Dataflow工作程序将具有相同的版本。
英文:
The simple answer is when deploying your pipeline, you need to make sure that your local environment is on python 3.5 3.6 or 3.7. If it is, then the Dataflow worker will have the same version once your job is launched.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论