2020年1月4日 11:57:50go评论151阅读模式

英文:

Force Dataflow workers to use Python 3?

问题

I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner works fine, but with DataflowRunner it fails to install 1 dependency from requirements.txt. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.

The pipeline is done and working fine locally (DirectRunner) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0 in my requirements.txt).

One of the modules required by my pipeline is:
from lbcapi3 import api

So my requirements.txt sent to GCP has a line with:
lbcapi3==1.0.0

That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:

ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n&#39;

That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt.

Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?

I tried by adding this as the first line of the my file api-etl.py, but didn't work:

#!/usr/bin/env python3

Thanks!

英文:

The pipeline is done and working fine locally (DirectRunner) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0 in my requirements.txt).

One of the modules required by my pipeline is:
from lbcapi3 import api

So my requirements.txt sent to GCP has a line with:
lbcapi3==1.0.0

That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:

ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n&#39;

That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt.

Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?

I tried by adding this as the first line of the my file api-etl.py, but didn't work:

#!/usr/bin/env python3

Thanks!

答案1

得分: 1

请按照快速入门指南中的说明来设置和运行您的流水线。在安装 Apache Beam SDK 时，请确保安装版本 2.16（因为这是正式支持 Python 3 的第一个版本）。请检查您的版本。

如果您想迁移到 Python 2.x 环境以外的版本，您可以将 Apache Beam SDK 与 Python 版本 3.5、3.6 或 3.7 配合使用。

有关更多信息，请参阅此文档。此外，请查看预安装的依赖项。

提供额外信息后编辑：

我在 Dataflow 上重现了问题。我看到了两种解决方案。

您可以使用 --extra_package 选项，这将允许以可访问的方式进行本地包的暂存。不要在 requirements.txt 中列出本地包，而是使用 --extra_package 选项创建本地包的 tarball（例如 my_package.tar.gz）并进行暂存。

从 Github 克隆存储库：

$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/

使用以下命令构建tarball：

$ python setup.py sdist

最后几行将如下所示：

Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)

然后，使用以下命令行选项运行您的流水线：

--extra_package /path/to/package/package-name

在我的情况下：

--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

确保在命令中提供了所有所需的选项（job_name、project、runner、staging_location、temp_location）：

python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

您遇到的错误将会消失。

第二种解决方案 - 在 setup.py 文件中发布应用程序正在使用的附加库，请参考文档。

为您的项目创建一个 setup.py 文件：

import setuptools

setuptools.setup(
   name='PACKAGE-NAME',
   version='PACKAGE-VERSION',
   install_requires=[],
   packages=setuptools.find_packages(),
)

您可以摆脱 requirements.txt 文件，而是将 requirements.txt 中包含的所有包添加到设置调用的 install_requires 字段中。

英文:

Follow the instructions in the quickstart to get up and running with your pipeline. When installing the Apache Beam SDK, make sure to install version 2.16 (since this is the first version that officially supports Python 3). Please, check your version.

You can use the Apache Beam SDK with Python versions 3.5, 3.6, or 3.7 if you are keen to migrate from Python 2.x environments.

For more information, refer to this documentation. Also, take a look for preinstalled dependencies.

Edited, after providing additional information:

I have reproduced problem on Dataflow. I see two solutions.

You can use --extra_package option, which would allow staging local packages in an accessible way. Instead of listing local package in the requirements.txt, create a tarball of the local package (e.g. my_package.tar.gz) and use --extra_package option to stage them.

Clone the repository from Github:

$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/

Build the tarball with the following command:

$ python setup.py sdist

The last few lines will look like this:

Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing &#39;lbcapi3-1.0.0&#39; (and everything under it)

Then, run your pipeline with the following command-line option:

 --extra_package /path/to/package/package-name

In my case:

--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

Make sure, that all of required options are provided in the command (job_name, project, runner, staging_location, temp_location):

python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz

The error you faced, would disappear.

Second solution - posting the additional libraries that your app is using in setup.py file, refer to the documentation.

Create a setup.py file for your project:

 import setuptools

 setuptools.setup(
    name=&#39;PACKAGE-NAME&#39;,
    version=&#39;PACKAGE-VERSION&#39;,
    install_requires=[],
    packages=setuptools.find_packages(),
 )

You can get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call.

答案2

得分: 0

简单的答案是，在部署管道时，您需要确保您的本地环境是在python 3.5、3.6或3.7上运行的。如果是这样，那么一旦启动您的作业，Dataflow工作程序将具有相同的版本。

英文:

The simple answer is when deploying your pipeline, you need to make sure that your local environment is on python 3.5 3.6 or 3.7. If it is, then the Dataflow worker will have the same version once your job is launched.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

强制数据流工作者使用Python 3。

问题

答案1

答案2

尝试使用套接字从 GUI 传递值到另一个应用程序。

用Pandas在Python中重塑和清理制表符分隔的数据文件

如何根据另一个pandas.Series的索引和值对pandas.Dataframe的列进行分组？

遍历所有可能的组合，对另一列求和。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论