英文:
How to install private repository on Dataflow Worker?
问题
我们在Dataflow作业部署过程中遇到了问题。
错误信息
我们使用CustomCommands来在工作节点上安装私有仓库,但是现在我们在作业的worker-startup
日志中遇到了以下错误:
Running command: ['pip', 'install', 'git+ssh://git@github.com/my_private_repo.git@v1.0.0']
Command output: b'Traceback (most recent call last):
File "/usr/local/bin/pip", line 6, in <module>
from pip._internal import main\nModuleNotFoundError: No module named \'pip\'\n'
这段代码曾经工作正常,但自从上周五部署服务以来就不再正常。
一些背景信息
- 我们使用GAE服务和定时作业来部署Dataflow作业,使用Python SDK。
- 在我们的作业中,我们使用存储在私有存储库中的代码。
- 为了允许工作节点拉取私有存储库,我们使用一个带有CustomCommands的
setup.py
,在工作节点启动期间运行这些命令。 (官方仓库的代码示例在这里) - 这些命令会从GCS中检索一个编码的SSH密钥,使用KMS解码它,获取一个SSH配置文件以指定密钥的路径和允许的主机,然后执行
pip install git+ssh://git@github.com/my_private_repo.git@v1.0.0
(请参阅下面的命令)。
CUSTOM_COMMANDS = [
# 检索SSH密钥
["gsutil", "cp", "gs://{bucket_name}/encrypted_python_repo_ssh_key".format(bucket_name=credentials_bucket), "encrypted_key"],
[
"gcloud",
"kms",
"decrypt",
"--location",
"global",
"--keyring",
project,
"--key",
project,
"--plaintext-file",
"decrypted_key",
"--ciphertext-file",
"encrypted_key",
],
["chmod", "700", "decrypted_key"],
# 安装git和ssh
["apt-get", "update"],
["apt-get", "install", "-y", "openssh-server"],
["apt-get", "install", "-y", "git"],
# 添加指定密钥位置和主机的SSH配置
[
"gsutil",
"cp",
"gs://{bucket_name}/ssh_config_gcloud".format(bucket_name=credentials_bucket),
"~/.ssh/config",
],
[
"pip",
"install",
"git+ssh://git@github.com/my_private_repo.git@v1.0.0",
],
]
我们尝试过的方法
- 根据pip的问题#5599的反馈,似乎存在多个pip版本之间的冲突。
我们尝试在CustomCommands中重新安装它,添加apt-get --reinstall install -y python-setuptools python-wheel python-pip
(以及其他类似的变体,如curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py --force-reinstall
),但没有具体的改进。
需要注意的事项
- 本地启动的作业可以正常工作(为什么?我很好奇为什么可以工作,因为CustomCommands没有运行)
- 登录到计算实例并连接到docker进程,手动运行命令不会显示任何错误日志
- 服务使用自定义Dockerfile部署,定义如下:
FROM gcr.io/google-appengine/python
RUN apt-get update && apt-get install -y openssh-server
RUN virtualenv /env -p python3.7
# 设置这些环境变量与运行 source /env/bin/activate 一样。
ENV VIRTUAL_ENV /env
ENV PATH /env/bin:$PATH
# 设置用于git的凭证,运行pip以安装所有依赖项到虚拟环境中。
... 指定SSH密钥、主机,以允许私有git仓库拉取
# 添加应用程序源代码。
ADD . /app
RUN pip install -r /app/requirements.txt && python /app/setup.py install && python /app/setup.py build
CMD gunicorn -b :$PORT main:app
有没有关于如何解决这个问题或任何可用的解决方法的想法?
谢谢你的帮助!
编辑
这主要是由于机器的本地状态或我们的计算机造成的。
在运行一些命令,如python setup.py install
或python setup.py build
之后,我现在无法再部署作业了(在服务部署期间面临与"worker-startup"相同的错误),但我的同事仍然能够部署作业(相同的代码、相同的分支,只是从.gitignore中排除了目录,如build
、dist
等),而这些作业正在运行。在他的情况下,CustomCommands在作业部署时没有运行(但工作节点仍然可以使用本地打包的管道)。
有没有办法指定工作节点使用编译后的包?我找不到相关的文档...
解决方法
由于我们无法从Dataflow工作节点拉取私有代码,我们使用了以下解决方法:
- 使用
python setup.py sdist bdist_wheel
构建我们私有仓库的wheel包。 - 将这个wheel包嵌入到我们的Dataflow包中的
lib/my-package-1.0.0-py3-none-any.whl
目录下。 - 将这个wheel包传递给Dataflow选项作为额外的包(参见beam代码这里)。
pipeline_options = PipelineOptions()
pipeline_options.view_as(
<details>
<summary>英文:</summary>
We're facing issues during Dataflow jobs deployment.
### The error
We are using CustomCommands to install private repo on workers, but we face now an error in the `worker-startup` logs of our jobs:
Running command: ['pip', 'install', 'git+ssh://git@github.com/my_private_repo.git@v1.0.0']
Command output: b'Traceback (most recent call last):
File "/usr/local/bin/pip", line 6, in <module>
from pip._internal import main\nModuleNotFoundError: No module named \'pip\'\n'
This code was working but since our last deploy of the service on Friday, it's not.
### Some context
1. We use a GAE service with a cron job to deploy Dataflow Jobs, using the python sdk
2. In our jobs, we're using code stored in private repository
3. To allow the workers to pull private repositories, we use a `setup.py` with customCommands which are run during worker startup. (code example from official repo [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py))
4. The commands retrieve an encoded ssh key from GCS, decode it with KMS, get a ssh config file to specify path of the key & allowed hosts then perform a `pip install git+ssh://git@github.com/my_private_repo.git@v1.0.0` (see commands below)
<!-- begin snippet: js hide: true console: true babel: false -->
<!-- language: lang-html -->
CUSTOM_COMMANDS = [
# retrieve ssh key
["gsutil", "cp","gs://{bucket_name}/encrypted_python_repo_ssh_key".format(bucket_name=credentials_bucket), "encrypted_key"],
[
"gcloud",
"kms",
"decrypt",
"--location",
"global",
"--keyring",
project,
"--key",
project,
"--plaintext-file",
"decrypted_key",
"--ciphertext-file",
"encrypted_key",
],
["chmod", "700", "decrypted_key"],
# install git & ssh
["apt-get", "update"],
["apt-get", "install", "-y", "openssh-server"],
["apt-get", "install", "-y", "git"],
# Add ssh config which specify the location of the key & the host
[
"gsutil",
"cp",
"gs://{bucket_name}/ssh_config_gcloud".format(bucket_name=credentials_bucket),
"~/.ssh/config",
],
[
"pip",
"install",
"git+ssh://git@github.com/my_private_repo.git@v1.0.0",
],
]
<!-- end snippet -->
### What we tried
- Following this issue in pip [#5599](https://github.com/pypa/pip/issues/5599), it seems that there is a conflict between several versions of pip.
We tried to reinstall it adding `apt-get --reinstall install -y python-setuptools python-wheel python-pip` (and other variations like `curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py --force-reinstall`) in the CustomCommands but no specific improvement.
### To Note:
- Jobs started locally are working (How ? I'm quite curious how can it work since the CustomCommands are not run)
- Logging in the compute instance & connect to the docker process & running the commands manually doesn't show any error log
- Service is deployed using a custom Dockerfile defined by following snippet
<!-- begin snippet: js hide: true console: true babel: false -->
<!-- language: lang-html -->
FROM gcr.io/google-appengine/python
RUN apt-get update && apt-get install -y openssh-server
RUN virtualenv /env -p python3.7
# Setting these environment variables are the same as running
# source /env/bin/activate.
ENV VIRTUAL_ENV /env
ENV PATH /env/bin:$PATH
# Set credentials for git run pip to install all
# dependencies into the virtualenv.
... specify SSH KEY, host, to allow private git repo pull
# Add the application source code.
ADD . /app
RUN pip install -r /app/requirements.txt && python /app/setup.py install && python /app/setup.py build
CMD gunicorn -b :$PORT main:app
<!-- end snippet -->
Any idea about how to solve this issue, or any workaround available ?
Thanks for your help !
### Edit
This seems mostly due to local state of the machine, or our computers.
After running some commands like `python setup.py install` or `python setup.py build`, I'm now unable to deploy jobs anymore (facing the same error during `worker-startup` as deployed by the service), but my colleague is still able to deploy jobs (same code, same branch, except excluded directories from .gitignore like `build`, `dist`, ...) which are running. In his case, CustomCommands are not run on job deployment (but workers are still able to use local packaged pipeline).
Any way to specify a compiled package to use by worker ? I was not able to find doc on that...
### Workaround
As we were not able to pull private code from dataflow worker, we used the following workaround:
- Build a wheel of our private repo using `python setup.py sdist bdist_wheel`
- Embed this wheel in our dataflow package under `lib/my-package-1.0.0-py3-none-any.whl`
- Pass the wheel to dataflow options as extra package (see beam code [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/options/pipeline_options.py#L879))
#### Commands used
pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).setup_file = "./setup.py"
pipeline_options.view_as(SetupOptions).extra_packages = ["./lib/my-package-1.0.0-py3-none-any.whl"]
</details>
# 答案1
**得分**: 2
我建议对于除了非常复杂的、公共的依赖项之外的情况,使用[自定义容器][1],并提前安装所有依赖项。
[1]: https://cloud.google.com/dataflow/docs/guides/using-custom-containers
<details>
<summary>英文:</summary>
For anything but non-trivial, public dependencies I would recommend using [custom containers][1] and installing all the dependencies ahead of time.
[1]: https://cloud.google.com/dataflow/docs/guides/using-custom-containers
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论