英文:
Using code_path in mlflow.pyfunc models on Databricks
问题
我们在AWS基础设施上使用Databricks,在mlflow
上注册模型。
我们将项目内的导入写为from src.(模块位置) import (对象)
。
根据在线示例,我期望当我使用mlflow.pyfunc.log_model(..., code_path=['PROJECT_ROOT/src'], ...)
时,它将向模型的运行环境添加整个代码树,从而使我们能够保持导入不变。
在记录模型时,我得到了一长串[Errno 95] Operation not supported
,每个笔记本都有一个。这阻止我们将模型注册到mlflow。
我们已经使用了几种临时解决方案和变通方法,从强制自己将所有代码放在一个文件中,到只使用位于同一目录中的文件(code_path=['./filename.py']
),到添加特定的库(并相应更改导入路径),等等。
然而,这些都不是最佳解决方案。结果要么是重复的代码(违反DRY原则),要么是将一些导入放在包装器内(即那些无法在我们的工作环境中运行的导入,因为它与模型在部署时将经历的环境不同),等等。
我们还没有尝试将所有笔记本(我们认为会导致[Errno 95] Operation not supported
错误的笔记本)放在一个单独的文件夹中。这将对我们当前的情况和流程产生严重干扰,我们希望尽量避免这种情况。
请提供建议。
英文:
We are using Databricks over AWS infra, registering models on mlflow
.
We write our in-project imports as from src.(module location) import (objects)
.
Following examples online, I expected that when I use mlflow.pyfunc.log_model(..., code_path=['PROJECT_ROOT/src'], ...)
, that would add the entire code tree to the model's running environment and thus allow us to keep our imports as-are.
When logging the model, I get a long list of [Errno 95] Operation not supported
, one for each notebook in our repo. This blocks us from registering the model to mlflow.
We have used several ad-hoc solutions and workarounds, from forcing ourselves to work with all code in one file, to only working with files in the same directory (code_path=['./filename.py']
, to adding specific libraries (and changing import paths accordingly), etc.
However none of these is optimal. As a result we either duplicate code (killing DRY), or we put some imports inside the wrapper (i.e. those that cannot be run in our working environment since it's different from the one the model will experience when deployed), etc.
We have not yet tried to put all the notebooks (which we believe cause [Errno 95] Operation not supported
) in a separate folder. This will be highly disruptive to our current situation and processes, and we'd like to avoid that as much as we can.
Please advise
答案1
得分: 1
我在使用来自src
目录的自定义模型逻辑(类似于cookiecutter-data-science的结构)时,与Databricks遇到了类似的问题。解决方案是使用相对路径记录整个src
目录。
因此,如果您有以下项目结构:
.
├── notebooks
│ └── train.py
└── src
├── __init__.py
└── model.py
您的train.py
应该如下所示。 请注意,AddN来自MLflow文档。
import mlflow
from src.model import AddN
model = AddN(n=5)
mlflow.pyfunc.log_model(
registered_model_name="add_n_model",
artifact_path="add_n_model",
python_model=model,
code_path=["../src"],
)
这将复制src/
中的所有代码并将其记录在MLflow存储中,从而使模型能够加载所有依赖项。
如果您未使用notebooks/
目录,可以将code_path=["src"]
。如果您使用了子目录,如notebooks/train/train.py
,则可以将code_path=["../../src"]
。
英文:
I had a similar struggle with Databricks when using custom model logic from an src
directory (similar structure to cookiecutter-data-science). The solution was to log the entire src
directory using the relative path.
So if you have the following project structure.
.
├── notebooks
│   └── train.py
└── src
├── __init__.py
└── model.py
Your train.py
should look like this. Note AddN comes from the MLflow Docs.
import mlflow
from src.model import AddN
model = AddN(n=5)
mlflow.pyfunc.log_model(
registered_model_name="add_n_model",
artifact_path="add_n_model",
python_model=model,
code_path=["../src"],
)
This will copy all code in src/
and log it in the MLflow artifact allowing the model to load all dependencies.
If you are not using a notebooks/
directory, you will set code_path=["src"]
. If you are using sub-directies like notebooks/train/train.py
, you will set code_path=["../../src"]
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论