2023年3月9日 17:00:10go评论89阅读模式

英文:

Using numpy in sklearn FunctionTransformer inside pipeline

问题

我正在训练一个回归模型，在我的管道内部有类似这样的内容：

best_pipeline = Pipeline(
    steps=[
        (
            "features",
            ColumnTransformer(
                transformers=[
                    (
                        "area",
                        make_pipeline(
                            impute.SimpleImputer(),
                            pr.FunctionTransformer(lambda x: np.log1p(x)),
                            StandardScaler(),
                        ),
                        ["area"],
                    )
                ]
            ),
        ),
        (
            "regressor",
            TransformedTargetRegressor(
                regressor=model,
                transformer=PowerTransformer(method='box-cox')
            ),
        ),
    ]
)

很显然还有更多特征，但代码会太长。所以我训练模型，如果我在同一个脚本中进行预测，一切都正常。我使用dill存储模型，然后尝试在另一个Python文件中使用它。

在另一个文件中，我加载模型并尝试这样做：

import numpy as np
df['prediction'] = self.model.predict(df)

在内部，当它尝试执行transform时，它返回：

NameError: name 'np' is not defined

英文:

I'm training a regression model and inside my pipeline I have something like this:

best_pipeline = Pipeline(
    steps=[
        (
            &quot;features&quot;,
            ColumnTransformer(
                transformers=[
                    (
                        &quot;area&quot;,
                        make_pipeline(
                            impute.SimpleImputer(),
                            pr.FunctionTransformer(lambda x: np.log1p(x)),
                            StandardScaler(),
                        ),
                        [&quot;area&quot;],
                    )
                ]
            ),
        ),
        (
            &quot;regressor&quot;,
            TransformedTargetRegressor(
                regressor=model,
                transformer=PowerTransformer(method=&#39;box-cox&#39;)
            ),
        ),
    ]
)

There are obviously more features but the code will be too long. So I train the model and if I predict in the same script everything is fine. I store the model using dill and then try to use it in another python file.

In this other file I load the model and try this:

import numpy as np
df[&#39;prediction&#39;] = self.model.predict(df)

And internally, when it tries to do the transform it returns:

NameError: name &#39;np&#39; is not defined

答案1

得分: 2

你可以通过将函数名称作为 func 参数传递来使用第三方库函数：

import numpy
transformer = FunctionTransformer(numpy.log1p)

无需使用lambda函数或自定义包装类。而且，上述解决方案可以在普通的pickle数据格式中进行持久化。

当在不同环境之间移植对象时，最好使用规范的模块名称。因此，应使用 numpy.log1p 而不是 np.log1p。

英文:

You can use third-party library functions by simply passing the name of the function as a func argument:

import numpy
transformer = FunctionTransformer(numpy.log1p)

There is no need for lambdas or custom wrapper classes. Also, the above solution is persistable in plain pickle data format.

When porting objects between different environments, then it's probably a good idea to use canonical module names. Hence numpy.log1p instead of np.log1p.

答案2

得分: 0

我找到了一种解决方法，尽管可能有更好的方法。

我创建了一个封装了numpy函数的类：

class LogTransformer(pr.FunctionTransformer):
    
    def transform(self, X):
        import numpy as np
        return np.log1p(X)

然后当我创建管道时：

make_pipeline(
     impute.SimpleImputer(),
     LogTransformer(),
     StandardScaler(),
),

欢迎其他方法。

英文:

I've found a way to fix it, although there might be a better approach.

I create a class encapsulating the numpy function

class LogTransformer(pr.FunctionTransformer):
    
    def transform(self, X):
        import numpy as np
        return np.log1p(X)

Then when I create the pipeline:

 make_pipeline(
      impute.SimpleImputer(),
      LogTransformer(),
      StandardScaler(),
 ),

Any other approaches are welcomed

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pipeline中使用sklearn的FunctionTransformer功能时，使用numpy。

问题

答案1

答案2

Reading a Fortran Data File in Python

复制远程的PostgreSQL数据库到第二个远程服务器。

TypeError: ‘Div’ object is not callable

缺少配置文件错误：[‘config’]。请检查OpenCV安装。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。