在pipeline中使用sklearn的FunctionTransformer功能时,使用numpy。

huangapple go评论82阅读模式
英文:

Using numpy in sklearn FunctionTransformer inside pipeline

问题

我正在训练一个回归模型,在我的管道内部有类似这样的内容:

  1. best_pipeline = Pipeline(
  2. steps=[
  3. (
  4. "features",
  5. ColumnTransformer(
  6. transformers=[
  7. (
  8. "area",
  9. make_pipeline(
  10. impute.SimpleImputer(),
  11. pr.FunctionTransformer(lambda x: np.log1p(x)),
  12. StandardScaler(),
  13. ),
  14. ["area"],
  15. )
  16. ]
  17. ),
  18. ),
  19. (
  20. "regressor",
  21. TransformedTargetRegressor(
  22. regressor=model,
  23. transformer=PowerTransformer(method='box-cox')
  24. ),
  25. ),
  26. ]
  27. )

很显然还有更多特征,但代码会太长。所以我训练模型,如果我在同一个脚本中进行预测,一切都正常。我使用dill存储模型,然后尝试在另一个Python文件中使用它。

在另一个文件中,我加载模型并尝试这样做:

  1. import numpy as np
  2. df['prediction'] = self.model.predict(df)

在内部,当它尝试执行transform时,它返回:

  1. NameError: name 'np' is not defined
英文:

I'm training a regression model and inside my pipeline I have something like this:

  1. best_pipeline = Pipeline(
  2. steps=[
  3. (
  4. "features",
  5. ColumnTransformer(
  6. transformers=[
  7. (
  8. "area",
  9. make_pipeline(
  10. impute.SimpleImputer(),
  11. pr.FunctionTransformer(lambda x: np.log1p(x)),
  12. StandardScaler(),
  13. ),
  14. ["area"],
  15. )
  16. ]
  17. ),
  18. ),
  19. (
  20. "regressor",
  21. TransformedTargetRegressor(
  22. regressor=model,
  23. transformer=PowerTransformer(method='box-cox')
  24. ),
  25. ),
  26. ]
  27. )

There are obviously more features but the code will be too long. So I train the model and if I predict in the same script everything is fine. I store the model using dill and then try to use it in another python file.

In this other file I load the model and try this:

  1. import numpy as np
  2. df['prediction'] = self.model.predict(df)

And internally, when it tries to do the transform it returns:

  1. NameError: name 'np' is not defined

答案1

得分: 2

你可以通过将函数名称作为 func 参数传递来使用第三方库函数:

  1. import numpy
  2. transformer = FunctionTransformer(numpy.log1p)

无需使用lambda函数或自定义包装类。而且,上述解决方案可以在普通的pickle数据格式中进行持久化。

当在不同环境之间移植对象时,最好使用规范的模块名称。因此,应使用 numpy.log1p 而不是 np.log1p

英文:

You can use third-party library functions by simply passing the name of the function as a func argument:

  1. import numpy
  2. transformer = FunctionTransformer(numpy.log1p)

There is no need for lambdas or custom wrapper classes. Also, the above solution is persistable in plain pickle data format.

When porting objects between different environments, then it's probably a good idea to use canonical module names. Hence numpy.log1p instead of np.log1p.

答案2

得分: 0

我找到了一种解决方法,尽管可能有更好的方法。

我创建了一个封装了numpy函数的类:

  1. class LogTransformer(pr.FunctionTransformer):
  2. def transform(self, X):
  3. import numpy as np
  4. return np.log1p(X)

然后当我创建管道时:

  1. make_pipeline(
  2. impute.SimpleImputer(),
  3. LogTransformer(),
  4. StandardScaler(),
  5. ),

欢迎其他方法。

英文:

I've found a way to fix it, although there might be a better approach.

I create a class encapsulating the numpy function

  1. class LogTransformer(pr.FunctionTransformer):
  2. def transform(self, X):
  3. import numpy as np
  4. return np.log1p(X)

Then when I create the pipeline:

  1. make_pipeline(
  2. impute.SimpleImputer(),
  3. LogTransformer(),
  4. StandardScaler(),
  5. ),

Any other approaches are welcomed

huangapple
  • 本文由 发表于 2023年3月9日 17:00:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75682363.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定