如何为自定义转换器创建 pandas 输出?

huangapple go评论80阅读模式
英文:

How to create pandas output for custom transformers?

问题

In your custom transformer, you can create a function called set_output to configure the output format to "pandas." Here's the translated code for that part:

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

    def set_output(self, transform_output):
        # Implement your logic here to configure the output format
        # For example:
        if transform_output == "pandas":
            # Configure output to pandas format
            # Add your code here
            pass
        else:
            # Handle other output formats if needed
            pass

You can add your specific logic inside the set_output function to handle the "pandas" output format as needed.

英文:

There are a lot of changes in scikit-learn 1.2.0 where it supports pandas output for all of the transformers but how can I use it in a custom transformer?

In [1]: Here is my custom transformer which is a standard scaler: <br>

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

In [2]: Created a specific scale pipeline

scale_pipe = make_pipeline(StandardScalerCustom())

In [3]: Added in a full pipeline where it may get mixed with scalers, imputers, encoders etc.

full_pipeline = ColumnTransformer([
    (&quot;imputer&quot;, impute_pipe, [&#39;column_1&#39;])
    (&quot;scaler&quot;, scale_pipe, [&#39;column_2&#39;])
])

# From documentation
full_pipeline.set_output(transform=&quot;pandas&quot;)

Got this error: <br>

ValueError: Unable to configure output for StandardScalerCustom() because set_output is not available.


There is a solution and it can be:
set_config(transform_output=&quot;pandas&quot;) <br>

But in case-to-case basis, how can I create a function in StandardScalerCustom() class that can fix the error above?

答案1

得分: 3

Sure, here's the translated portion:

我猜测,增强set_config()的原因之一,通过transform_output参数的方式,确实是为了使自定义变换器也能够输出pandas DataFrames。

通过查看底层代码,我找到了一种方法,允许自定义变换器在不需要显式设置全局配置的情况下输出pandas DataFrames;只需实现一个虚拟的.get_feature_names_out()方法即可。然而,这仅适用,因为以这种方式全局配置会被自动设置。
实际上,如果.get_feature_names_out()可用,_auto_wrap_is_configured()将返回True,如果是这样,full_pipeline将返回调用这个.set_output()方法,而不是这个._safe_set_output()方法,在第一个方法中使用transform=&quot;pandas&quot;自动设置全局配置,而在第二个方法中会输出您正在遇到的ValueError

这是一个工作示例:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd

df = pd.DataFrame({'column_1': [np.nan, 1.34, 10.98, 3.34, 5.32], 'column_2': [9.78, 20.34, 43.54, 1.98, 7.85]})

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

    def get_feature_names_out(self):
        pass

impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())

full_pipeline = ColumnTransformer([
    ("imputer", impute_pipe, ['column_1']),
    ("scaler", scale_pipe, ['column_2'])
])

full_pipeline.set_output(transform="pandas")
full_pipeline.fit_transform(df)

Hope this helps!

英文:

My guess is that one the rationales behind the enhancement of set_config() by means of the parameter transform_output was indeed to enable also custom transformers to output pandas DataFrames.

By looking at the underlying code, I've found one hack that allows custom transformers to output pandas DataFrames without the need to explicitly set the global configuration; it is sufficient to implement a dummy .get_feature_names_out() method. However, this works just because in this way the global configuration is automatically set.
Indeed, _auto_wrap_is_configured() returns True if .get_feature_names_out() is available and, if so, full_pipeline reverts to calling this .set_output()
method
rather than getting to this ._safe_set_output() method, where the first sets the global configuration with transform=&quot;pandas&quot; automatically, while the second would output the ValueError that you're getting.

Here's a working example:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd

df = pd.DataFrame({&#39;column_1&#39;: [np.nan, 1.34, 10.98, 3.34, 5.32], &#39;column_2&#39;: [9.78, 20.34, 43.54, 1.98, 7.85]})

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

    def get_feature_names_out(self):
        pass

impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())

full_pipeline = ColumnTransformer([
    (&quot;imputer&quot;, impute_pipe, [&#39;column_1&#39;]),
    (&quot;scaler&quot;, scale_pipe, [&#39;column_2&#39;])
])

full_pipeline.set_output(transform=&quot;pandas&quot;)
full_pipeline.fit_transform(df)

答案2

得分: 2

大多数情况下,自定义方法 'transform' 返回 NumPy 数组。要将它们转换回 Pandas DataFrame,您需要在拟合过程中提取列。之后,您需要添加 'get_feature_names_out' 方法,该方法返回列名。尝试使用以下代码:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = X.columns
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std

    def get_feature_names_out(self, *args, **params):
        return self.columns_
英文:

In most case custom methods 'transform' return numpy arrays. To convert them back to pandas DataFrame you need to extract columns while fitting. After that you need to add method get_feature_names_out, which returns column names. Try to use this code:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = X.columns
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean) / self.std
    
    def get_feature_names_out(self, *args, **params):
        return self.columns_

huangapple
  • 本文由 发表于 2023年1月6日 11:14:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026592.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定