2023年1月6日 11:14:45go评论106阅读模式

英文:

How to create pandas output for custom transformers?

问题

In your custom transformer, you can create a function called set_output to configure the output format to "pandas." Here's the translated code for that part:

class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    def transform(self, X):
        return (X - self.mean) / self.std
    def set_output(self, transform_output):
        # Implement your logic here to configure the output format
        # For example:
        if transform_output == "pandas":
            # Configure output to pandas format
            # Add your code here
            pass
        else:
            # Handle other output formats if needed
            pass

You can add your specific logic inside the set_output function to handle the "pandas" output format as needed.

英文:

There are a lot of changes in scikit-learn 1.2.0 where it supports pandas output for all of the transformers but how can I use it in a custom transformer?

In [1]: Here is my custom transformer which is a standard scaler: <br>

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    def transform(self, X):
        return (X - self.mean) / self.std

In [2]: Created a specific scale pipeline

scale_pipe = make_pipeline(StandardScalerCustom())

In [3]: Added in a full pipeline where it may get mixed with scalers, imputers, encoders etc.

full_pipeline = ColumnTransformer([
    (&quot;imputer&quot;, impute_pipe, [&#39;column_1&#39;])
    (&quot;scaler&quot;, scale_pipe, [&#39;column_2&#39;])
])
# From documentation
full_pipeline.set_output(transform=&quot;pandas&quot;)

Got this error: <br>

ValueError: Unable to configure output for StandardScalerCustom() because set_output is not available.

There is a solution and it can be:
set_config(transform_output="pandas") <br>

But in case-to-case basis, how can I create a function in StandardScalerCustom() class that can fix the error above?

答案1

得分: 3

Sure, here's the translated portion:

我猜测，增强set_config()的原因之一，通过transform_output参数的方式，确实是为了使自定义变换器也能够输出pandas DataFrames。

通过查看底层代码，我找到了一种方法，允许自定义变换器在不需要显式设置全局配置的情况下输出pandas DataFrames；只需实现一个虚拟的.get_feature_names_out()方法即可。然而，这仅适用，因为以这种方式全局配置会被自动设置。
实际上，如果.get_feature_names_out()可用，_auto_wrap_is_configured()将返回True，如果是这样，full_pipeline将返回调用这个.set_output()方法，而不是这个._safe_set_output()方法，在第一个方法中使用transform="pandas"自动设置全局配置，而在第二个方法中会输出您正在遇到的ValueError。

这是一个工作示例：

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd
df = pd.DataFrame({'column_1': [np.nan, 1.34, 10.98, 3.34, 5.32], 'column_2': [9.78, 20.34, 43.54, 1.98, 7.85]})
class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    def transform(self, X):
        return (X - self.mean) / self.std
    def get_feature_names_out(self):
        pass
impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())
full_pipeline = ColumnTransformer([
    ("imputer", impute_pipe, ['column_1']),
    ("scaler", scale_pipe, ['column_2'])
])
full_pipeline.set_output(transform="pandas")
full_pipeline.fit_transform(df)

Hope this helps!

英文:

My guess is that one the rationales behind the enhancement of set_config() by means of the parameter transform_output was indeed to enable also custom transformers to output pandas DataFrames.

By looking at the underlying code, I've found one hack that allows custom transformers to output pandas DataFrames without the need to explicitly set the global configuration; it is sufficient to implement a dummy .get_feature_names_out() method. However, this works just because in this way the global configuration is automatically set.
Indeed, _auto_wrap_is_configured() returns True if .get_feature_names_out() is available and, if so, full_pipeline reverts to calling this .set_output()
method rather than getting to this ._safe_set_output() method, where the first sets the global configuration with transform="pandas" automatically, while the second would output the ValueError that you're getting.

Here's a working example:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd
df = pd.DataFrame({&#39;column_1&#39;: [np.nan, 1.34, 10.98, 3.34, 5.32], &#39;column_2&#39;: [9.78, 20.34, 43.54, 1.98, 7.85]})
class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    def transform(self, X):
        return (X - self.mean) / self.std
    def get_feature_names_out(self):
        pass
impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())
full_pipeline = ColumnTransformer([
    (&quot;imputer&quot;, impute_pipe, [&#39;column_1&#39;]),
    (&quot;scaler&quot;, scale_pipe, [&#39;column_2&#39;])
])
full_pipeline.set_output(transform=&quot;pandas&quot;)
full_pipeline.fit_transform(df)

答案2

得分: 2

大多数情况下，自定义方法 'transform' 返回 NumPy 数组。要将它们转换回 Pandas DataFrame，您需要在拟合过程中提取列。之后，您需要添加 'get_feature_names_out' 方法，该方法返回列名。尝试使用以下代码：

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = X.columns
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    def transform(self, X):
        return (X - self.mean) / self.std
    def get_feature_names_out(self, *args, **params):
        return self.columns_

英文:

In most case custom methods 'transform' return numpy arrays. To convert them back to pandas DataFrame you need to extract columns while fitting. After that you need to add method get_feature_names_out, which returns column names. Try to use this code:

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class StandardScalerCustom(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.columns_ = X.columns
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    def transform(self, X):
        return (X - self.mean) / self.std
    
    def get_feature_names_out(self, *args, **params):
        return self.columns_

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何为自定义转换器创建 pandas 输出？

问题

答案1

答案2

React调用Flask时为什么会出现CORS问题，尽管已包含flask_cors？

Pycharm + Black 自动保存格式化运行脚本

`python pathllib joinpath()`在另一个路径以斜杠开头时会删除主路径的一部分。

使用两个变量创建sympy函数

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。