英文:
How to create pandas output for custom transformers?
问题
In your custom transformer, you can create a function called set_output
to configure the output format to "pandas." Here's the translated code for that part:
class StandardScalerCustom(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.mean = np.mean(X, axis=0)
self.std = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean) / self.std
def set_output(self, transform_output):
# Implement your logic here to configure the output format
# For example:
if transform_output == "pandas":
# Configure output to pandas format
# Add your code here
pass
else:
# Handle other output formats if needed
pass
You can add your specific logic inside the set_output
function to handle the "pandas" output format as needed.
英文:
There are a lot of changes in scikit-learn 1.2.0 where it supports pandas output for all of the transformers but how can I use it in a custom transformer?
In [1]: Here is my custom transformer which is a standard scaler: <br>
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class StandardScalerCustom(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.mean = np.mean(X, axis=0)
self.std = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean) / self.std
In [2]: Created a specific scale
pipeline
scale_pipe = make_pipeline(StandardScalerCustom())
In [3]: Added in a full pipeline where it may get mixed with scalers, imputers, encoders etc.
full_pipeline = ColumnTransformer([
("imputer", impute_pipe, ['column_1'])
("scaler", scale_pipe, ['column_2'])
])
# From documentation
full_pipeline.set_output(transform="pandas")
Got this error: <br>
ValueError: Unable to configure output for StandardScalerCustom() because set_output
is not available.
There is a solution and it can be:
set_config(transform_output="pandas")
<br>
But in case-to-case basis, how can I create a function in StandardScalerCustom() class that can fix the error above?
答案1
得分: 3
Sure, here's the translated portion:
我猜测,增强set_config()
的原因之一,通过transform_output
参数的方式,确实是为了使自定义变换器也能够输出pandas DataFrames。
通过查看底层代码,我找到了一种方法,允许自定义变换器在不需要显式设置全局配置的情况下输出pandas DataFrames;只需实现一个虚拟的.get_feature_names_out()
方法即可。然而,这仅适用,因为以这种方式全局配置会被自动设置。
实际上,如果.get_feature_names_out()
可用,_auto_wrap_is_configured()
将返回True,如果是这样,full_pipeline
将返回调用这个.set_output()
方法,而不是这个._safe_set_output()
方法,在第一个方法中使用transform="pandas"
自动设置全局配置,而在第二个方法中会输出您正在遇到的ValueError。
这是一个工作示例:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd
df = pd.DataFrame({'column_1': [np.nan, 1.34, 10.98, 3.34, 5.32], 'column_2': [9.78, 20.34, 43.54, 1.98, 7.85]})
class StandardScalerCustom(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.mean = np.mean(X, axis=0)
self.std = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean) / self.std
def get_feature_names_out(self):
pass
impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())
full_pipeline = ColumnTransformer([
("imputer", impute_pipe, ['column_1']),
("scaler", scale_pipe, ['column_2'])
])
full_pipeline.set_output(transform="pandas")
full_pipeline.fit_transform(df)
Hope this helps!
英文:
My guess is that one the rationales behind the enhancement of set_config()
by means of the parameter transform_output
was indeed to enable also custom transformers to output pandas DataFrames.
By looking at the underlying code, I've found one hack that allows custom transformers to output pandas DataFrames without the need to explicitly set the global configuration; it is sufficient to implement a dummy .get_feature_names_out()
method. However, this works just because in this way the global configuration is automatically set.
Indeed, _auto_wrap_is_configured()
returns True if .get_feature_names_out()
is available and, if so, full_pipeline
reverts to calling this .set_output()
method rather than getting to this ._safe_set_output()
method, where the first sets the global configuration with transform="pandas"
automatically, while the second would output the ValueError that you're getting.
Here's a working example:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd
df = pd.DataFrame({'column_1': [np.nan, 1.34, 10.98, 3.34, 5.32], 'column_2': [9.78, 20.34, 43.54, 1.98, 7.85]})
class StandardScalerCustom(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.mean = np.mean(X, axis=0)
self.std = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean) / self.std
def get_feature_names_out(self):
pass
impute_pipe = make_pipeline(SimpleImputer())
scale_pipe = make_pipeline(StandardScalerCustom())
full_pipeline = ColumnTransformer([
("imputer", impute_pipe, ['column_1']),
("scaler", scale_pipe, ['column_2'])
])
full_pipeline.set_output(transform="pandas")
full_pipeline.fit_transform(df)
答案2
得分: 2
大多数情况下,自定义方法 'transform' 返回 NumPy 数组。要将它们转换回 Pandas DataFrame,您需要在拟合过程中提取列。之后,您需要添加 'get_feature_names_out' 方法,该方法返回列名。尝试使用以下代码:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class StandardScalerCustom(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.columns_ = X.columns
self.mean = np.mean(X, axis=0)
self.std = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean) / self.std
def get_feature_names_out(self, *args, **params):
return self.columns_
英文:
In most case custom methods 'transform' return numpy arrays. To convert them back to pandas DataFrame you need to extract columns while fitting. After that you need to add method get_feature_names_out, which returns column names. Try to use this code:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class StandardScalerCustom(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.columns_ = X.columns
self.mean = np.mean(X, axis=0)
self.std = np.std(X, axis=0)
return self
def transform(self, X):
return (X - self.mean) / self.std
def get_feature_names_out(self, *args, **params):
return self.columns_
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论