英文:
Global variable in Column Transformer scikit-learn
问题
在这个代码片段中,我构建了一个转换器(transformer),用于插入一列数据,但似乎修改了原始变量。实际上,我无法重复进行fit操作。这是否是预期行为?这会破坏例如grid_search的行为,因为它会尝试为每个网格拟合(insert)列。
from sklearn.base import BaseEstimator, TransformerMixin
class customColumnTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X.insert(0, 'newCol', 1)
return X
df = pd.DataFrame([[1, 2], [3, 4]])
display(customColumnTransformer().fit_transform(df))
display(df)
# display(customColumnTransformer().fit_transform(df)) 这会生成一个错误 ValueError: 无法插入 newCol,因为它已经存在
我找到的唯一解决方法是使用:
customColumnTransformer().fit_transform(df.copy())
英文:
in this snippet I built a transformer that insert one column, but it seems to modify the original variable. In fact I cannot repeat the fit operation. Is this expected? This breaks the behavior of grid_search for example because it tries to insert the column for every grid fit.
from sklearn.base import BaseEstimator,TransformerMixin
class customColumnTransformer(BaseEstimator,TransformerMixin):
def fit(self,X,y=None):
return self
def transform(self,X):
X.insert(0,'newCol',1)
return X
df = pd.DataFrame([[1,2],[3,4]])
display(customColumnTransformer().fit_transform(df))
display(df)
# display(customColumnTransformer().fit_transform(df)) THIS GENERATE AN ERROR ValueError: cannot insert newCol, already exists
The only solution I found is to use
customColumnTransformer().fit_transform(df.copy())
答案1
得分: 1
当您在 transform 函数内部使用 insert 方法修改 DataFrame X 时,这种行为是完全符合预期的。它会原地修改 DataFrame。因此,如果在同一个 DataFrame 上连续调用 fit_transform,会导致错误,因为列 'newCol' 已经存在。
为了解决这个问题并避免修改原始 DataFrame,您可以在插入新列之前创建一个它的副本。这样,每次调用 fit_transform 都会在一个单独的副本上操作,不会影响原始 DataFrame。
下面是修改后的代码版本:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class CustomColumnTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X_copy = X.copy() # 创建 DataFrame 的副本
X_copy.insert(0, 'newCol', 1)
return X_copy
df = pd.DataFrame([[1, 2], [3, 4]])
display(CustomColumnTransformer().fit_transform(df))
display(df)
通过使用 X_copy = X.copy(),您生成了一个新的 DataFrame 对象 X_copy,可以独立修改,而不会影响原始的 df DataFrame。因此,您可以多次重复使用转换器而不会遇到“已存在”错误。
这个修改后的代码允许您在同一个 DataFrame 上多次调用 fit_transform 而不会生成任何错误。
英文:
Certainly! It's completely expected behavior. When you use the insert method to modify the DataFrame X within your transform function, it modifies the DataFrame in-place. Consequently, subsequent calls to fit_transform using the same DataFrame will result in an error since the column 'newCol' already exists.
To circumvent this issue and avoid modifying the original DataFrame, you can create a copy of it before inserting the new column. This way, each call to fit_transform will operate on a separate copy, leaving the original DataFrame intact.
Here's an updated version of your code that incorporates this change:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class CustomColumnTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X_copy = X.copy() # Create a copy of the DataFrame
X_copy.insert(0, 'newCol', 1)
return X_copy
df = pd.DataFrame([[1, 2], [3, 4]])
display(CustomColumnTransformer().fit_transform(df))
display(df)
By utilizing X_copy = X.copy(), you generate a new DataFrame object X_copy that can be modified independently without affecting the original df DataFrame. Consequently, you can reuse the transformer multiple times without encountering the "already exists" error.
This revised code allows you to call fit_transform repeatedly on the same DataFrame without generating any errors.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论