英文:
Python ColumnTransformer SettingWithCopyWarning
问题
当使用 scikit-learn 的 ColumnTransformer
对 DataFrame 进行转换时,我收到了 SettingWithCopyWarning
警告,但我不确定原因是什么。
这是我的代码。
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
import warnings
warnings.filterwarnings("always")
def filling_nan(frame):
"""将具有空值的列填充为零。"""
frame.fillna(0, inplace=True)
return frame
def as_is(frame):
"""返回原始的 DataFrame。"""
return frame
np.random.seed(1337)
df = pd.DataFrame(data=np.random.random(size=(5,5)), columns=list('ABCDE'))
df = df.applymap(lambda x: np.nan if x < 0.15 else x) # 将数据框中的一些数字设置为 NaN。
print(df)
这是原始的 DataFrame 看起来是这样的...
A B C D E
0 0.262025 0.158684 0.278127 0.459317 0.321001
1 0.518393 0.261943 0.976085 0.732815 NaN
2 0.386275 0.628501 NaN 0.983549 0.443225
3 0.789558 0.794119 0.361262 0.416104 0.584258
4 0.760172 0.187808 0.288167 0.670219 0.499648
然后,我创建了 ColumnTransformer 中的步骤,并且指定了列的索引而不是列名。
step_filling_nans = ('filling_nans', FunctionTransformer(filling_nan, validate=False), [2, 4])
step_as_is = ('as_is', FunctionTransformer(as_is, validate=False), [0, 1, 3])
然后,我创建了 ColumnTransformer...
trans = ColumnTransformer(
transformers=[
step_filling_nans
, step_as_is # 我可以在剩余关键字中传递 'passthrough' 而不是执行此步骤。
], remainder='drop')
最后,我打印了将 ColumnTransformer 应用于我的 DataFrame 后的结果。
print(trans.fit_transform(df))
这是转换的输出。ColumnTransformer 返回了一个 numpy 数组,如预期的那样(第一列和第二列分别为 'C' 和 'E'),但我不明白为什么会收到 SettingWithCopy 警告。
[[0.27812652 0.32100054 0.26202468 0.15868397 0.45931689]
[0.97608528 0. 0.51839282 0.26194293 0.73281455]
[0. 0.44322487 0.38627507 0.62850118 0.98354861]
[0.36126157 0.58425813 0.78955834 0.79411858 0.41610394]
[0.28816715 0.49964826 0.76017177 0.18780841 0.67021886]]
/bigdisk0/users/belladam/.conda/envs/day_zero_retention/lib/python3.6/site-packages/pandas/core/frame.py:3787: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
downcast=downcast, **kwargs)
我已经通过稍微更改 filling_nan()
函数来修复它,但我不明白为什么会修复它。
def filling_nan(frame):
"""将具有空值的列填充为零。"""
frame = frame.fillna(0)
return frame
我无法在使用 ColumnTransformer 之外重现这个结果,所以我想知道是否与此有关?
英文:
I'm receiving a SettingWithCopyWarning
when applying transformations to a DataFrame using a scikit-learn ColumnTransformer
, and I'm not sure why that is.
This is my code.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
import warnings
warnings.filterwarnings("always")
def filling_nan(frame):
"""Fills columns that have null values with zeros."""
frame.fillna(0, inplace=True)
return frame
def as_is(frame):
"""Returns the DataFrame as it is."""
return frame
np.random.seed(1337)
df = pd.DataFrame(data=np.random.random(size=(5,5)), columns=list('ABCDE'))
df = df.applymap(lambda x: np.nan if x<0.15 else x) # Set a few numbers in the dataframe to NaN.
print(df)
This is what the original DataFrame looks like...
A B C D E
0 0.262025 0.158684 0.278127 0.459317 0.321001
1 0.518393 0.261943 0.976085 0.732815 NaN
2 0.386275 0.628501 NaN 0.983549 0.443225
3 0.789558 0.794119 0.361262 0.416104 0.584258
4 0.760172 0.187808 0.288167 0.670219 0.499648
Then I create the steps in the ColumnTransformer and I specify the index of the column rather than the column name.
step_filling_nans = ('filling_nans', FunctionTransformer(filling_nan, validate=False), [2, 4])
step_as_is = ('as_is', FunctionTransformer(as_is, validate=False), [0, 1, 3])
Then I create the ColumnTransformer...
trans = ColumnTransformer(
transformers=[
step_filling_nans
, step_as_is # I could pass 'passthrough' to the remainder keyword instead of doing this step.
], remainder='drop')
Finally, I print the result of the applying the ColumnTransformer to my DataFrame.
print(trans.fit_transform(df))
This is the output of the transformations. The ColumnTransformer returns a numpy array as expected (with columns 'C' and 'E' first and second respectively), but I don't understand why I'm receiving the SettingWithCopy warning.
[[0.27812652 0.32100054 0.26202468 0.15868397 0.45931689]
[0.97608528 0. 0.51839282 0.26194293 0.73281455]
[0. 0.44322487 0.38627507 0.62850118 0.98354861]
[0.36126157 0.58425813 0.78955834 0.79411858 0.41610394]
[0.28816715 0.49964826 0.76017177 0.18780841 0.67021886]]
/bigdisk0/users/belladam/.conda/envs/day_zero_retention/lib/python3.6/site-packages/pandas/core/frame.py:3787: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
downcast=downcast, **kwargs)
I have managed to fix it by changing the filling_nan() function slightly, but I don't understand why that fixes it.
def filling_nan(frame):
"""Fills columns that have null values with zeros."""
frame = frame.fillna(0)
return frame
I've been unable to reproduce the result outside of using a ColumnTransformer so was wondering if it was something to do with that?
答案1
得分: 0
我相信ColumnTransformer
表现出这种行为是因为它能够并行运行与不同列相关的不同转换。您可以在这里查看sklearn代码本身,位于第448行。如果您想要使用并行化,那么在相同对象(相同内存位置)上操作实际上并不安全。
通过避免使用inplace
,您实际上是在原始对象的副本上操作,这解决了这个问题:
def filling_nan(frame):
"""将具有空值的列填充为零。"""
frame = frame.fillna(0)
return frame
def filling_nan_inplace(frame):
"""将具有空值的列填充为零。"""
frame.fillna(0, inplace=True)
return frame
print(id(df))
print(id(filling_nan_inplace(df)))
print(id(filling_nan(df)))
输出:
2088604584760
2088604584760
2088604583304
英文:
I believe ColumnTransformer
behaves this way because it is capable of running in parallel different transformations related to different columns. You can take a look at the sklearn code itself here, at line 448. If you want to use parallelization then it is not really safe to work on the same object (same memory location).
By avoiding using inplace
, you are actually working on copy of the original object, and this settles the problem:
def filling_nan(frame):
"""Fills columns that have null values with zeros."""
frame = frame.fillna(0)
return frame
def filling_nan_inplace(frame):
"""Fills columns that have null values with zeros."""
frame.fillna(0, inplace=True)
return frame
print(id(df))
print(id(filling_nan_inplace(df)))
print(id(filling_nan(df)))
output:
2088604584760
2088604584760
2088604583304
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论