Python ColumnTransformer 设置 SettingWithCopyWarning

huangapple go评论108阅读模式
英文:

Python ColumnTransformer SettingWithCopyWarning

问题

当使用 scikit-learn 的 ColumnTransformer 对 DataFrame 进行转换时,我收到了 SettingWithCopyWarning 警告,但我不确定原因是什么。

这是我的代码。

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

import warnings
warnings.filterwarnings("always")

def filling_nan(frame):
    """将具有空值的列填充为零。"""
    frame.fillna(0, inplace=True)
    return frame

def as_is(frame):
    """返回原始的 DataFrame。"""
    return frame

np.random.seed(1337)
df = pd.DataFrame(data=np.random.random(size=(5,5)), columns=list('ABCDE'))
df = df.applymap(lambda x: np.nan if x < 0.15 else x) # 将数据框中的一些数字设置为 NaN。

print(df)

这是原始的 DataFrame 看起来是这样的...

          A         B         C         D         E
0  0.262025  0.158684  0.278127  0.459317  0.321001
1  0.518393  0.261943  0.976085  0.732815       NaN
2  0.386275  0.628501       NaN  0.983549  0.443225
3  0.789558  0.794119  0.361262  0.416104  0.584258
4  0.760172  0.187808  0.288167  0.670219  0.499648

然后,我创建了 ColumnTransformer 中的步骤,并且指定了列的索引而不是列名。

step_filling_nans = ('filling_nans', FunctionTransformer(filling_nan, validate=False), [2, 4])
step_as_is = ('as_is', FunctionTransformer(as_is, validate=False), [0, 1, 3])

然后,我创建了 ColumnTransformer...

trans = ColumnTransformer(
    transformers=[
        step_filling_nans
        , step_as_is # 我可以在剩余关键字中传递 'passthrough' 而不是执行此步骤。
    ], remainder='drop')

最后,我打印了将 ColumnTransformer 应用于我的 DataFrame 后的结果。

print(trans.fit_transform(df))

这是转换的输出。ColumnTransformer 返回了一个 numpy 数组,如预期的那样(第一列和第二列分别为 'C' 和 'E'),但我不明白为什么会收到 SettingWithCopy 警告。

[[0.27812652 0.32100054 0.26202468 0.15868397 0.45931689]
 [0.97608528 0.         0.51839282 0.26194293 0.73281455]
 [0.         0.44322487 0.38627507 0.62850118 0.98354861]
 [0.36126157 0.58425813 0.78955834 0.79411858 0.41610394]
 [0.28816715 0.49964826 0.76017177 0.18780841 0.67021886]]

/bigdisk0/users/belladam/.conda/envs/day_zero_retention/lib/python3.6/site-packages/pandas/core/frame.py:3787: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)

我已经通过稍微更改 filling_nan() 函数来修复它,但我不明白为什么会修复它。

def filling_nan(frame):
    """将具有空值的列填充为零。"""
    frame = frame.fillna(0)
    return frame

我无法在使用 ColumnTransformer 之外重现这个结果,所以我想知道是否与此有关?

英文:

I'm receiving a SettingWithCopyWarning when applying transformations to a DataFrame using a scikit-learn ColumnTransformer, and I'm not sure why that is.

This is my code.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

import warnings
warnings.filterwarnings(&quot;always&quot;)

def filling_nan(frame):
    &quot;&quot;&quot;Fills columns that have null values with zeros.&quot;&quot;&quot;
    frame.fillna(0, inplace=True)
    return frame

def as_is(frame):
    &quot;&quot;&quot;Returns the DataFrame as it is.&quot;&quot;&quot;
    return frame

np.random.seed(1337)
df = pd.DataFrame(data=np.random.random(size=(5,5)), columns=list(&#39;ABCDE&#39;))
df = df.applymap(lambda x: np.nan if x&lt;0.15 else x) # Set a few numbers in the dataframe to NaN.

print(df)

This is what the original DataFrame looks like...

          A         B         C         D         E
0  0.262025  0.158684  0.278127  0.459317  0.321001
1  0.518393  0.261943  0.976085  0.732815       NaN
2  0.386275  0.628501       NaN  0.983549  0.443225
3  0.789558  0.794119  0.361262  0.416104  0.584258
4  0.760172  0.187808  0.288167  0.670219  0.499648

Then I create the steps in the ColumnTransformer and I specify the index of the column rather than the column name.

step_filling_nans = (&#39;filling_nans&#39;, FunctionTransformer(filling_nan, validate=False), [2, 4])
step_as_is = (&#39;as_is&#39;, FunctionTransformer(as_is, validate=False), [0, 1, 3])

Then I create the ColumnTransformer...

trans = ColumnTransformer(
    transformers=[
        step_filling_nans
        , step_as_is # I could pass &#39;passthrough&#39; to the remainder keyword instead of doing this step.
    ], remainder=&#39;drop&#39;)

Finally, I print the result of the applying the ColumnTransformer to my DataFrame.

print(trans.fit_transform(df))

This is the output of the transformations. The ColumnTransformer returns a numpy array as expected (with columns 'C' and 'E' first and second respectively), but I don't understand why I'm receiving the SettingWithCopy warning.

[[0.27812652 0.32100054 0.26202468 0.15868397 0.45931689]
 [0.97608528 0.         0.51839282 0.26194293 0.73281455]
 [0.         0.44322487 0.38627507 0.62850118 0.98354861]
 [0.36126157 0.58425813 0.78955834 0.79411858 0.41610394]
 [0.28816715 0.49964826 0.76017177 0.18780841 0.67021886]]

/bigdisk0/users/belladam/.conda/envs/day_zero_retention/lib/python3.6/site-packages/pandas/core/frame.py:3787: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)

I have managed to fix it by changing the filling_nan() function slightly, but I don't understand why that fixes it.

def filling_nan(frame):
    &quot;&quot;&quot;Fills columns that have null values with zeros.&quot;&quot;&quot;
    frame = frame.fillna(0)
    return frame

I've been unable to reproduce the result outside of using a ColumnTransformer so was wondering if it was something to do with that?

答案1

得分: 0

我相信ColumnTransformer表现出这种行为是因为它能够并行运行与不同列相关的不同转换。您可以在这里查看sklearn代码本身,位于第448行。如果您想要使用并行化,那么在相同对象(相同内存位置)上操作实际上并不安全。

通过避免使用inplace,您实际上是在原始对象的副本上操作,这解决了这个问题:

def filling_nan(frame):
    """将具有空值的列填充为零。"""
    frame = frame.fillna(0)
    return frame

def filling_nan_inplace(frame):
    """将具有空值的列填充为零。"""
    frame.fillna(0, inplace=True)
    return frame

print(id(df))
print(id(filling_nan_inplace(df)))
print(id(filling_nan(df)))

输出:

2088604584760
2088604584760
2088604583304
英文:

I believe ColumnTransformer behaves this way because it is capable of running in parallel different transformations related to different columns. You can take a look at the sklearn code itself here, at line 448. If you want to use parallelization then it is not really safe to work on the same object (same memory location).

By avoiding using inplace, you are actually working on copy of the original object, and this settles the problem:

def filling_nan(frame):
    &quot;&quot;&quot;Fills columns that have null values with zeros.&quot;&quot;&quot;
    frame = frame.fillna(0)
    return frame

def filling_nan_inplace(frame):
    &quot;&quot;&quot;Fills columns that have null values with zeros.&quot;&quot;&quot;
    frame.fillna(0, inplace=True)
    return frame

print(id(df))
print(id(filling_nan_inplace(df)))
print(id(filling_nan(df)))

output:

2088604584760
2088604584760
2088604583304

huangapple
  • 本文由 发表于 2020年1月3日 19:18:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/59577674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定