如何更正此代码,以避免引发SettingWithCopyWarning?

huangapple go评论84阅读模式
英文:

How to correct this code to not raise a SettingWithCopyWarning?

问题

以下是您提供的内容的翻译部分:

作者创建了一个函数来移除异常值:

def to_category(df):
    cols = df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio < 0.05:
            df[col] = df[col].astype('category')
    return df

这引发了Python的警告:

警告(来自warnings模块):
  文件 "D:/I7_Education/pandas_pipe_function1/pipes3.py",第 51 行
    df[col] = df[col].astype('category')
SettingWithCopyWarning: 
尝试在DataFrame的切片副本上设置值。
尝试使用 .loc[row_indexer,col_indexer] = value

请参阅文档中的注意事项:https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

我不确定问题是什么(尽管我正在研究它,并尝试理解一些在线帖子以理解它)。我仍在努力理解文档的解释。

我知道可以抑制Python的警告(如果抑制警告,代码可以正常运行)。我想知道如何更改文章中的代码,以便首先不引发警告。

我尝试联系作者,但没有收到回复。

我希望不需要抑制警告。但我不够了解问题,无法弄清楚如何更改代码,以避免首次引发SettingWithCopyWarning。

我没有预期到会出现警告。文档以及一些在线帖子都说要使用loc来更改df,但我并没有更改数据框中的值或元素,我只是将列的数据类型从object更改为category;使用astype('category')就是这样做的方法,我认为循环遍历列来执行这个操作应该没问题。一个朋友告诉我要创建传递给函数的df的副本,然后对其进行操作,然后返回副本,但我也不完全理解,而且它并没有解决问题 - 仍然引发相同的警告。

我传递给函数的数据框是一个副本。文章只是操作数据集(directmarketing.csv);它将csv文件读入pandas数据框并直接操作它。我相反创建了两个数据框:第一个是dataset = pd.read_csv(".\directmarketing.csv"),第二个是marketing = dataset.copy(),我只操作了marketing数据框。这样我可以回头检查数据集数据框,确保事情按照预期的方式发生等等。

但是当我调用函数时,我调用的是to_category(marketing) - 我根本没有触碰数据集数据框。

stackoverflow上有一个帖子 - https://stackoverflow.com/questions/66336670/returning-a-copy-versus-a-view-warning-when-using-python-pandas-dataframe?rq=2 - 讨论了这个问题,但它说要制作一个副本来避免警告,所以我很困惑。

有没有办法修改文章中的代码,以避免引发这个警告?

我正在使用Python 3.10和Idle - 我没有使用带有这个问题的IDE。

英文:

I'm following along with this: https://www.kdnuggets.com/2021/01/cleaner-data-analysis-pandas-pipes.html

About halfway down the author creates a function to remove outliers:

def to_category(df):
    cols = df.select_dtypes(include=&#39;object&#39;).columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio &lt; 0.05:
            df[col] = df[col].astype(&#39;category&#39;)
    return df

This raised a warning from Python:

Warning (from warnings module):
  File &quot;D:/I7_Education/pandas_pipe_function1/pipes3.py&quot;, line 51
    df[col] = df[col].astype(&#39;category&#39;)
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

I'm not sure I understand what the problem is (though I'm working my way through it, and some posts online to try to understand). I'm still trying to make sense of the documentation explanation.

I'm aware that I can suppress the warnings from Python (The code runs fine if you suppress the warnings). I would like to know how to change the code in the article so it doesn't raise a warning in the first place.

I tried contacting the author, but haven't heard back.

What I want is no suppression to be necessary. But I don't understand what the problem is well enough to figure out how to change the code to not trip a SettingWithCopyWarning in the first place.

I was not expecting the warning. The documentation, as well as a few posts online, say to change df using loc, but I'm not changing values, or elements, in the dataframe, I'm changing the dtype of columns from object to category; astype(&#39;catagory&#39;) is how to do that, and I would assume that looping through columns to do it should be fine. A friend told me to create a copy of the df that's passed to the function, and then manipulate that, then return the copy, which I also don't fully understand, but it doesn't solve the problem - it still raises the same warning.

The dataframe I'm passing to the function is a copy. The article is only manipulating the dataset (directmarketing.csv); it reads the csv into a pandas dataframe and manipulates it directly. I had instead created two dataframes: the first is dataset = pd.read_csv(&quot;.\directmarketing.csv&quot;) and the second is marketing = dataset.copy() and I'm only manipulating the marketing dataframe. That way I can go back and check against the dataset dataframe and make sure things have changed the way they're supposed to, etc.

But when I call the function, I'm calling to_category(marketing) - I haven't touched the dataset dataframe at all.

There is a thread on stackoverflow - https://stackoverflow.com/questions/66336670/returning-a-copy-versus-a-view-warning-when-using-python-pandas-dataframe?rq=2 - that talks about this, but it's saying to make a copy to avoid the warning, and so I'm very confused.

Is there a way to correct the code in the article so it does not trip this warning?

I'm using Python 3.10, and Idle - I'm not using an IDE with this.

答案1

得分: 0

一种想法是使用DataFrame.astype来重新编写解决方案,将final列表中的列名转换为字典,使用dict.fromkeys

def to_category(df):
    final = []
    cols = df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio < 0.05:
            final.append(col)
    return df.astype(dict.fromkeys(final, 'category'))
英文:

One idea is rewrite solution by DataFrame.astype with columns names in final list convert to dictionary by dict.fromkeys:

def to_category(df):
    final = []
    cols = df.select_dtypes(include=&#39;object&#39;).columns
    for col in cols:
        ratio = len(df[col].value_counts()) / len(df)
        if ratio &lt; 0.05:
            final.append(col)
    return df.astype(dict.fromkeys(final, &#39;category&#39;))

huangapple
  • 本文由 发表于 2023年6月16日 13:14:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76487120.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定