2023年2月27日 08:15:38go评论75阅读模式

英文:

Compute outliers 2 standard dev away for each pandas DataFrame column and replace with NaN

问题

以下是翻译好的部分：

我有以下DataFrame：

df = 

col_1   col_2   col_3
 0.5      7       300
 0.4      4       340
 0.6      8       276
 5.6      32      764
 11.2     98      1032

正如上面明确显示的那样，最后两行是异常值。我正在尝试计算每列的均值和标准差。对于每一列，我想将大于2个标准差的值替换为NaN。因此，替换那些在范围[均值 - 2个标准差，均值 + 2个标准差]之外的异常值。

我尝试使用pd.DataFrame.mask或numpy来实现这个目标，但无法使其正常工作。

任何帮助都将非常棒，谢谢！

英文:

I have the following DataFrame:

df = 

col_1   col_2   col_3
 0.5      7       300
 0.4      4       340
 0.6      8       276
 5.6      32      764
 11.2     98      1032

As clearly shown above, the last two rows are outliers. I'm trying to compute the mean and standard deviation of each column. For each column, I'd like to replace any values greater than 2 standard deviations away with NaN. So replace outliers that are outside of the range [mean - 2 SDs, mean + 2 SDs].

I was trying to do this with pd.DataFrame.mask or numpy, but couldn't get it to work.

Any help would be awesome, thanks!

答案1

得分: 2

df.loc 在这种情况下效果最好。

import pandas as pd
import numpy as np

df = pd.DataFrame({"col_1":[0.5,0.4,0.6,5.6,11.2],
                    "col_2":[7,4,8,32,98],
                    "col_3":[300,340,276,764,1032]})

for col in df.columns:
    df.loc[df[col]>np.mean(df[col])+2*np.std(df[col]),col]=0
    df.loc[df[col]<np.mean(df[col])-2*np.std(df[col]),col]=0

df
Out[17]: 
   col_1  col_2  col_3
0    0.5      7    300
1    0.4      4    340
2    0.6      8    276
3    5.6     32      0
4    0.0      0      0

英文:

df.loc works best for this sort of thing.

import pandas as pd
import numpy as np

df = pd.DataFrame({&quot;col_1&quot;:[0.5,0.4,0.6,5.6,11.2],
                    &quot;col_2&quot;:[7,4,8,32,98],
                    &quot;col_3&quot;:[300,340,276,764,1032]})

for col in df.columns:
    df.loc[df[col]&gt;np.mean(df[col])+2*np.std(df[col]),col]=0
    df.loc[df[col]&lt;np.mean(df[col])-2*np.std(df[col]),col]=0

df
Out[17]: 
   col_1  col_2  col_3
0    0.5      7    300
1    0.4      4    340
2    0.6      8    276
3    5.6     32      0
4    0.0      0      0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Compute outliers 2 standard dev away for each pandas DataFrame column and replace with NaN

问题

答案1

隐藏Matplotlib中空心点下的误差条。

如何将类内部的函数作为菜单按钮的命令引用？

广播对于NumPy数组 – 矢量化二次形式

将数据框中的值更改为相应的数字。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论