Compute outliers 2 standard dev away for each pandas DataFrame column and replace with NaN

huangapple go评论75阅读模式
英文:

Compute outliers 2 standard dev away for each pandas DataFrame column and replace with NaN

问题

以下是翻译好的部分:

我有以下DataFrame:

df = 

col_1   col_2   col_3
 0.5      7       300
 0.4      4       340
 0.6      8       276
 5.6      32      764
 11.2     98      1032

正如上面明确显示的那样,最后两行是异常值。我正在尝试计算每列的均值和标准差。对于每一列,我想将大于2个标准差的值替换为NaN。因此,替换那些在范围[均值 - 2个标准差,均值 + 2个标准差]之外的异常值。

我尝试使用pd.DataFrame.masknumpy来实现这个目标,但无法使其正常工作。

任何帮助都将非常棒,谢谢!

英文:

I have the following DataFrame:

df = 

col_1   col_2   col_3
 0.5      7       300
 0.4      4       340
 0.6      8       276
 5.6      32      764
 11.2     98      1032

As clearly shown above, the last two rows are outliers. I'm trying to compute the mean and standard deviation of each column. For each column, I'd like to replace any values greater than 2 standard deviations away with NaN. So replace outliers that are outside of the range [mean - 2 SDs, mean + 2 SDs].

I was trying to do this with pd.DataFrame.mask or numpy, but couldn't get it to work.

Any help would be awesome, thanks!

答案1

得分: 2

df.loc 在这种情况下效果最好。

import pandas as pd
import numpy as np

df = pd.DataFrame({"col_1":[0.5,0.4,0.6,5.6,11.2],
                    "col_2":[7,4,8,32,98],
                    "col_3":[300,340,276,764,1032]})

for col in df.columns:
    df.loc[df[col]>np.mean(df[col])+2*np.std(df[col]),col]=0
    df.loc[df[col]<np.mean(df[col])-2*np.std(df[col]),col]=0
df
Out[17]: 
   col_1  col_2  col_3
0    0.5      7    300
1    0.4      4    340
2    0.6      8    276
3    5.6     32      0
4    0.0      0      0
英文:

df.loc works best for this sort of thing.

import pandas as pd
import numpy as np

df = pd.DataFrame({&quot;col_1&quot;:[0.5,0.4,0.6,5.6,11.2],
                    &quot;col_2&quot;:[7,4,8,32,98],
                    &quot;col_3&quot;:[300,340,276,764,1032]})

for col in df.columns:
    df.loc[df[col]&gt;np.mean(df[col])+2*np.std(df[col]),col]=0
    df.loc[df[col]&lt;np.mean(df[col])-2*np.std(df[col]),col]=0
df
Out[17]: 
   col_1  col_2  col_3
0    0.5      7    300
1    0.4      4    340
2    0.6      8    276
3    5.6     32      0
4    0.0      0      0

huangapple
  • 本文由 发表于 2023年2月27日 08:15:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75575829.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定