英文:
Compute outliers 2 standard dev away for each pandas DataFrame column and replace with NaN
问题
以下是翻译好的部分:
我有以下DataFrame:
df =
col_1 col_2 col_3
0.5 7 300
0.4 4 340
0.6 8 276
5.6 32 764
11.2 98 1032
正如上面明确显示的那样,最后两行是异常值。我正在尝试计算每列的均值和标准差。对于每一列,我想将大于2个标准差的值替换为NaN。因此,替换那些在范围[均值 - 2个标准差,均值 + 2个标准差]之外的异常值。
我尝试使用pd.DataFrame.mask
或numpy
来实现这个目标,但无法使其正常工作。
任何帮助都将非常棒,谢谢!
英文:
I have the following DataFrame:
df =
col_1 col_2 col_3
0.5 7 300
0.4 4 340
0.6 8 276
5.6 32 764
11.2 98 1032
As clearly shown above, the last two rows are outliers. I'm trying to compute the mean and standard deviation of each column. For each column, I'd like to replace any values greater than 2 standard deviations away with NaN. So replace outliers that are outside of the range [mean - 2 SDs, mean + 2 SDs].
I was trying to do this with pd.DataFrame.mask
or numpy
, but couldn't get it to work.
Any help would be awesome, thanks!
答案1
得分: 2
df.loc 在这种情况下效果最好。
import pandas as pd
import numpy as np
df = pd.DataFrame({"col_1":[0.5,0.4,0.6,5.6,11.2],
"col_2":[7,4,8,32,98],
"col_3":[300,340,276,764,1032]})
for col in df.columns:
df.loc[df[col]>np.mean(df[col])+2*np.std(df[col]),col]=0
df.loc[df[col]<np.mean(df[col])-2*np.std(df[col]),col]=0
df
Out[17]:
col_1 col_2 col_3
0 0.5 7 300
1 0.4 4 340
2 0.6 8 276
3 5.6 32 0
4 0.0 0 0
英文:
df.loc works best for this sort of thing.
import pandas as pd
import numpy as np
df = pd.DataFrame({"col_1":[0.5,0.4,0.6,5.6,11.2],
"col_2":[7,4,8,32,98],
"col_3":[300,340,276,764,1032]})
for col in df.columns:
df.loc[df[col]>np.mean(df[col])+2*np.std(df[col]),col]=0
df.loc[df[col]<np.mean(df[col])-2*np.std(df[col]),col]=0
df
Out[17]:
col_1 col_2 col_3
0 0.5 7 300
1 0.4 4 340
2 0.6 8 276
3 5.6 32 0
4 0.0 0 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论