英文:
Why do these different outlier methods fail to detect outliers?
问题
我试图按组查找数据框中的异常值。我有两个组:Group1
和 Group2
,我正在寻找实现异常值检测的最佳方法。
data = {'Group1':['A', 'A', 'A', 'B', 'B', 'B','A','A','B','B','B','A','A','A','B','B','B','A','A','A','B','B','B','A','A','A','A','A','B','B'], 'Group2':['C', 'C', 'C', 'C', 'D', 'D','C','D','C','C','D', 'C', 'C', 'D', 'D','C', 'C','D','D','D', 'D','C','D','C','C', 'D','C','D','C','C'], 'Age':[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]}
df = pd.DataFrame(data)
groups = df.groupby(['Group1', 'Group2'])
means = groups.Age.transform('mean')
stds = groups.Age.transform('std')
df['Flag'] = ~df.Age.between(means-stds*3, means+stds*3)
def flag_outlier(x):
lower_limit = np.mean(x) - np.std(x) * 3
upper_limit = np.mean(x) + np.std(x) * 3
return (x>upper_limit)| (x<lower_limit)
df['Flag2'] = df.groupby(['Group1', 'Group2'])['Age'].apply(flag_outlier)
df["Flag3"] = df.groupby(['Group1', 'Group2'])['Age'].transform(lambda x: (x - x.mean()).abs() > 3*x.std())
然而,所有三种方法都无法检测到明显的异常值 - 例如,当Age
为2000时,这些方法都不将其视为异常值。这是否有原因?还是可能我的三种异常值检测模型的代码有错误?
我强烈感觉我在某处犯了一个愚蠢的错误,但不确定在哪里,所以任何帮助都将不胜感激,谢谢!
英文:
I am trying to find the outliers by group for my dataframe. I have two groups: Group1
and Group2
, and I am trying to find the best way to implement an outlier method
data = {'Group1':['A', 'A', 'A', 'B', 'B', 'B','A','A','B','B','B','A','A','A','B','B','B','A','A','A','B','B','B','A','A','A','A','A','B','B'], 'Group2':['C', 'C', 'C', 'C', 'D', 'D','C','D','C','C','D', 'C', 'C', 'D', 'D','C', 'C','D','D','D', 'D','C','D','C','C', 'D','C','D','C','C'], 'Age':[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]}
df = pd.DataFrame(data)
groups = df.groupby(['Group1', 'Group2'])
means = groups.Age.transform('mean')
stds = groups.Age.transform('std')
df['Flag'] = ~df.Age.between(means-stds*3, means+stds*3)
def flag_outlier(x):
lower_limit = np.mean(x) - np.std(x) * 3
upper_limit = np.mean(x) + np.std(x) * 3
return (x>upper_limit)| (x<lower_limit)
df['Flag2'] = df.groupby(['Group1', 'Group2'])['Age'].apply(flag_outlier)
df["Flag3"] = df.groupby(['Group1', 'Group2'])['Age'].transform(lambda x: (x - x.mean()).abs() > 3*x.std())
However, all 3 methods fail to detect obvious outliers - for example, when Age
is 2000, none of these methods treat it as an outlier. Is there a reason for this? Or is it possible that my code for all three outlier detection models is incorrect?
I have a strong feeling I've made a foolish mistake somewhere but I'm not sure where, so any help would be appreciated, thanks!
答案1
得分: 1
That age of 2000 just isn't over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.
这个年龄2000不超过组平均值的3个标准差。组平均值为239.666667,组标准差为660.129722。
It might look like an obvious outlier to you, but you don't have enough data to label it an outlier by that standard.
对你来说,这可能看起来是一个明显的异常值,但你没有足够的数据来按照这个标准将其标记为异常值。
英文:
Within its group, that age of 2000 just isn't over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.
It might look like an obvious outlier to you, but you don't have enough data to label it an outlier by that standard.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论