这些不同的异常值检测方法为什么无法检测到异常值?

huangapple go评论71阅读模式
英文:

Why do these different outlier methods fail to detect outliers?

问题

我试图按组查找数据框中的异常值。我有两个组:Group1Group2,我正在寻找实现异常值检测的最佳方法。

data = {'Group1':['A', 'A', 'A', 'B', 'B', 'B','A','A','B','B','B','A','A','A','B','B','B','A','A','A','B','B','B','A','A','A','A','A','B','B'], 'Group2':['C', 'C', 'C', 'C', 'D', 'D','C','D','C','C','D', 'C', 'C', 'D', 'D','C', 'C','D','D','D', 'D','C','D','C','C', 'D','C','D','C','C'], 'Age':[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]} 
df = pd.DataFrame(data) 

groups = df.groupby(['Group1', 'Group2'])
means = groups.Age.transform('mean')
stds = groups.Age.transform('std')

df['Flag'] = ~df.Age.between(means-stds*3, means+stds*3)

def flag_outlier(x):
    lower_limit  = np.mean(x) - np.std(x) * 3 
    upper_limit = np.mean(x) + np.std(x) * 3
    return (x>upper_limit)| (x<lower_limit)

df['Flag2'] = df.groupby(['Group1', 'Group2'])['Age'].apply(flag_outlier)

df["Flag3"] = df.groupby(['Group1', 'Group2'])['Age'].transform(lambda x: (x - x.mean()).abs() > 3*x.std())

然而,所有三种方法都无法检测到明显的异常值 - 例如,当Age为2000时,这些方法都不将其视为异常值。这是否有原因?还是可能我的三种异常值检测模型的代码有错误?

我强烈感觉我在某处犯了一个愚蠢的错误,但不确定在哪里,所以任何帮助都将不胜感激,谢谢!

英文:

I am trying to find the outliers by group for my dataframe. I have two groups: Group1 and Group2, and I am trying to find the best way to implement an outlier method

data = {&#39;Group1&#39;:[&#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;], &#39;Group2&#39;:[&#39;C&#39;, &#39;C&#39;, &#39;C&#39;, &#39;C&#39;, &#39;D&#39;, &#39;D&#39;,&#39;C&#39;,&#39;D&#39;,&#39;C&#39;,&#39;C&#39;,&#39;D&#39;, &#39;C&#39;, &#39;C&#39;, &#39;D&#39;, &#39;D&#39;,&#39;C&#39;, &#39;C&#39;,&#39;D&#39;,&#39;D&#39;,&#39;D&#39;, &#39;D&#39;,&#39;C&#39;,&#39;D&#39;,&#39;C&#39;,&#39;C&#39;, &#39;D&#39;,&#39;C&#39;,&#39;D&#39;,&#39;C&#39;,&#39;C&#39;], &#39;Age&#39;:[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]} 
df = pd.DataFrame(data) 

groups = df.groupby([&#39;Group1&#39;, &#39;Group2&#39;])
means = groups.Age.transform(&#39;mean&#39;)
stds = groups.Age.transform(&#39;std&#39;)

df[&#39;Flag&#39;] = ~df.Age.between(means-stds*3, means+stds*3)

def flag_outlier(x):
    lower_limit  = np.mean(x) - np.std(x) * 3 
    upper_limit = np.mean(x) + np.std(x) * 3
    return (x&gt;upper_limit)| (x&lt;lower_limit)

df[&#39;Flag2&#39;] = df.groupby([&#39;Group1&#39;, &#39;Group2&#39;])[&#39;Age&#39;].apply(flag_outlier)

df[&quot;Flag3&quot;] = df.groupby([&#39;Group1&#39;, &#39;Group2&#39;])[&#39;Age&#39;].transform(lambda x: (x - x.mean()).abs() &gt; 3*x.std())

However, all 3 methods fail to detect obvious outliers - for example, when Age is 2000, none of these methods treat it as an outlier. Is there a reason for this? Or is it possible that my code for all three outlier detection models is incorrect?

I have a strong feeling I've made a foolish mistake somewhere but I'm not sure where, so any help would be appreciated, thanks!

答案1

得分: 1

That age of 2000 just isn't over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.

这个年龄2000不超过组平均值的3个标准差。组平均值为239.666667,组标准差为660.129722。

It might look like an obvious outlier to you, but you don't have enough data to label it an outlier by that standard.

对你来说,这可能看起来是一个明显的异常值,但你没有足够的数据来按照这个标准将其标记为异常值。

英文:

Within its group, that age of 2000 just isn't over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.

It might look like an obvious outlier to you, but you don't have enough data to label it an outlier by that standard.

huangapple
  • 本文由 发表于 2023年3月7日 08:25:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657025.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定