2023年3月7日 08:25:46go评论101阅读模式

英文:

Why do these different outlier methods fail to detect outliers?

问题

我试图按组查找数据框中的异常值。我有两个组：Group1 和 Group2，我正在寻找实现异常值检测的最佳方法。

data = {'Group1':['A', 'A', 'A', 'B', 'B', 'B','A','A','B','B','B','A','A','A','B','B','B','A','A','A','B','B','B','A','A','A','A','A','B','B'], 'Group2':['C', 'C', 'C', 'C', 'D', 'D','C','D','C','C','D', 'C', 'C', 'D', 'D','C', 'C','D','D','D', 'D','C','D','C','C', 'D','C','D','C','C'], 'Age':[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]} 
df = pd.DataFrame(data) 
groups = df.groupby(['Group1', 'Group2'])
means = groups.Age.transform('mean')
stds = groups.Age.transform('std')
df['Flag'] = ~df.Age.between(means-stds*3, means+stds*3)
def flag_outlier(x):
    lower_limit  = np.mean(x) - np.std(x) * 3 
    upper_limit = np.mean(x) + np.std(x) * 3
    return (x>upper_limit)| (x<lower_limit)
df['Flag2'] = df.groupby(['Group1', 'Group2'])['Age'].apply(flag_outlier)
df["Flag3"] = df.groupby(['Group1', 'Group2'])['Age'].transform(lambda x: (x - x.mean()).abs() > 3*x.std())

然而，所有三种方法都无法检测到明显的异常值 - 例如，当Age为2000时，这些方法都不将其视为异常值。这是否有原因？还是可能我的三种异常值检测模型的代码有错误？

我强烈感觉我在某处犯了一个愚蠢的错误，但不确定在哪里，所以任何帮助都将不胜感激，谢谢！

英文:

I am trying to find the outliers by group for my dataframe. I have two groups: Group1 and Group2, and I am trying to find the best way to implement an outlier method

data = {&#39;Group1&#39;:[&#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;,&#39;B&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;A&#39;,&#39;B&#39;,&#39;B&#39;], &#39;Group2&#39;:[&#39;C&#39;, &#39;C&#39;, &#39;C&#39;, &#39;C&#39;, &#39;D&#39;, &#39;D&#39;,&#39;C&#39;,&#39;D&#39;,&#39;C&#39;,&#39;C&#39;,&#39;D&#39;, &#39;C&#39;, &#39;C&#39;, &#39;D&#39;, &#39;D&#39;,&#39;C&#39;, &#39;C&#39;,&#39;D&#39;,&#39;D&#39;,&#39;D&#39;, &#39;D&#39;,&#39;C&#39;,&#39;D&#39;,&#39;C&#39;,&#39;C&#39;, &#39;D&#39;,&#39;C&#39;,&#39;D&#39;,&#39;C&#39;,&#39;C&#39;], &#39;Age&#39;:[20, 21, 19, 24, 11, 15, 18, 1, 17,23, 35,2000,22,24,24,18,17,19,21,22,20,25,18,24,17,19,16,18,25,23]} 
df = pd.DataFrame(data) 
groups = df.groupby([&#39;Group1&#39;, &#39;Group2&#39;])
means = groups.Age.transform(&#39;mean&#39;)
stds = groups.Age.transform(&#39;std&#39;)
df[&#39;Flag&#39;] = ~df.Age.between(means-stds*3, means+stds*3)
def flag_outlier(x):
    lower_limit  = np.mean(x) - np.std(x) * 3 
    upper_limit = np.mean(x) + np.std(x) * 3
    return (x&gt;upper_limit)| (x&lt;lower_limit)
df[&#39;Flag2&#39;] = df.groupby([&#39;Group1&#39;, &#39;Group2&#39;])[&#39;Age&#39;].apply(flag_outlier)
df[&quot;Flag3&quot;] = df.groupby([&#39;Group1&#39;, &#39;Group2&#39;])[&#39;Age&#39;].transform(lambda x: (x - x.mean()).abs() &gt; 3*x.std())

However, all 3 methods fail to detect obvious outliers - for example, when Age is 2000, none of these methods treat it as an outlier. Is there a reason for this? Or is it possible that my code for all three outlier detection models is incorrect?

I have a strong feeling I've made a foolish mistake somewhere but I'm not sure where, so any help would be appreciated, thanks!

答案1

得分: 1

That age of 2000 just isn't over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.

这个年龄2000不超过组平均值的3个标准差。组平均值为239.666667，组标准差为660.129722。

It might look like an obvious outlier to you, but you don't have enough data to label it an outlier by that standard.

对你来说，这可能看起来是一个明显的异常值，但你没有足够的数据来按照这个标准将其标记为异常值。

英文:

Within its group, that age of 2000 just isn't over 3 standard deviations away from the group mean. The group mean is 239.666667 and the group standard deviation is 660.129722.

It might look like an obvious outlier to you, but you don't have enough data to label it an outlier by that standard.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

这些不同的异常值检测方法为什么无法检测到异常值？

问题

答案1

我刚刚开始编写代码，卡在解决这个Python问题上。

Numbers of combinations modulo m, efficiently.

理解C++相对于其他编程语言的性能优势

Alexa Skill需要超过8秒才能完成Lambda。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。