2023年7月17日 17:32:10go评论105阅读模式

英文:

Selecting all rows which contain values greater than a percentage of Average

问题

我有一个DataFrame，其中有3个数值列A、B、C。我需要提取仅包含这3个列A、B、C中的所有值都大于其行平均值的40%的行。

df = pd.DataFrame([['AA', 10, 8, 12], ['BB', 10, 2, 18], ['CC', 10, 6, 14]],
                  columns=['ID', 'A', 'B', 'C'])
print(df)

我的意思是以下内容：对于第一行，A、B、C的平均值是30/3=10，我希望在第一行中所有值，无论是A、B还是C，都应大于10的40%，即4。类似地，对于第二行和第三行也是如此。如果有任何一个元素小于这个值，我们就移除该行。

**我的尝试：**我使用了any()函数，但在涉及到列的平均值时无法帮助我。我总是得到一个空的DataFrame。

df = df[(df[['A', 'B', 'C']] > (0.4 * df[['A', 'B', 'C']].mean(axis=1))).all(1)]
print(df)

我期望的结果是：

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

所有行的平均值都是10，所以如果我将其硬编码，就会工作，像这样：

df[(df[['A', 'B', 'C']] > 0.4 * 10).all(1)]

如何以动态方式实现这个目标？
谢谢。

英文:

I have a DataFrame, which have 3 numeric columns A,B,C. I need to extract only those rows where values in all these 3 columns A,B,C is more than 40% of their row average.

df = pd.DataFrame([[&#39;AA&#39;,10,8,12],[&#39;BB&#39;,10,2,18],[&#39;CC&#39;,10,6,14]],
                  columns=[&#39;ID&#39;,&#39;A&#39;, &#39;B&#39;, &#39;C&#39;])
print(df)
	ID	A	B	C
0	AA	10	8	12
1	BB	10	2	18
2	CC	10	6	14

I mean the following: For Row 1, the mean of A,B,C is 30/3=10, and I want that in Row 1 all values, be it A or B or C should be more than 40% of 10, i.e; 4. Similarly, for Row 2 and Row 3. In case even one element is less than that, we remove that row.

My attempt: I used any() function, but that does't help me when involving average of columns. I always get empty DF.

df = df[(df[[&#39;A&#39;,&#39;B&#39;,&#39;C&#39;]] &gt; (0.4*df[[&#39;A&#39;,&#39;B&#39;,&#39;C&#39;]].mean(axis=1))).all(1)]
print(df)
ID	A	B	C

I was expecting this:

	ID	A	B	C
0	AA	10	8	12
2	CC	10	6	14

The average of all rows is 10, so if I would have hardcoded it, it would work, like this:

df[(df[[&#39;A&#39;,&#39;B&#39;,&#39;C&#39;]] &gt; 0.4*10).all(1)]

How can I do this dynamically?
Thanks.

答案1

得分: 2

你可以在与列的均值的40%进行比较后，使用 loc，并与 all 进行聚合。gt (>) 用于方便处理比较：

tmp = df.drop(columns='ID')
out = df.loc[tmp.gt(tmp.mean(axis=1).mul(0.4), axis=0).all(axis=1)]

替代方法：

cols = ['A', 'B', 'C']
df.loc[df[cols].gt(df[cols].mean(axis=1).mul(0.4), axis=0).all(axis=1)]

输出：

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

英文:

You can use, loc after comparing each value to 40% of the columns mean, and aggregating with all. gt (>) is used for easy handling of the comparison:

tmp = df.drop(columns=&#39;ID&#39;)
out = df.loc[tmp.gt(tmp.mean(axis=1).mul(0.4), axis=0).all(axis=1)]

Alternative:

cols = [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;]
df.loc[df[cols].gt(df[cols].mean(axis=1).mul(0.4), axis=0).all(axis=1)]

Output:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

答案2

得分: 1

另一个可能的解决方案：

a = df[['A', 'B', 'C']].values
df[(a > 0.4 * (a.mean(1)[:, None])).all(1)]

输出：

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

英文:

Another possible solution:

a = df[[&#39;A&#39;, &#39;B&#39;, &#39;C&#39;]].values
df[(a &gt; 0.4*(a.mean(1)[:, None])).all(1)]

Output:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

选择所有包含大于平均值百分比的值的行

问题

答案1

答案2

使用lapply在R中出现问题，该函数在全局环境中使用常量。

添加最高分

从头创建要素图层

VS Code无法打印输出。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。