选择所有包含大于平均值百分比的值的行

huangapple go评论62阅读模式
英文:

Selecting all rows which contain values greater than a percentage of Average

问题

我有一个DataFrame,其中有3个数值列A、B、C。我需要提取仅包含这3个列A、B、C中的所有值都大于其行平均值的40%的行。

df = pd.DataFrame([['AA', 10, 8, 12], ['BB', 10, 2, 18], ['CC', 10, 6, 14]],
                  columns=['ID', 'A', 'B', 'C'])
print(df)

我的意思是以下内容:对于第一行,A、B、C的平均值是30/3=10,我希望在第一行中所有值,无论是A、B还是C,都应大于10的40%,即4。类似地,对于第二行和第三行也是如此。如果有任何一个元素小于这个值,我们就移除该行。

**我的尝试:**我使用了any()函数,但在涉及到列的平均值时无法帮助我。我总是得到一个空的DataFrame。

df = df[(df[['A', 'B', 'C']] > (0.4 * df[['A', 'B', 'C']].mean(axis=1))).all(1)]
print(df)

我期望的结果是:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

所有行的平均值都是10,所以如果我将其硬编码,就会工作,像这样:

df[(df[['A', 'B', 'C']] > 0.4 * 10).all(1)]

如何以动态方式实现这个目标?
谢谢。

英文:

I have a DataFrame, which have 3 numeric columns A,B,C. I need to extract only those rows where values in all these 3 columns A,B,C is more than 40% of their row average.

df = pd.DataFrame([['AA',10,8,12],['BB',10,2,18],['CC',10,6,14]],
                  columns=['ID','A', 'B', 'C'])
print(df)
	ID	A	B	C
0	AA	10	8	12
1	BB	10	2	18
2	CC	10	6	14

I mean the following: For Row 1, the mean of A,B,C is 30/3=10, and I want that in Row 1 all values, be it A or B or C should be more than 40% of 10, i.e; 4. Similarly, for Row 2 and Row 3. In case even one element is less than that, we remove that row.

My attempt: I used any() function, but that does't help me when involving average of columns. I always get empty DF.

df = df[(df[['A','B','C']] > (0.4*df[['A','B','C']].mean(axis=1))).all(1)]
print(df)
ID	A	B	C

I was expecting this:

	ID	A	B	C
0	AA	10	8	12
2	CC	10	6	14

The average of all rows is 10, so if I would have hardcoded it, it would work, like this:

df[(df[['A','B','C']] > 0.4*10).all(1)]

How can I do this dynamically?
Thanks.

答案1

得分: 2

你可以在与列的均值的40%进行比较后,使用 loc,并与 all 进行聚合。gt (>) 用于方便处理比较:

tmp = df.drop(columns='ID')

out = df.loc[tmp.gt(tmp.mean(axis=1).mul(0.4), axis=0).all(axis=1)]

替代方法:

cols = ['A', 'B', 'C']

df.loc[df[cols].gt(df[cols].mean(axis=1).mul(0.4), axis=0).all(axis=1)]

输出:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14
英文:

You can use, loc after comparing each value to 40% of the columns mean, and aggregating with all. gt (>) is used for easy handling of the comparison:

tmp = df.drop(columns='ID')

out = df.loc[tmp.gt(tmp.mean(axis=1).mul(0.4), axis=0).all(axis=1)]

Alternative:

cols = ['A', 'B', 'C']

df.loc[df[cols].gt(df[cols].mean(axis=1).mul(0.4), axis=0).all(axis=1)]

Output:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

答案2

得分: 1

另一个可能的解决方案:

a = df[['A', 'B', 'C']].values

df[(a > 0.4 * (a.mean(1)[:, None])).all(1)]

输出:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14
英文:

Another possible solution:

a = df[['A', 'B', 'C']].values

df[(a > 0.4*(a.mean(1)[:, None])).all(1)]

Output:

   ID   A  B   C
0  AA  10  8  12
2  CC  10  6  14

huangapple
  • 本文由 发表于 2023年7月17日 17:32:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76703126.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定