英文:
Selecting all rows which contain values greater than a percentage of Average
问题
我有一个DataFrame,其中有3个数值列A、B、C。我需要提取仅包含这3个列A、B、C中的所有值都大于其行平均值的40%的行。
df = pd.DataFrame([['AA', 10, 8, 12], ['BB', 10, 2, 18], ['CC', 10, 6, 14]],
columns=['ID', 'A', 'B', 'C'])
print(df)
我的意思是以下内容:对于第一行,A、B、C的平均值是30/3=10,我希望在第一行中所有值,无论是A、B还是C,都应大于10的40%,即4。类似地,对于第二行和第三行也是如此。如果有任何一个元素小于这个值,我们就移除该行。
**我的尝试:**我使用了any()
函数,但在涉及到列的平均值时无法帮助我。我总是得到一个空的DataFrame。
df = df[(df[['A', 'B', 'C']] > (0.4 * df[['A', 'B', 'C']].mean(axis=1))).all(1)]
print(df)
我期望的结果是:
ID A B C
0 AA 10 8 12
2 CC 10 6 14
所有行的平均值都是10,所以如果我将其硬编码,就会工作,像这样:
df[(df[['A', 'B', 'C']] > 0.4 * 10).all(1)]
如何以动态方式实现这个目标?
谢谢。
英文:
I have a DataFrame
, which have 3 numeric columns A,B,C
. I need to extract only those rows where values in all these 3 columns A,B,C
is more than 40% of their row average.
df = pd.DataFrame([['AA',10,8,12],['BB',10,2,18],['CC',10,6,14]],
columns=['ID','A', 'B', 'C'])
print(df)
ID A B C
0 AA 10 8 12
1 BB 10 2 18
2 CC 10 6 14
I mean the following: For Row 1, the mean of A,B,C is 30/3=10, and I want that in Row 1 all values, be it A or B or C should be more than 40% of 10, i.e; 4. Similarly, for Row 2 and Row 3. In case even one element is less than that, we remove that row.
My attempt: I used any()
function, but that does't help me when involving average of columns. I always get empty DF
.
df = df[(df[['A','B','C']] > (0.4*df[['A','B','C']].mean(axis=1))).all(1)]
print(df)
ID A B C
I was expecting this:
ID A B C
0 AA 10 8 12
2 CC 10 6 14
The average of all rows is 10, so if I would have hardcoded it, it would work, like this:
df[(df[['A','B','C']] > 0.4*10).all(1)]
How can I do this dynamically?
Thanks.
答案1
得分: 2
你可以在与列的均值的40%进行比较后,使用 loc
,并与 all
进行聚合。gt
(>
) 用于方便处理比较:
tmp = df.drop(columns='ID')
out = df.loc[tmp.gt(tmp.mean(axis=1).mul(0.4), axis=0).all(axis=1)]
替代方法:
cols = ['A', 'B', 'C']
df.loc[df[cols].gt(df[cols].mean(axis=1).mul(0.4), axis=0).all(axis=1)]
输出:
ID A B C
0 AA 10 8 12
2 CC 10 6 14
英文:
You can use, loc
after comparing each value to 40% of the columns mean
, and aggregating with all
. gt
(>
) is used for easy handling of the comparison:
tmp = df.drop(columns='ID')
out = df.loc[tmp.gt(tmp.mean(axis=1).mul(0.4), axis=0).all(axis=1)]
Alternative:
cols = ['A', 'B', 'C']
df.loc[df[cols].gt(df[cols].mean(axis=1).mul(0.4), axis=0).all(axis=1)]
Output:
ID A B C
0 AA 10 8 12
2 CC 10 6 14
答案2
得分: 1
另一个可能的解决方案:
a = df[['A', 'B', 'C']].values
df[(a > 0.4 * (a.mean(1)[:, None])).all(1)]
输出:
ID A B C
0 AA 10 8 12
2 CC 10 6 14
英文:
Another possible solution:
a = df[['A', 'B', 'C']].values
df[(a > 0.4*(a.mean(1)[:, None])).all(1)]
Output:
ID A B C
0 AA 10 8 12
2 CC 10 6 14
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论