英文:
pandas: Filter a whole group out based on a condition
问题
以下是您要翻译的内容:
我已经创建了以下的 MWE
data = {'Name': ['Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom'], 'Article': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B'], 'Weekday': [1, 2, 3, 2, 3, 1, 2, 3, 1, 2], 'Value': [1, 40, 3, 91, 10, 6, 9, 10, 20, 10]}
df_test = pd.DataFrame(data)
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
3 Tom B 2 91
4 Tom B 3 10
5 Tom A 1 6
6 Tom A 2 9
7 Tom A 3 10
8 Tom B 1 20
9 Tom B 2 10
其中一个组由 Name-Article 对组成。我想要过滤掉所有没有至少在所有三个现有的工作日中拥有至少两次计数的组。因此,只应该有包含三个工作日(1、2、3)且至少有两次计数的组。如果一个 Name-Article 只有两个工作日至少有两次计数,那么它也应该被过滤掉。
期望的输出应该如下所示:
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
5 Tom A 1 6
6 Tom A 2 9
7 Tom A 3 10
英文:
I've created the following MWE
data = {'Name': ['Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom' , 'Tom', 'Tom', 'Tom'], 'Article': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B'], 'Weekday' : [1,2,3,2,3,1,2,3, 1, 2], 'Value': [1,40,3,91,10,6,9,10, 20, 10]}
df_test = pd.DataFrame(data)
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
3 Tom B 2 91
4 Tom B 3 10
5 Tom A 1 6
6 Tom A 2 9
7 Tom A 3 10
8 Tom B 1 20
9 Tom B 2 10
where a group consists of Name-Article pairs. I want to filter all groups out that dont have at least 2 counts in values for all three existing weekdays. So there should be only groups having three weekdays (1,2,3) and with at least two counts. If a Name-Article has only two weekdays with at least two counts it should be filtered out as well.
The expected output should look like this
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
5 Tom A 1 6
6 Tom A 2 9
7 Tom A 3 10
答案1
得分: 3
如果您想要确保每个Name
/Article
在**每个Weekday
**至少有2次计数,您可以使用crosstab
来计算Name
/Article
和Weekday
的组合的计数。
然后,您可以使用任何筛选条件,这里我们保留具有all
至少有2个值的Name
/Article
组合:
counts = pd.crosstab([df_test['Name'], df_test['Article']], df_test['Weekday'])
keep = counts[counts.ge(2).all(axis=1)]
out = df_test.set_index(['Name', 'Article']).loc[keep.index].reset_index()
# 或者
# out = df_test.merge(keep[[]].reset_index())
要计算不是所有天数,而是只有给定数量(例如,≥3),可以使用:
keep = counts[counts.ge(2).sum(axis=1).ge(3)]
输出:
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
3 Tom A 1 6
4 Tom A 2 9
5 Tom A 3 10
中间的counts
:
Weekday 1 2 3
Name Article
Tom A 2 2 2 # 所有3个都有≥2,我们保留
B 1 2 1 # 不是全部≥2,丢弃
请注意,以上是您提供的代码和注释的中文翻译部分。
英文:
If you want to ensure at least 2 counts per Name
/Article
per Weekday
, you can compute a crosstab
to count the combinations of Name
/Article
and Weekday
.
Then you can use any filter you want, here we keep the Name
/Article
combinations that have all
at least 2 values:
counts = pd.crosstab([df_test['Name'], df_test['Article']], df_test['Weekday'])
keep = counts[counts.ge(2).all(axis=1)]
out = df_test.set_index(['Name', 'Article']).loc[keep.index].reset_index()
# or
# out = df_test.merge(keep[[]].reset_index())
To count not all
days but only a given number (e.g. ≥3), use:
keep = counts[counts.ge(2).sum(axis=1).ge(3)]
Output:
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
3 Tom A 1 6
4 Tom A 2 9
5 Tom A 3 10
Intermediate counts
:
Weekday 1 2 3
Name Article
Tom A 2 2 2 # all 3 have ≥ 2, we keep
B 1 2 1 # not all ≥ 2, discard
答案2
得分: 1
使用boolean indexing
进行筛选:
s = df_test.groupby(['Name', 'Article', 'Weekday']).size()
m = s.ge(2).groupby(level=[0, 1]).sum().ge(3)
df = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print(df)
它的工作原理:
# 按Name/Article/Weekday计算计数
print(df_test.groupby(['Name', 'Article', 'Weekday']).size())
# 测试是否大于等于2
print(s.ge(2))
# 计算每个Name/Article中True的数量
print(s.ge(2).groupby(level=[0, 1]).sum())
# 测试是否计数大于等于3
print(s.ge(2).groupby(level=[0, 1]).sum().ge(3))
# 过滤Name/Article
print(m.index[m])
# 过滤原始DataFrame中Name/Article的组合
print(df_test.set_index(['Name', 'Article']).index.isin(m.index[m]))
如果需要测试是否所有Name/Article都具有2个或更多计数,可以使用GroupBy.all
进行类似的解决方案:
s = df_test.groupby(['Name', 'Article', 'Weekday']).size()
m = s.ge(2).groupby(level=[0, 1]).all()
df1 = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print(df1)
英文:
Use boolean indexing
for filtering:
s = df_test.groupby(['Name','Article','Weekday']).size()
m = s.ge(2).groupby(level=[0,1]).sum().ge(3)
df = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print (df)
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
5 Tom A 1 6
6 Tom A 2 9
7 Tom A 3 10
How it working:
#get counts per Name/Article/Weekday
print (df_test.groupby(['Name','Article','Weekday']).size())
Name Article Weekday
Tom A 1 2
2 2
3 2
B 1 1
2 2
3 1
dtype: int64
#test if greater or equal like 2
print (s.ge(2))
Tom A 1 True
2 True
3 True
B 1 False
2 True
3 False
dtype: bool
#count Trues per Name/Article
print (s.ge(2).groupby(level=[0,1]).sum())
Tom A 3
B 1
dtype: int64
#test if counts greater/ equal like 3
print (s.ge(2).groupby(level=[0,1]).sum().ge(3))
Tom A True
B False
#filter Name/Article
print (m.index[m])
MultiIndex([('Tom', 'A')],
names=['Name', 'Article'])
#filter combinaton from original DataFrame Name/Article
print (df_test.set_index(['Name', 'Article']).index.isin(m.index[m]))
[ True True True False False True True True False False]
Similar solution if need test if all NAme/Article has 2 or more counts with GroupBy.all
:
s = df_test.groupby(['Name','Article','Weekday']).size()
m = s.ge(2).groupby(level=[0,1]).all()
df1 = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print (df1)
Name Article Weekday Value
0 Tom A 1 1
1 Tom A 2 40
2 Tom A 3 3
5 Tom A 1 6
6 Tom A 2 9
7 Tom A 3 10
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论