英文:
How can I find redundant groups in a pandas dataframe using groupby in Python 3.x?
问题
Here's the translation of the code portion you provided:
groups = [set(i[1]['b']) for i in df.groupby('a')]
covered_groups = [groups[0]]
counter = 0
for i in groups[1:]:
for k in covered_groups:
if i.issubset(k):
counter += 1
break
covered_groups.append(i)
print(counter)
Translation:
组 = [set(i[1]['b']) for i in df.groupby('a')]
已覆盖的组 = [组[0]]
计数器 = 0
for i in 组[1:]:
for k in 已覆盖的组:
if i.issubset(k):
计数器 += 1
break
已覆盖的组.append(i)
print(计数器)
I've provided the translated code without additional content. If you have any more code or specific questions related to this code, please feel free to ask.
英文:
Below is the example dataframe:
df=pd.DataFrame({'a':[1,1,1,2,2,3,4,4,4,5,5,5,5], 'b':[1001,1002,1232,1001,1002,3002,1021,2021,4000,1002,1002,2031,1200]})
df
a b
0 1 1001
1 1 1002
2 1 1232
3 2 1001
4 2 1002
5 3 3002
6 4 1021
7 4 2021
8 4 4000
9 5 1002
10 5 1002
11 5 2031
12 5 120
I grouped the dataframe by column 'a', so that each group contains a set of values; like group 1 contains {1001,1002,1232}, group 2 - {1001,1002}, etc.:
df.groupby('a')
Going from top to bottom, let's call a group redundant if it does not contain any new value (every value it contains is already included by some earlier groups in this dataframe).
I need to write a code to find how many groups in the dataframe are redundant?
This is what I tried:
groups=[set(i[1]['b']) for i in df.groupby('a')]
covered_groups=[groups[0]]
counter=0
for i in groups[1:]:
for k in covered_groups:
if i.issubset(k):
counter+=1
break
covered_groups.append(i)
print(counter)
Output is 4, instead of 1. Not sure whats' wrong here.
Also maybe preferably there're pandas built in methods to achieve same result?
答案1
得分: 0
以下是翻译好的部分:
# 以集合形式汇总并按大小降序排序
tmp = (df.groupby('a')['b'].agg(set)
.sort_values(key=lambda s: s.str.len(), ascending=False)
)
# 将集合与较大的集合进行比较
keep = []
drop = []
for s in tmp:
if any(s.issubset(s2) for s2 in keep):
drop.append(s)
continue
else:
keep.append(s)
keep
# [{1001, 1002, 1232}, {1021, 2021, 4000}, {1002, 1200, 2031}, {3002}]
drop
# [{1001, 1002}]
英文:
You could try:
# aggregate as set and sort by decreasing size
tmp = (df.groupby('a')['b'].agg(set)
.sort_values(key=lambda s: s.str.len(), ascending=False)
)
# compare set to larger ones
keep = []
drop = []
for s in tmp:
if any(s.issubset(s2) for s2 in keep):
drop.append(s)
continue
else:
keep.append(s)
keep
# [{1001, 1002, 1232}, {1021, 2021, 4000}, {1002, 1200, 2031}, {3002}]
drop
# [{1001, 1002}]
答案2
得分: 0
我无法找到一个纯粹的pandas解决方案,但你可以尝试类似以下的方法:
values = set()
count = 0
for group in df.groupby('a')['b']:
are_in = group[1].isin(values)
if are_in.all():
count += 1
values = values.union(group[1])
print(count) #1
英文:
I couldn't figure it out a pure pandas solution, but you can try something like this:
values = set()
count = 0
for group in df.groupby('a')['b']:
are_in = group[1].isin(values)
if are_in.all():
count += 1
values = values.union(group[1])
print(count) #1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论