你可以使用Python 3.x中的groupby方法来查找pandas dataframe中的冗余分组。

huangapple go评论56阅读模式
英文:

How can I find redundant groups in a pandas dataframe using groupby in Python 3.x?

问题

Here's the translation of the code portion you provided:

groups = [set(i[1]['b']) for i in df.groupby('a')]
covered_groups = [groups[0]]

counter = 0
for i in groups[1:]:
    for k in covered_groups:
        if i.issubset(k):
            counter += 1
            break
    covered_groups.append(i)

print(counter)

Translation:

= [set(i[1]['b']) for i in df.groupby('a')]
已覆盖的组 = [组[0]]

计数器 = 0
for i in 组[1:]:
    for k in 已覆盖的组:
        if i.issubset(k):
            计数器 += 1
            break
    已覆盖的组.append(i)

print(计数器)

I've provided the translated code without additional content. If you have any more code or specific questions related to this code, please feel free to ask.

英文:

Below is the example dataframe:

df=pd.DataFrame({'a':[1,1,1,2,2,3,4,4,4,5,5,5,5], 'b':[1001,1002,1232,1001,1002,3002,1021,2021,4000,1002,1002,2031,1200]})
df
	a	b
0	1	1001
1	1	1002
2	1	1232
3	2	1001
4	2	1002
5	3	3002
6	4	1021
7	4	2021
8	4	4000
9	5	1002
10	5	1002
11	5	2031
12	5	120

I grouped the dataframe by column 'a', so that each group contains a set of values; like group 1 contains {1001,1002,1232}, group 2 - {1001,1002}, etc.:

df.groupby('a')

Going from top to bottom, let's call a group redundant if it does not contain any new value (every value it contains is already included by some earlier groups in this dataframe).
I need to write a code to find how many groups in the dataframe are redundant?

This is what I tried:

groups=[set(i[1]['b']) for i in df.groupby('a')]
covered_groups=[groups[0]]

counter=0
for i in groups[1:]:
    for k in covered_groups:
        if i.issubset(k):
            counter+=1
            break
    covered_groups.append(i)
        
print(counter) 

Output is 4, instead of 1. Not sure whats' wrong here.
Also maybe preferably there're pandas built in methods to achieve same result?

答案1

得分: 0

以下是翻译好的部分:

# 以集合形式汇总并按大小降序排序
tmp = (df.groupby('a')['b'].agg(set)
         .sort_values(key=lambda s: s.str.len(), ascending=False)
      )

# 将集合与较大的集合进行比较
keep = []
drop = []
for s in tmp:
    if any(s.issubset(s2) for s2 in keep):
        drop.append(s)
        continue
    else:
        keep.append(s)

keep
# [{1001, 1002, 1232}, {1021, 2021, 4000}, {1002, 1200, 2031}, {3002}]

drop
# [{1001, 1002}]
英文:

You could try:

# aggregate as set and sort by decreasing size
tmp = (df.groupby('a')['b'].agg(set)
         .sort_values(key=lambda s: s.str.len(), ascending=False)
      )

# compare set to larger ones
keep = []
drop = []
for s in tmp:
    if any(s.issubset(s2) for s2 in keep):
        drop.append(s)
        continue
    else:
        keep.append(s)

keep
# [{1001, 1002, 1232}, {1021, 2021, 4000}, {1002, 1200, 2031}, {3002}]

drop
# [{1001, 1002}]

答案2

得分: 0

我无法找到一个纯粹的pandas解决方案,但你可以尝试类似以下的方法:

values = set()
count = 0

for group in df.groupby('a')['b']:
    are_in = group[1].isin(values)

    if are_in.all():
        count += 1

    values = values.union(group[1])

print(count) #1
英文:

I couldn't figure it out a pure pandas solution, but you can try something like this:

values = set()                     
count = 0
                              
for group in df.groupby('a')['b']: 
    are_in = group[1].isin(values) 
                                                   
    if are_in.all():               
       count += 1
             
    values = values.union(group[1])

print(count) #1

huangapple
  • 本文由 发表于 2023年5月25日 22:56:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333657.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定