pandas:根据条件筛选整个分组。

huangapple go评论85阅读模式
英文:

pandas: Filter a whole group out based on a condition

问题

以下是您要翻译的内容:

我已经创建了以下的 MWE

    data = {'Name': ['Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom'], 'Article': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B'], 'Weekday': [1, 2, 3, 2, 3, 1, 2, 3, 1, 2], 'Value': [1, 40, 3, 91, 10, 6, 9, 10, 20, 10]}
    
    df_test = pd.DataFrame(data)

    Name    Article    Weekday    Value
    0    Tom    A    1    1
    1    Tom    A    2    40
    2    Tom    A    3    3
    3    Tom    B    2    91
    4    Tom    B    3    10
    5    Tom    A    1    6
    6    Tom    A    2    9
    7    Tom    A    3    10
    8    Tom    B    1    20
    9    Tom    B    2    10


其中一个组由 Name-Article 对组成我想要过滤掉所有没有至少在所有三个现有的工作日中拥有至少两次计数的组因此只应该有包含三个工作日123且至少有两次计数的组如果一个 Name-Article 只有两个工作日至少有两次计数那么它也应该被过滤掉

期望的输出应该如下所示

    Name    Article    Weekday    Value
    0    Tom    A    1    1
    1    Tom    A    2    40
    2    Tom    A    3    3
    5    Tom    A    1    6
    6    Tom    A    2    9
    7    Tom    A    3    10
英文:

I've created the following MWE

data = {'Name': ['Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom', 'Tom' , 'Tom', 'Tom', 'Tom'], 'Article': ['A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B'], 'Weekday' : [1,2,3,2,3,1,2,3, 1, 2], 'Value': [1,40,3,91,10,6,9,10, 20, 10]}  

df_test = pd.DataFrame(data)

	    Name	Article	Weekday	Value
    0	Tom	       A	1	    1
    1	Tom	       A	2	   40
    2	Tom	       A    3	    3
    3	Tom        B	2	   91
    4	Tom	       B	3	   10
    5	Tom	       A	1	    6
    6	Tom	       A	2	    9
    7	Tom	       A	3	   10
    8	Tom	       B	1	   20
    9	Tom	       B	2	   10

where a group consists of Name-Article pairs. I want to filter all groups out that dont have at least 2 counts in values for all three existing weekdays. So there should be only groups having three weekdays (1,2,3) and with at least two counts. If a Name-Article has only two weekdays with at least two counts it should be filtered out as well.

The expected output should look like this

	    Name	Article	Weekday	Value
    0	Tom	       A	1	    1
    1	Tom	       A	2	   40
    2	Tom	       A    3	    3
    5	Tom	       A	1	    6
    6	Tom	       A	2	    9
    7	Tom	       A	3	   10

答案1

得分: 3

如果您想要确保每个Name/Article在**每个Weekday**至少有2次计数,您可以使用crosstab来计算Name/ArticleWeekday的组合的计数。

然后,您可以使用任何筛选条件,这里我们保留具有all至少有2个值的Name/Article组合:

counts = pd.crosstab([df_test['Name'], df_test['Article']], df_test['Weekday'])
keep = counts[counts.ge(2).all(axis=1)]

out = df_test.set_index(['Name', 'Article']).loc[keep.index].reset_index()

# 或者
# out = df_test.merge(keep[[]].reset_index())

要计算不是所有天数,而是只有给定数量(例如,≥3),可以使用:

keep = counts[counts.ge(2).sum(axis=1).ge(3)]

输出:

  Name Article  Weekday  Value
0  Tom       A        1      1
1  Tom       A        2     40
2  Tom       A        3      3
3  Tom       A        1      6
4  Tom       A        2      9
5  Tom       A        3     10

中间的counts

Weekday       1  2  3
Name Article         
Tom  A        2  2  2  # 所有3个都有≥2,我们保留
     B        1  2  1  # 不是全部≥2,丢弃

请注意,以上是您提供的代码和注释的中文翻译部分。

英文:

If you want to ensure at least 2 counts per Name/Article per Weekday, you can compute a crosstab to count the combinations of Name/Article and Weekday.

Then you can use any filter you want, here we keep the Name/Article combinations that have all at least 2 values:

counts = pd.crosstab([df_test['Name'], df_test['Article']], df_test['Weekday'])
keep = counts[counts.ge(2).all(axis=1)]

out = df_test.set_index(['Name', 'Article']).loc[keep.index].reset_index()

# or
# out = df_test.merge(keep[[]].reset_index())

To count not all days but only a given number (e.g. ≥3), use:

keep = counts[counts.ge(2).sum(axis=1).ge(3)]

Output:

  Name Article  Weekday  Value
0  Tom       A        1      1
1  Tom       A        2     40
2  Tom       A        3      3
3  Tom       A        1      6
4  Tom       A        2      9
5  Tom       A        3     10

Intermediate counts:

Weekday       1  2  3
Name Article         
Tom  A        2  2  2  # all 3 have ≥ 2, we keep
     B        1  2  1  # not all ≥ 2, discard

答案2

得分: 1

使用boolean indexing进行筛选:

s = df_test.groupby(['Name', 'Article', 'Weekday']).size()
m = s.ge(2).groupby(level=[0, 1]).sum().ge(3)
df = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print(df)

它的工作原理:

# 按Name/Article/Weekday计算计数
print(df_test.groupby(['Name', 'Article', 'Weekday']).size())

# 测试是否大于等于2
print(s.ge(2))

# 计算每个Name/Article中True的数量
print(s.ge(2).groupby(level=[0, 1]).sum())

# 测试是否计数大于等于3
print(s.ge(2).groupby(level=[0, 1]).sum().ge(3))

# 过滤Name/Article
print(m.index[m])

# 过滤原始DataFrame中Name/Article的组合
print(df_test.set_index(['Name', 'Article']).index.isin(m.index[m]))

如果需要测试是否所有Name/Article都具有2个或更多计数,可以使用GroupBy.all进行类似的解决方案:

s = df_test.groupby(['Name', 'Article', 'Weekday']).size()
m = s.ge(2).groupby(level=[0, 1]).all()

df1 = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print(df1)
英文:

Use boolean indexing for filtering:

s = df_test.groupby(['Name','Article','Weekday']).size()
m = s.ge(2).groupby(level=[0,1]).sum().ge(3)
df = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print (df)
  Name Article  Weekday  Value
0  Tom       A        1      1
1  Tom       A        2     40
2  Tom       A        3      3
5  Tom       A        1      6
6  Tom       A        2      9
7  Tom       A        3     10

How it working:

#get counts per Name/Article/Weekday
print (df_test.groupby(['Name','Article','Weekday']).size())
Name  Article  Weekday
Tom   A        1          2
               2          2
               3          2
      B        1          1
               2          2
               3          1
dtype: int64

#test if greater or equal like 2
print (s.ge(2))
Tom   A        1           True
               2           True
               3           True
      B        1          False
               2           True
               3          False
dtype: bool

#count Trues per Name/Article
print (s.ge(2).groupby(level=[0,1]).sum())
Tom   A          3
      B          1
dtype: int64

#test if counts greater/ equal like 3
print (s.ge(2).groupby(level=[0,1]).sum().ge(3))
Tom   A           True
      B          False
      
#filter Name/Article
print (m.index[m])
MultiIndex([('Tom', 'A')],
           names=['Name', 'Article'])

#filter combinaton from original DataFrame Name/Article
print (df_test.set_index(['Name', 'Article']).index.isin(m.index[m]))
[ True  True  True False False  True  True  True False False]

Similar solution if need test if all NAme/Article has 2 or more counts with GroupBy.all:

s = df_test.groupby(['Name','Article','Weekday']).size()
m = s.ge(2).groupby(level=[0,1]).all()

df1 = df_test[df_test.set_index(['Name', 'Article']).index.isin(m.index[m])]
print (df1)
  Name Article  Weekday  Value
0  Tom       A        1      1
1  Tom       A        2     40
2  Tom       A        3      3
5  Tom       A        1      6
6  Tom       A        2      9
7  Tom       A        3     10

huangapple
  • 本文由 发表于 2023年3月9日 15:09:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75681421.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定