按条件分组 Python 保留所有行。

huangapple go评论104阅读模式
英文:

group by with conditions python keeping all lines

问题

  1. prelist = df['token'].str.contains('|'.join(['t', 'r']))
  2. token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()
英文:

I have the following pandas dataframe:

  1. import pandas as pd
  2. df = pd.DataFrame({
  3. "review_num": [2,2,2,1,1,1,1,1,3],
  4. "review": ["The second review","The second review","The second review",
  5. "This is the first review","This is the first review",
  6. "This is the first review","This is the first review",
  7. "This is the first review",'No'],
  8. "token_num":[1,2,3,1,2,3,4,5,1],
  9. "token":["The","second","review","This","is","the","first","review","No"],
  10. "score":[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
  11. })
  12. review_num review token_num token score
  13. 0 2 The second review 1 The 0.3
  14. 1 2 The second review 2 second -0.6
  15. 2 2 The second review 3 review 0.4
  16. 3 1 This is the first review 1 This 0.5
  17. 4 1 This is the first review 2 is 0.8
  18. 5 1 This is the first review 3 the -0.7
  19. 6 1 This is the first review 4 first 0.6
  20. 7 1 This is the first review 5 review 0.4
  21. 8 3 No 1 No 0.3

I need to get the lines as below:

  1. If the review contains "t" or "r": get the review line with the max score (just for lines with token containing t or r)
  2. If the review doesn't contain "t" or "r": get just one line of the review
  3. Keep the order of reviews as the order in the original table

With this code, I respect 1 and 3 but I don't see how to respect the second rule without perturbing the third rule.

  1. prelist=df['token'].str.contains('|'.join(['t','r']))
  2. token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()

Current result:

  1. review_num
  2. 2 2
  3. 1 6

Expected result :

  1. review_num
  2. 2 2
  3. 1 6
  4. 3 8

答案1

得分: 1

使用:

  1. # 包含't'或'r'的token所在的行
  2. m = df['token'].str.contains('r|t')
  3. # 找出没有匹配项的评论
  4. m2 = (~m).groupby(df['review_num']).transform('all')
  5. # 对于每个组,获取最大值的索引
  6. df[m | m2].groupby('review_num', sort=False)['score'].idxmax()

输出:

  1. review_num
  2. 2 2
  3. 1 6
  4. 3 8
  5. Name: score, dtype: int64

以前的回答

你可以使用自定义的 groupby.apply 方法:

  1. (df.groupby('review_num', sort=False)
  2. .apply(lambda g: g['score'].idxmax()
  3. if set(g['review'].iloc[0]).intersection(['t', 'r'])
  4. else g.sample(n=1).index[0])
  5. )

示例输出:

  1. review_num
  2. 2 2
  3. 1 3
  4. 3 8
  5. dtype: int64

逻辑:

  • 我们按"review_num"进行分组,保持原始组的顺序。
  • 对于每个组,我们将"review"转换为set,并与't'和'r'进行比较,如果交集不为空,则选择idxmax
  • 否则选择随机行。
英文:

Use:

  1. # rows with t/r in token
  2. m = df['token'].str.contains('r|t')
  3. # identify reviews with no match
  4. m2 = (~m).groupby(df['review_num']).transform('all')
  5. # for each group get idxmax
  6. df[m|m2].groupby('review_num', sort=False)['score'].idxmax()

Output:

  1. review_num
  2. 2 2
  3. 1 6
  4. 3 8
  5. Name: score, dtype: int64

previous answer

You can use a custom groupby.apply:

  1. (df.groupby('review_num', sort=False)
  2. .apply(lambda g: g['score'].idxmax()
  3. if set(g['review'].iloc[0]).intersection(['t', 'r'])
  4. else g.sample(n=1).index[0])
  5. )

Example output:

  1. review_num
  2. 2 2
  3. 1 3
  4. 3 8
  5. dtype: int64

Logic:

  • we group by "review_num", keeping the original order of the groups
  • for each group we convert the "review" to set and compare it to t/r, if the intersection is not empty pick the idxmax
  • else pick a random row

huangapple
  • 本文由 发表于 2023年2月27日 02:18:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75574092.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定