按条件分组 Python 保留所有行。

huangapple go评论77阅读模式
英文:

group by with conditions python keeping all lines

问题

prelist = df['token'].str.contains('|'.join(['t', 'r']))

token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()
英文:

I have the following pandas dataframe:

import pandas as pd
df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3],
    "review": ["The second review","The second review","The second review",
               "This is the first review","This is the first review",
               "This is the first review","This is the first review",
               "This is the first review",'No'],
    "token_num":[1,2,3,1,2,3,4,5,1],
    "token":["The","second","review","This","is","the","first","review","No"],
    "score":[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
})

   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.3
1           2         The second review          2  second   -0.6
2           2         The second review          3  review    0.4
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.8
5           1  This is the first review          3     the   -0.7
6           1  This is the first review          4   first    0.6
7           1  This is the first review          5  review    0.4
8           3                        No          1      No    0.3

I need to get the lines as below:

  1. If the review contains "t" or "r": get the review line with the max score (just for lines with token containing t or r)
  2. If the review doesn't contain "t" or "r": get just one line of the review
  3. Keep the order of reviews as the order in the original table

With this code, I respect 1 and 3 but I don't see how to respect the second rule without perturbing the third rule.

prelist=df['token'].str.contains('|'.join(['t','r']))

token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()

Current result:

review_num
2    2
1    6

Expected result :

review_num
2    2
1    6
3    8

答案1

得分: 1

使用:

# 包含't'或'r'的token所在的行
m = df['token'].str.contains('r|t')
# 找出没有匹配项的评论
m2 = (~m).groupby(df['review_num']).transform('all')

# 对于每个组,获取最大值的索引
df[m | m2].groupby('review_num', sort=False)['score'].idxmax()

输出:

review_num
2    2
1    6
3    8
Name: score, dtype: int64

以前的回答

你可以使用自定义的 groupby.apply 方法:

(df.groupby('review_num', sort=False)
   .apply(lambda g: g['score'].idxmax()
          if set(g['review'].iloc[0]).intersection(['t', 'r'])
          else g.sample(n=1).index[0])
)

示例输出:

review_num
2    2
1    3
3    8
dtype: int64

逻辑:

  • 我们按"review_num"进行分组,保持原始组的顺序。
  • 对于每个组,我们将"review"转换为set,并与't'和'r'进行比较,如果交集不为空,则选择idxmax
  • 否则选择随机行。
英文:

Use:

# rows with t/r in token
m = df['token'].str.contains('r|t')
# identify reviews with no match
m2 = (~m).groupby(df['review_num']).transform('all')

# for each group get idxmax
df[m|m2].groupby('review_num', sort=False)['score'].idxmax()

Output:

review_num
2    2
1    6
3    8
Name: score, dtype: int64

previous answer

You can use a custom groupby.apply:

(df.groupby('review_num', sort=False)
   .apply(lambda g: g['score'].idxmax()
          if set(g['review'].iloc[0]).intersection(['t', 'r'])
          else g.sample(n=1).index[0])
)

Example output:

review_num
2    2
1    3
3    8
dtype: int64

Logic:

  • we group by "review_num", keeping the original order of the groups
  • for each group we convert the "review" to set and compare it to t/r, if the intersection is not empty pick the idxmax
  • else pick a random row

huangapple
  • 本文由 发表于 2023年2月27日 02:18:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75574092.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定