英文:
group by with conditions python keeping all lines
问题
prelist = df['token'].str.contains('|'.join(['t', 'r']))
token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()
英文:
I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({
"review_num": [2,2,2,1,1,1,1,1,3],
"review": ["The second review","The second review","The second review",
"This is the first review","This is the first review",
"This is the first review","This is the first review",
"This is the first review",'No'],
"token_num":[1,2,3,1,2,3,4,5,1],
"token":["The","second","review","This","is","the","first","review","No"],
"score":[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
})
review_num review token_num token score
0 2 The second review 1 The 0.3
1 2 The second review 2 second -0.6
2 2 The second review 3 review 0.4
3 1 This is the first review 1 This 0.5
4 1 This is the first review 2 is 0.8
5 1 This is the first review 3 the -0.7
6 1 This is the first review 4 first 0.6
7 1 This is the first review 5 review 0.4
8 3 No 1 No 0.3
I need to get the lines as below:
- If the review contains "t" or "r": get the review line with the max score (just for lines with token containing t or r)
- If the review doesn't contain "t" or "r": get just one line of the review
- Keep the order of reviews as the order in the original table
With this code, I respect 1 and 3 but I don't see how to respect the second rule without perturbing the third rule.
prelist=df['token'].str.contains('|'.join(['t','r']))
token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()
Current result:
review_num
2 2
1 6
Expected result :
review_num
2 2
1 6
3 8
答案1
得分: 1
使用:
# 包含't'或'r'的token所在的行
m = df['token'].str.contains('r|t')
# 找出没有匹配项的评论
m2 = (~m).groupby(df['review_num']).transform('all')
# 对于每个组,获取最大值的索引
df[m | m2].groupby('review_num', sort=False)['score'].idxmax()
输出:
review_num
2 2
1 6
3 8
Name: score, dtype: int64
以前的回答
你可以使用自定义的 groupby.apply
方法:
(df.groupby('review_num', sort=False)
.apply(lambda g: g['score'].idxmax()
if set(g['review'].iloc[0]).intersection(['t', 'r'])
else g.sample(n=1).index[0])
)
示例输出:
review_num
2 2
1 3
3 8
dtype: int64
逻辑:
- 我们按"review_num"进行分组,保持原始组的顺序。
- 对于每个组,我们将"review"转换为
set
,并与't'和'r'进行比较,如果交集不为空,则选择idxmax
。 - 否则选择随机行。
英文:
Use:
# rows with t/r in token
m = df['token'].str.contains('r|t')
# identify reviews with no match
m2 = (~m).groupby(df['review_num']).transform('all')
# for each group get idxmax
df[m|m2].groupby('review_num', sort=False)['score'].idxmax()
Output:
review_num
2 2
1 6
3 8
Name: score, dtype: int64
previous answer
You can use a custom groupby.apply
:
(df.groupby('review_num', sort=False)
.apply(lambda g: g['score'].idxmax()
if set(g['review'].iloc[0]).intersection(['t', 'r'])
else g.sample(n=1).index[0])
)
Example output:
review_num
2 2
1 3
3 8
dtype: int64
Logic:
- we group by "review_num", keeping the original order of the groups
- for each group we convert the "review" to
set
and compare it to t/r, if the intersection is not empty pick theidxmax
- else pick a random row
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论