2023年2月27日 02:18:24go评论104阅读模式

英文:

group by with conditions python keeping all lines

问题

prelist = df['token'].str.contains('|'.join(['t', 'r']))
token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()

英文:

I have the following pandas dataframe:

import pandas as pd
df = pd.DataFrame({
    &quot;review_num&quot;: [2,2,2,1,1,1,1,1,3],
    &quot;review&quot;: [&quot;The second review&quot;,&quot;The second review&quot;,&quot;The second review&quot;,
               &quot;This is the first review&quot;,&quot;This is the first review&quot;,
               &quot;This is the first review&quot;,&quot;This is the first review&quot;,
               &quot;This is the first review&quot;,&#39;No&#39;],
    &quot;token_num&quot;:[1,2,3,1,2,3,4,5,1],
    &quot;token&quot;:[&quot;The&quot;,&quot;second&quot;,&quot;review&quot;,&quot;This&quot;,&quot;is&quot;,&quot;the&quot;,&quot;first&quot;,&quot;review&quot;,&quot;No&quot;],
    &quot;score&quot;:[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
})
   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.3
1           2         The second review          2  second   -0.6
2           2         The second review          3  review    0.4
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.8
5           1  This is the first review          3     the   -0.7
6           1  This is the first review          4   first    0.6
7           1  This is the first review          5  review    0.4
8           3                        No          1      No    0.3

I need to get the lines as below:

If the review contains "t" or "r": get the review line with the max score (just for lines with token containing t or r)
If the review doesn't contain "t" or "r": get just one line of the review
Keep the order of reviews as the order in the original table

With this code, I respect 1 and 3 but I don't see how to respect the second rule without perturbing the third rule.

prelist=df[&#39;token&#39;].str.contains(&#39;|&#39;.join([&#39;t&#39;,&#39;r&#39;]))
token_max_score = df[prelist].groupby(&#39;review_num&#39;, sort=False)[&#39;score&#39;].idxmax()

Current result:

review_num
2    2
1    6

Expected result :

review_num
2    2
1    6
3    8

答案1

得分: 1

使用:

# 包含't'或'r'的token所在的行
m = df['token'].str.contains('r|t')
# 找出没有匹配项的评论
m2 = (~m).groupby(df['review_num']).transform('all')
# 对于每个组，获取最大值的索引
df[m | m2].groupby('review_num', sort=False)['score'].idxmax()

输出:

review_num
2    2
1    6
3    8
Name: score, dtype: int64

以前的回答

你可以使用自定义的 groupby.apply 方法：

(df.groupby('review_num', sort=False)
   .apply(lambda g: g['score'].idxmax()
          if set(g['review'].iloc[0]).intersection(['t', 'r'])
          else g.sample(n=1).index[0])
)

示例输出:

review_num
2    2
1    3
3    8
dtype: int64

逻辑：

我们按"review_num"进行分组，保持原始组的顺序。
对于每个组，我们将"review"转换为set，并与't'和'r'进行比较，如果交集不为空，则选择idxmax。
否则选择随机行。

英文:

Use:

# rows with t/r in token
m = df[&#39;token&#39;].str.contains(&#39;r|t&#39;)
# identify reviews with no match
m2 = (~m).groupby(df[&#39;review_num&#39;]).transform(&#39;all&#39;)
# for each group get idxmax
df[m|m2].groupby(&#39;review_num&#39;, sort=False)[&#39;score&#39;].idxmax()

Output:

review_num
2    2
1    6
3    8
Name: score, dtype: int64

previous answer

You can use a custom groupby.apply:

(df.groupby(&#39;review_num&#39;, sort=False)
   .apply(lambda g: g[&#39;score&#39;].idxmax()
          if set(g[&#39;review&#39;].iloc[0]).intersection([&#39;t&#39;, &#39;r&#39;])
          else g.sample(n=1).index[0])
)

Example output:

review_num
2    2
1    3
3    8
dtype: int64

Logic:

we group by "review_num", keeping the original order of the groups
for each group we convert the "review" to set and compare it to t/r, if the intersection is not empty pick the idxmax
else pick a random row

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

按条件分组 Python 保留所有行。

问题

答案1

以前的回答

previous answer

可以将Flask部署到Azure并使用外部目录吗？

在Google Colab或Jupyter Notebook中使用Sherlock。

Numpy数组剪裁多行字符串

在后台运行 Python 的 asyncio 任务。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。