使用基于另一行的条件来删除分组的行

huangapple go评论79阅读模式
英文:

removing rows of groups using a condition based on another row

问题

我有一个电影数据表格,看起来像这样:

    索引   电影                      烂番茄评分   导演                  排名
    1    蝙蝠侠崛起               87                克里斯托弗·诺兰    1
    2    星际穿越                  73                克里斯托弗·诺兰    2
    3    金发法宝                  71                罗伯特·卢克蒂克     1
    4    丑陋的真相               14                罗伯特·卢克蒂克     2
    5    (500)天与夏              85                马克·韦布           1
    6    超凡蜘蛛侠               71                马克·韦布           2
    7    醒来                       45                M·N·夏马兰        1
    8    最后的气宗                 5                   M·N·夏马兰        2

我已经添加了一个排名列,显示了基于烂番茄评分的排名顺序。这是按照电影的导演分组的。

我想要做的是,使用排名列,如果顶级电影的评分超过某个点(50%),则删除排名较低的电影。例如,对于马克·韦布,我只想显示他的电影(500)天与夏,但是我希望显示M·N·夏马兰的两部电影。理想的表格应该是这样的:

    索引   电影                      烂番茄评分   导演                  排名
    1    蝙蝠侠崛起               87                克里斯托弗·诺兰    1
    3    金发法宝                  71                罗伯特·卢克蒂克     1
    5    (500)天与夏              85                马克·韦布           1
    7    醒来                       45                M·N·夏马兰        1
    8    最后的气宗                 5                   M·N·夏马兰        2

我尝试过:

    movie_names = movie_names.groupby('Movie').filter(lambda g: (g.score <= 0.5).any())

但这样会删除M·N·夏马兰的两部电影。有人知道如何做到这一点吗?任何帮助将不胜感激!
英文:

I have a dataframe of movies that looks like this:

Index  Movie                   Rotten Tomato Score  Director             Ranking
1      Batman Rises            87                   Christopher Nolan    1
2      Interstellar            73                   Christopher Nolan    2
3      Legally Blonde          71                   Robert Luketic       1
4      The Ugly Truth          14                   Robert Luketic       2
5      (500) Days of Summer    85                   Marc Webb            1
6      The Amazing Spider-Man  71                   Marc Webb            2
7      Wide Awake              45                   M N Shyamalan        1
8      The Last Airbender       5                   M N Shyamalan        2

I have made a ranking column which shows the order of ranking based off of the rotten tomato percentage. This is grouped by Director of the movie.

What I am trying to do is to use the ranking column, to drop the lower ranked movies if the top ranked movie is scored above a certain point (50%). For example, For Marc Webb, the only film I want showing for him is (500) Days of Summer, however I would want to show both the movies for M N Shyamalan. The ideal table would look like:

Index  Movie                 Rotten Tomato Score  Director             Ranking  
    1  Batman Rises          87                   Christopher Nolan    1      
    3  Legally Blonde        71                   Robert Luketic       1
    5  (500) Days of Summer  85                   Marc Webb            1
    7  Wide Awake            45                   M N Shyamalan        1
    8  The Last Airbender    5                    M N Shyamalan        2

I have attempted:

movie_names = movie_names.groupby(&#39;Movie&#39;).filter(lambda g: (g.score &lt;= 0.5).any())

However that removed both of M N Shyamalan's movies.

Would anyone know how to do this? Any help would be greatly appreciated!

答案1

得分: 2

代码

# 电影评分是否大于50?
m = df['Rotten Tomato Score'] > 50

# 导演是否至少有一部电影的评分大于50?
cond1 = m.groupby(df['Director']).transform('any')

# 标记重复行并保留评分最高的电影
cond2 = df.sort_values('Rotten Tomato Score').duplicated('Director', keep='last')

# 当满足cond1和cond2时删除行
df[~(cond1 & cond2)]

结果

       Index                 Movie  Rotten Tomato Score           Director  Ranking
    0      1          Batman Rises                   87  Christopher Nolan        1
    2      3        Legally Blonde                   71     Robert Luketic        1
    4      5  (500) Days of Summer                   85          Marc Webb        1
    6      7            Wide Awake                   45      M N Shyamalan        1
    7      8    The Last Airbender                    5      M N Shyamalan        2
英文:

Code

# is movie score &gt; 50?
m = df[&#39;Rotten Tomato Score&#39;] &gt; 50

# Does director has at least one movie with score &gt; 50?
cond1 = m.groupby(df[&#39;Director&#39;]).transform(&#39;any&#39;)

# flag the duplicate rows and keep the movie with highest score
cond2 = df.sort_values(&#39;Rotten Tomato Score&#39;).duplicated(&#39;Director&#39;, keep=&#39;last&#39;)

# Drop rows when cond1 and cond2 is met
df[~(cond1 &amp; cond2)]

Result

   Index                 Movie  Rotten Tomato Score           Director  Ranking
0      1          Batman Rises                   87  Christopher Nolan        1
2      3        Legally Blonde                   71     Robert Luketic        1
4      5  (500) Days of Summer                   85          Marc Webb        1
6      7            Wide Awake                   45      M N Shyamalan        1
7      8    The Last Airbender                    5      M N Shyamalan        2

答案2

得分: 1

import pandas as pd

df=pd.DataFrame({'Index':[1, 2, 3, 4, 5, 6, 7, 8],
                 'Movie':['Batman Rises', 'Interstellar', 'Legally Blonde', 'The Ugly Truth', 
                          '(500) Days of Summer', 'The Amazing Spider-Man', 'Wide Awake', 'The Last Airbender'],
                 'Rotten Tomato Score':[87, 73, 71, 14, 85, 71, 45, 5],
                 'Director':['Christopher Nolan', 'Christopher Nolan', 'Robert Luketic', 'Robert Luketic', 
                             'Marc Webb', 'Marc Webb', 'M N Shyamalan', 'M N Shyamalan'],
                 'Ranking':[1, 2, 1, 2, 1, 2, 1, 2],
                 })
    
r = (df.groupby('Director')
       .apply(lambda g: (g.sort_values('Rotten Tomato Score')
                          .drop_duplicates(subset='Director', keep='last')) \
                          if g['Rotten Tomato Score'].max() > 50 else g
                         )
       .reset_index(drop=True)
       )

print(r)
英文:
import pandas as pd

df=pd.DataFrame({&#39;Index&#39;:[1, 2, 3, 4, 5, 6, 7, 8],
                 &#39;Movie&#39;:[&quot;Batman Rises&quot;, &quot;Interstellar&quot;, &quot;Legally Blonde&quot;, &quot;The Ugly Truth&quot;, 
                          &quot;(500) Days of Summer&quot;, &quot;The Amazing Spider-Man&quot;, &quot;Wide Awake&quot;, &quot;The Last Airbender&quot;],
                 &#39;Rotten Tomato Score&#39;:[87, 73, 71, 14, 85, 71, 45, 5],
                 &#39;Director&#39;:[&quot;Christopher Nolan&quot;, &quot;Christopher Nolan&quot;, &quot;Robert Luketic&quot;, &quot;Robert Luketic&quot;, 
                             &quot;Marc Webb&quot;, &quot;Marc Webb&quot;, &quot;M N Shyamalan&quot;, &quot;M N Shyamalan&quot;],
                 &#39;Ranking&#39;:[1, 2, 1, 2, 1, 2, 1, 2],
                 })
    
r = (df.groupby(&#39;Director&#39;)
       .apply(lambda g: (g.sort_values(&#39;Rotten Tomato Score&#39;)
                          .drop_duplicates(subset=&#39;Director&#39;, keep=&#39;last&#39;)) \
                          if g[&#39;Rotten Tomato Score&#39;].max() &gt; 50 else g
                         )
       .reset_index(drop=True)
       )

print(r)

Result

   Index                 Movie  Rotten Tomato Score           Director  Ranking
0      1          Batman Rises                   87  Christopher Nolan        1
1      7            Wide Awake                   45      M N Shyamalan        1
2      8    The Last Airbender                    5      M N Shyamalan        2
3      5  (500) Days of Summer                   85          Marc Webb        1
4      3        Legally Blonde                   71     Robert Luketic        1

huangapple
  • 本文由 发表于 2023年4月7日 01:46:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75952370.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定