2023年8月9日 06:10:23go评论103阅读模式

英文:

Pandas Dataframe groupby (count of movies and average of rating)

问题

问题是我有一个来自csv文件的数据集，如下所示：

观众ID      电影ID       电影名称      评分
1              2              XXX             4
1              3              DDD             3
1              4              YYY             5
2              2              XXX             4
3              2              XXX             不可用

我想找出所有至少有2个评分且平均评分为4的电影。在评分列中还有"不可用"的值。我想使用Pandas来显示通过查询找到的电影是XXX。

我尝试使用groupby，但无法同时包括平均评分。我将"不可用"的评分转换为NaN，以消除阻止我计算平均值的"object"问题。

英文:

The problem is I have a dataset as follows from a csv

viewer id      movie id       movie Name      rating
1              2              XXX             4
1              3              DDD             3
1              4              YYY             5
2              2              XXX             4
3              2              XXX             Not Available

I'm trying to find all movies that have at least 2 ratings AND that have an average rating of 4. Under the rating column there are also 'Not available' values. With Pandas, I'd like to show that the movie here which would be found with a query is XXX

I tried using groupby but am not able to also include the average rating. I converted the Not available rating to nan to get rid of the 'object' issue stopping me from calculating a mean.

答案1

得分: 0

你可以使用groupby.filter，其中你可以检查电影有多少个有效评分，并计算平均值（并检查它是否等于4）：

x = df.groupby("movie id").filter(
    lambda x: (valid_ratings := x["rating"].ne("Not Available")).sum() >= 2
    and x.loc[valid_ratings, "rating"].astype(int).mean() == 4
)
print(x["movie Name"].unique())

输出结果为：

['XXX']

英文:

You can use groupby.filter, where you check how many valid ratings the movie has and do the average (and check it it's equal to 4):

x = df.groupby(&quot;movie id&quot;).filter(
    lambda x: (valid_ratings := x[&quot;rating&quot;].ne(&quot;Not Available&quot;)).sum() &gt;= 2
    and x.loc[valid_ratings, &quot;rating&quot;].astype(int).mean() == 4
)
print(x[&quot;movie Name&quot;].unique())

Prints:

[&#39;XXX&#39;]

答案2

得分: 0

似乎你已经在正确的方向上了。确实，首先我们需要通过将非数字值转换为NaN来对数据进行归一化处理。

然后，我们可以使用.groupby(...)对数据进行分组，如你所提到的，并使用.agg(...)将聚合函数应用于rating列：

使用count来获取电影数量，
使用mean来计算平均评分

最后，我们筛选并打印结果。

以下是完整的代码片段：

import pandas as pd
data = {
    'viewer id': [1, 1, 1, 2, 3],
    'movie id': [2, 3, 4, 2, 2],
    'movie Name': ['XXX', 'DDD', 'YYY', 'XXX', 'XXX'],
    'rating': [4, 3, 5, 4, 'Not Available']
}
df = pd.DataFrame(data)
# 将'Not available'转换为NaN，并确保评分为数字
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
# 分组和聚合
grouped = df.groupby('movie Name').agg(
    num_ratings=('rating', 'count'),
    avg_rating=('rating', 'mean'))
filtered = grouped[(grouped['num_ratings'] >= 2) & (grouped['avg_rating'] == 4)]
print(filtered)

输出结果：

              num_ratings  avg_rating
movie Name                         
XXX                  2         4.0

英文:

Seems like you were on the right track. Indeed, first we need to normalise the data by converting non-numeric values to NaN.

Then we can group the data using .groupby(...) as you mentioned and use .agg(...) to apply our aggregate functions to rating column:

count for getting count of movies, and
mean to calculate the average rating

Finally, we filter and print the results.

Here's the complete snippet:

import pandas as pd
data = {
    &#39;viewer id&#39;: [1, 1, 1, 2, 3],
    &#39;movie id&#39;: [2, 3, 4, 2, 2],
    &#39;movie Name&#39;: [&#39;XXX&#39;, &#39;DDD&#39;, &#39;YYY&#39;, &#39;XXX&#39;, &#39;XXX&#39;],
    &#39;rating&#39;: [4, 3, 5, 4, &#39;Not Available&#39;]
}
df = pd.DataFrame(data)
# Convert &#39;Not available&#39; to NaN and ensure ratings are numeric
df[&#39;rating&#39;] = pd.to_numeric(df[&#39;rating&#39;], errors=&#39;coerce&#39;)
# Group &amp; aggregate
grouped = df.groupby(&#39;movie Name&#39;).agg(
    num_ratings=(&#39;rating&#39;, &#39;count&#39;),
    avg_rating=(&#39;rating&#39;, &#39;mean&#39;))
filtered = grouped[(grouped[&#39;num_ratings&#39;] &gt;= 2) &amp; (grouped[&#39;avg_rating&#39;] == 4)]
print(filtered)

Output:

              num_ratings  avg_rating
movie Name                         
XXX                  2         4.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas DataFrame groupby（电影数量和评分平均值）

问题

答案1

答案2

`jax.numpy.delete`假定唯一索引出现意外的关键字参数。

如何避免在GEKKO中创建许多二进制开关变量

How to remove character strings that are detected/contained within other character strings, but only within a specified group_by()-argument

获取属性的XPath

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。