Pandas DataFrame groupby(电影数量和评分平均值)

huangapple go评论103阅读模式
英文:

Pandas Dataframe groupby (count of movies and average of rating)

问题

问题是我有一个来自csv文件的数据集,如下所示:

  1. 观众ID 电影ID 电影名称 评分
  2. 1 2 XXX 4
  3. 1 3 DDD 3
  4. 1 4 YYY 5
  5. 2 2 XXX 4
  6. 3 2 XXX 不可用

我想找出所有至少有2个评分且平均评分为4的电影。在评分列中还有"不可用"的值。我想使用Pandas来显示通过查询找到的电影是XXX。

我尝试使用groupby,但无法同时包括平均评分。我将"不可用"的评分转换为NaN,以消除阻止我计算平均值的"object"问题。

英文:

The problem is I have a dataset as follows from a csv

  1. viewer id movie id movie Name rating
  2. 1 2 XXX 4
  3. 1 3 DDD 3
  4. 1 4 YYY 5
  5. 2 2 XXX 4
  6. 3 2 XXX Not Available

I'm trying to find all movies that have at least 2 ratings AND that have an average rating of 4. Under the rating column there are also 'Not available' values. With Pandas, I'd like to show that the movie here which would be found with a query is XXX

I tried using groupby but am not able to also include the average rating. I converted the Not available rating to nan to get rid of the 'object' issue stopping me from calculating a mean.

答案1

得分: 0

你可以使用groupby.filter,其中你可以检查电影有多少个有效评分,并计算平均值(并检查它是否等于4):

  1. x = df.groupby("movie id").filter(
  2. lambda x: (valid_ratings := x["rating"].ne("Not Available")).sum() >= 2
  3. and x.loc[valid_ratings, "rating"].astype(int).mean() == 4
  4. )
  5. print(x["movie Name"].unique())

输出结果为:

  1. ['XXX']
英文:

You can use groupby.filter, where you check how many valid ratings the movie has and do the average (and check it it's equal to 4):

  1. x = df.groupby("movie id").filter(
  2. lambda x: (valid_ratings := x["rating"].ne("Not Available")).sum() >= 2
  3. and x.loc[valid_ratings, "rating"].astype(int).mean() == 4
  4. )
  5. print(x["movie Name"].unique())

Prints:

  1. ['XXX']

答案2

得分: 0

似乎你已经在正确的方向上了。确实,首先我们需要通过将非数字值转换为NaN来对数据进行归一化处理。

然后,我们可以使用.groupby(...)对数据进行分组,如你所提到的,并使用.agg(...)将聚合函数应用于rating列:

  • 使用count来获取电影数量
  • 使用mean来计算平均评分

最后,我们筛选并打印结果。

以下是完整的代码片段:

  1. import pandas as pd
  2. data = {
  3. 'viewer id': [1, 1, 1, 2, 3],
  4. 'movie id': [2, 3, 4, 2, 2],
  5. 'movie Name': ['XXX', 'DDD', 'YYY', 'XXX', 'XXX'],
  6. 'rating': [4, 3, 5, 4, 'Not Available']
  7. }
  8. df = pd.DataFrame(data)
  9. # 将'Not available'转换为NaN,并确保评分为数字
  10. df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
  11. # 分组和聚合
  12. grouped = df.groupby('movie Name').agg(
  13. num_ratings=('rating', 'count'),
  14. avg_rating=('rating', 'mean'))
  15. filtered = grouped[(grouped['num_ratings'] >= 2) & (grouped['avg_rating'] == 4)]
  16. print(filtered)

输出结果:

  1. num_ratings avg_rating
  2. movie Name
  3. XXX 2 4.0
英文:

Seems like you were on the right track. Indeed, first we need to normalise the data by converting non-numeric values to NaN.

Then we can group the data using .groupby(...) as you mentioned and use .agg(...) to apply our aggregate functions to rating column:

  • count for getting count of movies, and
  • mean to calculate the average rating

Finally, we filter and print the results.

Here's the complete snippet:

  1. import pandas as pd
  2. data = {
  3. 'viewer id': [1, 1, 1, 2, 3],
  4. 'movie id': [2, 3, 4, 2, 2],
  5. 'movie Name': ['XXX', 'DDD', 'YYY', 'XXX', 'XXX'],
  6. 'rating': [4, 3, 5, 4, 'Not Available']
  7. }
  8. df = pd.DataFrame(data)
  9. # Convert 'Not available' to NaN and ensure ratings are numeric
  10. df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
  11. # Group & aggregate
  12. grouped = df.groupby('movie Name').agg(
  13. num_ratings=('rating', 'count'),
  14. avg_rating=('rating', 'mean'))
  15. filtered = grouped[(grouped['num_ratings'] >= 2) & (grouped['avg_rating'] == 4)]
  16. print(filtered)

Output:

  1. num_ratings avg_rating
  2. movie Name
  3. XXX 2 4.0

huangapple
  • 本文由 发表于 2023年8月9日 06:10:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76863488.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定