英文:
Pandas Dataframe groupby (count of movies and average of rating)
问题
问题是我有一个来自csv文件的数据集,如下所示:
观众ID 电影ID 电影名称 评分
1 2 XXX 4
1 3 DDD 3
1 4 YYY 5
2 2 XXX 4
3 2 XXX 不可用
我想找出所有至少有2个评分且平均评分为4的电影。在评分列中还有"不可用"的值。我想使用Pandas来显示通过查询找到的电影是XXX。
我尝试使用groupby,但无法同时包括平均评分。我将"不可用"的评分转换为NaN,以消除阻止我计算平均值的"object"问题。
英文:
The problem is I have a dataset as follows from a csv
viewer id movie id movie Name rating
1 2 XXX 4
1 3 DDD 3
1 4 YYY 5
2 2 XXX 4
3 2 XXX Not Available
I'm trying to find all movies that have at least 2 ratings AND that have an average rating of 4. Under the rating column there are also 'Not available' values. With Pandas, I'd like to show that the movie here which would be found with a query is XXX
I tried using groupby but am not able to also include the average rating. I converted the Not available rating to nan to get rid of the 'object' issue stopping me from calculating a mean.
答案1
得分: 0
你可以使用groupby.filter
,其中你可以检查电影有多少个有效评分,并计算平均值(并检查它是否等于4):
x = df.groupby("movie id").filter(
lambda x: (valid_ratings := x["rating"].ne("Not Available")).sum() >= 2
and x.loc[valid_ratings, "rating"].astype(int).mean() == 4
)
print(x["movie Name"].unique())
输出结果为:
['XXX']
英文:
You can use groupby.filter
, where you check how many valid ratings the movie has and do the average (and check it it's equal to 4):
x = df.groupby("movie id").filter(
lambda x: (valid_ratings := x["rating"].ne("Not Available")).sum() >= 2
and x.loc[valid_ratings, "rating"].astype(int).mean() == 4
)
print(x["movie Name"].unique())
Prints:
['XXX']
答案2
得分: 0
似乎你已经在正确的方向上了。确实,首先我们需要通过将非数字值转换为NaN
来对数据进行归一化处理。
然后,我们可以使用.groupby(...)
对数据进行分组,如你所提到的,并使用.agg(...)
将聚合函数应用于rating
列:
- 使用
count
来获取电影数量, - 使用
mean
来计算平均评分
最后,我们筛选并打印结果。
以下是完整的代码片段:
import pandas as pd
data = {
'viewer id': [1, 1, 1, 2, 3],
'movie id': [2, 3, 4, 2, 2],
'movie Name': ['XXX', 'DDD', 'YYY', 'XXX', 'XXX'],
'rating': [4, 3, 5, 4, 'Not Available']
}
df = pd.DataFrame(data)
# 将'Not available'转换为NaN,并确保评分为数字
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
# 分组和聚合
grouped = df.groupby('movie Name').agg(
num_ratings=('rating', 'count'),
avg_rating=('rating', 'mean'))
filtered = grouped[(grouped['num_ratings'] >= 2) & (grouped['avg_rating'] == 4)]
print(filtered)
输出结果:
num_ratings avg_rating
movie Name
XXX 2 4.0
英文:
Seems like you were on the right track. Indeed, first we need to normalise the data by converting non-numeric values to NaN
.
Then we can group the data using .groupby(...)
as you mentioned and use .agg(...)
to apply our aggregate functions to rating
column:
count
for getting count of movies, andmean
to calculate the average rating
Finally, we filter and print the results.
Here's the complete snippet:
import pandas as pd
data = {
'viewer id': [1, 1, 1, 2, 3],
'movie id': [2, 3, 4, 2, 2],
'movie Name': ['XXX', 'DDD', 'YYY', 'XXX', 'XXX'],
'rating': [4, 3, 5, 4, 'Not Available']
}
df = pd.DataFrame(data)
# Convert 'Not available' to NaN and ensure ratings are numeric
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
# Group & aggregate
grouped = df.groupby('movie Name').agg(
num_ratings=('rating', 'count'),
avg_rating=('rating', 'mean'))
filtered = grouped[(grouped['num_ratings'] >= 2) & (grouped['avg_rating'] == 4)]
print(filtered)
Output:
num_ratings avg_rating
movie Name
XXX 2 4.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论