英文:
Find top 2 venues based on top 3 years with most concerts held?
问题
-
我找到了2018年、2019年、2020年和2021年是举办最多音乐会的前三年,每年都有5场音乐会举行。
-
对于问题2,你可以使用以下方法来解决:
stadium_count = df.groupby(['year', 'venue']).size().reset_index(name='count')
result = stadium_count.groupby('year').apply(lambda x: x[x['count'] == x['count'].max()])
我期望问题2的答案会类似于这样:
Year Venue
2018 V2
2020 V3
英文:
I am not sure if the question is written well by the exercise but to answer this question I split it into 2 questions.
- What were the top 3 years with the most concerts held?
- Based on the top years, which 2 venues had the most concert held during this period?
this is my dataset
df = {"year":["2017","2018","2018","2018","2018","2018","2019","2019","2019",'2019',"2019","2020","2020","2020","2020","2020","2021","2021","2021","2021","2021","2022",'2022','2023'],
"venue":["V2","V1","V2","V2","V2","V2","V1","V1","V2","V2","V5","V1","V2","V2","V3","V3","V3","V3","V4","V4","V5","V1","V6","V3"]}
To answer Q1,
stadium_count_per_year= df.groupby('year').agg({'venue':'count'})
highest_concert_yr = stadium_count_per_year[stadium_count_per_year["venue"] >= stadium_count_per_year["venue"].max()]
I found that 2018,2019,2020,2021 were the top 3 years with all having 5 concerts held.
To answer Q2,
I thought I could do this:
stadium_count = df.groupby("venue","year").agg({"venue":'count'})
However, this is incorrect. What method should I use to solve this question?
I am expecting question 2 to be something like this:
Year Venue
2018 V2
2020 V3
答案1
得分: 1
问题有点模糊,因为有关年份的并列情况。我会在这里使用value_counts
,以及nlargest
,结合isin
进行年份的筛选:
# 找到演唱会最多的3个年份
top = df['year'].value_counts().nlargest(3, keep='all').index
# ['2018', '2019', '2020', '2021']
# 注意:所有的年份都有5场演唱会
# 然后只保留上面识别出的年份并重复相同的逻辑
df.loc[df['year'].isin(top), 'venue'].value_counts().nlargest(2, keep='all')
注意:如果您真的想考虑前N个,即使有并列,那么在nlargest
中使用keep='first'
,但这可能会根据行的顺序产生模糊的结果。
输出:
venue
V2 8
V1 4
V3 4
Name: count, dtype: int64
英文:
The question is a bit ambiguous because you have ties. I would use value_counts
here and nlargest
, with isin
slicing to filter the years:
# find the 3 years with most concerts
top = df['year'].value_counts().nlargest(3, keep='all').index
# ['2018', '2019', '2020', '2021']
# NB. all are equal with 5 concerts
# then only keep the years identified above and repeat the logic
df.loc[df['year'].isin(top), 'venue'].value_counts().nlargest(2, keep='all')
NB. if you really want to consider the top N, even if there are ties, then use keep='first'
in nlargest
, but this might give ambiguous results depending on the order of the rows.
Output:
venue
V2 8
V1 4
V3 4
Name: count, dtype: int64
答案2
得分: 0
尝试按照您的方法,您只需要在年份上使用isin
。之后,其余部分很容易。
import pandas as pd
df = pd.DataFrame({"year": ["2017", "2018", "2018", "2018", "2018", "2018", "2019", "2019", "2019", '2019', "2019", "2020", "2020",
"2020", "2020", "2020", "2021", "2021", "2021", "2021", "2021", "2022", '2022', '2023'],
"venue": ["V2", "V1", "V2", "V2", "V2", "V2", "V1", "V1", "V2", "V2", "V5", "V1", "V2", "V2", "V3", "V3", "V3",
"V3", "V4", "V4", "V5", "V1", "V6", "V3"]})
stadium_count_per_year = df.groupby('year').agg({'venue':'count'})['venue']
max_years = stadium_count_per_year.sort_values(ascending=False).iloc[2]
yrs = stadium_count_per_year[stadium_count_per_year >= max_years].index.values
venue_count = df[df['year'].isin(yrs)].groupby('venue').count()['year']
max_venues = venue_count.sort_values(ascending=False).iloc[1]
venues = venue_count[venue_count >= max_venues].index.values
print(yrs, venues)
输出结果为:
['2018' '2019' '2020' '2021'] ['V1' 'V2' 'V3']
英文:
Trying to follow your approach, you just need to use isin
with years. After that, the rest is easy.
import pandas as pd
df = pd.DataFrame({"year": ["2017", "2018", "2018", "2018", "2018", "2018", "2019", "2019", "2019", '2019', "2019", "2020", "2020",
"2020", "2020", "2020", "2021", "2021", "2021", "2021", "2021", "2022", '2022', '2023'],
"venue": ["V2", "V1", "V2", "V2", "V2", "V2", "V1", "V1", "V2", "V2", "V5", "V1", "V2", "V2", "V3", "V3", "V3",
"V3", "V4", "V4", "V5", "V1", "V6", "V3"]})
stadium_count_per_year = df.groupby('year').agg({'venue':'count'})['venue']
max_years = stadium_count_per_year.sort_values(ascending=False).iloc[2]
yrs = stadium_count_per_year[stadium_count_per_year >= max_years].index.values
venue_count = df[df['year'].isin(yrs)].groupby('venue').count()['year']
max_venues = venue_count.sort_values(ascending=False).iloc[1]
venues = venue_count[venue_count >= max_venues].index.values
print(yrs, venues)
Yields
['2018' '2019' '2020' '2021'] ['V1' 'V2' 'V3']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论