找到基于举办最多音乐会的前3年的前2个场馆?

huangapple go评论72阅读模式
英文:

Find top 2 venues based on top 3 years with most concerts held?

问题

  1. 我找到了2018年、2019年、2020年和2021年是举办最多音乐会的前三年,每年都有5场音乐会举行。

  2. 对于问题2,你可以使用以下方法来解决:

stadium_count = df.groupby(['year', 'venue']).size().reset_index(name='count')
result = stadium_count.groupby('year').apply(lambda x: x[x['count'] == x['count'].max()])

我期望问题2的答案会类似于这样:

Year Venue

2018 V2

2020 V3

英文:

I am not sure if the question is written well by the exercise but to answer this question I split it into 2 questions.

  1. What were the top 3 years with the most concerts held?
  2. Based on the top years, which 2 venues had the most concert held during this period?

this is my dataset

  df = {"year":["2017","2018","2018","2018","2018","2018","2019","2019","2019",'2019',"2019","2020","2020","2020","2020","2020","2021","2021","2021","2021","2021","2022",'2022','2023'],
        "venue":["V2","V1","V2","V2","V2","V2","V1","V1","V2","V2","V5","V1","V2","V2","V3","V3","V3","V3","V4","V4","V5","V1","V6","V3"]}

To answer Q1,

stadium_count_per_year= df.groupby('year').agg({'venue':'count'})
highest_concert_yr = stadium_count_per_year[stadium_count_per_year["venue"] >= stadium_count_per_year["venue"].max()]

I found that 2018,2019,2020,2021 were the top 3 years with all having 5 concerts held.

To answer Q2,
I thought I could do this:

stadium_count = df.groupby("venue","year").agg({"venue":'count'})

However, this is incorrect. What method should I use to solve this question?

I am expecting question 2 to be something like this:

Year Venue

2018 V2

2020 V3

答案1

得分: 1

问题有点模糊,因为有关年份的并列情况。我会在这里使用value_counts,以及nlargest,结合isin进行年份的筛选:

# 找到演唱会最多的3个年份
top = df['year'].value_counts().nlargest(3, keep='all').index
# ['2018', '2019', '2020', '2021']
# 注意:所有的年份都有5场演唱会

# 然后只保留上面识别出的年份并重复相同的逻辑
df.loc[df['year'].isin(top), 'venue'].value_counts().nlargest(2, keep='all')

注意:如果您真的想考虑前N个,即使有并列,那么在nlargest中使用keep='first',但这可能会根据行的顺序产生模糊的结果。

输出:

venue
V2    8
V1    4
V3    4
Name: count, dtype: int64
英文:

The question is a bit ambiguous because you have ties. I would use value_counts here and nlargest, with isin slicing to filter the years:

# find the 3 years with most concerts
top = df['year'].value_counts().nlargest(3, keep='all').index
# ['2018', '2019', '2020', '2021']
# NB. all are equal with 5 concerts

# then only keep the years identified above and repeat the logic
df.loc[df['year'].isin(top), 'venue'].value_counts().nlargest(2, keep='all')

NB. if you really want to consider the top N, even if there are ties, then use keep='first' in nlargest, but this might give ambiguous results depending on the order of the rows.

Output:

venue
V2    8
V1    4
V3    4
Name: count, dtype: int64

答案2

得分: 0

尝试按照您的方法,您只需要在年份上使用isin。之后,其余部分很容易。

import pandas as pd
df = pd.DataFrame({"year": ["2017", "2018", "2018", "2018", "2018", "2018", "2019", "2019", "2019", '2019', "2019", "2020", "2020",
                            "2020", "2020", "2020", "2021", "2021", "2021", "2021", "2021", "2022", '2022', '2023'],
                   "venue": ["V2", "V1", "V2", "V2", "V2", "V2", "V1", "V1", "V2", "V2", "V5", "V1", "V2", "V2", "V3", "V3", "V3",
                             "V3", "V4", "V4", "V5", "V1", "V6", "V3"]})

stadium_count_per_year = df.groupby('year').agg({'venue':'count'})['venue']
max_years = stadium_count_per_year.sort_values(ascending=False).iloc[2]
yrs = stadium_count_per_year[stadium_count_per_year >= max_years].index.values

venue_count = df[df['year'].isin(yrs)].groupby('venue').count()['year']
max_venues = venue_count.sort_values(ascending=False).iloc[1]
venues = venue_count[venue_count >= max_venues].index.values
print(yrs, venues)

输出结果为:

['2018' '2019' '2020' '2021'] ['V1' 'V2' 'V3']
英文:

Trying to follow your approach, you just need to use isin with years. After that, the rest is easy.

import pandas as pd
df = pd.DataFrame({"year": ["2017", "2018", "2018", "2018", "2018", "2018", "2019", "2019", "2019", '2019', "2019", "2020", "2020",
                            "2020", "2020", "2020", "2021", "2021", "2021", "2021", "2021", "2022", '2022', '2023'],
                   "venue": ["V2", "V1", "V2", "V2", "V2", "V2", "V1", "V1", "V2", "V2", "V5", "V1", "V2", "V2", "V3", "V3", "V3",
                             "V3", "V4", "V4", "V5", "V1", "V6", "V3"]})



stadium_count_per_year = df.groupby('year').agg({'venue':'count'})['venue']
max_years = stadium_count_per_year.sort_values(ascending=False).iloc[2]
yrs = stadium_count_per_year[stadium_count_per_year >= max_years].index.values

venue_count = df[df['year'].isin(yrs)].groupby('venue').count()['year']
max_venues = venue_count.sort_values(ascending=False).iloc[1]
venues = venue_count[venue_count >= max_venues].index.values
print(yrs, venues)

Yields

['2018' '2019' '2020' '2021'] ['V1' 'V2' 'V3']

huangapple
  • 本文由 发表于 2023年6月5日 02:23:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76401860.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定