I have a dataset (df) of the US presidential election results from 2000-2020 that breaks down how many votes each party got within each state (so for arguments sake: 3 parties [Dem, Rep, Other], 51 voting districts [DC included], and 6 presidential elections = 918 rows of data).

I am trying to get a single line for each state/year combination that corresponds to the party that won that state (i.e., 51 districts * 6 years = 306 lines of data).

I grouped df by 'year' and 'state' and sorted by highest vote count so the data looks how I want it to in terms of sorting. Now I want to extract the first line (ie the party that got the highest vote in that state in that year) and append it to a new df2.

I wrote this line of code which shows me the values that I'm looking to extract, but I'm not sure how to append it to a new DF (screenshot of the output is attached).

df.groupby(by = ['year', 'state'])['candidatevotes'].max()

I cheated a bit and asked ChatGPT but it didn't provide me with a great output. Here's the code it gave me and I'll attach a screenshot of the output.

# Group the 'state_group' DataFrame by 'year'
grouped = state_group.groupby('year')
# Loop through each year group
for year, group_df in grouped:
    # Group the year group by 'state'
    state_grouped = group_df.groupby('state_po')
    # Loop through each state group
    for state, state_group_df in state_grouped:
        # Find the index of the row with the party having the highest vote
        max_vote_index = state_group_df['candidatevotes'].idxmax()
        # Extract the row with the party of the highest vote
        winner_row = state_group_df.loc[max_vote_index]
        # Append the winner row to the 'state_winners' DataFrame
        state_winners = state_winners.append(winner_row)

# Reset the index of the 'state_winners' DataFrame
state_winners.reset_index(drop=True, inplace=True)

# Print the 'state_winners' DataFrame

I'm just looking for a general starting point and if even doing a for loop is the right approach for this. I've tried running the loop code when the data was grouped, then tried when the data was ungrouped (screenshots are when I ungrouped the data), tried it when it was in a MultiIndex with year and state as the indexes, etc. If there's a simpler way to extract the data I'm looking for, that'd be even better.

This is what the data itself looks like

This is essentially the data I would like to extract and append to a new df, obviously with the entire row of data intact (i.e., all original columns, not just the raw data values)

This is the output for the code ChatGPT provided me with. I'm not even sure how the output was nearly 6x the number of rows as the original dataset.


避免始终使用.apply,因为这非常低效。 相反,您可以只使用.groupby,并通过首先排序来获取"candidatevotes"中最高值的行:

state_winners = df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()

由于默认排序顺序为ascending=True,使用.last()将返回分组数据的最后一行,这将是候选人得票数最高的行。 同样,如果您以降序排序,可以使用.first()


**注意:**如果存在两行具有相等的"candidatevotes",则只返回一行(类似于@eduarokapp的答案)。 尽管在候选人得票数方面发生这种情况的可能性不大,但还是要注意。


# 不使用`.apply`
%timeit df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()
# 3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 使用`.apply`
%timeit df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()])
# 27.2 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Always avoid using .apply where you can, as this is very inefficient. Instead you can just .groupby and take the highest valued row in "candidatevotes" by sorting first:

state_winners = df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()

As the default sorting order is ascending=True, using .last() will return the last row of the grouped data, which will be the row with the highest number of candidate votes. Similarly, you could use .first() if you were sorting in descending order.

Using as_index=False in the .groupby means that the two columns used for grouping are not used as the index, but are instead columns.

Note: if there are two rows with equal "candidatevotes" then only one will be returned (similar to the answer by @eduarokapp). Although this is unlikely to occur with candidate votes, it is something to be aware of.

Here are some time comparisons between the above and the answer by @eduardokapp (over x7 faster when I run this):

# without `.apply`
%timeit df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()
# 3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# with `.apply`
%timeit df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()])
# 27.2 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


state_winners = df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()]


我建议你查看pandas groupby 函数文档,里面有一些很好的应用示例。


You can use a more complex groupby, like so:

state_winners = df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()]

What this code is doing is grouping by year and state_po then making applying an anonymous (lambda) function to each group. The function takes the group and returns only the row where group['candidatevotes'] is max.

I recommend you taking a look at the pandas groupby function documentation, it has some nice examples of its applications.

