英文:
For loop with Python to iterate through a grouped DF and extract the top value for each group and append to a new DF?
问题
我有一个数据集(df),包括2000年至2020年美国总统选举结果,显示了每个州内每个党派获得的选票数量(举例来说:3个党派[民主党、共和党、其他]、51个投票地区[包括华盛顿特区]和6次总统选举,总共有918行数据)。
我试图获得每个州/年份组合的一行数据,该行数据对应于赢得该州的党派(即,51个地区*6年=306行数据)。
我通过'year'和'state'对df进行分组,并按照最高选票数排序,以便数据在排序方面看起来符合我的要求。现在,我想提取第一行(即在该年份该州获得最高选票的党派)并将其附加到一个新的df2中。
我编写了以下代码行,它显示了我想要提取的值,但我不确定如何将其附加到一个新的DF(附有输出截图)。
df.groupby(by=['year', 'state'])['candidatevotes'].max()
我有点作弊,向ChatGPT提问,但它没有给我提供一个很好的输出。以下是它给我的代码,并附上输出的截图。
# Group the 'state_group' DataFrame by 'year'
grouped = state_group.groupby('year')
# Loop through each year group
for year, group_df in grouped:
# Group the year group by 'state'
state_grouped = group_df.groupby('state_po')
# Loop through each state group
for state, state_group_df in state_grouped:
# Find the index of the row with the party having the highest vote
max_vote_index = state_group_df['candidatevotes'].idxmax()
# Extract the row with the party of the highest vote
winner_row = state_group_df.loc[max_vote_index]
# Append the winner row to the 'state_winners' DataFrame
state_winners = state_winners.append(winner_row)
# Reset the index of the 'state_winners' DataFrame
state_winners.reset_index(drop=True, inplace=True)
# Print the 'state_winners' DataFrame
print(state_winners)
我只是在寻找一个一般的起点,甚至是使用for循环的正确方法。我尝试在数据分组时运行循环代码,然后尝试在数据未分组时运行循环(截图是在我取消分组数据时拍摄的),尝试在具有年份和州作为索引的MultiIndex中运行它,等等。如果有更简单的方法来提取我想要的数据,那将更好。
这基本上是我想要提取并附加到新df的数据,显然包括整个数据行(即,所有原始列,而不仅仅是原始数据值)
这是ChatGPT为我提供的代码输出。我甚至不确定为什么输出的行数几乎是原始数据集的6倍。
英文:
I have a dataset (df) of the US presidential election results from 2000-2020 that breaks down how many votes each party got within each state (so for arguments sake: 3 parties [Dem, Rep, Other], 51 voting districts [DC included], and 6 presidential elections = 918 rows of data).
I am trying to get a single line for each state/year combination that corresponds to the party that won that state (i.e., 51 districts * 6 years = 306 lines of data).
I grouped df by 'year' and 'state' and sorted by highest vote count so the data looks how I want it to in terms of sorting. Now I want to extract the first line (ie the party that got the highest vote in that state in that year) and append it to a new df2.
I wrote this line of code which shows me the values that I'm looking to extract, but I'm not sure how to append it to a new DF (screenshot of the output is attached).
df.groupby(by = ['year', 'state'])['candidatevotes'].max()
I cheated a bit and asked ChatGPT but it didn't provide me with a great output. Here's the code it gave me and I'll attach a screenshot of the output.
# Group the 'state_group' DataFrame by 'year'
grouped = state_group.groupby('year')
# Loop through each year group
for year, group_df in grouped:
# Group the year group by 'state'
state_grouped = group_df.groupby('state_po')
# Loop through each state group
for state, state_group_df in state_grouped:
# Find the index of the row with the party having the highest vote
max_vote_index = state_group_df['candidatevotes'].idxmax()
# Extract the row with the party of the highest vote
winner_row = state_group_df.loc[max_vote_index]
# Append the winner row to the 'state_winners' DataFrame
state_winners = state_winners.append(winner_row)
# Reset the index of the 'state_winners' DataFrame
state_winners.reset_index(drop=True, inplace=True)
# Print the 'state_winners' DataFrame
print(state_winners)
I'm just looking for a general starting point and if even doing a for loop is the right approach for this. I've tried running the loop code when the data was grouped, then tried when the data was ungrouped (screenshots are when I ungrouped the data), tried it when it was in a MultiIndex with year and state as the indexes, etc. If there's a simpler way to extract the data I'm looking for, that'd be even better.
答案1
得分: 1
避免始终使用.apply
,因为这非常低效。 相反,您可以只使用.groupby
,并通过首先排序来获取"candidatevotes"
中最高值的行:
state_winners = df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()
由于默认排序顺序为ascending=True
,使用.last()
将返回分组数据的最后一行,这将是候选人得票数最高的行。 同样,如果您以降序排序,可以使用.first()
。
在.groupby
中使用as_index=False
意味着用于分组的两列不用作索引,而是用作列。
**注意:**如果存在两行具有相等的"candidatevotes"
,则只返回一行(类似于@eduarokapp的答案)。 尽管在候选人得票数方面发生这种情况的可能性不大,但还是要注意。
以下是上述方法与@eduardokapp的答案之间的时间比较(在我运行时,速度提高了7倍):
# 不使用`.apply`
%timeit df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()
# 3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 使用`.apply`
%timeit df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()])
# 27.2 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
英文:
Always avoid using .apply
where you can, as this is very inefficient. Instead you can just .groupby
and take the highest valued row in "candidatevotes"
by sorting first:
state_winners = df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()
As the default sorting order is ascending=True
, using .last()
will return the last row of the grouped data, which will be the row with the highest number of candidate votes. Similarly, you could use .first()
if you were sorting in descending order.
Using as_index=False
in the .groupby
means that the two columns used for grouping are not used as the index, but are instead columns.
Note: if there are two rows with equal "candidatevotes"
then only one will be returned (similar to the answer by @eduarokapp). Although this is unlikely to occur with candidate votes, it is something to be aware of.
Here are some time comparisons between the above and the answer by @eduardokapp (over x7 faster when I run this):
# without `.apply`
%timeit df.sort_values("candidatevotes").groupby(["year", "state_po"], as_index=False).last()
# 3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# with `.apply`
%timeit df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()])
# 27.2 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案2
得分: 0
你可以使用更复杂的分组方法,如下所示:
state_winners = df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()]
这段代码的作用是按年份和州代码分组,然后对每个组应用一个匿名(lambda)函数。该函数接受组并仅返回其中'candidatevotes'列最大的行。
我建议你查看pandas groupby 函数文档,里面有一些很好的应用示例。
英文:
You can use a more complex groupby, like so:
state_winners = df.groupby(['year', 'state_po']).apply(lambda group: group.loc[group['candidatevotes'].idxmax()]
What this code is doing is grouping by year and state_po then making applying an anonymous (lambda) function to each group. The function takes the group and returns only the row where group['candidatevotes'] is max.
I recommend you taking a look at the pandas groupby function documentation, it has some nice examples of its applications.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论