在数据框中迭代行和分组

huangapple go评论60阅读模式
英文:

Iterating over rows and groups in dataframe

问题

假设我有以下数据框:

d = {'Date': ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2', '2020-1-3'],
     'col2': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
     'col3': [0.01, 0.02, 0.03, 0.02, 0.03, 0.04, 0.05, 0.1, 0.01]}
d = pd.DataFrame(data=d)
d['Date'] = pd.to_datetime(d['Date'])

并且得到了以下数据框:

        Date    col2  col3
0  2020-01-01       A  0.01
1  2020-01-02       A  0.02
2  2020-01-03       A  0.03
3  2020-01-01       B  0.02
4  2020-01-02       B  0.03
5  2020-01-03       B  0.04
6  2020-01-01       C  0.05
7  2020-01-02       C  0.10
8  2020-01-03       C  0.01

你想要遍历行,以便对于每个Date获取col3的两个最高值,并按col2分组。例如,你应该得到以下结果:

        Date col2  col3
3  2020-01-01    B  0.02
6  2020-01-01    C  0.05
4  2020-01-02    B  0.03
7  2020-01-02    C  0.10
2  2020-01-03    A  0.03
5  2020-01-03    B  0.04

最后,对于每一天,你想要计算col3的总和,如下所示:

         Date   sum
0  2020-01-01  0.07
1  2020-01-02  0.13
2  2020-01-03  0.07

请注意,实际示例中可能有更多的组(col2)和更多的日期。

英文:

Suppose I have following data frame:

d = {'Date': ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2', 
              '2020-1-3','2020-1-1', '2020-1-2', '2020-1-3'], 'col2': ['A','A','A', 'B',
                                                    'B','B', 'C','C','C'],
     'col3':[0.01,0.02,0.03,0.02,0.03,0.04,0.05,0.1,0.01]}
d = pd.DataFrame(data=d)
d['Date'] = pd.to_datetime(d['Date'])
d

and get:

    Date	  col2	col3
0	2020-01-01	A	0.01
1	2020-01-02	A	0.02
2	2020-01-03	A	0.03
3	2020-01-01	B	0.02
4	2020-01-02	B	0.03
5	2020-01-03	B	0.04
6	2020-01-01	C	0.05
7	2020-01-02	C	0.10
8	2020-01-03	C	0.01

How could I iterate over rows, so that I get for each Date the 2 highest values of col3 and group from col2?
So for example I should get:

2020-01-01	B	0.02
2020-01-01	C	0.05
2020-01-02	B	0.03
2020-01-02	C	0.10
2020-01-03	A	0.03
2020-01-03	B	0.04

And at the end to sum for each day col 3:

Date        sum
2020-01-01	0.07
2020-01-02  0.13
2020-01-03  0.07

But of course real example has much more groups (col2) and more dates.

答案1

得分: 1

我认为这样

    d[['日期', 'col3']].groupby('日期')['col3'].nlargest(2).groupby('日期').sum()
英文:

I think this:

d[['Date', 'col3']].groupby('Date')['col3'].nlargest(2).groupby('Date').sum()

答案2

得分: 1

grouped_df = d.groupby("日期")
results = []
for name, group in grouped_df:
    group.sort_values(by="col3", ascending=False, inplace=True)
    top_2 = group.nlargest(2, "col3")
    top_2_sum = top_2["col3"].sum()
    results.append((name, top_2_sum))

sum_df = pd.DataFrame(results, columns=["日期", "总和"])
print(sum_df)

Output:

         日期    总和
0  2020-01-01  0.07
1  2020-01-02  0.13
2  2020-01-03  0.07
英文:
grouped_df = d.groupby("Date")
results = []
for name, group in grouped_df:
    group.sort_values(by="col3", ascending=False, inplace=True)
    top_2 = group.nlargest(2, "col3")
    top_2_sum = top_2["col3"].sum()
    results.append((name, top_2_sum))

sum_df = pd.DataFrame(results, columns=["Date", "Sum"])
print(sum_df)

Output:

        Date   Sum
0 2020-01-01  0.07
1 2020-01-02  0.13
2 2020-01-03  0.07

huangapple
  • 本文由 发表于 2023年1月9日 05:23:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75051412.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定