在数据框中迭代行和分组

huangapple go评论106阅读模式
英文:

Iterating over rows and groups in dataframe

问题

假设我有以下数据框:

  1. d = {'Date': ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2', '2020-1-3'],
  2. 'col2': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
  3. 'col3': [0.01, 0.02, 0.03, 0.02, 0.03, 0.04, 0.05, 0.1, 0.01]}
  4. d = pd.DataFrame(data=d)
  5. d['Date'] = pd.to_datetime(d['Date'])

并且得到了以下数据框:

  1. Date col2 col3
  2. 0 2020-01-01 A 0.01
  3. 1 2020-01-02 A 0.02
  4. 2 2020-01-03 A 0.03
  5. 3 2020-01-01 B 0.02
  6. 4 2020-01-02 B 0.03
  7. 5 2020-01-03 B 0.04
  8. 6 2020-01-01 C 0.05
  9. 7 2020-01-02 C 0.10
  10. 8 2020-01-03 C 0.01

你想要遍历行,以便对于每个Date获取col3的两个最高值,并按col2分组。例如,你应该得到以下结果:

  1. Date col2 col3
  2. 3 2020-01-01 B 0.02
  3. 6 2020-01-01 C 0.05
  4. 4 2020-01-02 B 0.03
  5. 7 2020-01-02 C 0.10
  6. 2 2020-01-03 A 0.03
  7. 5 2020-01-03 B 0.04

最后,对于每一天,你想要计算col3的总和,如下所示:

  1. Date sum
  2. 0 2020-01-01 0.07
  3. 1 2020-01-02 0.13
  4. 2 2020-01-03 0.07

请注意,实际示例中可能有更多的组(col2)和更多的日期。

英文:

Suppose I have following data frame:

  1. d = {'Date': ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2',
  2. '2020-1-3','2020-1-1', '2020-1-2', '2020-1-3'], 'col2': ['A','A','A', 'B',
  3. 'B','B', 'C','C','C'],
  4. 'col3':[0.01,0.02,0.03,0.02,0.03,0.04,0.05,0.1,0.01]}
  5. d = pd.DataFrame(data=d)
  6. d['Date'] = pd.to_datetime(d['Date'])
  7. d

and get:

  1. Date col2 col3
  2. 0 2020-01-01 A 0.01
  3. 1 2020-01-02 A 0.02
  4. 2 2020-01-03 A 0.03
  5. 3 2020-01-01 B 0.02
  6. 4 2020-01-02 B 0.03
  7. 5 2020-01-03 B 0.04
  8. 6 2020-01-01 C 0.05
  9. 7 2020-01-02 C 0.10
  10. 8 2020-01-03 C 0.01

How could I iterate over rows, so that I get for each Date the 2 highest values of col3 and group from col2?
So for example I should get:

  1. 2020-01-01 B 0.02
  2. 2020-01-01 C 0.05
  3. 2020-01-02 B 0.03
  4. 2020-01-02 C 0.10
  5. 2020-01-03 A 0.03
  6. 2020-01-03 B 0.04

And at the end to sum for each day col 3:

  1. Date sum
  2. 2020-01-01 0.07
  3. 2020-01-02 0.13
  4. 2020-01-03 0.07

But of course real example has much more groups (col2) and more dates.

答案1

得分: 1

  1. 我认为这样
  2. d[['日期', 'col3']].groupby('日期')['col3'].nlargest(2).groupby('日期').sum()
英文:

I think this:

  1. d[['Date', 'col3']].groupby('Date')['col3'].nlargest(2).groupby('Date').sum()

答案2

得分: 1

  1. grouped_df = d.groupby("日期")
  2. results = []
  3. for name, group in grouped_df:
  4. group.sort_values(by="col3", ascending=False, inplace=True)
  5. top_2 = group.nlargest(2, "col3")
  6. top_2_sum = top_2["col3"].sum()
  7. results.append((name, top_2_sum))
  8. sum_df = pd.DataFrame(results, columns=["日期", "总和"])
  9. print(sum_df)

Output:

  1. 日期 总和
  2. 0 2020-01-01 0.07
  3. 1 2020-01-02 0.13
  4. 2 2020-01-03 0.07
英文:
  1. grouped_df = d.groupby("Date")
  2. results = []
  3. for name, group in grouped_df:
  4. group.sort_values(by="col3", ascending=False, inplace=True)
  5. top_2 = group.nlargest(2, "col3")
  6. top_2_sum = top_2["col3"].sum()
  7. results.append((name, top_2_sum))
  8. sum_df = pd.DataFrame(results, columns=["Date", "Sum"])
  9. print(sum_df)

Output:

  1. Date Sum
  2. 0 2020-01-01 0.07
  3. 1 2020-01-02 0.13
  4. 2 2020-01-03 0.07

huangapple
  • 本文由 发表于 2023年1月9日 05:23:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75051412.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定