在pandas中合并行。

huangapple go评论102阅读模式
英文:

Aggregate rows in pandas

问题

我有很多类似的pandas行,如下所示:

日期 位置
2023-08-01 12:01:00 A23
2023-08-01 12:20:00 A23
2023-08-01 13:10:10 A23
2023-08-02 12:00:00 B12
2023-08-02 12:01:00 A23
2023-08-02 12:05:00 A23

我需要按“位置”汇总值,并合并日期时间范围,如下所示:

日期 日期2 位置
2023-08-01 12:01:00 2023-08-01 13:10:10 A23
2023-08-02 12:00:00 NaN B12
2023-08-02 12:01:00 2023-08-02 12:05:00 A23

谢谢

英文:

I have many similar rows in pandas like this:

Date Position
2023-08-01 12:01:00 A23
2023-08-01 12:20:00 A23
2023-08-01 13:10:10 A23
2023-08-02 12:00:00 B12
2023-08-02 12:01:00 A23
2023-08-02 12:05:00 A23

and Im need to aggregate values by "Position" and merge Datetime range like this:

Date Date2 Position
2023-08-01 12:01:00 2023-08-01 13:10:10 A23
2023-08-02 12:00:00 NaN B12
2023-08-02 12:01:00 2023-08-02 12:05:00 A23

Thank you

答案1

得分: 1

假设您想要每组相同连续位置的最小/最大日期,并使用自定义的 groupby.agg 进行后处理:

  1. # 确保日期时间格式
  2. df['Date'] = pd.to_datetime(df['Date'])
  3. # 分组连续位置
  4. group = df['Position'].ne(df['Position'].shift()).cumsum()
  5. out = (df
  6. .groupby(group, as_index=False)
  7. .agg(Date=('Date', 'min'),
  8. Date2=('Date', 'max'),
  9. Position=('Position', 'first'),
  10. n=('Position', 'count')
  11. )
  12. # 如果组内不超过1个项目,则隐藏Date2
  13. # 也可以检查Date ≠ Date2
  14. .assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
  15. )

注:要按位置和日期分组,请使用 .groupby(['Position', df['Date'].dt.normalize()], as_index=False)

输出:

  1. Date Date2 Position
  2. 0 2023-08-01 12:01:00 2023-08-01 13:10:10 A23
  3. 1 2023-08-02 12:00:00 NaT B12
  4. 2 2023-08-02 12:01:00 2023-08-02 12:05:00 A23
英文:

Assuming you want you min/max date per groups of identical successive positions, and using a custom groupby.agg with post-processing:

  1. # ensure datetime
  2. df['Date'] = pd.to_datetime(df['Date'])
  3. # group successive positions
  4. group = df['Position'].ne(df['Position'].shift()).cumsum()
  5. out = (df
  6. .groupby(group, as_index=False)
  7. .agg(Date=('Date', 'min'),
  8. Date2=('Date', 'max'),
  9. Position=('Position', 'first'),
  10. n=('Position', 'count')
  11. )
  12. # hide Date2 if there was not more than 1 item in the group
  13. # you could also check that Date ≠ Date2
  14. .assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
  15. )

NB. to group by position and day, use .groupby(['Position', df['Date'].dt.normalize()], as_index=False).

Output:

  1. Date Date2 Position
  2. 0 2023-08-01 12:01:00 2023-08-01 13:10:10 A23
  3. 1 2023-08-02 12:00:00 NaT B12
  4. 2 2023-08-02 12:01:00 2023-08-02 12:05:00 A23

答案2

得分: 1

  1. import pandas as pd
  2. from io import StringIO
  3. from pandas import Timestamp
  4. df = pd.DataFrame(
  5. {'Date': {0: Timestamp('2023-08-01 12:01:00'),
  6. 1: Timestamp('2023-08-01 12:20:00'),
  7. 2: Timestamp('2023-08-01 13:10:10'),
  8. 3: Timestamp('2023-08-02 12:00:00'),
  9. 4: Timestamp('2023-08-02 12:01:00'),
  10. 5: Timestamp('2023-08-02 12:05:00')},
  11. 'Position': {0: 'A23',
  12. 1: 'A23',
  13. 2: 'A23',
  14. 3: 'B12',
  15. 4: 'A23',
  16. 5: 'A23'}}
  17. )
  18. # 检查 Position 值是否与前一行的 Position 值相同
  19. df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()
  20. # 按照 group 和 position 列分组,然后获取 date 列的最小值和最大值,然后删除 group 列
  21. df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)
  22. # 如果 max == min,则 max 应该为 NaN
  23. df['max'] = df['max'].where(df['max'] != df['min'])
  24. # 重命名列名为所需的名称
  25. df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)
  26. # 输出:
  27. >>> df
  28. Position Date1 Date2
  29. 0 A23 2023-08-01 12:01:00 2023-08-01 13:10:10
  30. 1 B12 2023-08-02 12:00:00 NaT
  31. 2 A23 2023-08-02 12:01:00 2023-08-02 12:05:00
英文:
  1. import pandas as pd
  2. from io import StringIO
  3. from pandas import Timestamp
  4. df = pd.DataFrame(
  5. {'Date': {0: Timestamp('2023-08-01 12:01:00'),
  6. 1: Timestamp('2023-08-01 12:20:00'),
  7. 2: Timestamp('2023-08-01 13:10:10'),
  8. 3: Timestamp('2023-08-02 12:00:00'),
  9. 4: Timestamp('2023-08-02 12:01:00'),
  10. 5: Timestamp('2023-08-02 12:05:00')},
  11. 'Position': {0: 'A23',
  12. 1: 'A23',
  13. 2: 'A23',
  14. 3: 'B12',
  15. 4: 'A23',
  16. 5: 'A23'}}
  17. )
  18. # check if the Position value is the same as the previous row's Position value
  19. df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()
  20. # group by the group and position columns, then get the min and max of the date column, then drop the group column
  21. df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)
  22. # if max == min, then max should be NaN
  23. df['max'] = df['max'].where(df['max'] != df['min'])
  24. # rename the columns to the desired names
  25. df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)
  26. # Output:
  27. >>> df
  28. Position Date1 Date2
  29. 0 A23 2023-08-01 12:01:00 2023-08-01 13:10:10
  30. 1 B12 2023-08-02 12:00:00 NaT
  31. 2 A23 2023-08-02 12:01:00 2023-08-02 12:05:00
  32. </details>

huangapple
  • 本文由 发表于 2023年8月9日 14:42:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76865203-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定