在pandas中聚合行数据

huangapple go评论82阅读模式
英文:

Aggregate rows in pandas

问题

我有很多类似的pandas行,像这样:

日期 位置
2023-08-01 12:01:00 A23
2023-08-01 12:20:00 A23
2023-08-01 13:10:10 A23
2023-08-02 12:00:00 B12
2023-08-02 12:01:00 A23
2023-08-02 12:05:00 A23

我需要按"位置"聚合值,并合并日期范围,像这样:

日期 日期2 位置
2023-08-01 12:01:00 2023-08-01 13:10:10 A23
2023-08-02 12:00:00 NaN B12
2023-08-02 12:01:00 2023-08-02 12:05:00 A23

谢谢。

英文:

I have many similar rows in pandas like this:

Date Position
2023-08-01 12:01:00 A23
2023-08-01 12:20:00 A23
2023-08-01 13:10:10 A23
2023-08-02 12:00:00 B12
2023-08-02 12:01:00 A23
2023-08-02 12:05:00 A23

and Im need to aggregate values by "Position" and merge Datetime range like this:

Date Date2 Position
2023-08-01 12:01:00 2023-08-01 13:10:10 A23
2023-08-02 12:00:00 NaN B12
2023-08-02 12:01:00 2023-08-02 12:05:00 A23

Thank you

答案1

得分: 1

假设您想要按照相同连续位置的组来获取每个组的最小/最大日期,并使用自定义的groupby.agg进行后处理:

# 确保日期是datetime类型
df['Date'] = pd.to_datetime(df['Date'])

# 分组连续的位置
group = df['Position'].ne(df['Position'].shift()).cumsum()

out = (df
   .groupby(group, as_index=False)
   .agg(Date=('Date', 'min'),
        Date2=('Date', 'max'),
        Position=('Position', 'first'),
        n=('Position', 'count')
       )
   # 如果组内只有一个项目,则隐藏Date2
   # 也可以检查Date ≠ Date2
   .assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
)

注意:要按位置和日期分组,请使用.groupby(['Position', df['Date'].dt.normalize()], as_index=False)

输出结果:

                 Date               Date2 Position
0 2023-08-01 12:01:00 2023-08-01 13:10:10      A23
1 2023-08-02 12:00:00                 NaT      B12
2 2023-08-02 12:01:00 2023-08-02 12:05:00      A23

以上是给定代码的翻译结果。

英文:

Assuming you want you min/max date per groups of identical successive positions, and using a custom groupby.agg with post-processing:

# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])

# group successive positions
group = df['Position'].ne(df['Position'].shift()).cumsum()

out = (df
   .groupby(group, as_index=False)
   .agg(Date=('Date', 'min'),
        Date2=('Date', 'max'),
        Position=('Position', 'first'),
        n=('Position', 'count')
       )
   # hide Date2 if there was not more than 1 item in the group
   # you could also check that Date ≠ Date2
   .assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
)

NB. to group by position and day, use .groupby(['Position', df['Date'].dt.normalize()], as_index=False).

Output:

                 Date               Date2 Position
0 2023-08-01 12:01:00 2023-08-01 13:10:10      A23
1 2023-08-02 12:00:00                 NaT      B12
2 2023-08-02 12:01:00 2023-08-02 12:05:00      A23

答案2

得分: 1

import pandas as pd
from io import StringIO
from pandas import Timestamp

df = pd.DataFrame(
    {'Date': {0: Timestamp('2023-08-01 12:01:00'), 
              1: Timestamp('2023-08-01 12:20:00'), 
              2: Timestamp('2023-08-01 13:10:10'), 
              3: Timestamp('2023-08-02 12:00:00'), 
              4: Timestamp('2023-08-02 12:01:00'), 
              5: Timestamp('2023-08-02 12:05:00')}, 
    'Position': {0: 'A23', 
                 1: 'A23', 
                 2: 'A23', 
                 3: 'B12', 
                 4: 'A23', 
                 5: 'A23'}}
)


# 检查Position列的值是否与前一行的Position值相同
df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()

# 按照Group和Position列进行分组,然后获取Date列的最小值和最大值,最后删除Group列
df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)

# 如果max等于min,则max应为NaN
df['max'] = df['max'].where(df['max'] != df['min'])

# 将列名重命名为所需的名称
df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)

# 输出结果:
>>> df
  Position               Date1               Date2
0      A23 2023-08-01 12:01:00 2023-08-01 13:10:10
1      B12 2023-08-02 12:00:00                 NaT
2      A23 2023-08-02 12:01:00 2023-08-02 12:05:00
英文:
import pandas as pd
from io import StringIO
from pandas import Timestamp

df = pd.DataFrame(
    {'Date': {0: Timestamp('2023-08-01 12:01:00'), 
              1: Timestamp('2023-08-01 12:20:00'), 
              2: Timestamp('2023-08-01 13:10:10'), 
              3: Timestamp('2023-08-02 12:00:00'), 
              4: Timestamp('2023-08-02 12:01:00'), 
              5: Timestamp('2023-08-02 12:05:00')}, 
    'Position': {0: 'A23', 
                 1: 'A23', 
                 2: 'A23', 
                 3: 'B12', 
                 4: 'A23', 
                 5: 'A23'}}
)


# check if the Position value is the same as the previous row's Position value
df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()

# group by the group and position columns, then get the min and max of the date column, then drop the group column
df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)

# if max == min, then max should be NaN
df['max'] = df['max'].where(df['max'] != df['min'])

# rename the columns to the desired names
df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)

# Output:
>>> df
  Position               Date1               Date2
0      A23 2023-08-01 12:01:00 2023-08-01 13:10:10
1      B12 2023-08-02 12:00:00                 NaT
2      A23 2023-08-02 12:01:00 2023-08-02 12:05:00

</details>



huangapple
  • 本文由 发表于 2023年8月9日 14:42:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76865203.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定