英文:
Aggregate rows in pandas
问题
我有很多类似的pandas行,如下所示:
日期 | 位置 |
---|---|
2023-08-01 12:01:00 | A23 |
2023-08-01 12:20:00 | A23 |
2023-08-01 13:10:10 | A23 |
2023-08-02 12:00:00 | B12 |
2023-08-02 12:01:00 | A23 |
2023-08-02 12:05:00 | A23 |
我需要按“位置”汇总值,并合并日期时间范围,如下所示:
日期 | 日期2 | 位置 |
---|---|---|
2023-08-01 12:01:00 | 2023-08-01 13:10:10 | A23 |
2023-08-02 12:00:00 | NaN | B12 |
2023-08-02 12:01:00 | 2023-08-02 12:05:00 | A23 |
谢谢
英文:
I have many similar rows in pandas like this:
Date | Position |
---|---|
2023-08-01 12:01:00 | A23 |
2023-08-01 12:20:00 | A23 |
2023-08-01 13:10:10 | A23 |
2023-08-02 12:00:00 | B12 |
2023-08-02 12:01:00 | A23 |
2023-08-02 12:05:00 | A23 |
and Im need to aggregate values by "Position" and merge Datetime range like this:
Date | Date2 | Position |
---|---|---|
2023-08-01 12:01:00 | 2023-08-01 13:10:10 | A23 |
2023-08-02 12:00:00 | NaN | B12 |
2023-08-02 12:01:00 | 2023-08-02 12:05:00 | A23 |
Thank you
答案1
得分: 1
假设您想要每组相同连续位置的最小/最大日期,并使用自定义的 groupby.agg
进行后处理:
# 确保日期时间格式
df['Date'] = pd.to_datetime(df['Date'])
# 分组连续位置
group = df['Position'].ne(df['Position'].shift()).cumsum()
out = (df
.groupby(group, as_index=False)
.agg(Date=('Date', 'min'),
Date2=('Date', 'max'),
Position=('Position', 'first'),
n=('Position', 'count')
)
# 如果组内不超过1个项目,则隐藏Date2
# 也可以检查Date ≠ Date2
.assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
)
注:要按位置和日期分组,请使用 .groupby(['Position', df['Date'].dt.normalize()], as_index=False)
。
输出:
Date Date2 Position
0 2023-08-01 12:01:00 2023-08-01 13:10:10 A23
1 2023-08-02 12:00:00 NaT B12
2 2023-08-02 12:01:00 2023-08-02 12:05:00 A23
英文:
Assuming you want you min/max date per groups of identical successive positions, and using a custom groupby.agg
with post-processing:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# group successive positions
group = df['Position'].ne(df['Position'].shift()).cumsum()
out = (df
.groupby(group, as_index=False)
.agg(Date=('Date', 'min'),
Date2=('Date', 'max'),
Position=('Position', 'first'),
n=('Position', 'count')
)
# hide Date2 if there was not more than 1 item in the group
# you could also check that Date ≠ Date2
.assign(Date2=lambda d: d['Date2'].where(d.pop('n').gt(1)))
)
NB. to group by position and day, use .groupby(['Position', df['Date'].dt.normalize()], as_index=False)
.
Output:
Date Date2 Position
0 2023-08-01 12:01:00 2023-08-01 13:10:10 A23
1 2023-08-02 12:00:00 NaT B12
2 2023-08-02 12:01:00 2023-08-02 12:05:00 A23
答案2
得分: 1
import pandas as pd
from io import StringIO
from pandas import Timestamp
df = pd.DataFrame(
{'Date': {0: Timestamp('2023-08-01 12:01:00'),
1: Timestamp('2023-08-01 12:20:00'),
2: Timestamp('2023-08-01 13:10:10'),
3: Timestamp('2023-08-02 12:00:00'),
4: Timestamp('2023-08-02 12:01:00'),
5: Timestamp('2023-08-02 12:05:00')},
'Position': {0: 'A23',
1: 'A23',
2: 'A23',
3: 'B12',
4: 'A23',
5: 'A23'}}
)
# 检查 Position 值是否与前一行的 Position 值相同
df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()
# 按照 group 和 position 列分组,然后获取 date 列的最小值和最大值,然后删除 group 列
df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)
# 如果 max == min,则 max 应该为 NaN
df['max'] = df['max'].where(df['max'] != df['min'])
# 重命名列名为所需的名称
df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)
# 输出:
>>> df
Position Date1 Date2
0 A23 2023-08-01 12:01:00 2023-08-01 13:10:10
1 B12 2023-08-02 12:00:00 NaT
2 A23 2023-08-02 12:01:00 2023-08-02 12:05:00
英文:
import pandas as pd
from io import StringIO
from pandas import Timestamp
df = pd.DataFrame(
{'Date': {0: Timestamp('2023-08-01 12:01:00'),
1: Timestamp('2023-08-01 12:20:00'),
2: Timestamp('2023-08-01 13:10:10'),
3: Timestamp('2023-08-02 12:00:00'),
4: Timestamp('2023-08-02 12:01:00'),
5: Timestamp('2023-08-02 12:05:00')},
'Position': {0: 'A23',
1: 'A23',
2: 'A23',
3: 'B12',
4: 'A23',
5: 'A23'}}
)
# check if the Position value is the same as the previous row's Position value
df['Group'] = (df['Position'] != df['Position'].shift()).cumsum()
# group by the group and position columns, then get the min and max of the date column, then drop the group column
df = df.groupby(['Group', 'Position'])['Date'].agg(['min', 'max']).reset_index().drop('Group', axis=1)
# if max == min, then max should be NaN
df['max'] = df['max'].where(df['max'] != df['min'])
# rename the columns to the desired names
df.rename(columns={'min': 'Date1', 'max': 'Date2'}, inplace=True)
# Output:
>>> df
Position Date1 Date2
0 A23 2023-08-01 12:01:00 2023-08-01 13:10:10
1 B12 2023-08-02 12:00:00 NaT
2 A23 2023-08-02 12:01:00 2023-08-02 12:05:00
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论