英文:
Pandas finding dates lower than current row' date group by type
问题
我有一个带有id、number和date的pandas数据帧。我想创建一个新列,如下所示。基本上,按类型分组并查找在当前id的日期之间的两年内的日期。
id type date
1 a 2023-06-18
2 a 2022-06-18
3 a 2021-06-18
4 b 2023-06-18
5 b 2020-06-18
6 c 2023-06-18
id type date past_records_in_2_years
1 a 2023-06-18 2
2 a 2022-06-18 1
3 a 2021-06-18 0
4 b 2023-06-18 0
5 b 2020-06-18 0
6 c 2023-06-18 0
我尝试使用for循环,但是我有100万+行,所以花费太多时间。
for i in range(len(df)):
temp = df[df['type'] == df.loc[i]['type']].reset_index(drop=True)
if len(temp) > 1:
past_dates = 0
for j in range(len(temp)):
if (temp.loc[j]['date'] - df.loc[i]['date']) / np.timedelta64(1, 'Y') < 3:
past_dates += 1
if past_dates >= 2:
df[i]['date'] = 1
else:
df[i]['date'] = 0
else:
df[i]['date'] = 0
注意: 你的现有代码有一些问题,因为它尝试在DataFrame上直接进行更改,但是更好的方法是使用apply
函数或者向DataFrame添加一个新列来实现你的目标。如果需要更多帮助,可以提出具体问题。
英文:
I have pandas dataframe with id, number and date. I want to create a new column as below. Basically, groupby type and find the dates which are between two years of current id's date.
id type date
1 a 2023-06-18
2 a 2022-06-18
3 a 2021-06-18
4 b 2023-06-18
5 b 2020-06-18
6 c 2023-06-18
id type date past_records_in_2_years
1 a 2023-06-18 2
2 a 2022-06-18 1
3 a 2021-06-18 0
4 b 2023-06-18 0
5 b 2020-06-18 0
6 c 2023-06-18 0
I tried using for loop but I have 1M+ rows so it is taking too much time.
for i in range(len(df)):
temp = df[df['type'] == df.loc[i]['type']].reset_index(drop = True)
if len(temp) > 1:
past_dates = 0
for j in range(len(temp)):
if (temp.loc[j]['date'] - df.loc[i]['date']) / np.timedelta64(1, 'Y') < 3:
past_dates += 1
if past_dates >= 2:
df[i]['date'] = 1
else:
df[i]['date'] = 0
else:
df[i]['date'] = 0
答案1
得分: 1
pandas merge 和 filter 应该足够:
other = pd.DataFrame({'type': df['type'],
'present': df.date,
'two_yrs_ahead': df.date.add(pd.DateOffset(years=2))})
(df.merge(
other,
on='type'
)
.assign(counts=lambda f: f.date.gt(f.present) &
f.date.le(f.two_yrs_ahead))
.groupby(df.columns.tolist())
.counts
.sum()
)
id type date
1 a 2023-06-18 2
2 a 2022-06-18 1
3 a 2021-06-18 0
4 b 2023-06-18 0
5 b 2020-06-18 0
6 c 2023-06-18 0
Name: counts, dtype: int64
英文:
pandas merge and filter should suffice:
other = pd.DataFrame({'type': df['type'],
'present':df.date,
'two_yrs_ahead' : df.date.add(pd.DateOffset(years=2))})
(df.merge(
other,
on = 'type'
)
.assign(counts = lambda f: f.date.gt(f.present) &
f.date.le(f.two_yrs_ahead))
.groupby(df.columns.tolist())
.counts
.sum()
)
id type date
1 a 2023-06-18 2
2 a 2022-06-18 1
3 a 2021-06-18 0
4 b 2023-06-18 0
5 b 2020-06-18 0
6 c 2023-06-18 0
Name: counts, dtype: int64
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论