Pandas按类型分组查找低于当前行日期的日期

huangapple go评论63阅读模式
英文:

Pandas finding dates lower than current row' date group by type

问题

我有一个带有id、number和date的pandas数据帧。我想创建一个新列,如下所示。基本上,按类型分组并查找在当前id的日期之间的两年内的日期。

id	type 	date	
1	a	2023-06-18	
2	a	2022-06-18	
3	a	2021-06-18	
4	b	2023-06-18	
5	b	2020-06-18	
6	c	2023-06-18	

			
id	type 	date	past_records_in_2_years
1	a	2023-06-18	2
2	a	2022-06-18	1
3	a	2021-06-18	0
4	b	2023-06-18	0
5	b	2020-06-18	0
6	c	2023-06-18	0

我尝试使用for循环,但是我有100万+行,所以花费太多时间。

for i in range(len(df)):
    temp = df[df['type'] == df.loc[i]['type']].reset_index(drop=True)
    if len(temp) > 1:
        past_dates = 0
        for j in range(len(temp)):
            if (temp.loc[j]['date'] - df.loc[i]['date']) / np.timedelta64(1, 'Y') < 3:
                past_dates += 1
        if past_dates >= 2:
            df[i]['date'] = 1
        else:
            df[i]['date'] = 0
    else:
        df[i]['date'] = 0

注意: 你的现有代码有一些问题,因为它尝试在DataFrame上直接进行更改,但是更好的方法是使用apply函数或者向DataFrame添加一个新列来实现你的目标。如果需要更多帮助,可以提出具体问题。

英文:

I have pandas dataframe with id, number and date. I want to create a new column as below. Basically, groupby type and find the dates which are between two years of current id's date.

id	type 	date	
1	a	2023-06-18	
2	a	2022-06-18	
3	a	2021-06-18	
4	b	2023-06-18	
5	b	2020-06-18	
6	c	2023-06-18	

		
id	type 	date	past_records_in_2_years
1	a	2023-06-18	2
2	a	2022-06-18	1
3	a	2021-06-18	0
4	b	2023-06-18	0
5	b	2020-06-18	0
6	c	2023-06-18	0

I tried using for loop but I have 1M+ rows so it is taking too much time.

for i in range(len(df)):

  temp = df[df[&#39;type&#39;] == df.loc[i][&#39;type&#39;]].reset_index(drop = True)

  if len(temp) &gt; 1:

    past_dates = 0

    for j in range(len(temp)):

      if (temp.loc[j][&#39;date&#39;] - df.loc[i][&#39;date&#39;]) / np.timedelta64(1, &#39;Y&#39;) &lt; 3:

        past_dates += 1

    if past_dates &gt;= 2:

      df[i][&#39;date&#39;] = 1

    else:

      df[i][&#39;date&#39;] = 0

  else:

    df[i][&#39;date&#39;] = 0

答案1

得分: 1

pandas merge 和 filter 应该足够:

other = pd.DataFrame({'type': df['type'],
                      'present': df.date, 
                      'two_yrs_ahead': df.date.add(pd.DateOffset(years=2))})
(df.merge(
     other,
     on='type'
    )
.assign(counts=lambda f: f.date.gt(f.present) &amp; 
                           f.date.le(f.two_yrs_ahead))
.groupby(df.columns.tolist())
.counts
.sum()
)

id  type  date
1   a     2023-06-18    2
2   a     2022-06-18    1
3   a     2021-06-18    0
4   b     2023-06-18    0
5   b     2020-06-18    0
6   c     2023-06-18    0
Name: counts, dtype: int64
英文:

pandas merge and filter should suffice:

other = pd.DataFrame({&#39;type&#39;: df[&#39;type&#39;],
                      &#39;present&#39;:df.date, 
                      &#39;two_yrs_ahead&#39; : df.date.add(pd.DateOffset(years=2))})
(df.merge(
     other,
     on = &#39;type&#39;
    )
.assign(counts = lambda f: f.date.gt(f.present) &amp; 
                           f.date.le(f.two_yrs_ahead))
.groupby(df.columns.tolist())
.counts
.sum()
)

id  type  date
1   a     2023-06-18    2
2   a     2022-06-18    1
3   a     2021-06-18    0
4   b     2023-06-18    0
5   b     2020-06-18    0
6   c     2023-06-18    0
Name: counts, dtype: int64

huangapple
  • 本文由 发表于 2023年6月19日 10:27:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76503279.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定