2023年6月19日 10:27:28go评论90阅读模式

英文:

Pandas finding dates lower than current row' date group by type

问题

我有一个带有id、number和date的pandas数据帧。我想创建一个新列，如下所示。基本上，按类型分组并查找在当前id的日期之间的两年内的日期。

id	type 	date	
1	a	2023-06-18	
2	a	2022-06-18	
3	a	2021-06-18	
4	b	2023-06-18	
5	b	2020-06-18	
6	c	2023-06-18	
			
id	type 	date	past_records_in_2_years
1	a	2023-06-18	2
2	a	2022-06-18	1
3	a	2021-06-18	0
4	b	2023-06-18	0
5	b	2020-06-18	0
6	c	2023-06-18	0

我尝试使用for循环，但是我有100万+行，所以花费太多时间。

for i in range(len(df)):
    temp = df[df['type'] == df.loc[i]['type']].reset_index(drop=True)
    if len(temp) > 1:
        past_dates = 0
        for j in range(len(temp)):
            if (temp.loc[j]['date'] - df.loc[i]['date']) / np.timedelta64(1, 'Y') < 3:
                past_dates += 1
        if past_dates >= 2:
            df[i]['date'] = 1
        else:
            df[i]['date'] = 0
    else:
        df[i]['date'] = 0

注意： 你的现有代码有一些问题，因为它尝试在DataFrame上直接进行更改，但是更好的方法是使用apply函数或者向DataFrame添加一个新列来实现你的目标。如果需要更多帮助，可以提出具体问题。

英文:

I have pandas dataframe with id, number and date. I want to create a new column as below. Basically, groupby type and find the dates which are between two years of current id's date.

id	type 	date	
1	a	2023-06-18	
2	a	2022-06-18	
3	a	2021-06-18	
4	b	2023-06-18	
5	b	2020-06-18	
6	c	2023-06-18	
		
id	type 	date	past_records_in_2_years
1	a	2023-06-18	2
2	a	2022-06-18	1
3	a	2021-06-18	0
4	b	2023-06-18	0
5	b	2020-06-18	0
6	c	2023-06-18	0

I tried using for loop but I have 1M+ rows so it is taking too much time.

for i in range(len(df)):
  temp = df[df[&#39;type&#39;] == df.loc[i][&#39;type&#39;]].reset_index(drop = True)
  if len(temp) &gt; 1:
    past_dates = 0
    for j in range(len(temp)):
      if (temp.loc[j][&#39;date&#39;] - df.loc[i][&#39;date&#39;]) / np.timedelta64(1, &#39;Y&#39;) &lt; 3:
        past_dates += 1
    if past_dates &gt;= 2:
      df[i][&#39;date&#39;] = 1
    else:
      df[i][&#39;date&#39;] = 0
  else:
    df[i][&#39;date&#39;] = 0

答案1

得分: 1

pandas merge 和 filter 应该足够：

other = pd.DataFrame({'type': df['type'],
                      'present': df.date, 
                      'two_yrs_ahead': df.date.add(pd.DateOffset(years=2))})
(df.merge(
     other,
     on='type'
    )
.assign(counts=lambda f: f.date.gt(f.present) &amp; 
                           f.date.le(f.two_yrs_ahead))
.groupby(df.columns.tolist())
.counts
.sum()
)
id  type  date
1   a     2023-06-18    2
2   a     2022-06-18    1
3   a     2021-06-18    0
4   b     2023-06-18    0
5   b     2020-06-18    0
6   c     2023-06-18    0
Name: counts, dtype: int64

英文:

pandas merge and filter should suffice:

other = pd.DataFrame({&#39;type&#39;: df[&#39;type&#39;],
                      &#39;present&#39;:df.date, 
                      &#39;two_yrs_ahead&#39; : df.date.add(pd.DateOffset(years=2))})
(df.merge(
     other,
     on = &#39;type&#39;
    )
.assign(counts = lambda f: f.date.gt(f.present) &amp; 
                           f.date.le(f.two_yrs_ahead))
.groupby(df.columns.tolist())
.counts
.sum()
)
id  type  date
1   a     2023-06-18    2
2   a     2022-06-18    1
3   a     2021-06-18    0
4   b     2023-06-18    0
5   b     2020-06-18    0
6   c     2023-06-18    0
Name: counts, dtype: int64

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas按类型分组查找低于当前行日期的日期

问题

答案1

使用`yaml.SafeDumper`时如何保持yaml格式。

我的Python字典没有正确更新。

将条形图转换为正态分布

Pandas: Shape of passed values is (10, 1), indices imply (10, 5) error when trying to append a dict to an existing Dataframe

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。