2023年6月12日 08:59:18go评论169阅读模式

英文:

How to compare dates from two different datasets in python

问题

我有2个数据集，需要从数据集1中获取一个日期列，以与数据集2中的另一个日期列进行比较。

df1：

REQUEST_ID   NAME      CREATED_DATE   PRIOR_DATE   STATUS
100          ADAM      1/24/2022      10/24/2021  Approved
101          GRACE     4/12/2022      1/12/2022   Approved
102          BLAKE D.  9/21/2022      6/21/2022   Pending
103          FRANK     5/18/2022      2/18/2022   Approved

df2：

ID      Name     Start_Date    End_Date    Team
10000   Michael  11/23/2021    1/23/2022   Sales
10000   Michael  1/23/2022     5/2/2022    Sales
10001   Adam     9/24/2021     12/22/2021  Tech
10001   Adam     12/22/2021    4/5/2022    HR
10001   Adam     4/5/2022      9/21/2022   HR
10002   Grace    7/24/2021     12/31/2021  Finance
10002   Grace    12/31/2021    3/5/2022    Finance
10002   Grace    3/5/2022      9/23/2022   Tech
...
...
10025   Blake    11/22/2021    3/12/2022   Sales
10025   Blake    3/12/2022     6/30/2022   Sales
10025   Blake    6/30/2022     9/12/2022   Sales

df2按数字顺序继续下去，直到Blake，所以df1中的名称在df2中。我需要找出df2中具有在CREATED_DATE和PRIOR_DATE范围内的Start_Date的ID，以及在那个时间段内的Team。唯一的问题是，不是所有的名称都匹配或格式相同，因此无法正确合并。我以前从未做过这样的事情，所以不知道该如何继续。下面是期望的输出：

期望的输出：

 ID      Name     Start_Date    End_Date    Status     Team
10000   Michael  11/23/2021    1/23/2022   Approved   Sales
10000   Michael  1/23/2022     5/2/2022    Approved   Sales
10001   Adam     9/24/2022     12/22/2021  Approved   HR
10001   Adam     12/22/2021    4/5/2022    Approved   Finance
10002   Grace    3/5/2022      9/23/2022   Approved   Tech
...
...
10025   Blake    6/30/2022     9/12/2022   Pending    Sales

如果有人知道如何做到这一点，真的需要帮助。谢谢！

英文:

I have 2 datasets and I need to somehow get a date column from dataset 1 to be compared to another date column in dataset 2.

df1

REQUEST_ID      NAME       CREATED_DATE      PRIOR_DATE      STATUS 
100             ADAM        1/24/2022        10/24/2021     Approved
101             GRACE       4/12/2022        1/12/2022      Approved
102             BLAKE D.    9/21/2022        6/21/2022      Pending 
103             FRANK       5/18/2022        2/18/2022      Approved

df2

  ID      Name      Start_Date       End_Date       Team 
10000    Michael    11/23/2021       1/23/2022      Sales 
10000    Michael    1/23/2022        5/2/2022       Sales 
10001    Adam       9/24/2021        12/22/2021     Tech 
10001    Adam       12/22/2021       4/5/2022       HR 
10001    Adam       4/5/2022         9/21/2022      HR 
10002    Grace      7/24/2021        12/31/2021     Finance
10002    Grace      12/31/2021       3/5/2022       Finance
10002    Grace      3/5/2022         9/23/2022      Tech 
.
.
.
10025    Blake      11/22/2021       3/12/2022      Sales 
10025    Blake      3/12/2022        6/30/2022      Sales
10025    Blake      6/30/2022        9/12/2022      Sales

df2 continues down in numeric order until Blake, so the names above in df1 are in df2. I need to find what ID from df2 has a Start_Date that falls in range of the CREATED_DATE and PRIOR_DATE and what team do they align with at time. Only issue is that not all names match or have same formatting so merging can't be done correctly. I have never done anything like this before so I am kind of lost on how to proceed. Below is a desired look.

Desired Output

 ID      Name     Start_Date     End_Date     Status      Team 
10000   Michael   11/23/2021     1/23/2022    Approved    Sales
10000   Michael   1/23/2022      5/2/2022     Approved    Sales 
10001   Adam      9/24/2022      12/22/2021   Approved    HR
10001   Adam      12/22/2021     4/5/2022     Approved    Finance
10002   Grace     3/5/2022       9/23/2022    Approved    Tech 
.
.
.
10025   Blake     6/30/2022      9/12/2022    Pending     Sales

If anyone knows a way to do it could really use the help. Thank you

答案1

得分: 1

以下是您要翻译的内容：

"It looks like a possible combination of a 'fuzzy merge' and a 'nearest date merge'."

有许多帖子已经讨论了这些主题，但也许这个具体的例子有用：

[tag:rapidfuzz] 的 cdist() 用于快速模糊匹配
pandas.merge_asof 用于日期合并

import rapidfuzz

scores = rapidfuzz.process.cdist(
   df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)

# pick closest match (max score)
NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) > 0]

# merge_asof requires sorted keys
pd.merge_asof(
   df2.assign(NAME=NAME).sort_values('Start_Date'),
   df1.sort_values('PRIOR_DATE'),
   by='NAME',
   left_on='Start_Date',
   right_on='PRIOR_DATE',
)

       ID     Name Start_Date   End_Date     Team      NAME  REQUEST_ID CREATED_DATE PRIOR_DATE    STATUS
0   10002    Grace 2021-07-24 2021-12-31  Finance     GRACE         NaN          NaT        NaT       NaN
1   10001     Adam 2021-09-24 2021-12-22     Tech      ADAM         NaN          NaT        NaT       NaN
2   10025    Blake 2021-11-22 2022-03-12    Sales  BLAKE D.         NaN          NaT        NaT       NaN
3   10000  Michael 2021-11-23 2022-01-23    Sales       NaN         NaN          NaT        NaT       NaN
4   10001     Adam 2021-12-22 2022-04-05       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved  # <- not valid
5   10002    Grace 2021-12-31 2022-03-05  Finance     GRACE         NaN          NaT        NaT       NaN
6   10000  Michael 2022-01-23 2022-05-02    Sales       NaN         NaN          NaT        NaT       NaN
7   10002    Grace 2022-03-05 2022-09-23     Tech     GRACE       101.0   2022-04-12 2022-01-12  Approved
8   10025    Blake 2022-03-12 2022-06-30    Sales  BLAKE D.         NaN          NaT        NaT       NaN
9   10001     Adam 2022-04-05 2022-09-21       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved
10  10025    Blake 2022-06-30 2022-09-12    Sales  BLAKE D.       102.0   2022-09-21 2022-06-21   Pending

The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".

e.g.

df.loc[
   ~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']), 
   'STATUS'
] = float('nan')

As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.

0         ADAM
1         ADAM
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
Name: NAME, dtype: object

Which is why we have the > 0 filter:

2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.

英文:

It looks like a possible combination of a "fuzzy merge" and a "nearest date merge".

There are many posts already on those topics but perhaps this specific example is useful:

[tag:rapidfuzz]'s .cdist() for fast fuzzy matching
pandas.merge_asof for the date merge

import rapidfuzz

scores = rapidfuzz.process.cdist(
   df2[&#39;Name&#39;], df1[&#39;NAME&#39;], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)

# pick closest match (max score)
NAME = df1[&#39;NAME&#39;].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) &gt; 0]

# merge_asof requires sorted keys
pd.merge_asof(
   df2.assign(NAME=NAME).sort_values(&#39;Start_Date&#39;),
   df1.sort_values(&#39;PRIOR_DATE&#39;),
   by=&#39;NAME&#39;,
   left_on=&#39;Start_Date&#39;,
   right_on=&#39;PRIOR_DATE&#39;,
)

       ID     Name Start_Date   End_Date     Team      NAME  REQUEST_ID CREATED_DATE PRIOR_DATE    STATUS
0   10002    Grace 2021-07-24 2021-12-31  Finance     GRACE         NaN          NaT        NaT       NaN
1   10001     Adam 2021-09-24 2021-12-22     Tech      ADAM         NaN          NaT        NaT       NaN
2   10025    Blake 2021-11-22 2022-03-12    Sales  BLAKE D.         NaN          NaT        NaT       NaN
3   10000  Michael 2021-11-23 2022-01-23    Sales       NaN         NaN          NaT        NaT       NaN
4   10001     Adam 2021-12-22 2022-04-05       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved  # &lt;- not valid
5   10002    Grace 2021-12-31 2022-03-05  Finance     GRACE         NaN          NaT        NaT       NaN
6   10000  Michael 2022-01-23 2022-05-02    Sales       NaN         NaN          NaT        NaT       NaN
7   10002    Grace 2022-03-05 2022-09-23     Tech     GRACE       101.0   2022-04-12 2022-01-12  Approved
8   10025    Blake 2022-03-12 2022-06-30    Sales  BLAKE D.         NaN          NaT        NaT       NaN
9   10001     Adam 2022-04-05 2022-09-21       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved
10  10025    Blake 2022-06-30 2022-09-12    Sales  BLAKE D.       102.0   2022-09-21 2022-06-21   Pending

The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".

e.g.

df.loc[
   ~df[&#39;Start_Date&#39;].between(df[&#39;PRIOR_DATE&#39;], df[&#39;CREATED_DATE&#39;]), 
   &#39;STATUS&#39;
] = float(&#39;nan&#39;)

As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.

0         ADAM
1         ADAM
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
Name: NAME, dtype: object

Which is why we have the > 0 filter:

2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Python中比较来自两个不同数据集的日期

问题

答案1

无法在磁盘上创建/保存/加载非常大的数组

使用”for”来重复pandas中的行数。

如何避免Graphene在查询响应时对OrderedDict进行排序？（Python，Django）

How to plot a contour plot (density) of a data file with 3 columns (x, y, density) with the script automatically picking the data array dimension?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论