如何在Python中比较来自两个不同数据集的日期

huangapple go评论70阅读模式
英文:

How to compare dates from two different datasets in python

问题

我有2个数据集,需要从数据集1中获取一个日期列,以与数据集2中的另一个日期列进行比较。

df1:

REQUEST_ID   NAME      CREATED_DATE   PRIOR_DATE   STATUS
100          ADAM      1/24/2022      10/24/2021  Approved
101          GRACE     4/12/2022      1/12/2022   Approved
102          BLAKE D.  9/21/2022      6/21/2022   Pending
103          FRANK     5/18/2022      2/18/2022   Approved

df2:

ID      Name     Start_Date    End_Date    Team
10000   Michael  11/23/2021    1/23/2022   Sales
10000   Michael  1/23/2022     5/2/2022    Sales
10001   Adam     9/24/2021     12/22/2021  Tech
10001   Adam     12/22/2021    4/5/2022    HR
10001   Adam     4/5/2022      9/21/2022   HR
10002   Grace    7/24/2021     12/31/2021  Finance
10002   Grace    12/31/2021    3/5/2022    Finance
10002   Grace    3/5/2022      9/23/2022   Tech
...
...
10025   Blake    11/22/2021    3/12/2022   Sales
10025   Blake    3/12/2022     6/30/2022   Sales
10025   Blake    6/30/2022     9/12/2022   Sales

df2按数字顺序继续下去,直到Blake,所以df1中的名称在df2中。我需要找出df2中具有在CREATED_DATEPRIOR_DATE范围内的Start_DateID,以及在那个时间段内的Team。唯一的问题是,不是所有的名称都匹配或格式相同,因此无法正确合并。我以前从未做过这样的事情,所以不知道该如何继续。下面是期望的输出:

期望的输出:

 ID      Name     Start_Date    End_Date    Status     Team
10000   Michael  11/23/2021    1/23/2022   Approved   Sales
10000   Michael  1/23/2022     5/2/2022    Approved   Sales
10001   Adam     9/24/2022     12/22/2021  Approved   HR
10001   Adam     12/22/2021    4/5/2022    Approved   Finance
10002   Grace    3/5/2022      9/23/2022   Approved   Tech
...
...
10025   Blake    6/30/2022     9/12/2022   Pending    Sales

如果有人知道如何做到这一点,真的需要帮助。谢谢!

英文:

I have 2 datasets and I need to somehow get a date column from dataset 1 to be compared to another date column in dataset 2.

df1

REQUEST_ID      NAME       CREATED_DATE      PRIOR_DATE      STATUS 
100             ADAM        1/24/2022        10/24/2021     Approved
101             GRACE       4/12/2022        1/12/2022      Approved
102             BLAKE D.    9/21/2022        6/21/2022      Pending 
103             FRANK       5/18/2022        2/18/2022      Approved 

df2

  ID      Name      Start_Date       End_Date       Team 
10000    Michael    11/23/2021       1/23/2022      Sales 
10000    Michael    1/23/2022        5/2/2022       Sales 
10001    Adam       9/24/2021        12/22/2021     Tech 
10001    Adam       12/22/2021       4/5/2022       HR 
10001    Adam       4/5/2022         9/21/2022      HR 
10002    Grace      7/24/2021        12/31/2021     Finance
10002    Grace      12/31/2021       3/5/2022       Finance
10002    Grace      3/5/2022         9/23/2022      Tech 
.
.
.
10025    Blake      11/22/2021       3/12/2022      Sales 
10025    Blake      3/12/2022        6/30/2022      Sales
10025    Blake      6/30/2022        9/12/2022      Sales

df2 continues down in numeric order until Blake, so the names above in df1 are in df2. I need to find what ID from df2 has a Start_Date that falls in range of the CREATED_DATE and PRIOR_DATE and what team do they align with at time. Only issue is that not all names match or have same formatting so merging can't be done correctly. I have never done anything like this before so I am kind of lost on how to proceed. Below is a desired look.

Desired Output

 ID      Name     Start_Date     End_Date     Status      Team 
10000   Michael   11/23/2021     1/23/2022    Approved    Sales
10000   Michael   1/23/2022      5/2/2022     Approved    Sales 
10001   Adam      9/24/2022      12/22/2021   Approved    HR
10001   Adam      12/22/2021     4/5/2022     Approved    Finance
10002   Grace     3/5/2022       9/23/2022    Approved    Tech 
.
.
.
10025   Blake     6/30/2022      9/12/2022    Pending     Sales   

If anyone knows a way to do it could really use the help. Thank you

答案1

得分: 1

以下是您要翻译的内容:

"It looks like a possible combination of a 'fuzzy merge' and a 'nearest date merge'."

有许多帖子已经讨论了这些主题,但也许这个具体的例子有用:

import rapidfuzz

scores = rapidfuzz.process.cdist(
   df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)

# pick closest match (max score)
NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) > 0]

# merge_asof requires sorted keys
pd.merge_asof(
   df2.assign(NAME=NAME).sort_values('Start_Date'),
   df1.sort_values('PRIOR_DATE'),
   by='NAME',
   left_on='Start_Date',
   right_on='PRIOR_DATE',
)
       ID     Name Start_Date   End_Date     Team      NAME  REQUEST_ID CREATED_DATE PRIOR_DATE    STATUS
0   10002    Grace 2021-07-24 2021-12-31  Finance     GRACE         NaN          NaT        NaT       NaN
1   10001     Adam 2021-09-24 2021-12-22     Tech      ADAM         NaN          NaT        NaT       NaN
2   10025    Blake 2021-11-22 2022-03-12    Sales  BLAKE D.         NaN          NaT        NaT       NaN
3   10000  Michael 2021-11-23 2022-01-23    Sales       NaN         NaN          NaT        NaT       NaN
4   10001     Adam 2021-12-22 2022-04-05       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved  # <- not valid
5   10002    Grace 2021-12-31 2022-03-05  Finance     GRACE         NaN          NaT        NaT       NaN
6   10000  Michael 2022-01-23 2022-05-02    Sales       NaN         NaN          NaT        NaT       NaN
7   10002    Grace 2022-03-05 2022-09-23     Tech     GRACE       101.0   2022-04-12 2022-01-12  Approved
8   10025    Blake 2022-03-12 2022-06-30    Sales  BLAKE D.         NaN          NaT        NaT       NaN
9   10001     Adam 2022-04-05 2022-09-21       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved
10  10025    Blake 2022-06-30 2022-09-12    Sales  BLAKE D.       102.0   2022-09-21 2022-06-21   Pending

The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".

e.g.

df.loc[
   ~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']), 
   'STATUS'
] = float('nan')

As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.

0         ADAM
1         ADAM
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
Name: NAME, dtype: object

Which is why we have the > 0 filter:

2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
英文:

It looks like a possible combination of a "fuzzy merge" and a "nearest date merge".

There are many posts already on those topics but perhaps this specific example is useful:

import rapidfuzz

scores = rapidfuzz.process.cdist(
   df2[&#39;Name&#39;], df1[&#39;NAME&#39;], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)

# pick closest match (max score)
NAME = df1[&#39;NAME&#39;].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) &gt; 0]

# merge_asof requires sorted keys
pd.merge_asof(
   df2.assign(NAME=NAME).sort_values(&#39;Start_Date&#39;),
   df1.sort_values(&#39;PRIOR_DATE&#39;),
   by=&#39;NAME&#39;,
   left_on=&#39;Start_Date&#39;,
   right_on=&#39;PRIOR_DATE&#39;,
)
       ID     Name Start_Date   End_Date     Team      NAME  REQUEST_ID CREATED_DATE PRIOR_DATE    STATUS
0   10002    Grace 2021-07-24 2021-12-31  Finance     GRACE         NaN          NaT        NaT       NaN
1   10001     Adam 2021-09-24 2021-12-22     Tech      ADAM         NaN          NaT        NaT       NaN
2   10025    Blake 2021-11-22 2022-03-12    Sales  BLAKE D.         NaN          NaT        NaT       NaN
3   10000  Michael 2021-11-23 2022-01-23    Sales       NaN         NaN          NaT        NaT       NaN
4   10001     Adam 2021-12-22 2022-04-05       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved  # &lt;- not valid
5   10002    Grace 2021-12-31 2022-03-05  Finance     GRACE         NaN          NaT        NaT       NaN
6   10000  Michael 2022-01-23 2022-05-02    Sales       NaN         NaN          NaT        NaT       NaN
7   10002    Grace 2022-03-05 2022-09-23     Tech     GRACE       101.0   2022-04-12 2022-01-12  Approved
8   10025    Blake 2022-03-12 2022-06-30    Sales  BLAKE D.         NaN          NaT        NaT       NaN
9   10001     Adam 2022-04-05 2022-09-21       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved
10  10025    Blake 2022-06-30 2022-09-12    Sales  BLAKE D.       102.0   2022-09-21 2022-06-21   Pending

The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".

e.g.

df.loc[
   ~df[&#39;Start_Date&#39;].between(df[&#39;PRIOR_DATE&#39;], df[&#39;CREATED_DATE&#39;]), 
   &#39;STATUS&#39;
] = float(&#39;nan&#39;)

As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.

0         ADAM
1         ADAM
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
Name: NAME, dtype: object

Which is why we have the &gt; 0 filter:

2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.

huangapple
  • 本文由 发表于 2023年6月12日 08:59:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76453107.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定