英文:
How to compare dates from two different datasets in python
问题
我有2个数据集,需要从数据集1中获取一个日期列,以与数据集2中的另一个日期列进行比较。
df1:
REQUEST_ID   NAME      CREATED_DATE   PRIOR_DATE   STATUS
100          ADAM      1/24/2022      10/24/2021  Approved
101          GRACE     4/12/2022      1/12/2022   Approved
102          BLAKE D.  9/21/2022      6/21/2022   Pending
103          FRANK     5/18/2022      2/18/2022   Approved
df2:
ID      Name     Start_Date    End_Date    Team
10000   Michael  11/23/2021    1/23/2022   Sales
10000   Michael  1/23/2022     5/2/2022    Sales
10001   Adam     9/24/2021     12/22/2021  Tech
10001   Adam     12/22/2021    4/5/2022    HR
10001   Adam     4/5/2022      9/21/2022   HR
10002   Grace    7/24/2021     12/31/2021  Finance
10002   Grace    12/31/2021    3/5/2022    Finance
10002   Grace    3/5/2022      9/23/2022   Tech
...
...
10025   Blake    11/22/2021    3/12/2022   Sales
10025   Blake    3/12/2022     6/30/2022   Sales
10025   Blake    6/30/2022     9/12/2022   Sales
df2按数字顺序继续下去,直到Blake,所以df1中的名称在df2中。我需要找出df2中具有在CREATED_DATE和PRIOR_DATE范围内的Start_Date的ID,以及在那个时间段内的Team。唯一的问题是,不是所有的名称都匹配或格式相同,因此无法正确合并。我以前从未做过这样的事情,所以不知道该如何继续。下面是期望的输出:
期望的输出:
 ID      Name     Start_Date    End_Date    Status     Team
10000   Michael  11/23/2021    1/23/2022   Approved   Sales
10000   Michael  1/23/2022     5/2/2022    Approved   Sales
10001   Adam     9/24/2022     12/22/2021  Approved   HR
10001   Adam     12/22/2021    4/5/2022    Approved   Finance
10002   Grace    3/5/2022      9/23/2022   Approved   Tech
...
...
10025   Blake    6/30/2022     9/12/2022   Pending    Sales
如果有人知道如何做到这一点,真的需要帮助。谢谢!
英文:
I have 2 datasets and I need to somehow get a date column from dataset 1 to be compared to another date column in dataset 2.
df1
REQUEST_ID      NAME       CREATED_DATE      PRIOR_DATE      STATUS 
100             ADAM        1/24/2022        10/24/2021     Approved
101             GRACE       4/12/2022        1/12/2022      Approved
102             BLAKE D.    9/21/2022        6/21/2022      Pending 
103             FRANK       5/18/2022        2/18/2022      Approved 
df2
  ID      Name      Start_Date       End_Date       Team 
10000    Michael    11/23/2021       1/23/2022      Sales 
10000    Michael    1/23/2022        5/2/2022       Sales 
10001    Adam       9/24/2021        12/22/2021     Tech 
10001    Adam       12/22/2021       4/5/2022       HR 
10001    Adam       4/5/2022         9/21/2022      HR 
10002    Grace      7/24/2021        12/31/2021     Finance
10002    Grace      12/31/2021       3/5/2022       Finance
10002    Grace      3/5/2022         9/23/2022      Tech 
.
.
.
10025    Blake      11/22/2021       3/12/2022      Sales 
10025    Blake      3/12/2022        6/30/2022      Sales
10025    Blake      6/30/2022        9/12/2022      Sales
df2 continues down in numeric order until Blake, so the names above in df1 are in df2. I need to find what ID from df2 has a Start_Date that falls in range of the CREATED_DATE and PRIOR_DATE and what team do they align with at time. Only issue is that not all names match or have same formatting so merging can't be  done correctly. I have never done anything like this before so I am kind of lost on how to proceed. Below is a desired look.
Desired Output
 ID      Name     Start_Date     End_Date     Status      Team 
10000   Michael   11/23/2021     1/23/2022    Approved    Sales
10000   Michael   1/23/2022      5/2/2022     Approved    Sales 
10001   Adam      9/24/2022      12/22/2021   Approved    HR
10001   Adam      12/22/2021     4/5/2022     Approved    Finance
10002   Grace     3/5/2022       9/23/2022    Approved    Tech 
.
.
.
10025   Blake     6/30/2022      9/12/2022    Pending     Sales   
If anyone knows a way to do it could really use the help. Thank you
答案1
得分: 1
以下是您要翻译的内容:
"It looks like a possible combination of a 'fuzzy merge' and a 'nearest date merge'."
有许多帖子已经讨论了这些主题,但也许这个具体的例子有用:
- [tag:rapidfuzz] 的 
cdist()用于快速模糊匹配 pandas.merge_asof用于日期合并
import rapidfuzz
scores = rapidfuzz.process.cdist(
   df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)
# pick closest match (max score)
NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) > 0]
# merge_asof requires sorted keys
pd.merge_asof(
   df2.assign(NAME=NAME).sort_values('Start_Date'),
   df1.sort_values('PRIOR_DATE'),
   by='NAME',
   left_on='Start_Date',
   right_on='PRIOR_DATE',
)
       ID     Name Start_Date   End_Date     Team      NAME  REQUEST_ID CREATED_DATE PRIOR_DATE    STATUS
0   10002    Grace 2021-07-24 2021-12-31  Finance     GRACE         NaN          NaT        NaT       NaN
1   10001     Adam 2021-09-24 2021-12-22     Tech      ADAM         NaN          NaT        NaT       NaN
2   10025    Blake 2021-11-22 2022-03-12    Sales  BLAKE D.         NaN          NaT        NaT       NaN
3   10000  Michael 2021-11-23 2022-01-23    Sales       NaN         NaN          NaT        NaT       NaN
4   10001     Adam 2021-12-22 2022-04-05       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved  # <- not valid
5   10002    Grace 2021-12-31 2022-03-05  Finance     GRACE         NaN          NaT        NaT       NaN
6   10000  Michael 2022-01-23 2022-05-02    Sales       NaN         NaN          NaT        NaT       NaN
7   10002    Grace 2022-03-05 2022-09-23     Tech     GRACE       101.0   2022-04-12 2022-01-12  Approved
8   10025    Blake 2022-03-12 2022-06-30    Sales  BLAKE D.         NaN          NaT        NaT       NaN
9   10001     Adam 2022-04-05 2022-09-21       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved
10  10025    Blake 2022-06-30 2022-09-12    Sales  BLAKE D.       102.0   2022-09-21 2022-06-21   Pending
The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".
e.g.
df.loc[
   ~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']), 
   'STATUS'
] = float('nan')
As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.
0         ADAM
1         ADAM
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
Name: NAME, dtype: object
Which is why we have the > 0 filter:
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
英文:
It looks like a possible combination of a "fuzzy merge" and a "nearest date merge".
There are many posts already on those topics but perhaps this specific example is useful:
- [tag:rapidfuzz]'s 
.cdist()for fast fuzzy matching pandas.merge_asoffor the date merge
import rapidfuzz
scores = rapidfuzz.process.cdist(
   df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)
# pick closest match (max score)
NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) > 0]
# merge_asof requires sorted keys
pd.merge_asof(
   df2.assign(NAME=NAME).sort_values('Start_Date'),
   df1.sort_values('PRIOR_DATE'),
   by='NAME',
   left_on='Start_Date',
   right_on='PRIOR_DATE',
)
       ID     Name Start_Date   End_Date     Team      NAME  REQUEST_ID CREATED_DATE PRIOR_DATE    STATUS
0   10002    Grace 2021-07-24 2021-12-31  Finance     GRACE         NaN          NaT        NaT       NaN
1   10001     Adam 2021-09-24 2021-12-22     Tech      ADAM         NaN          NaT        NaT       NaN
2   10025    Blake 2021-11-22 2022-03-12    Sales  BLAKE D.         NaN          NaT        NaT       NaN
3   10000  Michael 2021-11-23 2022-01-23    Sales       NaN         NaN          NaT        NaT       NaN
4   10001     Adam 2021-12-22 2022-04-05       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved  # <- not valid
5   10002    Grace 2021-12-31 2022-03-05  Finance     GRACE         NaN          NaT        NaT       NaN
6   10000  Michael 2022-01-23 2022-05-02    Sales       NaN         NaN          NaT        NaT       NaN
7   10002    Grace 2022-03-05 2022-09-23     Tech     GRACE       101.0   2022-04-12 2022-01-12  Approved
8   10025    Blake 2022-03-12 2022-06-30    Sales  BLAKE D.         NaN          NaT        NaT       NaN
9   10001     Adam 2022-04-05 2022-09-21       HR      ADAM       100.0   2022-01-24 2021-10-24  Approved
10  10025    Blake 2022-06-30 2022-09-12    Sales  BLAKE D.       102.0   2022-09-21 2022-06-21   Pending
The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".
e.g.
df.loc[
   ~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']), 
   'STATUS'
] = float('nan')
As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.
0         ADAM
1         ADAM
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
Name: NAME, dtype: object
Which is why we have the > 0 filter:
2         ADAM
3         ADAM
4         ADAM
5        GRACE
6        GRACE
7        GRACE
8     BLAKE D.
9     BLAKE D.
10    BLAKE D.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论