英文:
How to compare dates from two different datasets in python
问题
我有2个数据集,需要从数据集1中获取一个日期列,以与数据集2中的另一个日期列进行比较。
df1:
REQUEST_ID NAME CREATED_DATE PRIOR_DATE STATUS
100 ADAM 1/24/2022 10/24/2021 Approved
101 GRACE 4/12/2022 1/12/2022 Approved
102 BLAKE D. 9/21/2022 6/21/2022 Pending
103 FRANK 5/18/2022 2/18/2022 Approved
df2:
ID Name Start_Date End_Date Team
10000 Michael 11/23/2021 1/23/2022 Sales
10000 Michael 1/23/2022 5/2/2022 Sales
10001 Adam 9/24/2021 12/22/2021 Tech
10001 Adam 12/22/2021 4/5/2022 HR
10001 Adam 4/5/2022 9/21/2022 HR
10002 Grace 7/24/2021 12/31/2021 Finance
10002 Grace 12/31/2021 3/5/2022 Finance
10002 Grace 3/5/2022 9/23/2022 Tech
...
...
10025 Blake 11/22/2021 3/12/2022 Sales
10025 Blake 3/12/2022 6/30/2022 Sales
10025 Blake 6/30/2022 9/12/2022 Sales
df2按数字顺序继续下去,直到Blake,所以df1中的名称在df2中。我需要找出df2中具有在CREATED_DATE
和PRIOR_DATE
范围内的Start_Date
的ID
,以及在那个时间段内的Team
。唯一的问题是,不是所有的名称都匹配或格式相同,因此无法正确合并。我以前从未做过这样的事情,所以不知道该如何继续。下面是期望的输出:
期望的输出:
ID Name Start_Date End_Date Status Team
10000 Michael 11/23/2021 1/23/2022 Approved Sales
10000 Michael 1/23/2022 5/2/2022 Approved Sales
10001 Adam 9/24/2022 12/22/2021 Approved HR
10001 Adam 12/22/2021 4/5/2022 Approved Finance
10002 Grace 3/5/2022 9/23/2022 Approved Tech
...
...
10025 Blake 6/30/2022 9/12/2022 Pending Sales
如果有人知道如何做到这一点,真的需要帮助。谢谢!
英文:
I have 2 datasets and I need to somehow get a date column from dataset 1 to be compared to another date column in dataset 2.
df1
REQUEST_ID NAME CREATED_DATE PRIOR_DATE STATUS
100 ADAM 1/24/2022 10/24/2021 Approved
101 GRACE 4/12/2022 1/12/2022 Approved
102 BLAKE D. 9/21/2022 6/21/2022 Pending
103 FRANK 5/18/2022 2/18/2022 Approved
df2
ID Name Start_Date End_Date Team
10000 Michael 11/23/2021 1/23/2022 Sales
10000 Michael 1/23/2022 5/2/2022 Sales
10001 Adam 9/24/2021 12/22/2021 Tech
10001 Adam 12/22/2021 4/5/2022 HR
10001 Adam 4/5/2022 9/21/2022 HR
10002 Grace 7/24/2021 12/31/2021 Finance
10002 Grace 12/31/2021 3/5/2022 Finance
10002 Grace 3/5/2022 9/23/2022 Tech
.
.
.
10025 Blake 11/22/2021 3/12/2022 Sales
10025 Blake 3/12/2022 6/30/2022 Sales
10025 Blake 6/30/2022 9/12/2022 Sales
df2 continues down in numeric order until Blake, so the names above in df1 are in df2. I need to find what ID
from df2 has a Start_Date
that falls in range of the CREATED_DATE
and PRIOR_DATE
and what team do they align with at time. Only issue is that not all names match or have same formatting so merging can't be done correctly. I have never done anything like this before so I am kind of lost on how to proceed. Below is a desired look.
Desired Output
ID Name Start_Date End_Date Status Team
10000 Michael 11/23/2021 1/23/2022 Approved Sales
10000 Michael 1/23/2022 5/2/2022 Approved Sales
10001 Adam 9/24/2022 12/22/2021 Approved HR
10001 Adam 12/22/2021 4/5/2022 Approved Finance
10002 Grace 3/5/2022 9/23/2022 Approved Tech
.
.
.
10025 Blake 6/30/2022 9/12/2022 Pending Sales
If anyone knows a way to do it could really use the help. Thank you
答案1
得分: 1
以下是您要翻译的内容:
"It looks like a possible combination of a 'fuzzy merge' and a 'nearest date merge'."
有许多帖子已经讨论了这些主题,但也许这个具体的例子有用:
- [tag:rapidfuzz] 的
cdist()
用于快速模糊匹配 pandas.merge_asof
用于日期合并
import rapidfuzz
scores = rapidfuzz.process.cdist(
df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)
# pick closest match (max score)
NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) > 0]
# merge_asof requires sorted keys
pd.merge_asof(
df2.assign(NAME=NAME).sort_values('Start_Date'),
df1.sort_values('PRIOR_DATE'),
by='NAME',
left_on='Start_Date',
right_on='PRIOR_DATE',
)
ID Name Start_Date End_Date Team NAME REQUEST_ID CREATED_DATE PRIOR_DATE STATUS
0 10002 Grace 2021-07-24 2021-12-31 Finance GRACE NaN NaT NaT NaN
1 10001 Adam 2021-09-24 2021-12-22 Tech ADAM NaN NaT NaT NaN
2 10025 Blake 2021-11-22 2022-03-12 Sales BLAKE D. NaN NaT NaT NaN
3 10000 Michael 2021-11-23 2022-01-23 Sales NaN NaN NaT NaT NaN
4 10001 Adam 2021-12-22 2022-04-05 HR ADAM 100.0 2022-01-24 2021-10-24 Approved # <- not valid
5 10002 Grace 2021-12-31 2022-03-05 Finance GRACE NaN NaT NaT NaN
6 10000 Michael 2022-01-23 2022-05-02 Sales NaN NaN NaT NaT NaN
7 10002 Grace 2022-03-05 2022-09-23 Tech GRACE 101.0 2022-04-12 2022-01-12 Approved
8 10025 Blake 2022-03-12 2022-06-30 Sales BLAKE D. NaN NaT NaT NaN
9 10001 Adam 2022-04-05 2022-09-21 HR ADAM 100.0 2022-01-24 2021-10-24 Approved
10 10025 Blake 2022-06-30 2022-09-12 Sales BLAKE D. 102.0 2022-09-21 2022-06-21 Pending
The asof merge doesn't apply the range check, e.g. row 4
here needs to be "set to NaN".
e.g.
df.loc[
~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']),
'STATUS'
] = float('nan')
As Michael
has no match in df1
it ends up being matched to ADAM
with a score of 0
.
0 ADAM
1 ADAM
2 ADAM
3 ADAM
4 ADAM
5 GRACE
6 GRACE
7 GRACE
8 BLAKE D.
9 BLAKE D.
10 BLAKE D.
Name: NAME, dtype: object
Which is why we have the > 0
filter:
2 ADAM
3 ADAM
4 ADAM
5 GRACE
6 GRACE
7 GRACE
8 BLAKE D.
9 BLAKE D.
10 BLAKE D.
英文:
It looks like a possible combination of a "fuzzy merge" and a "nearest date merge".
There are many posts already on those topics but perhaps this specific example is useful:
- [tag:rapidfuzz]'s
.cdist()
for fast fuzzy matching pandas.merge_asof
for the date merge
import rapidfuzz
scores = rapidfuzz.process.cdist(
df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
)
# pick closest match (max score)
NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
# discard non-matches
NAME = NAME[scores.max(axis=1) > 0]
# merge_asof requires sorted keys
pd.merge_asof(
df2.assign(NAME=NAME).sort_values('Start_Date'),
df1.sort_values('PRIOR_DATE'),
by='NAME',
left_on='Start_Date',
right_on='PRIOR_DATE',
)
ID Name Start_Date End_Date Team NAME REQUEST_ID CREATED_DATE PRIOR_DATE STATUS
0 10002 Grace 2021-07-24 2021-12-31 Finance GRACE NaN NaT NaT NaN
1 10001 Adam 2021-09-24 2021-12-22 Tech ADAM NaN NaT NaT NaN
2 10025 Blake 2021-11-22 2022-03-12 Sales BLAKE D. NaN NaT NaT NaN
3 10000 Michael 2021-11-23 2022-01-23 Sales NaN NaN NaT NaT NaN
4 10001 Adam 2021-12-22 2022-04-05 HR ADAM 100.0 2022-01-24 2021-10-24 Approved # <- not valid
5 10002 Grace 2021-12-31 2022-03-05 Finance GRACE NaN NaT NaT NaN
6 10000 Michael 2022-01-23 2022-05-02 Sales NaN NaN NaT NaT NaN
7 10002 Grace 2022-03-05 2022-09-23 Tech GRACE 101.0 2022-04-12 2022-01-12 Approved
8 10025 Blake 2022-03-12 2022-06-30 Sales BLAKE D. NaN NaT NaT NaN
9 10001 Adam 2022-04-05 2022-09-21 HR ADAM 100.0 2022-01-24 2021-10-24 Approved
10 10025 Blake 2022-06-30 2022-09-12 Sales BLAKE D. 102.0 2022-09-21 2022-06-21 Pending
The asof merge doesn't apply the range check, e.g. row 4
here needs to be "set to NaN".
e.g.
df.loc[
~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']),
'STATUS'
] = float('nan')
As Michael
has no match in df1
it ends up being matched to ADAM
with a score of 0
.
0 ADAM
1 ADAM
2 ADAM
3 ADAM
4 ADAM
5 GRACE
6 GRACE
7 GRACE
8 BLAKE D.
9 BLAKE D.
10 BLAKE D.
Name: NAME, dtype: object
Which is why we have the > 0
filter:
2 ADAM
3 ADAM
4 ADAM
5 GRACE
6 GRACE
7 GRACE
8 BLAKE D.
9 BLAKE D.
10 BLAKE D.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论