如何在Python中比较来自两个不同数据集的日期

huangapple go评论104阅读模式
英文:

How to compare dates from two different datasets in python

问题

我有2个数据集,需要从数据集1中获取一个日期列,以与数据集2中的另一个日期列进行比较。

df1:

  1. REQUEST_ID NAME CREATED_DATE PRIOR_DATE STATUS
  2. 100 ADAM 1/24/2022 10/24/2021 Approved
  3. 101 GRACE 4/12/2022 1/12/2022 Approved
  4. 102 BLAKE D. 9/21/2022 6/21/2022 Pending
  5. 103 FRANK 5/18/2022 2/18/2022 Approved

df2:

  1. ID Name Start_Date End_Date Team
  2. 10000 Michael 11/23/2021 1/23/2022 Sales
  3. 10000 Michael 1/23/2022 5/2/2022 Sales
  4. 10001 Adam 9/24/2021 12/22/2021 Tech
  5. 10001 Adam 12/22/2021 4/5/2022 HR
  6. 10001 Adam 4/5/2022 9/21/2022 HR
  7. 10002 Grace 7/24/2021 12/31/2021 Finance
  8. 10002 Grace 12/31/2021 3/5/2022 Finance
  9. 10002 Grace 3/5/2022 9/23/2022 Tech
  10. ...
  11. ...
  12. 10025 Blake 11/22/2021 3/12/2022 Sales
  13. 10025 Blake 3/12/2022 6/30/2022 Sales
  14. 10025 Blake 6/30/2022 9/12/2022 Sales

df2按数字顺序继续下去,直到Blake,所以df1中的名称在df2中。我需要找出df2中具有在CREATED_DATEPRIOR_DATE范围内的Start_DateID,以及在那个时间段内的Team。唯一的问题是,不是所有的名称都匹配或格式相同,因此无法正确合并。我以前从未做过这样的事情,所以不知道该如何继续。下面是期望的输出:

期望的输出:

  1. ID Name Start_Date End_Date Status Team
  2. 10000 Michael 11/23/2021 1/23/2022 Approved Sales
  3. 10000 Michael 1/23/2022 5/2/2022 Approved Sales
  4. 10001 Adam 9/24/2022 12/22/2021 Approved HR
  5. 10001 Adam 12/22/2021 4/5/2022 Approved Finance
  6. 10002 Grace 3/5/2022 9/23/2022 Approved Tech
  7. ...
  8. ...
  9. 10025 Blake 6/30/2022 9/12/2022 Pending Sales

如果有人知道如何做到这一点,真的需要帮助。谢谢!

英文:

I have 2 datasets and I need to somehow get a date column from dataset 1 to be compared to another date column in dataset 2.

df1

  1. REQUEST_ID NAME CREATED_DATE PRIOR_DATE STATUS
  2. 100 ADAM 1/24/2022 10/24/2021 Approved
  3. 101 GRACE 4/12/2022 1/12/2022 Approved
  4. 102 BLAKE D. 9/21/2022 6/21/2022 Pending
  5. 103 FRANK 5/18/2022 2/18/2022 Approved

df2

  1. ID Name Start_Date End_Date Team
  2. 10000 Michael 11/23/2021 1/23/2022 Sales
  3. 10000 Michael 1/23/2022 5/2/2022 Sales
  4. 10001 Adam 9/24/2021 12/22/2021 Tech
  5. 10001 Adam 12/22/2021 4/5/2022 HR
  6. 10001 Adam 4/5/2022 9/21/2022 HR
  7. 10002 Grace 7/24/2021 12/31/2021 Finance
  8. 10002 Grace 12/31/2021 3/5/2022 Finance
  9. 10002 Grace 3/5/2022 9/23/2022 Tech
  10. .
  11. .
  12. .
  13. 10025 Blake 11/22/2021 3/12/2022 Sales
  14. 10025 Blake 3/12/2022 6/30/2022 Sales
  15. 10025 Blake 6/30/2022 9/12/2022 Sales

df2 continues down in numeric order until Blake, so the names above in df1 are in df2. I need to find what ID from df2 has a Start_Date that falls in range of the CREATED_DATE and PRIOR_DATE and what team do they align with at time. Only issue is that not all names match or have same formatting so merging can't be done correctly. I have never done anything like this before so I am kind of lost on how to proceed. Below is a desired look.

Desired Output

  1. ID Name Start_Date End_Date Status Team
  2. 10000 Michael 11/23/2021 1/23/2022 Approved Sales
  3. 10000 Michael 1/23/2022 5/2/2022 Approved Sales
  4. 10001 Adam 9/24/2022 12/22/2021 Approved HR
  5. 10001 Adam 12/22/2021 4/5/2022 Approved Finance
  6. 10002 Grace 3/5/2022 9/23/2022 Approved Tech
  7. .
  8. .
  9. .
  10. 10025 Blake 6/30/2022 9/12/2022 Pending Sales

If anyone knows a way to do it could really use the help. Thank you

答案1

得分: 1

以下是您要翻译的内容:

"It looks like a possible combination of a 'fuzzy merge' and a 'nearest date merge'."

有许多帖子已经讨论了这些主题,但也许这个具体的例子有用:

  1. import rapidfuzz
  2. scores = rapidfuzz.process.cdist(
  3. df2['Name'], df1['NAME'], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
  4. )
  5. # pick closest match (max score)
  6. NAME = df1['NAME'].loc[scores.argmax(axis=1)].set_axis(df2.index)
  7. # discard non-matches
  8. NAME = NAME[scores.max(axis=1) > 0]
  9. # merge_asof requires sorted keys
  10. pd.merge_asof(
  11. df2.assign(NAME=NAME).sort_values('Start_Date'),
  12. df1.sort_values('PRIOR_DATE'),
  13. by='NAME',
  14. left_on='Start_Date',
  15. right_on='PRIOR_DATE',
  16. )
  1. ID Name Start_Date End_Date Team NAME REQUEST_ID CREATED_DATE PRIOR_DATE STATUS
  2. 0 10002 Grace 2021-07-24 2021-12-31 Finance GRACE NaN NaT NaT NaN
  3. 1 10001 Adam 2021-09-24 2021-12-22 Tech ADAM NaN NaT NaT NaN
  4. 2 10025 Blake 2021-11-22 2022-03-12 Sales BLAKE D. NaN NaT NaT NaN
  5. 3 10000 Michael 2021-11-23 2022-01-23 Sales NaN NaN NaT NaT NaN
  6. 4 10001 Adam 2021-12-22 2022-04-05 HR ADAM 100.0 2022-01-24 2021-10-24 Approved # <- not valid
  7. 5 10002 Grace 2021-12-31 2022-03-05 Finance GRACE NaN NaT NaT NaN
  8. 6 10000 Michael 2022-01-23 2022-05-02 Sales NaN NaN NaT NaT NaN
  9. 7 10002 Grace 2022-03-05 2022-09-23 Tech GRACE 101.0 2022-04-12 2022-01-12 Approved
  10. 8 10025 Blake 2022-03-12 2022-06-30 Sales BLAKE D. NaN NaT NaT NaN
  11. 9 10001 Adam 2022-04-05 2022-09-21 HR ADAM 100.0 2022-01-24 2021-10-24 Approved
  12. 10 10025 Blake 2022-06-30 2022-09-12 Sales BLAKE D. 102.0 2022-09-21 2022-06-21 Pending

The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".

e.g.

  1. df.loc[
  2. ~df['Start_Date'].between(df['PRIOR_DATE'], df['CREATED_DATE']),
  3. 'STATUS'
  4. ] = float('nan')

As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.

  1. 0 ADAM
  2. 1 ADAM
  3. 2 ADAM
  4. 3 ADAM
  5. 4 ADAM
  6. 5 GRACE
  7. 6 GRACE
  8. 7 GRACE
  9. 8 BLAKE D.
  10. 9 BLAKE D.
  11. 10 BLAKE D.
  12. Name: NAME, dtype: object

Which is why we have the > 0 filter:

  1. 2 ADAM
  2. 3 ADAM
  3. 4 ADAM
  4. 5 GRACE
  5. 6 GRACE
  6. 7 GRACE
  7. 8 BLAKE D.
  8. 9 BLAKE D.
  9. 10 BLAKE D.
英文:

It looks like a possible combination of a "fuzzy merge" and a "nearest date merge".

There are many posts already on those topics but perhaps this specific example is useful:

  1. import rapidfuzz
  2. scores = rapidfuzz.process.cdist(
  3. df2[&#39;Name&#39;], df1[&#39;NAME&#39;], workers=-1, scorer=rapidfuzz.distance.JaroWinkler.similarity
  4. )
  5. # pick closest match (max score)
  6. NAME = df1[&#39;NAME&#39;].loc[scores.argmax(axis=1)].set_axis(df2.index)
  7. # discard non-matches
  8. NAME = NAME[scores.max(axis=1) &gt; 0]
  9. # merge_asof requires sorted keys
  10. pd.merge_asof(
  11. df2.assign(NAME=NAME).sort_values(&#39;Start_Date&#39;),
  12. df1.sort_values(&#39;PRIOR_DATE&#39;),
  13. by=&#39;NAME&#39;,
  14. left_on=&#39;Start_Date&#39;,
  15. right_on=&#39;PRIOR_DATE&#39;,
  16. )
  1. ID Name Start_Date End_Date Team NAME REQUEST_ID CREATED_DATE PRIOR_DATE STATUS
  2. 0 10002 Grace 2021-07-24 2021-12-31 Finance GRACE NaN NaT NaT NaN
  3. 1 10001 Adam 2021-09-24 2021-12-22 Tech ADAM NaN NaT NaT NaN
  4. 2 10025 Blake 2021-11-22 2022-03-12 Sales BLAKE D. NaN NaT NaT NaN
  5. 3 10000 Michael 2021-11-23 2022-01-23 Sales NaN NaN NaT NaT NaN
  6. 4 10001 Adam 2021-12-22 2022-04-05 HR ADAM 100.0 2022-01-24 2021-10-24 Approved # &lt;- not valid
  7. 5 10002 Grace 2021-12-31 2022-03-05 Finance GRACE NaN NaT NaT NaN
  8. 6 10000 Michael 2022-01-23 2022-05-02 Sales NaN NaN NaT NaT NaN
  9. 7 10002 Grace 2022-03-05 2022-09-23 Tech GRACE 101.0 2022-04-12 2022-01-12 Approved
  10. 8 10025 Blake 2022-03-12 2022-06-30 Sales BLAKE D. NaN NaT NaT NaN
  11. 9 10001 Adam 2022-04-05 2022-09-21 HR ADAM 100.0 2022-01-24 2021-10-24 Approved
  12. 10 10025 Blake 2022-06-30 2022-09-12 Sales BLAKE D. 102.0 2022-09-21 2022-06-21 Pending

The asof merge doesn't apply the range check, e.g. row 4 here needs to be "set to NaN".

e.g.

  1. df.loc[
  2. ~df[&#39;Start_Date&#39;].between(df[&#39;PRIOR_DATE&#39;], df[&#39;CREATED_DATE&#39;]),
  3. &#39;STATUS&#39;
  4. ] = float(&#39;nan&#39;)

As Michael has no match in df1 it ends up being matched to ADAM with a score of 0.

  1. 0 ADAM
  2. 1 ADAM
  3. 2 ADAM
  4. 3 ADAM
  5. 4 ADAM
  6. 5 GRACE
  7. 6 GRACE
  8. 7 GRACE
  9. 8 BLAKE D.
  10. 9 BLAKE D.
  11. 10 BLAKE D.
  12. Name: NAME, dtype: object

Which is why we have the &gt; 0 filter:

  1. 2 ADAM
  2. 3 ADAM
  3. 4 ADAM
  4. 5 GRACE
  5. 6 GRACE
  6. 7 GRACE
  7. 8 BLAKE D.
  8. 9 BLAKE D.
  9. 10 BLAKE D.

huangapple
  • 本文由 发表于 2023年6月12日 08:59:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76453107.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定