在日期范围内基于多个条件合并数据框。

huangapple go评论138阅读模式
英文:

Dataframe merge on multiple conditions in date range

问题

我有两个数据框:

  1. df = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Cesar"], 'Date':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07']})
  2. df2 = pd.DataFrame({'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Ricardo"], 'Date':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', '2022-09-01']})

我需要根据 'Name' 和日期合并它们,但日期不同,所以它们需要成为一个日期范围,生成以下数据框:

  1. df_m = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"], 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John', "Null"], 'Date_y':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', 'Null']}

我需要避免与错误的日期合并,如下所示:

  1. df_m_wrong_date = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"], 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John', "Null"], 'Date_y':['2022-02-01', '2022-01-30', '2022-01-20', '2022-01-01', 'Null']}
英文:

I have two dataframes

  1. df = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Cesar"],
  2. 'Date':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07']})
  3. df2 = pd.DataFrame({'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Ricardo"],
  4. 'Date':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', '2022-09-01']})

I need to merge them based on 'Name' and date, but the dates aren't the same, so they need to be a date range yielding a dataframe as follow:

  1. df_m = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"],
  2. 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John', "Null"], 'Date_y':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', 'Null']})

I need to avoid merging with the wrong dates as follow:

  1. df_m_wrong_date = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"],
  2. 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John', "Null"], 'Date_y':['2022-02-01', '2022-01-30', '2022-01-20', '2022-01-01', 'Null']})

答案1

得分: 2

你可以使用 merge_asof

  1. df['Date'] = pd.to_datetime(df['Date'])
  2. df2['Date'] = pd.to_datetime(df2['Date'])
  3. out = (pd.merge_asof(df.sort_values('Date'),
  4. df2.sort_values('Date').rename(columns={'Date': 'NegDate'}),
  5. by=['Sales', 'Name'],
  6. left_on='Date', right_on='NegDate', direction='backward')
  7. .sort_values('orderID'))

输出:

  1. >>> out
  2. orderID Sales Name Date Negotiation NegDate
  3. 0 10 100 John 2022-01-08 100.0 2022-01-01
  4. 2 11 110 Maria 2022-02-10 110.0 2022-01-20
  5. 3 12 120 Maria 2022-02-15 121.0 2022-01-30
  6. 1 13 140 John 2022-02-05 134.0 2022-02-01
  7. 4 14 150 Cesar 2022-05-07 NaN NaT
英文:

You can use merge_asof:

  1. df['Date'] = pd.to_datetime(df['Date'])
  2. df2['Date'] = pd.to_datetime(df2['Date'])
  3. out = (pd.merge_asof(df.sort_values('Date'),
  4. df2.sort_values('Date').rename(columns={'Date': 'NegDate'}),
  5. by=['Sales', 'Name'],
  6. left_on='Date', right_on='NegDate', direction='backward')
  7. .sort_values('orderID'))

Output:

  1. >>> out
  2. orderID Sales Name Date Negotiation NegDate
  3. 0 10 100 John 2022-01-08 100.0 2022-01-01
  4. 2 11 110 Maria 2022-02-10 110.0 2022-01-20
  5. 3 12 120 Maria 2022-02-15 121.0 2022-01-30
  6. 1 13 140 John 2022-02-05 134.0 2022-02-01
  7. 4 14 150 Cesar 2022-05-07 NaN NaT

答案2

得分: 0

你可以这样拼接。如果名称和日期相同,则会删除。我认为将其合并是不合逻辑的。如果你想删除错误的日期,你可以使用pd.to_datetime来做。

  1. data = [df, df2]
  2. df = pd.concat(data)
  3. print(df.drop_duplicates(subset=['Name', 'Date']))
英文:

You can concatenate it like that. It drops if name and date is same. I think it is unlogical to merging it. If you want to drop wrong dates you can do it with pd.to_datetime

  1. data = [df,df2]
  2. df = pd.concat(data)
  3. print(df.drop_duplicates(subset = (['Name', 'Date'])))

huangapple
  • 本文由 发表于 2023年3月9日 21:15:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75685131.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定