在日期范围内基于多个条件合并数据框。

huangapple go评论104阅读模式
英文:

Dataframe merge on multiple conditions in date range

问题

我有两个数据框:

df = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Cesar"], 'Date':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07']})

df2 = pd.DataFrame({'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Ricardo"], 'Date':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', '2022-09-01']})

我需要根据 'Name' 和日期合并它们,但日期不同,所以它们需要成为一个日期范围,生成以下数据框:

df_m = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"], 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John',  "Null"], 'Date_y':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', 'Null']}

我需要避免与错误的日期合并,如下所示:

df_m_wrong_date = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"], 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John',  "Null"], 'Date_y':['2022-02-01', '2022-01-30', '2022-01-20', '2022-01-01', 'Null']}
英文:

I have two dataframes

df = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Cesar"],
                   'Date':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07']})

df2 = pd.DataFrame({'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Ricardo"],
                   'Date':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', '2022-09-01']})

I need to merge them based on 'Name' and date, but the dates aren't the same, so they need to be a date range yielding a dataframe as follow:

df_m = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"],
                   'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John',  "Null"], 'Date_y':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', 'Null']})

I need to avoid merging with the wrong dates as follow:

df_m_wrong_date = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"],
                   'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John',  "Null"], 'Date_y':['2022-02-01', '2022-01-30', '2022-01-20', '2022-01-01', 'Null']})

答案1

得分: 2

你可以使用 merge_asof

df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

out = (pd.merge_asof(df.sort_values('Date'), 
                    df2.sort_values('Date').rename(columns={'Date': 'NegDate'}),
                    by=['Sales', 'Name'],
                    left_on='Date', right_on='NegDate', direction='backward')
         .sort_values('orderID'))

输出:

>>> out
   orderID  Sales   Name       Date  Negotiation    NegDate
0       10    100   John 2022-01-08        100.0 2022-01-01
2       11    110  Maria 2022-02-10        110.0 2022-01-20
3       12    120  Maria 2022-02-15        121.0 2022-01-30
1       13    140   John 2022-02-05        134.0 2022-02-01
4       14    150  Cesar 2022-05-07          NaN        NaT
英文:

You can use merge_asof:

df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

out = (pd.merge_asof(df.sort_values('Date'), 
                    df2.sort_values('Date').rename(columns={'Date': 'NegDate'}),
                    by=['Sales', 'Name'],
                    left_on='Date', right_on='NegDate', direction='backward')
         .sort_values('orderID'))

Output:

>>> out
   orderID  Sales   Name       Date  Negotiation    NegDate
0       10    100   John 2022-01-08        100.0 2022-01-01
2       11    110  Maria 2022-02-10        110.0 2022-01-20
3       12    120  Maria 2022-02-15        121.0 2022-01-30
1       13    140   John 2022-02-05        134.0 2022-02-01
4       14    150  Cesar 2022-05-07          NaN        NaT

答案2

得分: 0

你可以这样拼接。如果名称和日期相同,则会删除。我认为将其合并是不合逻辑的。如果你想删除错误的日期,你可以使用pd.to_datetime来做。

data = [df, df2]
df = pd.concat(data)
print(df.drop_duplicates(subset=['Name', 'Date']))
英文:

You can concatenate it like that. It drops if name and date is same. I think it is unlogical to merging it. If you want to drop wrong dates you can do it with pd.to_datetime

data = [df,df2]
df = pd.concat(data)
print(df.drop_duplicates(subset = (['Name', 'Date'])))

huangapple
  • 本文由 发表于 2023年3月9日 21:15:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75685131.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定