2023年3月9日 21:15:12go评论138阅读模式

英文:

Dataframe merge on multiple conditions in date range

问题

我有两个数据框：

df = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Cesar"], 'Date':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07']})
df2 = pd.DataFrame({'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name': ['John', "Maria", "Maria", "John", "Ricardo"], 'Date':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', '2022-09-01']})

我需要根据 'Name' 和日期合并它们，但日期不同，所以它们需要成为一个日期范围，生成以下数据框：

df_m = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"], 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John',  "Null"], 'Date_y':['2022-01-01', '2022-01-20', '2022-01-30', '2022-02-01', 'Null']}

我需要避免与错误的日期合并，如下所示：

df_m_wrong_date = pd.DataFrame({'orderID': [10, 11, 12, 13, 14], 'Sales': [100, 110, 120, 140, 150], 'Name_x': ['John', "Maria", "Maria", "John", "Cesar"], 'Date_X':['2022-01-08', '2022-02-10', '2022-02-15', '2022-02-05', '2022-05-07'], 'Negotiation': [100, 110, 121, 134, 141], 'Sales': [100, 110, 120, 140, 150], 'Name_y': ['John', "Maria", "Maria", 'John',  "Null"], 'Date_y':['2022-02-01', '2022-01-30', '2022-01-20', '2022-01-01', 'Null']}

英文:

I have two dataframes

df = pd.DataFrame({&#39;orderID&#39;: [10, 11, 12, 13, 14], &#39;Sales&#39;: [100, 110, 120, 140, 150], &#39;Name&#39;: [&#39;John&#39;, &quot;Maria&quot;, &quot;Maria&quot;, &quot;John&quot;, &quot;Cesar&quot;],
                   &#39;Date&#39;:[&#39;2022-01-08&#39;, &#39;2022-02-10&#39;, &#39;2022-02-15&#39;, &#39;2022-02-05&#39;, &#39;2022-05-07&#39;]})
df2 = pd.DataFrame({&#39;Negotiation&#39;: [100, 110, 121, 134, 141], &#39;Sales&#39;: [100, 110, 120, 140, 150], &#39;Name&#39;: [&#39;John&#39;, &quot;Maria&quot;, &quot;Maria&quot;, &quot;John&quot;, &quot;Ricardo&quot;],
                   &#39;Date&#39;:[&#39;2022-01-01&#39;, &#39;2022-01-20&#39;, &#39;2022-01-30&#39;, &#39;2022-02-01&#39;, &#39;2022-09-01&#39;]})

I need to merge them based on 'Name' and date, but the dates aren't the same, so they need to be a date range yielding a dataframe as follow:

df_m = pd.DataFrame({&#39;orderID&#39;: [10, 11, 12, 13, 14], &#39;Sales&#39;: [100, 110, 120, 140, 150], &#39;Name_x&#39;: [&#39;John&#39;, &quot;Maria&quot;, &quot;Maria&quot;, &quot;John&quot;, &quot;Cesar&quot;],
                   &#39;Date_X&#39;:[&#39;2022-01-08&#39;, &#39;2022-02-10&#39;, &#39;2022-02-15&#39;, &#39;2022-02-05&#39;, &#39;2022-05-07&#39;], &#39;Negotiation&#39;: [100, 110, 121, 134, 141], &#39;Sales&#39;: [100, 110, 120, 140, 150], &#39;Name_y&#39;: [&#39;John&#39;, &quot;Maria&quot;, &quot;Maria&quot;, &#39;John&#39;,  &quot;Null&quot;], &#39;Date_y&#39;:[&#39;2022-01-01&#39;, &#39;2022-01-20&#39;, &#39;2022-01-30&#39;, &#39;2022-02-01&#39;, &#39;Null&#39;]})

I need to avoid merging with the wrong dates as follow:

df_m_wrong_date = pd.DataFrame({&#39;orderID&#39;: [10, 11, 12, 13, 14], &#39;Sales&#39;: [100, 110, 120, 140, 150], &#39;Name_x&#39;: [&#39;John&#39;, &quot;Maria&quot;, &quot;Maria&quot;, &quot;John&quot;, &quot;Cesar&quot;],
                   &#39;Date_X&#39;:[&#39;2022-01-08&#39;, &#39;2022-02-10&#39;, &#39;2022-02-15&#39;, &#39;2022-02-05&#39;, &#39;2022-05-07&#39;], &#39;Negotiation&#39;: [100, 110, 121, 134, 141], &#39;Sales&#39;: [100, 110, 120, 140, 150], &#39;Name_y&#39;: [&#39;John&#39;, &quot;Maria&quot;, &quot;Maria&quot;, &#39;John&#39;,  &quot;Null&quot;], &#39;Date_y&#39;:[&#39;2022-02-01&#39;, &#39;2022-01-30&#39;, &#39;2022-01-20&#39;, &#39;2022-01-01&#39;, &#39;Null&#39;]})

答案1

得分: 2

你可以使用 merge_asof：

df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
out = (pd.merge_asof(df.sort_values('Date'), 
                    df2.sort_values('Date').rename(columns={'Date': 'NegDate'}),
                    by=['Sales', 'Name'],
                    left_on='Date', right_on='NegDate', direction='backward')
         .sort_values('orderID'))

输出：

>>> out
   orderID  Sales   Name       Date  Negotiation    NegDate
0       10    100   John 2022-01-08        100.0 2022-01-01
2       11    110  Maria 2022-02-10        110.0 2022-01-20
3       12    120  Maria 2022-02-15        121.0 2022-01-30
1       13    140   John 2022-02-05        134.0 2022-02-01
4       14    150  Cesar 2022-05-07          NaN        NaT

英文:

You can use merge_asof:

df[&#39;Date&#39;] = pd.to_datetime(df[&#39;Date&#39;])
df2[&#39;Date&#39;] = pd.to_datetime(df2[&#39;Date&#39;])
out = (pd.merge_asof(df.sort_values(&#39;Date&#39;), 
                    df2.sort_values(&#39;Date&#39;).rename(columns={&#39;Date&#39;: &#39;NegDate&#39;}),
                    by=[&#39;Sales&#39;, &#39;Name&#39;],
                    left_on=&#39;Date&#39;, right_on=&#39;NegDate&#39;, direction=&#39;backward&#39;)
         .sort_values(&#39;orderID&#39;))

Output:

&gt;&gt;&gt; out
   orderID  Sales   Name       Date  Negotiation    NegDate
0       10    100   John 2022-01-08        100.0 2022-01-01
2       11    110  Maria 2022-02-10        110.0 2022-01-20
3       12    120  Maria 2022-02-15        121.0 2022-01-30
1       13    140   John 2022-02-05        134.0 2022-02-01
4       14    150  Cesar 2022-05-07          NaN        NaT

答案2

得分: 0

你可以这样拼接。如果名称和日期相同，则会删除。我认为将其合并是不合逻辑的。如果你想删除错误的日期，你可以使用pd.to_datetime来做。

data = [df, df2]
df = pd.concat(data)
print(df.drop_duplicates(subset=['Name', 'Date']))

英文:

You can concatenate it like that. It drops if name and date is same. I think it is unlogical to merging it. If you want to drop wrong dates you can do it with pd.to_datetime

data = [df,df2]
df = pd.concat(data)
print(df.drop_duplicates(subset = ([&#39;Name&#39;, &#39;Date&#39;])))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在日期范围内基于多个条件合并数据框。

问题

答案1

答案2

迭代将一个数据框的行逐行连接到另一个数据框的单元格中，带有条件。

如何将数据插入到CSV文件的所需列中？

如何在pandas中找到整体平均值

立方体边缘检测

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。