2023年7月27日 22:14:47go评论72阅读模式

英文:

How to join two dataframes with different column sets in pandas

问题

我有两个pandas数据框：df1和df2。

data_df1 = {
    'seller': ['John', 'Alice', 'Bob', 'Alice'],
    'date': ['2023-07-25', '2023-07-26', '2023-07-27', '2023-07-28']
}

df1 = pd.DataFrame(data_df1)

data_df2 = {
    'seller': ['Alice Perry', 'John Russell', 'Bob Foster', 'Alice Mcgee'],
    'id': [101, 102, 103, 104],
    'date': ['2023-07-26', '2023-07-25', '2023-07-27', '2023-07-26']
}

df2 = pd.DataFrame(data_df2)

我想要在这些数据框之间执行一个基于'seller'和'date'列的连接。值得一提的是，卖方列在两个数据框中不同，一个只有名字（df1），另一个有全名（df2），日期也不同，因此合并必须是基于接近的日期，而不是精确日期。

我应该如何执行连接并获得所需的结果？我是否应该在连接之前将'date'列转换为通用格式，如果是的话，我该如何做到这一点？

我欣赏任何关于如何正确完成此连接的见解或示例。谢谢！

我已经成功地将日期转换为连接正常工作，但我不知道如何处理名称。

英文:

I have two pandas dataframes: df1 and df2.

data_df1 = {
    &#39;seller&#39;: [&#39;John&#39;, &#39;Alice&#39;, &#39;Bob&#39;, &#39;Alice&#39;],
    &#39;date&#39;: [&#39;2023-07-25&#39;, &#39;2023-07-26&#39;, &#39;2023-07-27&#39;, &#39;2023-07-28&#39;]
}

df1 = pd.DataFrame(data_df1)

data_df2 = {
    &#39;seller&#39;: [&#39;Alice Perry&#39;, &#39;John Russell&#39;, &#39;Bob Foster&#39;, &#39;Alice Mcgee&#39;],
    &#39;id&#39;: [101, 102, 103, 104],
    &#39;date&#39;: [&#39;2023-07-26&#39;, &#39;2023-07-25&#39;, &#39;2023-07-27&#39;, &#39;2023-07-26&#39;]
}

df2 = pd.DataFrame(data_df2)

I want to perform a join between these dataframes based on the 'seller' and 'date' columns. It is worth mentioning that the seller column is not the same in both dataframes, one has only the name (df1) and the other has the full name (df2) and the date is not the same, so the merge must be by proximity, not exact dates.

How can I perform the join and get the desired result? Should I convert the 'date' columns to a common format before joining, and if so, how can I do that?

I appreciate any insights or examples on how to accomplish this join correctly. Thank you!

I already managed to make the conversion of dates for the join to work, but I don't know how to do this with the names

答案1

得分: 1

标准化 date 列：

df1['date'] = pd.to_datetime(df1['date']).dt.date
df2['date'] = pd.to_datetime(df2['date']).dt.date

提取 df2 的姓氏：

df2['first_name'] = df2['seller'].str.split(' ').str[0]

使用 .merge() 进行合并：

result_df = df1.merge(df2, how='inner', left_on=['seller', 'date'], right_on=['first_name', 'date'])

输出：

>       seller_x        date      seller_y   id first_name
>     0     John  2023-07-25  John Russell  102       John
>     1    Alice  2023-07-26   Alice Perry  101      Alice
>     2    Alice  2023-07-26   Alice Mcgee  104      Alice
>     3      Bob  2023-07-27    Bob Foster  103        Bob

英文:

Standardize the date columns:

df1[&#39;date&#39;] = pd.to_datetime(df1[&#39;date&#39;]).dt.date
df2[&#39;date&#39;] = pd.to_datetime(df2[&#39;date&#39;]).dt.date

Extract only the first name for df2:

df2[&#39;first_name&#39;] = df2[&#39;seller&#39;].str.split(&#39; &#39;).str[0]

Merge using .merge():

result_df = df1.merge(df2, how=&#39;inner&#39;, left_on=[&#39;seller&#39;, &#39;date&#39;], right_on=[&#39;first_name&#39;, &#39;date&#39;])

Output:

> seller_x date seller_y id first_name
> 0 John 2023-07-25 John Russell 102 John
> 1 Alice 2023-07-26 Alice Perry 101 Alice
> 2 Alice 2023-07-26 Alice Mcgee 104 Alice
> 3 Bob 2023-07-27 Bob Foster 103 Bob

答案2

得分: 0

你可以使用以下代码：

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

df1.merge(df2, left_on=['seller', 'date'],
          right_on=[df2['seller'].str.extract(r'(\w+)', expand=False), df2['date'].dt.normalize()],
          suffixes=(None, '_'))

输出：

 seller       date       seller_   id               date_
0   John 2023-07-25  John Russell  102 2023-07-25 11:45:00
1  Alice 2023-07-26   Alice Perry  101 2023-07-26 12:30:00
2  Alice 2023-07-26   Alice Mcgee  104 2023-07-26 16:20:00
3    Bob 2023-07-27    Bob Foster  103 2023-07-27 08:15:00

如果你只想保留其中一个键：

out = df1.merge(df2, left_on=['seller', 'date'],
                right_on=[df2['seller'].str.extract(r'(\w+)', expand=False),
                          df2['date'].dt.normalize()],
               suffixes=(None, '_')
               ).drop(columns=['seller_', 'date_'])

输出：

  seller       date   id
0   John 2023-07-25  102
1  Alice 2023-07-26  101
2  Alice 2023-07-26  104
3    Bob 2023-07-27  103

英文:

You can use:

df1[&#39;date&#39;] = pd.to_datetime(df1[&#39;date&#39;])
df2[&#39;date&#39;] = pd.to_datetime(df2[&#39;date&#39;])

df1.merge(df2, left_on=[&#39;seller&#39;, &#39;date&#39;],
          right_on=[df2[&#39;seller&#39;].str.extract(r&#39;(\w+)&#39;, expand=False), df2[&#39;date&#39;].dt.normalize()],
          suffixes=(None, &#39;_&#39;))

Output:

 seller       date       seller_   id               date_
0   John 2023-07-25  John Russell  102 2023-07-25 11:45:00
1  Alice 2023-07-26   Alice Perry  101 2023-07-26 12:30:00
2  Alice 2023-07-26   Alice Mcgee  104 2023-07-26 16:20:00
3    Bob 2023-07-27    Bob Foster  103 2023-07-27 08:15:00

If you only want to keep one of the keys:

out = df1.merge(df2, left_on=[&#39;seller&#39;, &#39;date&#39;],
                right_on=[df2[&#39;seller&#39;].str.extract(r&#39;(\w+)&#39;, expand=False),
                          df2[&#39;date&#39;].dt.normalize()],
               suffixes=(None, &#39;_&#39;)
               ).drop(columns=[&#39;seller_&#39;, &#39;date_&#39;])

Output:

  seller       date   id
0   John 2023-07-25  102
1  Alice 2023-07-26  101
2  Alice 2023-07-26  104
3    Bob 2023-07-27  103

答案3

得分: 0

好的，

以下是翻译好的部分：

日期上的连接需要具有相同的数据类型，所以将它们都更改为日期而不是日期时间。 df["date"] = df["date"].astype(date)
在卖方方面，连接可能会有点复杂。在df1中，您只有名字，但在df2中，您有全名。您可以使用类似以下方式将它们分开为seller_first和seller_last：df2["seller_first"] = df2["seller"].partition(" ")[0]，但真正的问题在于您的数据。 df1中有两个Alice的条目，df2中有两个以Alice开头的条目。由于df2中的Alice代表不同的人，您的连接将不知道如何正确解决这个问题。

英文:

Good Day,

There are few things to consider in your example.

Joins on the dates needs to be the same data type, so make them both dates instead of datetimes. df["date"] = df["date"].astype(date)
Join in the seller might be a bit more complicated. In df1 you have only the first name, but in df2 you have the full name. You can separate them into seller_first and seller_last using something like: df2["seller_first"] = df2["seller"].partition(" ")[0], but the real problem is in your data. You have two entries in df1 with Alice and two entries in df2 that begin with Alice. Your join will not know how to resolve this correctly since the Alice in df2 are different people.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 pandas 中连接具有不同列集的两个数据框。

问题

答案1

答案2

答案3

如何在静态类型语言中复制？

运行守护进程服务器还是Shell命令？

获取亚马逊评论的CSV文件，使用Python和AWS。

如何在转置后保持两列的值对齐？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论