英文:
How to join two dataframes with different column sets in pandas
问题
我有两个pandas数据框:df1和df2。
data_df1 = {
'seller': ['John', 'Alice', 'Bob', 'Alice'],
'date': ['2023-07-25', '2023-07-26', '2023-07-27', '2023-07-28']
}
df1 = pd.DataFrame(data_df1)
data_df2 = {
'seller': ['Alice Perry', 'John Russell', 'Bob Foster', 'Alice Mcgee'],
'id': [101, 102, 103, 104],
'date': ['2023-07-26', '2023-07-25', '2023-07-27', '2023-07-26']
}
df2 = pd.DataFrame(data_df2)
我想要在这些数据框之间执行一个基于'seller'和'date'列的连接。值得一提的是,卖方列在两个数据框中不同,一个只有名字(df1),另一个有全名(df2),日期也不同,因此合并必须是基于接近的日期,而不是精确日期。
我应该如何执行连接并获得所需的结果?我是否应该在连接之前将'date'列转换为通用格式,如果是的话,我该如何做到这一点?
我欣赏任何关于如何正确完成此连接的见解或示例。谢谢!
我已经成功地将日期转换为连接正常工作,但我不知道如何处理名称。
英文:
I have two pandas dataframes: df1 and df2.
data_df1 = {
'seller': ['John', 'Alice', 'Bob', 'Alice'],
'date': ['2023-07-25', '2023-07-26', '2023-07-27', '2023-07-28']
}
df1 = pd.DataFrame(data_df1)
data_df2 = {
'seller': ['Alice Perry', 'John Russell', 'Bob Foster', 'Alice Mcgee'],
'id': [101, 102, 103, 104],
'date': ['2023-07-26', '2023-07-25', '2023-07-27', '2023-07-26']
}
df2 = pd.DataFrame(data_df2)
I want to perform a join between these dataframes based on the 'seller' and 'date' columns. It is worth mentioning that the seller column is not the same in both dataframes, one has only the name (df1) and the other has the full name (df2) and the date is not the same, so the merge must be by proximity, not exact dates.
How can I perform the join and get the desired result? Should I convert the 'date' columns to a common format before joining, and if so, how can I do that?
I appreciate any insights or examples on how to accomplish this join correctly. Thank you!
I already managed to make the conversion of dates for the join to work, but I don't know how to do this with the names
答案1
得分: 1
标准化 date
列:
df1['date'] = pd.to_datetime(df1['date']).dt.date
df2['date'] = pd.to_datetime(df2['date']).dt.date
提取 df2
的姓氏:
df2['first_name'] = df2['seller'].str.split(' ').str[0]
使用 .merge()
进行合并:
result_df = df1.merge(df2, how='inner', left_on=['seller', 'date'], right_on=['first_name', 'date'])
输出:
> seller_x date seller_y id first_name
> 0 John 2023-07-25 John Russell 102 John
> 1 Alice 2023-07-26 Alice Perry 101 Alice
> 2 Alice 2023-07-26 Alice Mcgee 104 Alice
> 3 Bob 2023-07-27 Bob Foster 103 Bob
英文:
Standardize the date
columns:
df1['date'] = pd.to_datetime(df1['date']).dt.date
df2['date'] = pd.to_datetime(df2['date']).dt.date
Extract only the first name for df2
:
df2['first_name'] = df2['seller'].str.split(' ').str[0]
Merge using .merge()
:
result_df = df1.merge(df2, how='inner', left_on=['seller', 'date'], right_on=['first_name', 'date'])
Output:
> seller_x date seller_y id first_name
> 0 John 2023-07-25 John Russell 102 John
> 1 Alice 2023-07-26 Alice Perry 101 Alice
> 2 Alice 2023-07-26 Alice Mcgee 104 Alice
> 3 Bob 2023-07-27 Bob Foster 103 Bob
答案2
得分: 0
你可以使用以下代码:
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df1.merge(df2, left_on=['seller', 'date'],
right_on=[df2['seller'].str.extract(r'(\w+)', expand=False), df2['date'].dt.normalize()],
suffixes=(None, '_'))
输出:
seller date seller_ id date_
0 John 2023-07-25 John Russell 102 2023-07-25 11:45:00
1 Alice 2023-07-26 Alice Perry 101 2023-07-26 12:30:00
2 Alice 2023-07-26 Alice Mcgee 104 2023-07-26 16:20:00
3 Bob 2023-07-27 Bob Foster 103 2023-07-27 08:15:00
如果你只想保留其中一个键:
out = df1.merge(df2, left_on=['seller', 'date'],
right_on=[df2['seller'].str.extract(r'(\w+)', expand=False),
df2['date'].dt.normalize()],
suffixes=(None, '_')
).drop(columns=['seller_', 'date_'])
输出:
seller date id
0 John 2023-07-25 102
1 Alice 2023-07-26 101
2 Alice 2023-07-26 104
3 Bob 2023-07-27 103
英文:
You can use:
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df1.merge(df2, left_on=['seller', 'date'],
right_on=[df2['seller'].str.extract(r'(\w+)', expand=False), df2['date'].dt.normalize()],
suffixes=(None, '_'))
Output:
seller date seller_ id date_
0 John 2023-07-25 John Russell 102 2023-07-25 11:45:00
1 Alice 2023-07-26 Alice Perry 101 2023-07-26 12:30:00
2 Alice 2023-07-26 Alice Mcgee 104 2023-07-26 16:20:00
3 Bob 2023-07-27 Bob Foster 103 2023-07-27 08:15:00
If you only want to keep one of the keys:
out = df1.merge(df2, left_on=['seller', 'date'],
right_on=[df2['seller'].str.extract(r'(\w+)', expand=False),
df2['date'].dt.normalize()],
suffixes=(None, '_')
).drop(columns=['seller_', 'date_'])
Output:
seller date id
0 John 2023-07-25 102
1 Alice 2023-07-26 101
2 Alice 2023-07-26 104
3 Bob 2023-07-27 103
答案3
得分: 0
好的,
以下是翻译好的部分:
- 日期上的连接需要具有相同的数据类型,所以将它们都更改为日期而不是日期时间。 df["date"] = df["date"].astype(date)
- 在卖方方面,连接可能会有点复杂。 在df1中,您只有名字,但在df2中,您有全名。 您可以使用类似以下方式将它们分开为seller_first和seller_last:df2["seller_first"] = df2["seller"].partition(" ")[0],但真正的问题在于您的数据。 df1中有两个Alice的条目,df2中有两个以Alice开头的条目。 由于df2中的Alice代表不同的人,您的连接将不知道如何正确解决这个问题。
英文:
Good Day,
There are few things to consider in your example.
- Joins on the dates needs to be the same data type, so make them both dates instead of datetimes. df["date"] = df["date"].astype(date)
- Join in the seller might be a bit more complicated. In df1 you have only the first name, but in df2 you have the full name. You can separate them into seller_first and seller_last using something like: df2["seller_first"] = df2["seller"].partition(" ")[0], but the real problem is in your data. You have two entries in df1 with Alice and two entries in df2 that begin with Alice. Your join will not know how to resolve this correctly since the Alice in df2 are different people.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论