如何在 pandas 中连接具有不同列集的两个数据框。

huangapple go评论72阅读模式
英文:

How to join two dataframes with different column sets in pandas

问题

我有两个pandas数据框:df1和df2。

data_df1 = {
    'seller': ['John', 'Alice', 'Bob', 'Alice'],
    'date': ['2023-07-25', '2023-07-26', '2023-07-27', '2023-07-28']
}

df1 = pd.DataFrame(data_df1)

data_df2 = {
    'seller': ['Alice Perry', 'John Russell', 'Bob Foster', 'Alice Mcgee'],
    'id': [101, 102, 103, 104],
    'date': ['2023-07-26', '2023-07-25', '2023-07-27', '2023-07-26']
}

df2 = pd.DataFrame(data_df2)

我想要在这些数据框之间执行一个基于'seller'和'date'列的连接。值得一提的是,卖方列在两个数据框中不同,一个只有名字(df1),另一个有全名(df2),日期也不同,因此合并必须是基于接近的日期,而不是精确日期。

我应该如何执行连接并获得所需的结果?我是否应该在连接之前将'date'列转换为通用格式,如果是的话,我该如何做到这一点?

我欣赏任何关于如何正确完成此连接的见解或示例。谢谢!

我已经成功地将日期转换为连接正常工作,但我不知道如何处理名称。

英文:

I have two pandas dataframes: df1 and df2.

data_df1 = {
    'seller': ['John', 'Alice', 'Bob', 'Alice'],
    'date': ['2023-07-25', '2023-07-26', '2023-07-27', '2023-07-28']
}

df1 = pd.DataFrame(data_df1)

data_df2 = {
    'seller': ['Alice Perry', 'John Russell', 'Bob Foster', 'Alice Mcgee'],
    'id': [101, 102, 103, 104],
    'date': ['2023-07-26', '2023-07-25', '2023-07-27', '2023-07-26']
}

df2 = pd.DataFrame(data_df2)

I want to perform a join between these dataframes based on the 'seller' and 'date' columns. It is worth mentioning that the seller column is not the same in both dataframes, one has only the name (df1) and the other has the full name (df2) and the date is not the same, so the merge must be by proximity, not exact dates.

How can I perform the join and get the desired result? Should I convert the 'date' columns to a common format before joining, and if so, how can I do that?

I appreciate any insights or examples on how to accomplish this join correctly. Thank you!

I already managed to make the conversion of dates for the join to work, but I don't know how to do this with the names

答案1

得分: 1

标准化 date 列:

df1['date'] = pd.to_datetime(df1['date']).dt.date
df2['date'] = pd.to_datetime(df2['date']).dt.date

提取 df2 的姓氏:

df2['first_name'] = df2['seller'].str.split(' ').str[0]

使用 .merge() 进行合并:

result_df = df1.merge(df2, how='inner', left_on=['seller', 'date'], right_on=['first_name', 'date'])

输出:

>       seller_x        date      seller_y   id first_name
>     0     John  2023-07-25  John Russell  102       John
>     1    Alice  2023-07-26   Alice Perry  101      Alice
>     2    Alice  2023-07-26   Alice Mcgee  104      Alice
>     3      Bob  2023-07-27    Bob Foster  103        Bob
英文:

Standardize the date columns:

df1['date'] = pd.to_datetime(df1['date']).dt.date
df2['date'] = pd.to_datetime(df2['date']).dt.date

Extract only the first name for df2:

df2['first_name'] = df2['seller'].str.split(' ').str[0]

Merge using .merge():

result_df = df1.merge(df2, how='inner', left_on=['seller', 'date'], right_on=['first_name', 'date'])

Output:

> seller_x date seller_y id first_name
> 0 John 2023-07-25 John Russell 102 John
> 1 Alice 2023-07-26 Alice Perry 101 Alice
> 2 Alice 2023-07-26 Alice Mcgee 104 Alice
> 3 Bob 2023-07-27 Bob Foster 103 Bob

答案2

得分: 0

你可以使用以下代码:

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

df1.merge(df2, left_on=['seller', 'date'],
          right_on=[df2['seller'].str.extract(r'(\w+)', expand=False), df2['date'].dt.normalize()],
          suffixes=(None, '_'))

输出:

 seller       date       seller_   id               date_
0   John 2023-07-25  John Russell  102 2023-07-25 11:45:00
1  Alice 2023-07-26   Alice Perry  101 2023-07-26 12:30:00
2  Alice 2023-07-26   Alice Mcgee  104 2023-07-26 16:20:00
3    Bob 2023-07-27    Bob Foster  103 2023-07-27 08:15:00

如果你只想保留其中一个键:

out = df1.merge(df2, left_on=['seller', 'date'],
                right_on=[df2['seller'].str.extract(r'(\w+)', expand=False),
                          df2['date'].dt.normalize()],
               suffixes=(None, '_')
               ).drop(columns=['seller_', 'date_'])

输出:

  seller       date   id
0   John 2023-07-25  102
1  Alice 2023-07-26  101
2  Alice 2023-07-26  104
3    Bob 2023-07-27  103
英文:

You can use:

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

df1.merge(df2, left_on=['seller', 'date'],
          right_on=[df2['seller'].str.extract(r'(\w+)', expand=False), df2['date'].dt.normalize()],
          suffixes=(None, '_'))

Output:

 seller       date       seller_   id               date_
0   John 2023-07-25  John Russell  102 2023-07-25 11:45:00
1  Alice 2023-07-26   Alice Perry  101 2023-07-26 12:30:00
2  Alice 2023-07-26   Alice Mcgee  104 2023-07-26 16:20:00
3    Bob 2023-07-27    Bob Foster  103 2023-07-27 08:15:00

If you only want to keep one of the keys:

out = df1.merge(df2, left_on=['seller', 'date'],
                right_on=[df2['seller'].str.extract(r'(\w+)', expand=False),
                          df2['date'].dt.normalize()],
               suffixes=(None, '_')
               ).drop(columns=['seller_', 'date_'])

Output:

  seller       date   id
0   John 2023-07-25  102
1  Alice 2023-07-26  101
2  Alice 2023-07-26  104
3    Bob 2023-07-27  103

答案3

得分: 0

好的,

以下是翻译好的部分:

  1. 日期上的连接需要具有相同的数据类型,所以将它们都更改为日期而不是日期时间。 df["date"] = df["date"].astype(date)
  2. 在卖方方面,连接可能会有点复杂。 在df1中,您只有名字,但在df2中,您有全名。 您可以使用类似以下方式将它们分开为seller_first和seller_last:df2["seller_first"] = df2["seller"].partition(" ")[0],但真正的问题在于您的数据。 df1中有两个Alice的条目,df2中有两个以Alice开头的条目。 由于df2中的Alice代表不同的人,您的连接将不知道如何正确解决这个问题。
英文:

Good Day,

There are few things to consider in your example.

  1. Joins on the dates needs to be the same data type, so make them both dates instead of datetimes. df["date"] = df["date"].astype(date)
  2. Join in the seller might be a bit more complicated. In df1 you have only the first name, but in df2 you have the full name. You can separate them into seller_first and seller_last using something like: df2["seller_first"] = df2["seller"].partition(" ")[0], but the real problem is in your data. You have two entries in df1 with Alice and two entries in df2 that begin with Alice. Your join will not know how to resolve this correctly since the Alice in df2 are different people.

huangapple
  • 本文由 发表于 2023年7月27日 22:14:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76780631.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定