如何在 pandas 中将两个 DataFrame 根据路径/文件名的尾部部分合并

huangapple go评论67阅读模式
英文:

How to join two pandas DataFrames on trailing part of path / filename

问题

你可以使用 Pandas 的 merge 方法来进行这种合并操作。以下是将 df2 合并到 df1 的 Pandas 合并语句:

df3 = df1.merge(df2, left_on='PATH', right_on='F_NAME', how='inner')

这会根据 'PATH' 列和 'F_NAME' 列的匹配进行内连接合并,产生与你描述的结果相似的 df3 数据帧。

英文:

I have two data Frames as follows.

df1 = pd.DataFrame({'PATH':[r'C:\FODLER\Test1.jpg',
                            r'C:\A\FODLER\Test2.jpg',
                            r'C:\A\FODLER\Test3.jpg',
                            r'C:\A\FODLER\Test4.jpg'],
                    'VALUE':[45,23,45,2]})

df2 = pd.DataFrame({'F_NAME': [r'FODLER\Test1.jpg',
                               r'FODLER\Test2.jpg',
                               r'FODLER\Test6.jpg',
                               r'FODLER\Test3.jpg',
                               r'FODLER\Test4.jpg',
                               r'FODLER\Test9.jpg'],
                    'VALUE_X': ['12', '25', '97', '33', '123', '0'],
                    'CORDS': ['1', '2', '3', '4', '5', '6']})

I want to join df2, where PATH.Contains(F_NAME) to df1 table.
so resulting data frame is as follows :

df3 = pd.DataFrame({'PATH':[r'C:\FODLER\Test1.jpg',
                            r'C:\A\FODLER\Test2.jpg',
                            r'C:\A\FODLER\Test3.jpg',
                            r'C:\A\FODLER\Test4.jpg'],
                    'F_NAME': [r'FODLER\Test1.jpg',
                               r'FODLER\Test2.jpg',
                               r'FODLER\Test3.jpg',
                               r'FODLER\Test4.jpg'],
                    'VALUE_X': ['12', '25', '33', '123'],
                    'CORDS': ['1', '2', '4', '5'],
                    'VALUE':[45,23,45,2]})

How do I write the pandas merge statement to do this joining?

答案1

得分: 2

You can use a merge with a regex using str.extract to extract the end-of-line anchored part of the path:

import re

pattern = f"({'|'.join(df2['F_NAME'].apply(re.escape))})$"

df3 = df1.merge(df2, left_on=df1['PATH'].str.extract(pattern, expand=False),
                right_on='F_NAME', how='left')

Output:

                    PATH  VALUE            F_NAME VALUE_X CORDS
0    C:\FODLER\Test1.jpg     45  FODLER\Test1.jpg      12     1
1  C:\A\FODLER\Test2.jpg     23  FODLER\Test2.jpg      25     2
2  C:\A\FODLER\Test3.jpg     45  FODLER\Test3.jpg      33     4
3  C:\A\FODLER\Test4.jpg      2  FODLER\Test4.jpg     123     5

pattern:

(FODLER\\Test1\.jpg|FODLER\\Test2\.jpg|FODLER\\Test6\.jpg|FODLER\\Test3\.jpg|FODLER\\Test4\.jpg|FODLER\\Test9\.jpg)$

regex demo

Alternatively, if the PATH only has 2 components (folder\filename.ext), you can assign a column with the trailing part of the path before merging:

df3 = (df1
    .assign(F_NAME=df1['PATH'].str.extract(r'([^\\]+\\[^\\]+)$', expand=False))
    .merge(df2, how='left')
)

regex demo

英文:

You can use a merge with a regex using str.extract to extract the end-of-line anchored part of the path:

import re

pattern = f"({'|'.join(df2['F_NAME'].apply(re.escape))})$"

df3 = df1.merge(df2, left_on=df1['PATH'].str.extract(pattern, expand=False),
                right_on='F_NAME', how='left')

Output:

                    PATH  VALUE            F_NAME VALUE_X CORDS
0    C:\FODLER\Test1.jpg     45  FODLER\Test1.jpg      12     1
1  C:\A\FODLER\Test2.jpg     23  FODLER\Test2.jpg      25     2
2  C:\A\FODLER\Test3.jpg     45  FODLER\Test3.jpg      33     4
3  C:\A\FODLER\Test4.jpg      2  FODLER\Test4.jpg     123     5

pattern:

(FODLER\\Test1\.jpg|FODLER\\Test2\.jpg|FODLER\\Test6\.jpg|FODLER\\Test3\.jpg|FODLER\\Test4\.jpg|FODLER\\Test9\.jpg)$

regex demo


Alternatively, if the PATH only has 2 components (folder\filename.ext), you can assign a column with the trailing part of the path before merging:

df3 = (df1
    .assign(F_NAME=df1['PATH'].str.extract(r'([^\\]+\\[^\\]+)$', expand=False))
    .merge(df2, how='left')
)

regex demo

答案2

得分: 1

可以尝试这个:

df3 = df1[df1['PATH'].str.contains('|'.join(df2['F_NAME']))].merge(df2, left_on=df1['PATH'], right_on=df2['F_NAME'], how='left')

print(df3)
英文:

Can you try this one:

df3 = df1[df1['PATH'].str.contains('|'.join(df2['F_NAME']))].merge(df2, left_on=df1['PATH'], right_on=df2['F_NAME'], how='left')


print(df3)

huangapple
  • 本文由 发表于 2023年5月25日 14:54:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76329602.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定