从Pandas DataFrame提取数据

huangapple go评论87阅读模式
英文:

Extracting Data From Pandas DataFrame

问题

df1和df2是两个Pandas数据帧,我想从这两个数据帧中提取具有相同名称的文件,并将提取的文件放入一个数据帧的两列中。我想要从df1中获取文件名并与df2匹配(df2中的文件比df1多)。两个数据帧(df1和df2)中都只有一列。以字母s****开头的“BOLD”是共同匹配的字母数字字符。我们需要根据这一点匹配这两个数据帧。

df1["Text_File_Location"] =
0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt

1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt

df2["Image_File_Location"] =

0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg'

1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg

英文:

I have two pandas dataframe named df1 and df2. I want to extract same named files from both of the dataframe and put extracted in two columns in a data frame. I want the take, files name from df1 and match with df2 (df2 has more files than df1). There is only one column in both dataframe (df1 and df2). The "BOLD" one started with letter s**** is the common matching alpha-numeric characters. We have to match both dataframe on that.

df1["Text_File_Location"] =
0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt

1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt

df2["Image_File_Location"]=

0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg'

1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg

答案1

得分: 1

在Python 3.4+中,您可以使用pathlib轻松处理文件路径。您可以从df1中提取没有扩展名的文件名("stem"),然后可以从df2中提取父文件夹名称。然后,您可以在这些名称上执行内部合并。

import pandas as pd
from pathlib import Path

df1 = pd.DataFrame(
    {
        "Text_File_Location": [
            "/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
            "/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
        ]
    }
)
df2 = pd.DataFrame(
    {
        "Image_File_Location": [
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
        ]
    }
)

df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)

df3 = pd.merge(df1, df2, on="name", how="inner")

注意:以上为您提供的Python代码的中文翻译,只包括代码部分,不包括任何其他内容。

英文:

In Python 3.4+, you can use pathlib to handily work with filepaths. You can extract the filename without extension ("stem") from df1 and then you can extract the parent folder name from df2. Then, you can do an inner merge on those names.

import pandas as pd
from pathlib import Path

df1 = pd.DataFrame(
    {
        "Text_File_Location": [
            "/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
            "/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
        ]
    }
)
df2 = pd.DataFrame(
    {
        "Image_File_Location": [
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
        ]
    }
)

df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)

df3 = pd.merge(df1, df2, on="name", how="inner")

huangapple
  • 本文由 发表于 2023年1月9日 06:44:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75051811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定