比较 pandas 数据框的列,忽略文本前面的数字。

huangapple go评论66阅读模式
英文:

Compare pandas dataframe columns ignoring number in front of text

问题

我创建了2个数据框,将它们连接起来,交换了列级别,并将列名放在一起。然后,我比较这些列。我的列中的数据具有类似1-1然后文本的格式。我想比较文本并忽略前面的数字。

我目前的进展示例:

import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})

    data2 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})

    df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
    df_both = df_both.swaplevel(axis='columns')

    for column in df_both.columns.levels[0]:
        print(df_both.index[np.where(df_both[column]['Data1'] != df_both[column]['Data2'])])

结果: Int64Index([2, 3, 4], dtype='int64', name='ID')

目前,返回了ID为2、3和4,因为它们不相等。我想找到一种方法来忽略文本前面的数字(例如1-1,1-2,1-3,1-4),只比较后面的文本部分。

英文:

I am creating 2 dataframes, concatenating them, and swapping the column levels and putting the column names next to each other. I am then comparing the columns. The data in my columns has a number with the format like 1-1 and then text. I would like to compare the text and ignore the number before it.

Example of what I have so far:

import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})

    data2 = pd.DataFrame({
    "ID": [1, 2, 3, 4],
    "Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})

    df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
    df_both = df_both.swaplevel(axis='columns')

    for column in df_both.columns.levels[0]:
        print(df_both.index[np.where(df_both[column]['Data1'] != df_both[column]['Data2'])])

Result: Int64Index([2, 3, 4], dtype='int64', name='ID')

Currently, the IDs 2, 3, and 4 are being returned because they are not equal. I would like to find a way to ignore the number before the text (ex. 1-1, 1-2, 1-3, 1-4) and only compare what follows it.

答案1

得分: 0

为了忽略文本前面的数字部分,你可以使用 pandas Series 的 str.split() 方法,在第一个空格字符处分割 'Text' 列,并保留第二部分。可以尝试以下操作:

import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = pd.DataFrame({
        "ID": [1, 2, 3, 4],
        "Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']
    })

    data2 = pd.DataFrame({
        "ID": [1, 2, 3, 4],
        "Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']
    })

    # 在第一个空格字符处分割 'Text' 列,并保留第二部分
    data1['Text'] = data1['Text'].str.split(' ', n=1).str[1]
    data2['Text'] = data2['Text'].str.split(' ', n=1).str[1]

    df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
    df_both = df_both.swaplevel(axis='columns')

    for column in df_both.columns.levels[0]:
        # 在第一个空格字符处分割 'Text' 列,并保留第二部分
        data1_col = data1['Text'].str.split(' ', n=1).str[1]
        data2_col = data2['Text'].str.split(' ', n=1).str[1]

        print(df_both.index[np.where(data1_col != data2_col)])

希望这对你有所帮助。

英文:

To ignore the number before the text, you can split the 'Text' column on the first space character, and keep only the second part. This can be done using the str.split() method of pandas Series, and then accessing the second element of the resulting list using the str accessor. You can try this:

import pandas as pd
import numpy as np

if __name__ == "__main__":
    data1 = pd.DataFrame({
        "ID": [1, 2, 3, 4],
        "Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']
    })

    data2 = pd.DataFrame({
        "ID": [1, 2, 3, 4],
        "Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']
    })

    # Split 'Text' column on the first space character and keep only the second part
    data1['Text'] = data1['Text'].str.split(' ', n=1).str[1]
    data2['Text'] = data2['Text'].str.split(' ', n=1).str[1]

    df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
    df_both = df_both.swaplevel(axis='columns')

    for column in df_both.columns.levels[0]:
        # Split 'Text' column on the first space character and keep only the second part
        data1_col = data1['Text'].str.split(' ', n=1).str[1]
        data2_col = data2['Text'].str.split(' ', n=1).str[1]

        print(df_both.index[np.where(data1_col != data2_col)])

答案2

得分: 0

import pandas as pd

data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})

data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})

df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')

s1 = df_both.loc[:,('Text','Data1')].str.extract('(\d-\d)(.+)')[1]
s2 = df_both.loc[:,('Text','Data2')].str.extract('(\d-\d)(.+)')[1]

m = s1.eq(s2)

print(df_both[m])
              Text                
             Data1           Data2
ID                                
1   1-1 Text here1  1-1 Text here1
3   1-3 Text here3  1-5 Text here3
英文:
import pandas as pd

data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})

data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})

df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')

s1 = df_both.loc[:,('Text','Data1')].str.extract('(\d-\d)(.+)')[1]
s2 = df_both.loc[:,('Text','Data2')].str.extract('(\d-\d)(.+)')[1]

m = s1.eq(s2)

print(df_both[m])
              Text                
             Data1           Data2
ID                                
1   1-1 Text here1  1-1 Text here1
3   1-3 Text here3  1-5 Text here3

huangapple
  • 本文由 发表于 2023年3月7日 07:20:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75656704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定