英文:
Compare pandas dataframe columns ignoring number in front of text
问题
我创建了2个数据框,将它们连接起来,交换了列级别,并将列名放在一起。然后,我比较这些列。我的列中的数据具有类似1-1然后文本的格式。我想比较文本并忽略前面的数字。
我目前的进展示例:
import pandas as pd
import numpy as np
if __name__ == "__main__":
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
for column in df_both.columns.levels[0]:
print(df_both.index[np.where(df_both[column]['Data1'] != df_both[column]['Data2'])])
结果: Int64Index([2, 3, 4], dtype='int64', name='ID')
目前,返回了ID为2、3和4,因为它们不相等。我想找到一种方法来忽略文本前面的数字(例如1-1,1-2,1-3,1-4),只比较后面的文本部分。
英文:
I am creating 2 dataframes, concatenating them, and swapping the column levels and putting the column names next to each other. I am then comparing the columns. The data in my columns has a number with the format like 1-1 and then text. I would like to compare the text and ignore the number before it.
Example of what I have so far:
import pandas as pd
import numpy as np
if __name__ == "__main__":
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
for column in df_both.columns.levels[0]:
print(df_both.index[np.where(df_both[column]['Data1'] != df_both[column]['Data2'])])
Result: Int64Index([2, 3, 4], dtype='int64', name='ID')
Currently, the IDs 2, 3, and 4 are being returned because they are not equal. I would like to find a way to ignore the number before the text (ex. 1-1, 1-2, 1-3, 1-4) and only compare what follows it.
答案1
得分: 0
为了忽略文本前面的数字部分,你可以使用 pandas Series 的 str.split() 方法,在第一个空格字符处分割 'Text' 列,并保留第二部分。可以尝试以下操作:
import pandas as pd
import numpy as np
if __name__ == "__main__":
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']
})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']
})
# 在第一个空格字符处分割 'Text' 列,并保留第二部分
data1['Text'] = data1['Text'].str.split(' ', n=1).str[1]
data2['Text'] = data2['Text'].str.split(' ', n=1).str[1]
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
for column in df_both.columns.levels[0]:
# 在第一个空格字符处分割 'Text' 列,并保留第二部分
data1_col = data1['Text'].str.split(' ', n=1).str[1]
data2_col = data2['Text'].str.split(' ', n=1).str[1]
print(df_both.index[np.where(data1_col != data2_col)])
希望这对你有所帮助。
英文:
To ignore the number before the text, you can split the 'Text' column on the first space character, and keep only the second part. This can be done using the str.split() method of pandas Series, and then accessing the second element of the resulting list using the str accessor. You can try this:
import pandas as pd
import numpy as np
if __name__ == "__main__":
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']
})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']
})
# Split 'Text' column on the first space character and keep only the second part
data1['Text'] = data1['Text'].str.split(' ', n=1).str[1]
data2['Text'] = data2['Text'].str.split(' ', n=1).str[1]
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
for column in df_both.columns.levels[0]:
# Split 'Text' column on the first space character and keep only the second part
data1_col = data1['Text'].str.split(' ', n=1).str[1]
data2_col = data2['Text'].str.split(' ', n=1).str[1]
print(df_both.index[np.where(data1_col != data2_col)])
答案2
得分: 0
import pandas as pd
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
s1 = df_both.loc[:,('Text','Data1')].str.extract('(\d-\d)(.+)')[1]
s2 = df_both.loc[:,('Text','Data2')].str.extract('(\d-\d)(.+)')[1]
m = s1.eq(s2)
print(df_both[m])
Text
Data1 Data2
ID
1 1-1 Text here1 1-1 Text here1
3 1-3 Text here3 1-5 Text here3
英文:
import pandas as pd
data1 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-2 Text here2', '1-3 Text here3', '1-4 Text here4']})
data2 = pd.DataFrame({
"ID": [1, 2, 3, 4],
"Text": ['1-1 Text here1', '1-4 Text here', '1-5 Text here3', '1-6 Text here']})
df_both = pd.concat([data1.set_index('ID'), data2.set_index('ID')], axis=1, keys=['Data1', 'Data2'])
df_both = df_both.swaplevel(axis='columns')
s1 = df_both.loc[:,('Text','Data1')].str.extract('(\d-\d)(.+)')[1]
s2 = df_both.loc[:,('Text','Data2')].str.extract('(\d-\d)(.+)')[1]
m = s1.eq(s2)
print(df_both[m])
Text
Data1 Data2
ID
1 1-1 Text here1 1-1 Text here1
3 1-3 Text here3 1-5 Text here3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论