如何正确确定Pandas数据框中的一列是否基于另一列中的字符串替换了值

huangapple go评论59阅读模式
英文:

How to correctly determine if a Pandas dataframe has replaced values in a column based on a string in another column

问题

我有一个在Python中的非常大的Pandas数据帧,其中有三个重要的列; 'file', 'comment' 和 'number'。这是一个包含许多不同文件的列表,每个文件都有分配的id号码,但其中一些文件替换了旧文件,应该具有相同的id号码,而不是单独的id号码。一个例子是:

df_test = pd.DataFrame(data=None, columns=['file', 'comment', 'number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none', 'Replacing: file_1', 'none', 'Replacing: file_2', 'none', 'Replacing: file_3']
df_test.number = ['12', '12', '15', '16', '18', '18']

我想要的是检查'number'列是否显示相同的id号码,对于原始文件和具有以'Replacing: '开头的注释的文件,比较该数字与注释末尾显示的文件。在这个示例中,我希望得到一个类似列表或数据帧的新列,其中显示为; 'True', 'True', 'False', 'False', 'True', 'True'; 因为第二个和最后一个文件已被分配与它们替换的文件相同的id号码,但第四个文件没有。我无法弄清楚如何检查它,任何帮助都将不胜感激!谢谢!

英文:

I have a really large Pandas dataframe in Python with three important columns; 'file', 'comment', and 'number'. It is a list of many different files with assigned id-numbers, but some of these files replaces old files and should have the same id-numbers instead of separate ones. An example is:

df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none', 'Replacing: file_1', 'none', 'Replacing: file_2', 'none', 'Replacing: file_3']
df_test.number = ['12', '12', '15', '16', '18', '18']

What I want is to check if the 'number' column shows the same id-number for the original file and the file which has a comment that starts with 'Replacing: ' that compares the number with the file shown at the end of the comment. In this example, I would want something like a list or a new column in the dataframe which reads; 'True', 'True', 'False', 'False', 'True', 'True'; since the second and last files have been assigned the same id-number as the file they are replacing, but the fourth file has not. I can't really figure out how to check it and any help is appreciated! Thanks!

答案1

得分: 1

如果始终在替换文件之前是none,可以将其替换为缺失值,并为辅助的Series组进行缺失值回填,然后测试每个组的唯一number通过 GroupBy.transformDataFrameGroupBy.nunique:

s = df_test['comment'].mask(df_test['comment'].eq('none')).bfill()
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)

另一个关于组的想法是在 Series.str.extract 中提取最后一个空格后的文件名,然后用 Series.fillna 替换不匹配的值,并像前面的解决方案一样测试每个组的唯一性:

s = df_test['comment'].str.extract(r'\s(.*)$', expand=False).fillna(df_test['file'])
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)
英文:

If always none are before replacing files, is possible replace it to missing values and back filling missing values for helper Series of groups, last test if unique number per groups by GroupBy.transform with DataFrameGroupBy.nunique:

s = df_test['comment'].mask(df_test['comment'].eq('none')).bfill()
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)

        file            comment number   test
0     file_1               none     12   True
1  file_1_v2  Replacing: file_1     12   True
2     file_2               none     15  False
3  file_2_v2  Replacing: file_2     16  False
4     file_3               none     18   True
5  file_3_v2  Replacing: file_3     18   True

Another idea for groups is extract filenames after last space in Series.str.extract with replace non matched values by Series.fillna and test uniquness per groups like previous solution:

s = df_test['comment'].str.extract(r'\s(.*)$', expand=False).fillna(df_test['file'])
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)
        file            comment number   test
0     file_1               none     12   True
1  file_1_v2  Replacing: file_1     12   True
2     file_2               none     15  False
3  file_2_v2  Replacing: file_2     16  False
4     file_3               none     18   True
5  file_3_v2  Replacing: file_3     18   True

huangapple
  • 本文由 发表于 2023年5月10日 18:19:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76217268.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定