英文:
How to correctly determine if a Pandas dataframe has replaced values in a column based on a string in another column
问题
我有一个在Python中的非常大的Pandas数据帧,其中有三个重要的列; 'file', 'comment' 和 'number'。这是一个包含许多不同文件的列表,每个文件都有分配的id号码,但其中一些文件替换了旧文件,应该具有相同的id号码,而不是单独的id号码。一个例子是:
df_test = pd.DataFrame(data=None, columns=['file', 'comment', 'number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none', 'Replacing: file_1', 'none', 'Replacing: file_2', 'none', 'Replacing: file_3']
df_test.number = ['12', '12', '15', '16', '18', '18']
我想要的是检查'number'列是否显示相同的id号码,对于原始文件和具有以'Replacing: '开头的注释的文件,比较该数字与注释末尾显示的文件。在这个示例中,我希望得到一个类似列表或数据帧的新列,其中显示为; 'True', 'True', 'False', 'False', 'True', 'True'; 因为第二个和最后一个文件已被分配与它们替换的文件相同的id号码,但第四个文件没有。我无法弄清楚如何检查它,任何帮助都将不胜感激!谢谢!
英文:
I have a really large Pandas dataframe in Python with three important columns; 'file', 'comment', and 'number'. It is a list of many different files with assigned id-numbers, but some of these files replaces old files and should have the same id-numbers instead of separate ones. An example is:
df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none', 'Replacing: file_1', 'none', 'Replacing: file_2', 'none', 'Replacing: file_3']
df_test.number = ['12', '12', '15', '16', '18', '18']
What I want is to check if the 'number' column shows the same id-number for the original file and the file which has a comment that starts with 'Replacing: ' that compares the number with the file shown at the end of the comment. In this example, I would want something like a list or a new column in the dataframe which reads; 'True', 'True', 'False', 'False', 'True', 'True'; since the second and last files have been assigned the same id-number as the file they are replacing, but the fourth file has not. I can't really figure out how to check it and any help is appreciated! Thanks!
答案1
得分: 1
如果始终在替换文件之前是none
,可以将其替换为缺失值,并为辅助的Series
组进行缺失值回填,然后测试每个组的唯一number
通过 GroupBy.transform
和 DataFrameGroupBy.nunique
:
s = df_test['comment'].mask(df_test['comment'].eq('none')).bfill()
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)
另一个关于组的想法是在 Series.str.extract
中提取最后一个空格后的文件名,然后用 Series.fillna
替换不匹配的值,并像前面的解决方案一样测试每个组的唯一性:
s = df_test['comment'].str.extract(r'\s(.*)$', expand=False).fillna(df_test['file'])
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)
英文:
If always none
are before replacing files, is possible replace it to missing values and back filling missing values for helper Series
of groups, last test if unique number
per groups by GroupBy.transform
with DataFrameGroupBy.nunique
:
s = df_test['comment'].mask(df_test['comment'].eq('none')).bfill()
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)
file comment number test
0 file_1 none 12 True
1 file_1_v2 Replacing: file_1 12 True
2 file_2 none 15 False
3 file_2_v2 Replacing: file_2 16 False
4 file_3 none 18 True
5 file_3_v2 Replacing: file_3 18 True
Another idea for groups is extract filenames after last space in Series.str.extract
with replace non matched values by Series.fillna
and test uniquness per groups like previous solution:
s = df_test['comment'].str.extract(r'\s(.*)$', expand=False).fillna(df_test['file'])
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)
file comment number test
0 file_1 none 12 True
1 file_1_v2 Replacing: file_1 12 True
2 file_2 none 15 False
3 file_2_v2 Replacing: file_2 16 False
4 file_3 none 18 True
5 file_3_v2 Replacing: file_3 18 True
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论