2023年5月10日 18:19:40go评论98阅读模式

英文:

How to correctly determine if a Pandas dataframe has replaced values in a column based on a string in another column

问题

我有一个在Python中的非常大的Pandas数据帧，其中有三个重要的列; 'file', 'comment' 和 'number'。这是一个包含许多不同文件的列表，每个文件都有分配的id号码，但其中一些文件替换了旧文件，应该具有相同的id号码，而不是单独的id号码。一个例子是：

df_test = pd.DataFrame(data=None, columns=['file', 'comment', 'number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none', 'Replacing: file_1', 'none', 'Replacing: file_2', 'none', 'Replacing: file_3']
df_test.number = ['12', '12', '15', '16', '18', '18']

我想要的是检查'number'列是否显示相同的id号码，对于原始文件和具有以'Replacing: '开头的注释的文件，比较该数字与注释末尾显示的文件。在这个示例中，我希望得到一个类似列表或数据帧的新列，其中显示为; 'True', 'True', 'False', 'False', 'True', 'True'; 因为第二个和最后一个文件已被分配与它们替换的文件相同的id号码，但第四个文件没有。我无法弄清楚如何检查它，任何帮助都将不胜感激！谢谢！

英文:

I have a really large Pandas dataframe in Python with three important columns; 'file', 'comment', and 'number'. It is a list of many different files with assigned id-numbers, but some of these files replaces old files and should have the same id-numbers instead of separate ones. An example is:

df_test = pd.DataFrame(data = None, columns = [&#39;file&#39;,&#39;comment&#39;,&#39;number&#39;])
df_test.file = [&#39;file_1&#39;, &#39;file_1_v2&#39;, &#39;file_2&#39;, &#39;file_2_v2&#39;, &#39;file_3&#39;, &#39;file_3_v2&#39;]
df_test.comment = [&#39;none&#39;, &#39;Replacing: file_1&#39;, &#39;none&#39;, &#39;Replacing: file_2&#39;, &#39;none&#39;, &#39;Replacing: file_3&#39;]
df_test.number = [&#39;12&#39;, &#39;12&#39;, &#39;15&#39;, &#39;16&#39;, &#39;18&#39;, &#39;18&#39;]

What I want is to check if the 'number' column shows the same id-number for the original file and the file which has a comment that starts with 'Replacing: ' that compares the number with the file shown at the end of the comment. In this example, I would want something like a list or a new column in the dataframe which reads; 'True', 'True', 'False', 'False', 'True', 'True'; since the second and last files have been assigned the same id-number as the file they are replacing, but the fourth file has not. I can't really figure out how to check it and any help is appreciated! Thanks!

答案1

得分: 1

如果始终在替换文件之前是none，可以将其替换为缺失值，并为辅助的Series组进行缺失值回填，然后测试每个组的唯一number通过 GroupBy.transform 和 DataFrameGroupBy.nunique:

s = df_test['comment'].mask(df_test['comment'].eq('none')).bfill()
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)

另一个关于组的想法是在 Series.str.extract 中提取最后一个空格后的文件名，然后用 Series.fillna 替换不匹配的值，并像前面的解决方案一样测试每个组的唯一性:

s = df_test['comment'].str.extract(r'\s(.*)$', expand=False).fillna(df_test['file'])
df_test['test'] = df_test.groupby(s)['number'].transform('nunique').eq(1)
print (df_test)

英文:

If always none are before replacing files, is possible replace it to missing values and back filling missing values for helper Series of groups, last test if unique number per groups by GroupBy.transform with DataFrameGroupBy.nunique:

s = df_test[&#39;comment&#39;].mask(df_test[&#39;comment&#39;].eq(&#39;none&#39;)).bfill()
df_test[&#39;test&#39;] = df_test.groupby(s)[&#39;number&#39;].transform(&#39;nunique&#39;).eq(1)
print (df_test)
        file            comment number   test
0     file_1               none     12   True
1  file_1_v2  Replacing: file_1     12   True
2     file_2               none     15  False
3  file_2_v2  Replacing: file_2     16  False
4     file_3               none     18   True
5  file_3_v2  Replacing: file_3     18   True

Another idea for groups is extract filenames after last space in Series.str.extract with replace non matched values by Series.fillna and test uniquness per groups like previous solution:

s = df_test[&#39;comment&#39;].str.extract(r&#39;\s(.*)$&#39;, expand=False).fillna(df_test[&#39;file&#39;])
df_test[&#39;test&#39;] = df_test.groupby(s)[&#39;number&#39;].transform(&#39;nunique&#39;).eq(1)
print (df_test)
        file            comment number   test
0     file_1               none     12   True
1  file_1_v2  Replacing: file_1     12   True
2     file_2               none     15  False
3  file_2_v2  Replacing: file_2     16  False
4     file_3               none     18   True
5  file_3_v2  Replacing: file_3     18   True

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何正确确定Pandas数据框中的一列是否基于另一列中的字符串替换了值

问题

答案1

Pandas DataFrame – 在多列上使用groupby()函数分组连续数值块。

使用Selenium Python进行网页抓取选择下拉选项。

根据另一列的更改逐行填充NaN值。

多个来自Django中的crontab的通知

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。