2023年5月10日 22:16:35go评论97阅读模式

英文:

Replace a value in a column in a Pandas dataframe if another column contains a certain string

问题

I understand that you want to translate the provided code-related text into Chinese. Here's the translation:

我有一个非常长且复杂的Python Pandas数据框，类似于以下内容：
df_test = pd.DataFrame(data=None, columns=['file', 'comment', 'number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']
该数据框包含一些与之相关联的数据文件编号。但是，也有这些文件的更新版本，它们应该具有与旧文件相同的编号，但有些已经被分配了新的编号。
我想要做的是检查每个文件的“comment”列，如果它以字符串“Replacing:”开头，并且“number”列中的值与在“Replacing:”之后找到的字符串中的数据集的“number”列不同，那么数字应该被更改为与原始文件相同。
在这个示例中，这意味着“number”列应该更改为：
['12', '12', '13', '13', '14', '14']
数据框中还有一些例外情况，例如包含冒号的其他注释或nan值，也必须考虑在内。我可以使用下面的代码提取应该替换编号的文件，但我不确定接下来该怎么做。任何帮助将不胜感激，谢谢！
df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]

Is there anything else you need assistance with?

英文:

I have a very long and complicated Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data = None, columns = [&#39;file&#39;,&#39;comment&#39;,&#39;number&#39;])
df_test.file = [&#39;file_1&#39;, &#39;file_1_v2&#39;, &#39;file_2&#39;, &#39;file_2_v2&#39;, &#39;file_3&#39;, &#39;file_3_v2&#39;]
df_test.comment = [&#39;none: 5&#39;, &#39;Replacing: file_1&#39;, &#39;old&#39;, &#39;Replacing: file_2&#39;, &#39;&#39;, &#39;Replacing: file_3&#39;]
df_test.number = [&#39;12&#39;, &#39;15&#39;, &#39;13&#39;, &#39;16&#39;, &#39;14&#39;, &#39;14&#39;]

The frame contains certain data files that have a number associated with them. However, there are also updated versions of those files that should have the same number as the old file but some have been assigned a new number instead.

What I want to do is to check the 'comment' column for each file, and if it starts with the string 'Replacing: ', and the value in the 'number' column is not the same as the 'number' column for the dataset found in the string after 'Replacing: ', the number should be put to be the same as the original file.

In this example, it means that the 'number' column should be changed to read:

['12', '12', '13', '13', '14', '14']

There are also some exception in the dataframe such as other comments which includes a colon or nan-values which must be considered as well. I can extract the files that should have the number replaced with the line below, but I'm not sure where to go from there. Any help is appreciated, thanks!

df_test_replace = df_test.loc[df_test.comment.str.startswith(&#39;Replacing: &#39;)]

答案1

得分: 1

你可以使用正则表达式来提取文件名，然后使用map和fillna函数：

df_test['number'] = (df_test['comment']
 .str.extract('Replacing: (.*)', expand=False)
 .map(df_test.set_index('file')['number'])
 .fillna(df_test['number'])
)

或者使用索引的替代方法：

s = df_test['comment'].str.extract('Replacing: (.*)', expand=False).dropna()
df_test.loc展开收缩
 = s.map(df_test.set_index('file')['number'])

输出结果：

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14

英文:

You can use a regex to extract the filename, then map and fillna:

df_test[&#39;number&#39;] = (df_test[&#39;comment&#39;]
 .str.extract(&#39;Replacing: (.*)&#39;, expand=False)
 .map(df_test.set_index(&#39;file&#39;)[&#39;number&#39;])
 .fillna(df_test[&#39;number&#39;])
)

Alternative with indexing:

s = df_test[&#39;comment&#39;].str.extract(&#39;Replacing: (.*)&#39;, expand=False).dropna()
df_test.loc展开收缩
 = s.map(df_test.set_index(&#39;file&#39;)[&#39;number&#39;])

Output:

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14

答案2

得分: 1

如果你想避免使用正则表达式，你也可以手动解析你的字符串。这里我创建了一个新的列 new_file，以便于你调试旧文件和新文件。

df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '), 
                                  'comment']\
                             .str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')

            file            comment number
    0     file_1            none: 5     12
    1  file_1_v2  Replacing: file_1     12
    2     file_2                old     13
    3  file_2_v2  Replacing: file_2     13
    4     file_3                        14
    5  file_3_v2  Replacing: file_3     14

英文:

If you want to steer clear of regexes, you can manually parse your strings as well. Here I created a new column new_file to make it easy for you to debug the old/new files.

df_test[&#39;new_file&#39;] = df_test.loc[df_test.comment.str.startswith(&#39;Replacing: &#39;), 
                                  &#39;comment&#39;]\
                             str.removeprefix(&#39;Replacing: &#39;)
df_test[&#39;new_file&#39;] = df_test[&#39;new_file&#39;].fillna(df_test[&#39;file&#39;])
df_test[&#39;number&#39;] = df_test.groupby(&#39;new_file&#39;)[&#39;number&#39;].transform(&#39;min&#39;)

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

替换Pandas数据框中的一列值，如果另一列包含特定字符串。

问题

答案1

答案2

统计每个“option”和“Type”中每年发生的次数。

如何获取pandas数据框中每行的第二大值

使用BeautifulSoup从文本中删除标签。

如何获取您在Github项目中编写了多少行代码以及删除了多少行代码

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。