英文:
Replace a value in a column in a Pandas dataframe if another column contains a certain string
问题
I understand that you want to translate the provided code-related text into Chinese. Here's the translation:
我有一个非常长且复杂的Python Pandas数据框,类似于以下内容:
df_test = pd.DataFrame(data=None, columns=['file', 'comment', 'number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']
该数据框包含一些与之相关联的数据文件编号。但是,也有这些文件的更新版本,它们应该具有与旧文件相同的编号,但有些已经被分配了新的编号。
我想要做的是检查每个文件的“comment”列,如果它以字符串“Replacing:”开头,并且“number”列中的值与在“Replacing:”之后找到的字符串中的数据集的“number”列不同,那么数字应该被更改为与原始文件相同。
在这个示例中,这意味着“number”列应该更改为:
['12', '12', '13', '13', '14', '14']
数据框中还有一些例外情况,例如包含冒号的其他注释或nan值,也必须考虑在内。我可以使用下面的代码提取应该替换编号的文件,但我不确定接下来该怎么做。任何帮助将不胜感激,谢谢!
df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]
Is there anything else you need assistance with?
英文:
I have a very long and complicated Pandas dataframe in Python that looks something like this:
df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']
The frame contains certain data files that have a number associated with them. However, there are also updated versions of those files that should have the same number as the old file but some have been assigned a new number instead.
What I want to do is to check the 'comment' column for each file, and if it starts with the string 'Replacing: ', and the value in the 'number' column is not the same as the 'number' column for the dataset found in the string after 'Replacing: ', the number should be put to be the same as the original file.
In this example, it means that the 'number' column should be changed to read:
['12', '12', '13', '13', '14', '14']
There are also some exception in the dataframe such as other comments which includes a colon or nan-values which must be considered as well. I can extract the files that should have the number replaced with the line below, but I'm not sure where to go from there. Any help is appreciated, thanks!
df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]
答案1
得分: 1
你可以使用正则表达式来提取
文件名,然后使用map
和fillna
函数:
df_test['number'] = (df_test['comment']
.str.extract('Replacing: (.*)', expand=False)
.map(df_test.set_index('file')['number'])
.fillna(df_test['number'])
)
或者使用索引的替代方法:
s = df_test['comment'].str.extract('Replacing: (.*)', expand=False).dropna()
df_test.loc展开收缩 = s.map(df_test.set_index('file')['number'])
输出结果:
file comment number
0 file_1 none: 5 12
1 file_1_v2 Replacing: file_1 12
2 file_2 old 13
3 file_2_v2 Replacing: file_2 13
4 file_3 14
5 file_3_v2 Replacing: file_3 14
英文:
You can use a regex to extract
the filename, then map
and fillna
:
df_test['number'] = (df_test['comment']
.str.extract('Replacing: (.*)', expand=False)
.map(df_test.set_index('file')['number'])
.fillna(df_test['number'])
)
Alternative with indexing:
s = df_test['comment'].str.extract('Replacing: (.*)', expand=False).dropna()
df_test.loc展开收缩 = s.map(df_test.set_index('file')['number'])
Output:
file comment number
0 file_1 none: 5 12
1 file_1_v2 Replacing: file_1 12
2 file_2 old 13
3 file_2_v2 Replacing: file_2 13
4 file_3 14
5 file_3_v2 Replacing: file_3 14
答案2
得分: 1
如果你想避免使用正则表达式,你也可以手动解析你的字符串。这里我创建了一个新的列 new_file
,以便于你调试旧文件和新文件。
df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '),
'comment']\
.str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')
file comment number
0 file_1 none: 5 12
1 file_1_v2 Replacing: file_1 12
2 file_2 old 13
3 file_2_v2 Replacing: file_2 13
4 file_3 14
5 file_3_v2 Replacing: file_3 14
英文:
If you want to steer clear of regexes, you can manually parse your strings as well. Here I created a new column new_file
to make it easy for you to debug the old/new files.
df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '),
'comment']\
str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')
file comment number
0 file_1 none: 5 12
1 file_1_v2 Replacing: file_1 12
2 file_2 old 13
3 file_2_v2 Replacing: file_2 13
4 file_3 14
5 file_3_v2 Replacing: file_3 14
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论