替换Pandas数据框中的一列值,如果另一列包含特定字符串。

huangapple go评论68阅读模式
英文:

Replace a value in a column in a Pandas dataframe if another column contains a certain string

问题

I understand that you want to translate the provided code-related text into Chinese. Here's the translation:

我有一个非常长且复杂的Python Pandas数据框类似于以下内容

df_test = pd.DataFrame(data=None, columns=['file', 'comment', 'number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']

该数据框包含一些与之相关联的数据文件编号但是也有这些文件的更新版本它们应该具有与旧文件相同的编号但有些已经被分配了新的编号

我想要做的是检查每个文件的comment如果它以字符串Replacing:开头并且number列中的值与在Replacing:之后找到的字符串中的数据集的number列不同那么数字应该被更改为与原始文件相同

在这个示例中这意味着number列应该更改为

['12', '12', '13', '13', '14', '14']

数据框中还有一些例外情况例如包含冒号的其他注释或nan值也必须考虑在内我可以使用下面的代码提取应该替换编号的文件但我不确定接下来该怎么做任何帮助将不胜感激谢谢

df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]

Is there anything else you need assistance with?

英文:

I have a very long and complicated Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data = None, columns = ['file','comment','number'])
df_test.file = ['file_1', 'file_1_v2', 'file_2', 'file_2_v2', 'file_3', 'file_3_v2']
df_test.comment = ['none: 5', 'Replacing: file_1', 'old', 'Replacing: file_2', '', 'Replacing: file_3']
df_test.number = ['12', '15', '13', '16', '14', '14']

The frame contains certain data files that have a number associated with them. However, there are also updated versions of those files that should have the same number as the old file but some have been assigned a new number instead.

What I want to do is to check the 'comment' column for each file, and if it starts with the string 'Replacing: ', and the value in the 'number' column is not the same as the 'number' column for the dataset found in the string after 'Replacing: ', the number should be put to be the same as the original file.

In this example, it means that the 'number' column should be changed to read:

['12', '12', '13', '13', '14', '14']

There are also some exception in the dataframe such as other comments which includes a colon or nan-values which must be considered as well. I can extract the files that should have the number replaced with the line below, but I'm not sure where to go from there. Any help is appreciated, thanks!

df_test_replace = df_test.loc[df_test.comment.str.startswith('Replacing: ')]

答案1

得分: 1

你可以使用正则表达式来提取文件名,然后使用mapfillna函数:

df_test['number'] = (df_test['comment']
 .str.extract('Replacing: (.*)', expand=False)
 .map(df_test.set_index('file')['number'])
 .fillna(df_test['number'])
)

或者使用索引的替代方法:

s = df_test['comment'].str.extract('Replacing: (.*)', expand=False).dropna()

df_test.loc
展开收缩
= s.map(df_test.set_index('file')['number'])

输出结果:

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14
英文:

You can use a regex to extract the filename, then map and fillna:

df_test['number'] = (df_test['comment']
 .str.extract('Replacing: (.*)', expand=False)
 .map(df_test.set_index('file')['number'])
 .fillna(df_test['number'])
)

Alternative with indexing:

s = df_test['comment'].str.extract('Replacing: (.*)', expand=False).dropna()

df_test.loc
展开收缩
= s.map(df_test.set_index('file')['number'])

Output:

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14

答案2

得分: 1

如果你想避免使用正则表达式,你也可以手动解析你的字符串。这里我创建了一个新的列 new_file,以便于你调试旧文件和新文件。

df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '), 
                                  'comment']\
                             .str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')

            file            comment number
    0     file_1            none: 5     12
    1  file_1_v2  Replacing: file_1     12
    2     file_2                old     13
    3  file_2_v2  Replacing: file_2     13
    4     file_3                        14
    5  file_3_v2  Replacing: file_3     14
英文:

If you want to steer clear of regexes, you can manually parse your strings as well. Here I created a new column new_file to make it easy for you to debug the old/new files.

df_test['new_file'] = df_test.loc[df_test.comment.str.startswith('Replacing: '), 
                                  'comment']\
                             str.removeprefix('Replacing: ')
df_test['new_file'] = df_test['new_file'].fillna(df_test['file'])
df_test['number'] = df_test.groupby('new_file')['number'].transform('min')

        file            comment number
0     file_1            none: 5     12
1  file_1_v2  Replacing: file_1     12
2     file_2                old     13
3  file_2_v2  Replacing: file_2     13
4     file_3                        14
5  file_3_v2  Replacing: file_3     14

huangapple
  • 本文由 发表于 2023年5月10日 22:16:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76219507.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定