英文:
Regex - removing everything after first word following a comma
问题
我有一列包含名称变体的数据,我想清理它们。我在使用正则表达式来删除逗号后的第一个单词时遇到问题。
已尝试的正则表达式:
x['names'] = [re.sub(r',\s+[^\s,]+', ',', str(x)) for x in x['names']]
期望的输出:
['smith,john', 'smith, john', 'brown, bob', 'brown, bob']
不确定为什么我的正则表达式不起作用,但任何帮助都将不胜感激。
英文:
I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.
d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)
Tried:
x['names'] = [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]
Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']
Not sure why my regex isn't working, but any help would be appreciated.
答案1
得分: 1
你可以尝试使用一个正则表达式,它查找逗号,然后是一个可选的空格,然后只保留剩下的单词:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object
英文:
You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object
答案2
得分: 0
尝试 re.sub(r'/(,\s*\w+).*$', '$1', str(x))...
将触发的模式放入捕获组 1 中,然后在替换的内容中恢复它。
英文:
Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...
Put the triggered pattern into capture group 1 and then restore it in what gets replaced.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论