删除逗号后第一个单词之后的所有内容。

huangapple go评论51阅读模式
英文:

Regex - removing everything after first word following a comma

问题

我有一列包含名称变体的数据,我想清理它们。我在使用正则表达式来删除逗号后的第一个单词时遇到问题。

已尝试的正则表达式:

x['names'] = [re.sub(r',\s+[^\s,]+', ',', str(x)) for x in x['names']]

期望的输出:

['smith,john', 'smith, john', 'brown, bob', 'brown, bob']

不确定为什么我的正则表达式不起作用,但任何帮助都将不胜感激。

英文:

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.

d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)

Tried:
x['names'] =  [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]

Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']

Not sure why my regex isn't working, but any help would be appreciated.

答案1

得分: 1

你可以尝试使用一个正则表达式,它查找逗号,然后是一个可选的空格,然后只保留剩下的单词:

x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"")

0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object

英文:

You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:

x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"")

0     smith,john
1    smith, john
2     brown, bob
3     brown, bob
Name: names, dtype: object

答案2

得分: 0

尝试 re.sub(r'/(,\s*\w+).*$', '$1', str(x))... 将触发的模式放入捕获组 1 中,然后在替换的内容中恢复它。

英文:

Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...

Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

huangapple
  • 本文由 发表于 2023年2月6日 10:27:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75356848.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定