Python正则表达式提取较大字符串中的主题标签

huangapple go评论61阅读模式
英文:

Python regex to extract hashtag from within larger string

问题

我有一个包含社交媒体标题列的pandas数据框。其中使用了标签,它们以以下格式出现:{hashtag|\#|WorldWaterDay}。我想循环遍历这一列,并将这些标签字符串重新格式化为#WorldWaterDay的格式。

我在正则表达式方面有点生疏。我可以轻松找到字符串(假设它们都以{}开头和结尾)使用^{.*}$,但我正在寻找一种有效使用正则表达式来查找和重新格式化这些标签的方法。我可以找到并在标签上拆分,删除|然后将标签附加到标签文本中,但我希望有人能提供一个纯粹的正则表达式解决方案的建议。

英文:

I have a pandas dataframe that contains a column of social media captions. Where hashtags have been used they appear in the following format {hashtag|\#|WorldWaterDay}. I want to loop though this column and reformat these hashtags strings in the format #WorldWaterDay.

I am quite rusty on my regex. I can easily find the strings (assuming they all start and end with {}) using ^{.*}$, but I am looking for an efficient use of regex to find and reformat these hashtags. I can find and split on the hashtag, remove the | then append the hashtag to the hashtag text in several steps, but I was hoping someone could provide some advice on a pure regex solution.

答案1

得分: 3

你只需要一个能够匹配现有格式的正则表达式:

\{hashtag\|\\#\|([^}]+)}

它匹配了以下内容:

  • \{hashtag\|\\#\|:字面上的{hashtag|\#|
  • ([^}]+):一些非}字符,捕获在第一组中
  • }:一个}字符

然后,你可以用#\1来替换它。在Python中:

df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#', regex=True)
英文:

You just need a regex that will match the existing format:

\{hashtag\|\\#\|([^}]+)}

which matches:

  • \{hashtag\|\\#\| : literally {hashtag|\#|
  • ([^}]+) : some number of non-} characters, captured in group 1
  • } : a } character

You can then replace that with #\1. In python:

df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#', regex=True)

huangapple
  • 本文由 发表于 2023年5月17日 18:22:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271063.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定