英文:
Python regex to extract hashtag from within larger string
问题
我有一个包含社交媒体标题列的pandas数据框。其中使用了标签,它们以以下格式出现:{hashtag|\#|WorldWaterDay}
。我想循环遍历这一列,并将这些标签字符串重新格式化为#WorldWaterDay
的格式。
我在正则表达式方面有点生疏。我可以轻松找到字符串(假设它们都以{}
开头和结尾)使用^{.*}$
,但我正在寻找一种有效使用正则表达式来查找和重新格式化这些标签的方法。我可以找到并在标签上拆分,删除|
然后将标签附加到标签文本中,但我希望有人能提供一个纯粹的正则表达式解决方案的建议。
英文:
I have a pandas dataframe that contains a column of social media captions. Where hashtags have been used they appear in the following format {hashtag|\#|WorldWaterDay}
. I want to loop though this column and reformat these hashtags strings in the format #WorldWaterDay
.
I am quite rusty on my regex. I can easily find the strings (assuming they all start and end with {}
) using ^{.*}$
, but I am looking for an efficient use of regex to find and reformat these hashtags. I can find and split on the hashtag, remove the |
then append the hashtag to the hashtag text in several steps, but I was hoping someone could provide some advice on a pure regex solution.
答案1
得分: 3
你只需要一个能够匹配现有格式的正则表达式:
\{hashtag\|\\#\|([^}]+)}
它匹配了以下内容:
\{hashtag\|\\#\|
:字面上的{hashtag|\#|
([^}]+)
:一些非}
字符,捕获在第一组中}
:一个}
字符
然后,你可以用#\1
来替换它。在Python中:
df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#', regex=True)
英文:
You just need a regex that will match the existing format:
\{hashtag\|\\#\|([^}]+)}
which matches:
\{hashtag\|\\#\|
: literally{hashtag|\#|
([^}]+)
: some number of non-}
characters, captured in group 1}
: a}
character
You can then replace that with #\1
. In python:
df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#', regex=True)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论