2023年5月17日 18:22:46go评论95阅读模式

英文:

Python regex to extract hashtag from within larger string

问题

我有一个包含社交媒体标题列的pandas数据框。其中使用了标签，它们以以下格式出现：{hashtag|\#|WorldWaterDay}。我想循环遍历这一列，并将这些标签字符串重新格式化为#WorldWaterDay的格式。

我在正则表达式方面有点生疏。我可以轻松找到字符串（假设它们都以{}开头和结尾）使用^{.*}$，但我正在寻找一种有效使用正则表达式来查找和重新格式化这些标签的方法。我可以找到并在标签上拆分，删除|然后将标签附加到标签文本中，但我希望有人能提供一个纯粹的正则表达式解决方案的建议。

英文:

I have a pandas dataframe that contains a column of social media captions. Where hashtags have been used they appear in the following format {hashtag|\#|WorldWaterDay}. I want to loop though this column and reformat these hashtags strings in the format #WorldWaterDay.

I am quite rusty on my regex. I can easily find the strings (assuming they all start and end with {}) using ^{.*}$, but I am looking for an efficient use of regex to find and reformat these hashtags. I can find and split on the hashtag, remove the | then append the hashtag to the hashtag text in several steps, but I was hoping someone could provide some advice on a pure regex solution.

答案1

得分: 3

你只需要一个能够匹配现有格式的正则表达式：

\{hashtag\|\\#\|([^}]+)}

它匹配了以下内容：

\{hashtag\|\\#\|：字面上的{hashtag|\#|
([^}]+)：一些非}字符，捕获在第一组中
}：一个}字符

然后，你可以用#\1来替换它。在Python中：

df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#', regex=True)

英文:

You just need a regex that will match the existing format:

\{hashtag\|\\#\|([^}]+)}

which matches:

\{hashtag\|\\#\| : literally {hashtag|\#|
([^}]+) : some number of non-} characters, captured in group 1
} : a } character

You can then replace that with #\1. In python:

df[&#39;Caption&#39;] = df[&#39;Caption&#39;].str.replace(r&#39;\{hashtag\|\\#\|([^}]+)}&#39;, r&#39;#&#39;, regex=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python正则表达式提取较大字符串中的主题标签

问题

答案1

比较Pandas数据帧列中相似拼写但另一列中的不同值。

将列表转换为格式化的pandas数据框。

在每个分组中查找两列中值的第一次和第二次出现。

合并具有数组的数据框。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。