删除子字符串出现以及其后的任何内容。

huangapple go评论93阅读模式
英文:

remove substring occurrence and anything comes after it

问题

dataframe 中包含一个名为 'text' 的列,其中包含推文。每个推文都以短网址结尾,我想使用正则表达式从所有行中删除该网址。

字符串:'This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https//(这里是短网址)'

我需要一个正则表达式,删除 'http' 以及后面的所有内容。

archive_clean['text'] = archive_clean['text'].replace('https.*', '', regex=True)

输出:

This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 //(剩余的网址部分)

英文:

dataframe with 'text' column that contains tweets
each tweets has short url in the end i want to remove that url using regex from all rows

string : 'This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https//(short url here i cannot post)'

i need regular expression that deletes 'http' and all that comes after it using regex

archive_clean['text'] = archive_clean['text'].replace('https.', '', regex=True)

output:

This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 //(the rest of the url)

答案1

得分: 0

应该尽量简单,只需在您已有的通配符后面添加星号即可。星号匹配前一项的零个或多个重复。 (Python re文档链接)

将代码更改为

archive_clean['cleaned_text'] = archive_clean['text'].replace('http.*', '', regex=True)

以摆脱“"http"”子串后的所有内容。

话虽如此,使用正则表达式时总会有一些例外情况。

  • 您想在“"http"”前去除空格吗?我刚刚提供的解决方案会将您的示例字符串保留为“"...boops, the whole bit. 13/10 "”。

  • 您是否会有一些没有前导“"http"”的链接?

  • 文本中是否会有另一个链接,不应该被移除?例如:

    "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. Check her out at https://tillythepup.com. 13/10 https://twitter.com/post/xxxx"

    在这种情况下,我们应将正则表达式更改为“"http\S*\Z"”,以确保仅移除字符串末尾锚定的URL。

    (注意:URL必须连续,没有空白字符直到字符串末尾。如果短于此,将无法按预期删除URL。可以通过使用.str.strip()预先去除列中的空格来解决这个问题。)

    感谢评论中的mozway提出的建议。

处理这些边缘情况有很多方法,也许您已经考虑过,但您在问题中概述的简单情况非常直接。

希望对您有所帮助!

英文:

It should be as simple as adding a star to the end of that period wildcard you already have. The star matches zero or more repetitions of the preceding item. (Link to python re docs)

Change the code to be

archive_clean['cleaned_text'] = archive_clean['text'].replace('http.*', '', regex=True)

to get rid of everything after the "http" substring.

That being said, with regex there are pretty much always exceptions.

  • Do you want to strip the white-space before the "http"? The solution I just provided leaves your example string as "...boops, the whole bit. 13/10 "

  • Are you going to have some links with no leading "http" at all?

  • Will there ever be another link in the middle of the text that should not be removed? Example:

    "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. Check her out at https://tillythepup.com. 13/10 https://twitter.com/post/xxxx"
    

    In this case we should change the regex to be "http\S*\Z" which makes sure it only removes the URL that is anchored at the end of the string.

    (Note: URL must be continuous with no whitespace right up to the end of the string. Anything short of this will not remove the URL as expected. Maybe account for this by pre-stripping the column of whitespaces using .str.strip())

    Thanks to mozway in the comments for this suggestion.

There are many ways to handle those edge cases, and maybe you have already thought of them, but the simple case that you outlined in your question is fairly straightforward.

Hope this helps!

huangapple
  • 本文由 发表于 2023年6月1日 08:30:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378014.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定