正则表达式在Regex101上有效,但在Jupyter笔记本中无效。

huangapple go评论55阅读模式
英文:

Regular Expression works in Regex101 but not Jupyter notebook

问题

import re
with open('names.txt') as f:
    data = f.readlines()

twitter_pattern = re.compile(r"\s{1}[@]\w+")

twitter_match = twitter_pattern.findall(str(data))
print(twitter_match)
英文:
import re
with open('names.txt') as f:
    data = f.readlines()

twitter_pattern = re.compile(r"\s{1}[@]\w+")

twitter_match = twitter_pattern.findall(str(data))
print(twitter_match)

names.txt is a list of full names, phone numbers and twitter handles.
\s{1}[@]\w+ should return only twitter handles, but returns an empty list. Everything seems to be working fine in regex101, but not when I run this through Jupyter Notebook.

The content of the file is identical to the data provided in the Regex101 link:

Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp

答案1

得分: -2

readlines() 将文本读取为字符串数组。

文件

Hello
World

生成数组 ["Hello", "World"]

str(data) 是该数组的文本表示形式。在Python中,它是文本 ["Hello", "World"]。请注意,换行符被消耗并被解释为数组下一项的开始。

在你的情况下,这意味着你会得到 [] 以及大量额外的 ",,导致你的Twitter用户名后不再有空格。

要修复你的代码,请不要将文件读取为数组,而是将其作为文本读取。

with open('twitter.txt') as f:
    data = f.read()              # 而不是 readlines()

此外,请不要使你的正则表达式比必要的更复杂。\s@\w+ 是相同的但不容易混淆。

英文:

readlines() reads the text as an array of strings.

The file

Hello
World

gives the array ["Hello", "World"].

str(data) is the textual representation of that array. In Python that's the text ["Hello", "World"]. Note that the line break was consumed and interpreted as the start of the next item of the array.

In your case that means you get [ and ] as well as a lot of additional " and ,, with the consequence that you no longer have a whitespace after the Twitter handle.

To fix your code, don't read the file as an array, read it as text instead.

with open('twitter.txt') as f:
    data = f.read()              # instead of readlines()

Also, please don't make your Regex more complicated than necessary. \s@\w+ is identical but less confusing.

huangapple
  • 本文由 发表于 2023年6月12日 01:43:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76451727.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定