英文:
Regular Expression works in Regex101 but not Jupyter notebook
问题
import re
with open('names.txt') as f:
data = f.readlines()
twitter_pattern = re.compile(r"\s{1}[@]\w+")
twitter_match = twitter_pattern.findall(str(data))
print(twitter_match)
英文:
import re
with open('names.txt') as f:
data = f.readlines()
twitter_pattern = re.compile(r"\s{1}[@]\w+")
twitter_match = twitter_pattern.findall(str(data))
print(twitter_match)
names.txt is a list of full names, phone numbers and twitter handles.
\s{1}[@]\w+
should return only twitter handles, but returns an empty list. Everything seems to be working fine in regex101, but not when I run this through Jupyter Notebook.
The content of the file is identical to the data provided in the Regex101 link:
Osterberg, Sven-Erik governor@norrbotten.co.se Governor, Norrbotten @sverik
, Tim tim@killerrabbit.com Enchanter, Killer Rabbit Cave
Butz, Ryan ryanb@codingtemple.com (555) 555-5543 CEO, Coding Temple @ryanbutz
Doctor, The doctor+companion@tardis.co.uk Time Lord, Gallifrey
Exampleson, Example me@example.com 555-555-5552 Example, Example Co. @example
Pael, Ripal ripalp@codingtemple.com (555) 555-5553 Teacher, Coding Temple @ripalp
答案1
得分: -2
readlines()
将文本读取为字符串数组。
文件
Hello
World
生成数组 ["Hello", "World"]
。
str(data)
是该数组的文本表示形式。在Python中,它是文本 ["Hello", "World"]
。请注意,换行符被消耗并被解释为数组下一项的开始。
在你的情况下,这意味着你会得到 [
和 ]
以及大量额外的 "
和 ,
,导致你的Twitter用户名后不再有空格。
要修复你的代码,请不要将文件读取为数组,而是将其作为文本读取。
with open('twitter.txt') as f:
data = f.read() # 而不是 readlines()
此外,请不要使你的正则表达式比必要的更复杂。\s@\w+
是相同的但不容易混淆。
英文:
readlines()
reads the text as an array of strings.
The file
Hello
World
gives the array ["Hello", "World"]
.
str(data)
is the textual representation of that array. In Python that's the text ["Hello", "World"]
. Note that the line break was consumed and interpreted as the start of the next item of the array.
In your case that means you get [
and ]
as well as a lot of additional "
and ,
, with the consequence that you no longer have a whitespace after the Twitter handle.
To fix your code, don't read the file as an array, read it as text instead.
with open('twitter.txt') as f:
data = f.read() # instead of readlines()
Also, please don't make your Regex more complicated than necessary. \s@\w+
is identical but less confusing.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论