英文:
In Python, how to use re.sub() to replace all literal Unicode spaces?
问题
在Python中,当我使用readlines()从文本文件中读取内容时,原本是空格的部分会变成Unicode字符,如下所示。其中,\u2009 在原始文本文件中表示空格。
因此,我正在使用re.sub()来将这些Unicode文字中的空格替换为普通空格。
我的代码如下:
x = "在感染的未经治疗小鼠中观察到所有脂蛋白分数显著增加,与正常对照小鼠相比。用100和250毫克/千克的灵芝提取物治疗与用500毫克/千克的灵芝提取物和CQ治疗相比,显著降低了血清总胆固醇(TC)和低密度胆固醇(LDL-C)含量。"
x = re.sub(r'[\x0b\x0c\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]', " ", x)
我不确定我是否正确?
虽然程序看起来正常,但我不确定,因为我不太了解正则表达式。
英文:
In Python, when I use readlines() to read from a text file, something that was originally a space will become a literal Unicode character, as shown follows. Where \u2009 is a space in the original text file.
So, I'm using re.sub() to replace these Unicode literal spaces with a normal space.
My code is as follows:
x = "Significant increases in all the lipoprotein fractions were observed in infected untreated mice compared with normal control mice. Treatment with 100 and 250\u2009mg/kg G. lucidum extract produced significant reduction in serum total cholesterol (TC) and low-density cholesterol (LDL-C) contents compared with 500\u2009mg/kg G. lucidum and CQ."
x = re.sub(r'[\x0b\x0c\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]', " ", x)
I don't know if I'm right?
Although the program looks normal, I'm not sure because I don't understand regular expressions well enough.
答案1
得分: 2
re.sub("[^\S \t\n\r\f\v]",' ',x)
可以解决问题(基于 docs.python.org: re
— 正则表达式操作):
你知道 []
用于表示一组字符,不在范围内的字符可以通过补集来匹配。如果集合的第一个字符是 '^'
,那么将匹配不在集合中的所有字符。
正则表达式模式 [^\S \t\n\r\f\v]
可以解读为:
^
(U+005E,插入符号) 非 (\S
(非空白字符) 或- (空格) 或
\t
(制表符) 或\n
(换行符 (LF)) 或\r
(回车符 (CR)) 或\f
(换页符 (FF)) 或\v
(纵向制表符)
<BR>)
将外部的非(也就是字符类中的补集 ^
)与德摩根定律结合,等效于“空白字符 除外 [ \t\n\r\f\v]
之外的任何字符”。
在模式中同时包括 \r
和 \n
可以正确处理所有 Unix(LF)、经典 Mac OS(CR)和类似 Windows 的(CR+LF)换行符约定。<BR>
包括一个空格本身(我们不需要将空格翻译为空格)…
\s
<BR>
对于 Unicode (str
) 模式:<BR>
匹配 Unicode 空白字符(包括 [ \t\n\r\f\v]
,以及许多其他字符,例如许多语言中印刷规则规定的不间断空格)…
部分应用了以下答案:Regex – 匹配空白字符但不包括换行符
英文:
re.sub("[^\S \t\n\r\f\v]",' ',x)
should do the trick (based on docs.python.org: re
— Regular expression operations):
You know that []
is used to indicate a set of characters, and characters that are not within a range can be matched by complementing the set. If the first character of the set is '^'
, all the characters that are not in the set will be matched.
The regex pattern [^\S \t\n\r\f\v]
reads as
^
(U+005E, Circumflex Accent) Not (\S
(not a whitespace) or- (Space) or
\t
(Character Tabulation) or\n
(Line Feed (LF)) or\r
(Carriage Return (CR)) or\f
(Form Feed (FF)) or\v
(Line Tabulation)
<BR>)
Distributing the outer not (i.e., the complementing ^
in the character class) with De Morgan's law, this is equivalent to “whitespace except any of [ \t\n\r\f\v]
.”
Including both \r
and \n
in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and Windows-ish (CR+LF) newline conventions.<BR>
Included a space itself (we do not need translate a space to space)…
\s
<BR>
For Unicode (str
) patterns:<BR>
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v]
, and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages)…
Partially applied the following answer: Regex – Match whitespace but not newlines
答案2
得分: 0
x = " ".join(x.split())
英文:
quick solution:
x = " ".join(x.split())
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论