2023年1月9日 16:47:07go评论76阅读模式

英文:

In Python, how to use re.sub() to replace all literal Unicode spaces?

问题

在Python中，当我使用readlines()从文本文件中读取内容时，原本是空格的部分会变成Unicode字符，如下所示。其中，\u2009 在原始文本文件中表示空格。

因此，我正在使用re.sub()来将这些Unicode文字中的空格替换为普通空格。

我的代码如下：

x = "在感染的未经治疗小鼠中观察到所有脂蛋白分数显著增加，与正常对照小鼠相比。用100和250毫克/千克的灵芝提取物治疗与用500毫克/千克的灵芝提取物和CQ治疗相比，显著降低了血清总胆固醇（TC）和低密度胆固醇（LDL-C）含量。"

x = re.sub(r'[\x0b\x0c\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]', " ", x)

我不确定我是否正确？

虽然程序看起来正常，但我不确定，因为我不太了解正则表达式。

英文:

In Python, when I use readlines() to read from a text file, something that was originally a space will become a literal Unicode character, as shown follows. Where \u2009 is a space in the original text file.

So, I'm using re.sub() to replace these Unicode literal spaces with a normal space.

My code is as follows:

x = &quot;Significant increases in all the lipoprotein fractions were observed in infected untreated mice compared with normal control mice. Treatment with 100 and 250\u2009mg/kg G. lucidum extract produced significant reduction in serum total cholesterol (TC) and low-density cholesterol (LDL-C) contents compared with 500\u2009mg/kg G. lucidum and CQ.&quot;

x = re.sub(r&#39;[\x0b\x0c\x1c\x1d\x1e\x1f\x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]&#39;, &quot; &quot;, x)

I don't know if I'm right?

Although the program looks normal, I'm not sure because I don't understand regular expressions well enough.

答案1

得分: 2

re.sub("[^\S \t\n\r\f\v]",' ',x) 可以解决问题（基于 docs.python.org: re — 正则表达式操作）：

你知道 [] 用于表示一组字符，不在范围内的字符可以通过补集来匹配。如果集合的第一个字符是 '^'，那么将匹配不在集合中的所有字符。

正则表达式模式 [^\S \t\n\r\f\v] 可以解读为：

^（U+005E，插入符号) 非（
\S（非空白字符) 或
（空格) 或
\t（制表符) 或
\n（换行符 (LF)) 或
\r（回车符 (CR)) 或
\f（换页符 (FF)) 或
\v（纵向制表符)
 ）

将外部的非（也就是字符类中的补集 ^）与德摩根定律结合，等效于“空白字符除外 [ \t\n\r\f\v] 之外的任何字符”。

在模式中同时包括 \r 和 \n 可以正确处理所有 Unix（LF）、经典 Mac OS（CR）和类似 Windows 的（CR+LF）换行符约定。 
包括一个空格本身（我们不需要将空格翻译为空格）…

\s 
对于 Unicode (str) 模式: 
匹配 Unicode 空白字符（包括 [ \t\n\r\f\v]，以及许多其他字符，例如许多语言中印刷规则规定的不间断空格）…

部分应用了以下答案：Regex – 匹配空白字符但不包括换行符

英文:

re.sub("[^\S \t\n\r\f\v]",' ',x)

should do the trick (based on docs.python.org: re — Regular expression operations):

You know that [] is used to indicate a set of characters, and characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched.

The regex pattern [^\S \t\n\r\f\v] reads as

^ (U+005E, Circumflex Accent) Not （
\S (not a whitespace) or
(Space) or
\t (Character Tabulation) or
\n (Line Feed (LF)) or
\r (Carriage Return (CR)) or
\f (Form Feed (FF)) or
\v (Line Tabulation)
 ）

Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace except any of [ \t\n\r\f\v].”

Including both \r and \n in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and Windows-ish (CR+LF) newline conventions. 
Included a space itself (we do not need translate a space to space)…

\s 
For Unicode (str) patterns: 
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages)…

Partially applied the following answer: Regex – Match whitespace but not newlines

答案2

得分: 0

x = " ".join(x.split())

英文:

quick solution:

x = &quot; &quot;.join(x.split())

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用 re.sub() 如何替换所有字面上的Unicode空格？

问题

答案1

答案2

我想使用Python查询MongoDB。

SQLAlchemy映射器事件未触发。

Pandas Data Error on value_counts() does not display the count correctly to clean data.

Pytest和Playwright – 在使用类范围的固定装置时使用多个浏览器

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论