2023年2月14日 03:31:07go评论95阅读模式

英文:

Proper regex (re) pattern in python

问题

以下是您提供的内容的翻译部分：

"我试图想出一个适当的正则表达式模式（我非常不擅长它）来处理我手头的字符串。每次我都只能得到部分工作的模式。我将在下面稍后展示我制作的模式，但首先，我想说明我想从文本中提取什么。

数据：

公司Fragile9获得了900万欧元的B轮融资
Appplle21获得了17,500加拿大元的股本融资
Cat公司获得了1.08亿美元的A轮融资
Sun公司以10亿美元估值融资了3,5000万欧元
日本1337公司宣布了17.8亿日元的融资轮

从这些数据中，我只需要提取公司获得的金额（包括美元/欧元等，如果有的话，还包括货币的具体规定，比如加拿大元（CAD））。

因此，结果中我希望得到以下内容：

900万欧元
17,500加拿大元
1.08亿美元
3,5000万欧元
17.8亿日元

我使用的模式如下：

try:
    pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[$\€\&#163;\&#165;\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
    raises = re.search(pattern, text, re.IGNORECASE) # text – 上面提到的数据行
    raises = raises.group().upper().strip()
    print(raises)
except:
    raises = '???'
    print(raises)

另外，有时在在线的Python正则表达式编辑器中有效的模式，在实际脚本中可能无法正常工作。"

希望这个翻译对您有所帮助。如果您需要进一步的帮助，请随时告诉我。

英文:

I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.

Data:

Company Fragile9 Closes €9M Series B Funding
Appplle21 Receives CAD$17.5K in Equity Financing
Cat Raises $10.8 Millions in Series A Funding
Sun Raises EUR35M in Funding at a $1 Billion Valuation
Japan1337 Announces JPY 1.78 Billion Funding Round

From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).

So, in result, I expect to receive this:

€9M
CAD$17.5K
$10.8 Millions
EUR35M
JPY 1.78 Billion

The pattern that I use (throw rotten tomatoes at me):

try:
    pattern = &#39;(\bAU|\bUSD|\bUS|\bCHF)*\s*[$\€\&#163;\&#165;\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*&#39;
    raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
    raises = raises.group().upper().strip()
    print(raises)
except:
    raises = &#39;???&#39;
    print(raises)

Also, sometimes the pattern that works in online python regex editor, will not work in actual script.

答案1

得分: 1

以下是您提供的翻译好的部分：

在您的正则表达式中存在一些问题：

货币缩写列表（AU USD US CHF）太有限。它将无法匹配JPY，以及许多其他缩写。也许允许任何由2-3个大写字母组成的单词。
没有问题，但不需要使用反斜杠来转义货币符号。
货币列表中的\?不是货币符号。
正则表达式要求同时使用货币缩写和货币符号。也许您打算使用\?使货币符号成为可选的，但那么?应该出现在字符类之后而不进行转义，仍然应该存在不包含缩写而只包含符号的可能性。
正则表达式要求数字必须有小数部分。这应该是可选的。
(K|M)*将允许KKKKKKK。您不想在这里使用*。
[(B|M)illion]*将允许字母BMilon，以及字面上的管道和字面上的括号以任何顺序和任何数量出现。就像它会匹配"in"和"non"以及"(BooM)"。
前两个提到的模式是连续放置的，而它们应该是互斥的。
正则表达式不提供匹配"millions"中的最后一个"s"的功能。

以下是纠正后的正则表达式：

(?:\b[A-Z]{2,3}\s*[$€&#163;&#165;₣₹]?|[$€&#163;&#165;₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?

在Python语法中：

pattern = r"(?:\b[A-Z]{2,3}\s*[$€&#163;&#165;₣₹]?|[$€&#163;&#165₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"

在regex101中可用。

英文:

Some issues in your regex:

The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.
Not a problem, but there is no need to escape the currency symbols with a backslash.
The \? in the currency list is not a currency symbol.
The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with \? but then that the ? should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol.
The regex requires that the number has decimals. This should be made optional.
(K|M)* will allow KKKKKKK. You don't want a * here.
[(B|M)illion]* will allow the letters BMilon, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)"
The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.
The regex does not provide for matching the final "s" in "millions".

Here is a correction:

(?:\b[A-Z]{2,3}\s*[$€&#163;&#165;₣₹]?|[$€&#163;&#165;₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?

On regex101

In Python syntax:

pattern = r&quot;(?:\b[A-Z]{2,3}\s*[$€&#163;&#165;₣₹]?|[$€&#163;&#165;₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

适当的正则表达式 (re) 模式在Python中的表示是：

问题

答案1

SQL MariaDB应用正则表达式到列

为什么Python解释器不使用虚拟环境（venv）？

ModuleNotFoundError: 找不到模块名 ‘forex_python’

json2token 在使用 Huggingface transformers 中的 Donut VisionEncoderDecoderModel 时未找到。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。