英文:
Proper regex (re) pattern in python
问题
以下是您提供的内容的翻译部分:
"我试图想出一个适当的正则表达式模式(我非常不擅长它)来处理我手头的字符串。每次我都只能得到部分工作的模式。我将在下面稍后展示我制作的模式,但首先,我想说明我想从文本中提取什么。
数据:
- 公司Fragile9获得了900万欧元的B轮融资
- Appplle21获得了17,500加拿大元的股本融资
- Cat公司获得了1.08亿美元的A轮融资
- Sun公司以10亿美元估值融资了3,5000万欧元
- 日本1337公司宣布了17.8亿日元的融资轮
从这些数据中,我只需要提取公司获得的金额(包括美元/欧元等,如果有的话,还包括货币的具体规定,比如加拿大元(CAD))。
因此,结果中我希望得到以下内容:
- 900万欧元
- 17,500加拿大元
- 1.08亿美元
- 3,5000万欧元
- 17.8亿日元
我使用的模式如下:
try:
pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
raises = re.search(pattern, text, re.IGNORECASE) # text – 上面提到的数据行
raises = raises.group().upper().strip()
print(raises)
except:
raises = '???'
print(raises)
另外,有时在在线的Python正则表达式编辑器中有效的模式,在实际脚本中可能无法正常工作。"
希望这个翻译对您有所帮助。如果您需要进一步的帮助,请随时告诉我。
英文:
I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.
Data:
- Company Fragile9 Closes €9M Series B Funding
- Appplle21 Receives CAD$17.5K in Equity Financing
- Cat Raises $10.8 Millions in Series A Funding
- Sun Raises EUR35M in Funding at a $1 Billion Valuation
- Japan1337 Announces JPY 1.78 Billion Funding Round
From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).
So, in result, I expect to receive this:
- €9M
- CAD$17.5K
- $10.8 Millions
- EUR35M
- JPY 1.78 Billion
The pattern that I use (throw rotten tomatoes at me):
try:
pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
raises = raises.group().upper().strip()
print(raises)
except:
raises = '???'
print(raises)
Also, sometimes the pattern that works in online python regex editor, will not work in actual script.
答案1
得分: 1
以下是您提供的翻译好的部分:
在您的正则表达式中存在一些问题:
-
货币缩写列表(AU USD US CHF)太有限。它将无法匹配JPY,以及许多其他缩写。也许允许任何由2-3个大写字母组成的单词。
-
没有问题,但不需要使用反斜杠来转义货币符号。
-
货币列表中的
\?
不是货币符号。 -
正则表达式要求同时使用货币缩写和货币符号。也许您打算使用
\?
使货币符号成为可选的,但那么?
应该出现在字符类之后而不进行转义,仍然应该存在不包含缩写而只包含符号的可能性。 -
正则表达式要求数字必须有小数部分。这应该是可选的。
-
(K|M)*
将允许KKKKKKK
。您不想在这里使用*
。 -
[(B|M)illion]*
将允许字母BMilon
,以及字面上的管道和字面上的括号以任何顺序和任何数量出现。就像它会匹配"in"和"non"以及"(BooM)"。 -
前两个提到的模式是连续放置的,而它们应该是互斥的。
-
正则表达式不提供匹配"millions"中的最后一个"s"的功能。
以下是纠正后的正则表达式:
(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?
在Python语法中:
pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"
在regex101中可用。
英文:
Some issues in your regex:
-
The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.
-
Not a problem, but there is no need to escape the currency symbols with a backslash.
-
The
\?
in the currency list is not a currency symbol. -
The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with
\?
but then that the?
should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol. -
The regex requires that the number has decimals. This should be made optional.
-
(K|M)*
will allowKKKKKKK
. You don't want a*
here. -
[(B|M)illion]*
will allow the lettersBMilon
, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)" -
The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.
-
The regex does not provide for matching the final "s" in "millions".
Here is a correction:
(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?
On regex101
In Python syntax:
pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论