适当的正则表达式 (re) 模式在Python中的表示是:

huangapple go评论66阅读模式
英文:

Proper regex (re) pattern in python

问题

以下是您提供的内容的翻译部分:

"我试图想出一个适当的正则表达式模式(我非常不擅长它)来处理我手头的字符串。每次我都只能得到部分工作的模式。我将在下面稍后展示我制作的模式,但首先,我想说明我想从文本中提取什么。

数据:

  • 公司Fragile9获得了900万欧元的B轮融资
  • Appplle21获得了17,500加拿大元的股本融资
  • Cat公司获得了1.08亿美元的A轮融资
  • Sun公司以10亿美元估值融资了3,5000万欧元
  • 日本1337公司宣布了17.8亿日元的融资轮

从这些数据中,我只需要提取公司获得的金额(包括美元/欧元等,如果有的话,还包括货币的具体规定,比如加拿大元(CAD))。

因此,结果中我希望得到以下内容:

  • 900万欧元
  • 17,500加拿大元
  • 1.08亿美元
  • 3,5000万欧元
  • 17.8亿日元

我使用的模式如下:

try:
    pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
    raises = re.search(pattern, text, re.IGNORECASE) # text – 上面提到的数据行
    raises = raises.group().upper().strip()
    print(raises)
except:
    raises = '???'
    print(raises)

另外,有时在在线的Python正则表达式编辑器中有效的模式,在实际脚本中可能无法正常工作。"

希望这个翻译对您有所帮助。如果您需要进一步的帮助,请随时告诉我。

英文:

I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.

Data:

  • Company Fragile9 Closes €9M Series B Funding
  • Appplle21 Receives CAD$17.5K in Equity Financing
  • Cat Raises $10.8 Millions in Series A Funding
  • Sun Raises EUR35M in Funding at a $1 Billion Valuation
  • Japan1337 Announces JPY 1.78 Billion Funding Round

From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).

So, in result, I expect to receive this:

  • €9M
  • CAD$17.5K
  • $10.8 Millions
  • EUR35M
  • JPY 1.78 Billion

The pattern that I use (throw rotten tomatoes at me):

try:
    pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
    raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
    raises = raises.group().upper().strip()
    print(raises)
except:
    raises = '???'
    print(raises)

Also, sometimes the pattern that works in online python regex editor, will not work in actual script.

答案1

得分: 1

以下是您提供的翻译好的部分:

在您的正则表达式中存在一些问题:

  • 货币缩写列表(AU USD US CHF)太有限。它将无法匹配JPY,以及许多其他缩写。也许允许任何由2-3个大写字母组成的单词。

  • 没有问题,但不需要使用反斜杠来转义货币符号。

  • 货币列表中的\?不是货币符号。

  • 正则表达式要求同时使用货币缩写和货币符号。也许您打算使用\?使货币符号成为可选的,但那么?应该出现在字符类之后而不进行转义,仍然应该存在不包含缩写而只包含符号的可能性。

  • 正则表达式要求数字必须有小数部分。这应该是可选的。

  • (K|M)*将允许KKKKKKK。您不想在这里使用*

  • [(B|M)illion]*将允许字母BMilon,以及字面上的管道和字面上的括号以任何顺序和任何数量出现。就像它会匹配"in"和"non"以及"(BooM)"。

  • 前两个提到的模式是连续放置的,而它们应该是互斥的。

  • 正则表达式不提供匹配"millions"中的最后一个"s"的功能。

以下是纠正后的正则表达式:

(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?

在Python语法中:

pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£&#165₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"

regex101中可用。

英文:

Some issues in your regex:

  • The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.

  • Not a problem, but there is no need to escape the currency symbols with a backslash.

  • The \? in the currency list is not a currency symbol.

  • The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with \? but then that the ? should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol.

  • The regex requires that the number has decimals. This should be made optional.

  • (K|M)* will allow KKKKKKK. You don't want a * here.

  • [(B|M)illion]* will allow the letters BMilon, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)"

  • The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.

  • The regex does not provide for matching the final "s" in "millions".

Here is a correction:

(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?

On regex101

In Python syntax:

pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"

huangapple
  • 本文由 发表于 2023年2月14日 03:31:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75440432.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定