如何修复这个正则表达式模式?

huangapple go评论57阅读模式
英文:

How can I fix this regex pattern?

问题

我正在尝试创建一个正则表达式模式来从我的信用卡账单中提取信息,这些信息只能以PDF形式获取。我将文本复制到文本编辑器中,然后在Notepad++中使用替换工具将复制的文本转换为CSV格式。

我遇到了负值的问题。

给定以下文本片段:

16/04 RC GRACAS 2 - 0,02
SAÚDE .RECIFE

16/04 RC GRACAS 2 02/03 45,97
SAÚDE .RECIFE

上面的文本包含了两条账单数据。正则表达式应该在第一条记录中捕获以下组:

"16/04": 日期
"RC GRACAS 2": 描述
"-": 值的符号
"0,02": 值
"SAÚDE .RECIFE": 类别

以及在第二条记录中捕获以下组:

"16/04": 日期
"RC GRACAS 2 02/03": 描述
"": 值的符号
"45,97": 值
"SAÚDE .RECIFE": 类别

我目前的正则表达式如下:^(\d{2}/\d{2})\s+(.*)\s+([-+]?)\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+(.*)?

我遇到的问题是,在第一次购买中,正则表达式无法捕获减号,它成为第二组(描述)的一部分。

我该如何修改这个正则表达式,以将该符号捕获到它自己的组中?

英文:

I'm trying to create a regex pattern to capture information from my credit card invoice, which I can only get as a PDF. I'm copying the text to a text editor and then I use the replace tool in notepad++ to convert that copied text to a CSV one.

I'm having problem with negative values.

Given this piece of text:

16/04 RC GRACAS 2 - 0,02
SAÚDE .RECIFE

16/04 RC GRACAS 2 02/03 45,97
SAÚDE .RECIFE

The text above contains data for 2 bill data. The regex should capture the following groups in the first entry:

"16/04": date
"RC GRACAS 2": description
"-": value sign
"0,02": value
"SAÚDE .RECIFE": categories

As well as the following groups in the second entry:

"16/04": date
"RC GRACAS 2 02/03": description
"": value sign
"45,97": value
"SAÚDE .RECIFE": categories

The current regex I have is this: ^(\d{2}/\d{2})\s+(.*)\s+([-+]?)\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+(.*)?

The problem I'm having is that in the first purchase, the regex can't capture the minus sign, it becomes parte of the second group (description).

How can I change this regex to capture that sign in its own group?

答案1

得分: 1

. 匹配任何字符,包括 -+。如果你的描述中保证没有连字符或加号,你可以通过将第二组更改为 ([^-+]*) 来防止它们匹配这两个字符:

^
(\d{2}/\d{2})\s+
([^-+]*)\s+
([-+]?)\s?
(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+
(.*)?

regex101.com 上尝试

或者,这是我的建议:

^                                # 匹配行的开头
(?<date>\d{2}/\d{2})             # 一个日期,
(?:                              # 一个描述
  \s+(?<description>.*?)         # 至少包含一些空格
)??                              # (可选,惰性匹配)
(?:                              # 
  \s+                            # 一些其他空格
  (?:(?<value_sign>[-+])\s)?     # 然后是一个符号和一个空格,一起是可选的,
  (?<value>                      # 然后是一个值
    \d{1,3}(?:\.\d{3})*(?:,\d+)? # (这是一个数字)
  )                              # 
)                                # 
\s*$                             # 紧挨着行尾,
\n                               # 之后是一个换行符
(?<categories>.*)                # 包含类别。

regex101.com 上尝试

英文:

. matches everything, including - and +. You can prevent them from matching those two by changing the second group to ([^-+]*) if your descriptions are guaranteed to have neither hyphens nor pluses:

^
(\d{2}/\d{2})\s+
([^-+]*)\s+
([-+]?)\s?
(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+
(.*)?

Try it on regex101.com.

Alternatively, here's my suggestion:

^                                # Match at the start of line
(?<date>\d{2}/\d{2})             # a date,
(?:                              # a description
  \s+(?<description>.*?)         # consists of at least some spaces
)??                              # (optional, lazily matched)
(?:                              # 
  \s+                            # some other spaces
  (?:(?<value_sign>[-+])\s)?     # then a sign and a space, collectively optional,
  (?<value>                      # followed by a value
    \d{1,3}(?:\.\d{3})*(?:,\d+)? # (which is a number)
  )                              # 
)                                # 
\s*$                             # right before the end of line,
\n                               # after which is a new line
(?<categories>.*)                # containing categories.

Try it on regex101.com.

答案2

得分: 1

issuethe is that description group the is too general (and matches the next group's pattern as well) and the sign group is optional so it gets captured by the description. What really makes it a problem is that you have another .* group that is optional to the right of the sign group. You can solve this by making two simple changes to your regular expression. The first is to make the description group lazy (by adding a ? after the . The second is to add an end-line $ to the expression: ^(\d{2}/\d{2})\s+(.?)\s+([-+]?)\s?(\d{1,3}(?:.\d{3})(?:,\d+)?)\s+(.)?$ ^ ^ The change to a lazy group prevents the description field from passing into the next group and the end-line adds more structure to the expression, allowing the laziness to work.

英文:

The issue is that the description group is too general (and matches the next group's pattern as well) and the sign group is optional so it gets captured by the description. What really makes it a problem is that you have another .* group that is optional to the right of the sign group.

You can solve this by making two simple changes to your regular expression. The first is to make the description group lazy (by adding a ? after the *. The second is to add an end-line $ to the expression:

^(\d{2}/\d{2})\s+(.*?)\s+([-+]?)\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)\s+(.*)?$
                    ^                                                    ^

The change to a lazy group prevents the description field from passing into the next group and the end-line adds more structure to the expression, allowing the laziness to work.

答案3

得分: 1

使用不情愿量词.*?,它尽量匹配尽可能少的字符,并使用[\r\n]来匹配换行符:

^(\d\d\/\d\d)\s+(.*?)\s+([-+])?\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)[\r\n]+(.*)?

查看实时演示

英文:

Use reluctant quantifier .*?, which matches as few characters as possible, and use [\r\n] to match the new line:

^(\d\d\/\d\d)\s+(.*?)\s+([-+])?\s?(\d{1,3}(?:\.\d{3})*(?:,\d+)?)[\r\n]+(.*)?

See live demo.

答案4

得分: 1

以下是翻译好的部分:

你可以使用以下正则表达式。

    (?x)                        # 启用自由间距模式
    ^                           # 匹配行的开头
    (?<date>\d{2}\/\d{2})       # 匹配 2 位数字,'/',2 位数字,并保存到捕获组 'date' 中
                                # 空格匹配 1 个或多个,尽可能多地匹配

    (?<description>[\w ]*\w(?:[ ]+\d{2}\/\d{2})?)
                                # 匹配零个或多个单词字符或空格,尽可能多地匹配,
                                # 后跟一个单词字符,可选地后跟一个或多个空格,2 位数字,'/',2 位数字,
                                # 保存到捕获组 'description' 中
    [ ]+                        # 匹配 1 个或多个空格,尽可能多地匹配
    (?<value_sign>[-+]|)        # 匹配 '-','+' 或一个空格,并保存到捕获组 'value_sign' 中
    [ ]+                        # 匹配 1 个或多个空格,尽可能多地匹配
    (?<value>\d+,\d{2})         # 匹配 1 个或多个数字,',',2 位数字,保存到捕获组 'value' 中
    \r?\n                       # 匹配一个可能以回车符为前导的换行符(用于支持 Windows)
    (?<categories>\S.*)         # 匹配一个非空白字符,后跟零个或多个非换行符的字符,尽可能多地匹配,保存到捕获组 'categories' 中

[演示](https://regex101.com/r/NVI6fy/1)

如果没有指定自由间距模式,那么将会是:

    ^(?<date>\d{2}\/\d{2}) +(?<description>[\w ]*\w(?: +\d{2}\/\d{2})?) +(?<value_sign>[-+]|) +(?<value>\d+,\d{2})\r?\n(?<categories>\S.*)

如果使用了带编号的捕获组,那么将会是:

    ^(\d{2}\/\d{2}) +([\w ]*\w(?: +\d{2}\/\d{2})?) +([-+]|) +(\d+,\d{2})\r?\n(\S.*)
英文:

You can use the following regular expression.

(?x)                        # invoke free-spacing mode
^                           # match beginning of line
(?<date>\d{2}\/\d{2})       # match 2 digits, '/', 2 digits and save to capture
                            # group 'date'
[ ]+                        # match 1 or more spaces, as many as possible

(?<description>[\w ]*\w(?:[ ]+\d{2}\/\d{2})?)
                            # match zero or more word chars or spaces, as many as
                            # possible, followed by a word char,  optionally
                            # followed by a one or more spaces, 2 digits, '/', 2 digits,
                            # save to capture group 'description'
[ ]+                        # match 1 or more spaces, as many as possible
(?<value_sign>[-+]|)        # match '-', '+' or an empty space, save to caputure
                            # group 'value_sign' 
[ ]+                        # match 1 or more spaces, as many as possible
(?<value>\d+,\d{2})         # match 1 or more digits, ',', 2 digits, save to capture
                            # group 'value'
\r?\n                       # match a line feed optionally preceded by a carriage
                            # return (for Windows support)
(?<categories>\S.*)         # match a non-whitespace character followed by
                            # zero or more characters other than line
                            # terminators, as many as possible, save to
                            # capture group 'categories'

Demo

If free-spacing mode were not specified this would be

^(?<date>\d{2}\/\d{2}) +(?<description>[\w ]*\w(?: +\d{2}\/\d{2})?) +(?<value_sign>[-+]|) +(?<value>\d+,\d{2})\r?\n(?<categories>\S.*)

If numbered capture groups were used this would be

^(\d{2}\/\d{2}) +([\w ]*\w(?: +\d{2}\/\d{2})?) +([-+]|) +(\d+,\d{2})\r?\n(\S.*)

huangapple
  • 本文由 发表于 2023年5月29日 03:48:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76353332.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定