正则表达式语句以便仅匹配字符串的部分以进行比较 – Python

huangapple go评论81阅读模式
英文:

Regex Statement to only match parts of a string for comparison - Python

问题

我试图做的是将一个文件中的值与另一个文件中的值进行匹配,但我只需要匹配字符串的第一部分和最后一部分。

我将每个文件读入一个列表中,并根据我创建的不同正则表达式模式来操作它们。一切都正常,除了当涉及到这种类型的值时:

V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24

在这个示例中,我只想匹配'V-1\ZDS\R\EMBO-20',然后比较字符串末尾的'24'值。'20-x:'中的数字x可能会变化,与比较无关,只要字符串的第一部分和最后部分匹配即可。

这是我正在使用的正则表达式:

re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")

一旦我筛选出列表,我使用以下函数来返回两个集合之间的差异:

funcDiff = lambda x, y: list((set(x) - set(y))) + list((set(y) - set(x)))

是否有一种方法可以获取差异列表并过滤掉那些在“:”之后具有匹配值的差异项,如上所述?

如果这是一个明显的答案,我为之道歉,我是Python和正则表达式的新手!

我得到的输出是整个字符串之间的差异,因此即使字符串的第一部分和最后部分匹配,如果'EMBO-20-x'之后的数字也不匹配,它也会将其视为不同。

英文:

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.

I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:

V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24

In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.

This is the Regex I am using:

re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")

Once I filter down the list, I use the following function to return the difference between the two sets:

funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))

Is there a way to take the list of differences and filter out the ones that have matching values after the

: 

as mentioned above?

I apologize is this is an obvious answer, I'm new to Python and Regex!

The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.

答案1

得分: 0

在讨论你的问题之前,regex101 是这类问题非常有用的工具。

你的问题源自两个方面:

1.) 你使用 .* 的方式

2.) 贪婪匹配与非贪婪匹配

.* 有点糟糕

.* 是一个正则表达式,但实际上很少是你真正想要的。
> 顺便提一句,一个有用的正则表达式是 [^c]*[^c]+。这些表达式匹配除了字母 c 之外的任何字符,第一个表达式匹配 0 或更多次,第二个表达式匹配 1 或更多次。

.* 将匹配尽可能多次的所有字符。相反,尝试用更具体的起始点开始你的正则表达式模式。两种好的方法是使用回顾表达式和锚点。

> 再提一句,你很可能在误用 regex.matchregex.findmatch 只会返回从字符串开头开始的匹配项,而 find 会在输入字符串的任何位置返回匹配项。这可能是你最初使用 .* 的原因,允许 .match 调用在字符串的深处返回匹配项。

回顾表达式

关于这方面的详细解释可以在网上找到,但简而言之,正则表达式模式如下:

(?<=test)foo 

将匹配文本 foo,但只有在 test 紧挨着它的时候。更明确地说,以下字符串将不匹配该正则表达式:

foo 
test-foo 
test foo 

但以下字符串将匹配:

testfoo 

然而,这只会匹配文本 foo

锚点

另一个选择是锚点。^$ 是特殊字符,匹配文本行的开头和结尾。如果你知道你的正则模式只会匹配一行文本,那就用 ^ 开始,用 $ 结束。

在正则表达式模式前面加上 .*,然后以 .* 结尾,很可能是你问题的根源。虽然你没有提供完整的输入或代码示例,但你很可能使用的是 match 而不是 find

在正则表达式中,. 匹配任何字符,* 表示0次或更多次。这意味着对于任何输入,你的模式将匹配整个字符串。

贪婪匹配与非贪婪匹配

第二个问题与贪婪性有关。当你的正则模式中有 * 时,它们可以匹配0个或更多个字符。这可能会隐藏问题,因为整个 * 表达式可以被跳过。你的正则表达式很可能将多行文本作为一个匹配项匹配,将多个记录隐藏在单个 .* 中。

实际答案

考虑到所有这些,让我们假设你的输入数据看起来像这样:

V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28

一个更好的正则表达式应该是:

^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$

要查看这个正则表达式的效果,请访问 这个链接

我想要强调几个不同点:

  • ^ 开始表达式,用 $ 结束表达式。这会强制正则表达式匹配完全符合一行文本。即使在没有这些字符的情况下该模式也能工作,但在使用正则表达式时,尽量明确是个好习惯。

  • 没有无用的非捕获组。你的示例在开头有一个 (?:) 组。这表示一个不捕获其匹配的组。如果你想多次匹配子模式((?:ab){5} 匹配 ababababab 而不捕获任何东西),这很有用。然而,在你的例子中,它什么也没有做 正则表达式语句以便仅匹配字符串的部分以进行比较 – Python

  • 只捕获数字。这样更容易提取捕获组的值。

  • 不使用 *,只使用 ++ 的工作原理类似于 *,但它匹配1次或更多次。这通常更正确,因为它防止了“跳过”整个字符。

英文:

Before discussing your question, regex101 is an incredibly useful tool for this type of thing.

Your issue stems from two issues:

1.) The way you used .*

2.) Greedy vs. Nongreedy matches

.* kinda sucks

.* is a regex expression that is very rarely what you actually want.
> As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.

.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.

> Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.

Lookbehind Expressions

There are more complete explanations online, but in short, regex patterns like:

(?<=test)foo

will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:

foo
test-foo
test foo

but the following string will match:

testfoo

This will only match the text foo, though.

Anchors

Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.

Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.

In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.

Greedy vs. Non-Greedy qualifiers

The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.

The Actual Answer

Taking all of this in to consideration, let's assume that your input data looks like this:

V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28

A better regular expression would be:

^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$

To visualize this regex in action, follow this link.

There are several differences I would like to highlight:

  • Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.

  • No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing 正则表达式语句以便仅匹配字符串的部分以进行比较 – Python

  • Only capturing the number. This makes it easier to extract the value of the capture groups.

  • No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

huangapple
  • 本文由 发表于 2023年2月9日 02:23:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75390158.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定