2023年2月16日 18:16:53go评论55阅读模式

英文:

Writing a proper regular expression that handles multiple spaces

问题

以下是您要提取的信息的翻译：

NOWAKOWSKA
NOWAK_S6_2
KOWALSKA_S6_
NOWACKI S6_

您之前使用的正则表达式是%^Numer właściciela:[[:space:]](?<owner_id>.+)$%imu，但出现了问题，因为源文本中的空格可能会在模式内部，并且这些空格必须包含在结果中，但如果有多个重复的空格，则不应包含在结果中。您尝试使用条件和负向预查来排除多个空格，但未成功。希望这有所帮助。

英文:

I'm struggling to write a proper regexp to use with PHP7.4 to extract required information from a string.

Here are the sample strings:

Numer właściciela: NOWAKOWSKA                                              01-234 Warsaw
Numer właściciela: NOWAK_S6_2
Numer właściciela: KOWALSKA_S6_                                            01-234 Warsaw
Numer właściciela: NOWACKI S6_                                             01-234 Warsaw

What I want to extract is accordingly:

NOWAKOWSKA
NOWAK_S6_2
KOWALSKA_S6_
NOWACKI S6_

So far I was using the %^Numer właściciela:[[:space:]](?<owner_id>.+)$%imu which worked fine (with example from row#2). However, turns out that the other cases (#1, #3, #4) appeared during a roll-out phase and our text extraction is not accurate enough.

The problem here is with spaces, the source text may contain space inside the pattern and this space must be included in the result. However, if there are repeating spaces, they must not be included.

Tried playing around with some conditionals and negative lookaheads to exclude multiple spaces, but failed to do so.

Would really appreciate any help here.

答案1

得分: 1

在一般情况下，当您想要匹配由单个空格分隔的字符序列时，您可以使用以下正则表达式：

/^Numer właściciela:\h*(?&lt;owner_id&gt;\S+(?:\h\S+)*)/imu

请查看正则表达式演示。在这里，\h 优于 \s，因为您是从更长的文本中提取数据而不是独立的文本。

如果您要提取的字符串都很短，您还可以使用以下正则表达式：

/^Numer właściciela:\h*(?&lt;owner_id&gt;.*?)(?:\h{2}|$)/imu

这样的话，效率可能会更高，但前提是它们与问题中的字符串一样短。.*? 通常与任意长度的字符串一样昂贵。

模式详细信息：

^ - 行的开头（由于 m 标志）
Numer właściciela: - 一个字面字符串（用 \h 替换以匹配任何水平空格）
\h* - 零个或多个水平空格
(?<owner_id>\S+(?:\h\S+)*) - "owner_id" 组：一个或多个非空格字符，后跟零个或多个单个水平空格，然后跟一个或多个非空格字符。
(?<owner_id>.*?)(?:\h{2}|$) - "owner_id" 组，捕获尽可能少的零个或多个非换行字符，然后是两个水平空格或行尾。

英文:

In a general case, when you want to match sequences of chars separated with a single whitespace, you can use

/^Numer właściciela:\h*(?&lt;owner_id&gt;\S+(?:\h\S+)*)/imu

See the regex demo. \h is preferred to \s since you are extracting data from lines in a longer text, not standalone texts.

If the strings you extract are all short, you may also use

/^Numer właściciela:\h*(?&lt;owner_id&gt;.*?)(?:\h{2}|$)/imu

Then, it should be even more efficient, but only if they are that short as in the question. The .*? is usually as expensive as .* in strings of arbitrary length.

Pattern details:

^ - start of a line (due to m flag)
Numer właściciela: - a literal string (replace with \h to match any horizontal whitespace)
\h* - zero or more horizontal whitespaces
(?<owner_id>\S+(?:\h\S+)*) - Group "owner_id": one or more non-whitespace chars followed with zero or more sequences of a single horizontal whitespace followed with one or more non-whitespace chars.
(?<owner_id>.*?)(?:\h{2}|$) - Group "owner_id" that captures any zero or more chars other than line break chars as few as possible, and then either two horizontal whitespaces or end of a line.

答案2

得分: 1

这个正则表达式：

/^Numer właściciela:\s+(?<owner_id>.*?)(?=\s{20,}|$)/imu

在线演示

英文:

This regex:

/^Numer właściciela:\s+(?&lt;owner_id&gt;.*?)(?=\s{20,}|$)/imu

online demo

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

编写一个适当的正则表达式，处理多个空格

问题

答案1

答案2

This regex:

如何移动所关注列中小数点的位置（sed）

如何在Flutter中使用正则表达式验证喀麦隆电话号码

正则表达式用于排除非Golang文件在inotifywait中无法工作。

排除德语umlaut从给定的正则表达式匹配中。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论