2023年3月15日 21:32:19go评论90阅读模式

英文:

Regular expression to identify text between semi-colons that contains comma and spaces

问题

以下是您要翻译的部分：

I am trying to identify some texts that contains comma(,) and white spaces(\s+) in a csv that is semi-colon(;) separated. Sample csv entries are as followed:

在尝试识别包含逗号(,)和空格(\s+)的一些文本，这些文本位于一个以分号(;)分隔的CSV文件中。示例CSV条目如下所示：

09/03/2023;13;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(UNSC RESOLUTION 1483);;;;;;;;;;;;;;;;;;;;;;;;;;;14;13;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;1937-04-28;al-Awja, near Tikrit;IRQ;;;;;;;;;;;;;;;;EU.27.28
09/03/2023;20;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(Saddam's second son);26;20;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;Hussein Al-Tikriti;Qusay;Saddam;Qusay Saddam Hussein Al-Tikriti;M;;Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;EU.39.56

In the sample data I am trying to extract following texts:

在示例数据中，我试图提取以下文本：

al-Awja, near Tikrit
Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard

Both the instances of target texts have comma(,) in it and that is creating issue when trying to convert the semi-colon(;) separated file into a comma(,) separated file as it adds extra columns for existing commas(,) in the string.

目标文本的两个实例都包含逗号(,)，这在尝试将以分号(;)分隔的文件转换为以逗号(,)分隔的文件时会导致问题，因为它会为字符串中的现有逗号(,)添加额外的列。

So far I have following regular expression that is taking me to the required texts. However, I am unable to retrieve the entire string using this.

到目前为止，我有以下正则表达式可以找到所需的文本。但是，我无法使用这个正则表达式检索整个字符串。

Regex: ([A-Za-z0-9-]+)([,])(\s+)([A-Za-z0-9-]+)

Please help.

英文:

I am trying to identify some texts that contains comma(,) and white spaces(\s+) in a csv that is semi-colon(;) separated. Sample csv entries are as followed:

09/03/2023;13;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(UNSC RESOLUTION 1483);;;;;;;;;;;;;;;;;;;;;;;;;;;14;13;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;1937-04-28;al-Awja, near Tikrit;IRQ;;;;;;;;;;;;;;;;EU.27.28
09/03/2023;20;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(Saddam&#39;s second son);26;20;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;Hussein Al-Tikriti;Qusay;Saddam;Qusay Saddam Hussein Al-Tikriti;M;;Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;EU.39.56

In the sample data I am trying to extract following texts:

al-Awja, near Tikrit
Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard

So far I have following regular expression that is taking me to the required texts. However, I am unable to retrieve entire string using this.

Regex: ([A-Za-z0-9-]+)([,])(\s+)([A-Za-z0-9-]+)

Please help.

答案1

得分: 1

# 假设$lines包含输入文件的行，例如通过Get-Content获得
$lines -split ';' -match ' .*,|,.* '

这会输出包含空格和逗号（,）的字段，产生与您问题中显示的输出相同的结果。

如果您只关心逗号，-match ',' 就足够了。

放慢脚步：

您可以使用 Import-Csv 以 -Delimiter ';' 读取文件，然后使用 Export-Csv 将其导出为普通的 , 分隔的 CSV。这不需要特殊处理包含,字符的字段，因为这些命令会自动用双引号 " 括起字段值，允许字段包含,。

如果您的输入文件恰好没有标题行（第一行包含列名），您需要通过 -Header 参数自行提供一个。


<details>
<summary>英文:</summary>
&lt;!-- language-all: sh --&gt;
It is simpler to split your input lines into fields and then use a relative simple regex to filter those fields by the characters of interest:

Assume that $lines contains the lines of the input file, such as

obtained via Get-Content

$lines -split ';' -match ' .,|,. '


This outputs those fields that contain _both_ a space _and_ a comma (`,`), yielding the output shown in your question.
If you only care about commas, `-match &#39;,&#39;` will do.
---
Taking a step back:
You can read your file with [`Import-Csv`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/import-csv) `-Delimiter &#39;;&#39;` and export it to a regular, `,`-separated CSV with [`Export-Csv`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/export-csv), which does _not_ require special handling of fields with embedded `,` chars., because these cmdlets automatically enclose the field values in `&quot;...&quot;` (double quotes), which allows the fields to contain `,`.
If your input file happens to lack a _header_ row (a first line that contains column names), you&#39;ll have to supply one yourself, via the `-Header` parameter.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

正则表达式以识别在分号之间的文本，其中包含逗号和空格：

问题

答案1

Assume that $lines contains the lines of the input file, such as

obtained via Get-Content

将字符串中多处花括号内的ASCII数字转换为它们对应的字符。

如何在CSV文件已存在时填写数据？

在双引号之间选择特定字符的正则表达式

如何将JSON文件内容转换为PowerShell运行簿中的PowerShell对象？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。