英文:
Regular expression to identify text between semi-colons that contains comma and spaces
问题
以下是您要翻译的部分:
I am trying to identify some texts that contains comma(,) and white spaces(\s+) in a csv that is semi-colon(;) separated. Sample csv entries are as followed:
在尝试识别包含逗号(,)和空格(\s+)的一些文本,这些文本位于一个以分号(;)分隔的CSV文件中。示例CSV条目如下所示:
09/03/2023;13;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(UNSC RESOLUTION 1483);;;;;;;;;;;;;;;;;;;;;;;;;;;14;13;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;1937-04-28;al-Awja, near Tikrit;IRQ;;;;;;;;;;;;;;;;EU.27.28
09/03/2023;20;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(Saddam's second son);26;20;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;Hussein Al-Tikriti;Qusay;Saddam;Qusay Saddam Hussein Al-Tikriti;M;;Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;EU.39.56
In the sample data I am trying to extract following texts:
在示例数据中,我试图提取以下文本:
al-Awja, near Tikrit
Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard
Both the instances of target texts have comma(,) in it and that is creating issue when trying to convert the semi-colon(;) separated file into a comma(,) separated file as it adds extra columns for existing commas(,) in the string.
目标文本的两个实例都包含逗号(,),这在尝试将以分号(;)分隔的文件转换为以逗号(,)分隔的文件时会导致问题,因为它会为字符串中的现有逗号(,)添加额外的列。
So far I have following regular expression that is taking me to the required texts. However, I am unable to retrieve the entire string using this.
到目前为止,我有以下正则表达式可以找到所需的文本。但是,我无法使用这个正则表达式检索整个字符串。
Regex: ([A-Za-z0-9-]+)([,])(\s+)([A-Za-z0-9-]+)
Please help.
英文:
I am trying to identify some texts that contains comma(,) and white spaces(\s+) in a csv that is semi-colon(;) separated. Sample csv entries are as followed:
09/03/2023;13;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(UNSC RESOLUTION 1483);;;;;;;;;;;;;;;;;;;;;;;;;;;14;13;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;1937-04-28;al-Awja, near Tikrit;IRQ;;;;;;;;;;;;;;;;EU.27.28
09/03/2023;20;P;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;(Saddam's second son);26;20;1210/2003 (OJ L169);2003-07-08;http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:169:0006:0023:EN:PDF;IRQ;Hussein Al-Tikriti;Qusay;Saddam;Qusay Saddam Hussein Al-Tikriti;M;;Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;EU.39.56
In the sample data I am trying to extract following texts:
al-Awja, near Tikrit
Oversaw Special Republican Guard, Special Security Organisation, and Republican Guard
Both the instances of target texts have comma(,) in it and that is creating issue when trying to convert the semi-colon(;) separated file into a comma(,) separated file as it adds extra columns for existing commas(,) in the string.
So far I have following regular expression that is taking me to the required texts. However, I am unable to retrieve entire string using this.
Regex: ([A-Za-z0-9-]+)([,])(\s+)([A-Za-z0-9-]+)
Please help.
答案1
得分: 1
# 假设$lines包含输入文件的行,例如通过Get-Content获得
$lines -split ';' -match ' .*,|,.* '
这会输出包含空格和逗号(,
)的字段,产生与您问题中显示的输出相同的结果。
如果您只关心逗号,-match ','
就足够了。
放慢脚步:
您可以使用 Import-Csv
以 -Delimiter ';'
读取文件,然后使用 Export-Csv
将其导出为普通的 ,
分隔的 CSV。这不需要特殊处理包含,
字符的字段,因为这些命令会自动用双引号 "
括起字段值,允许字段包含,
。
如果您的输入文件恰好没有 标题 行(第一行包含列名),您需要通过 -Header
参数自行提供一个。
<details>
<summary>英文:</summary>
<!-- language-all: sh -->
It is simpler to split your input lines into fields and then use a relative simple regex to filter those fields by the characters of interest:
Assume that $lines contains the lines of the input file, such as
obtained via Get-Content
$lines -split ';' -match ' .,|,. '
This outputs those fields that contain _both_ a space _and_ a comma (`,`), yielding the output shown in your question.
If you only care about commas, `-match ','` will do.
---
Taking a step back:
You can read your file with [`Import-Csv`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/import-csv) `-Delimiter ';'` and export it to a regular, `,`-separated CSV with [`Export-Csv`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/export-csv), which does _not_ require special handling of fields with embedded `,` chars., because these cmdlets automatically enclose the field values in `"..."` (double quotes), which allows the fields to contain `,`.
If your input file happens to lack a _header_ row (a first line that contains column names), you'll have to supply one yourself, via the `-Header` parameter.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论