正则表达式匹配包含日语和英语字符混合的字符串

huangapple go评论71阅读模式
英文:

Regex matching strings with mixture of Japanese and English characters

问题

以下是已经翻译好的内容:

我有这个PowerShell脚本最终将用于翻译一个包含一些日语单词并用英语替换的XML文件目前这只是一个简单的正则表达式匹配示例

$pattern = "(?<=&gt;)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=&lt;/)"
$text = 'tag3&gt;日本語&lt;/tag&gt;漢字&lt;/tag&gt;.'

$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }

$matches

这个工作得很好将返回以下内容

日本語
漢字

然而我希望它还可以捕获日语字符之前或之后的一个或多个英文字符并且整个内容都包裹在&gt;和&lt;/中

对于这个字符串

tag3&gt;Some text before 日本語 and some text after&lt;/tag&gt;&lt;Before text 漢字&lt;/tag&gt;

它应该捕获这些内容

Some text before 日本語 and some text after
Before text 漢字
英文:

I have this script in PowerShell which I am going to use eventually to translate an XML file with some Japanese words and replace with the English. For now this is a simple regex matching example:

$pattern = &quot;(?&lt;=\&gt;)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=\&lt;\/)&quot;
$text = &#39;tag3&gt;日本語&lt;/tag&gt;漢字&lt;/tag&gt;.&#39;

$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }

$matches

This works fine and will return the following:

日本語
漢字

However, I want it to also grab on or more English characters before or after the Japanese characters, and the whole thing wrapped between > and </

For this string:

tag3&gt;Some text before 日本語 and some text after&lt;/tag&gt;&lt;Before text 漢字&lt;/tag&gt;

It should grab these:

Some text before 日本語 and some text after
Before text 漢字

答案1

得分: 1

强烈建议的一般性建议

  • 最好避免对XML文本进行_String_解析,因为它在本质上是有限且脆弱的;始终更可取的是使用专用的XML解析器,例如.NET的System.Xml.XmlDocument类,PowerShell通过其[xml]类型加速器和XML DOM的基于属性的适应性可以轻松访问;请参阅此答案以获取示例。

您可以根据需要优化您的regex如下所示:

$pattern = '(?<=[^>])[^>\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^>\P{IsBasicLatin}]*(?=</)'

$text = '<tag3>Some text before 日本語 and some text after</tag3><tag>Before text 漢字</tag>.'

# 用于诊断目的,直接输出到控制台。
$text |
  Select-String -Pattern $pattern -AllMatches |
  ForEach-Object { $_.Matches.Value }

输出:

Some text before 日本語 and some text after
Before text 漢字

有关正则表达式的解释和实验的能力,请参阅此regex101.com页面

英文:

<!-- language-all: sh -->

The obligatory general recommendation:

  • String parsing of XML text is best avoided, because it is inherently limited and brittle; it's always preferable to use a dedicated XML parser, such as .NET's System.Xml.XmlDocument class, which PowerShell provides easy access to via its [xml] type accelerator and the property-based adaptation of the XML DOM; see this answer for an example.

You can refine your regex as follows:

$pattern = &#39;(?&lt;=[^/]&gt;)[^&gt;\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^&gt;\P{IsBasicLatin}]*(?=&lt;/)&#39;

$text = &#39;&lt;tag3&gt;Some text before 日本語 and some text after&lt;/tag3&gt;&lt;tag&gt;Before text 漢字&lt;/tag&gt;.&#39;

# Outputs directly to the console for diagnostic purposes.
$text |
  Select-String -Pattern $pattern -AllMatches |
  ForEach-Object { $_.Matches.Value } 

Output:

Some text before 日本語 and some text after
Before text 漢字

For an explanation of the regex and the ability to experiment with it, see this regex101.com page.

huangapple
  • 本文由 发表于 2023年6月12日 05:10:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76452529.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定