英文:
Regex matching strings with mixture of Japanese and English characters
问题
以下是已经翻译好的内容:
我有这个PowerShell脚本,最终将用于翻译一个包含一些日语单词并用英语替换的XML文件。目前,这只是一个简单的正则表达式匹配示例:
$pattern = "(?<=>)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=</)"
$text = 'tag3>日本語</tag>漢字</tag>.'
$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }
$matches
这个工作得很好,将返回以下内容:
日本語
漢字
然而,我希望它还可以捕获日语字符之前或之后的一个或多个英文字符,并且整个内容都包裹在>和</中。
对于这个字符串:
tag3>Some text before 日本語 and some text after</tag><Before text 漢字</tag>
它应该捕获这些内容:
Some text before 日本語 and some text after
Before text 漢字
英文:
I have this script in PowerShell which I am going to use eventually to translate an XML file with some Japanese words and replace with the English. For now this is a simple regex matching example:
$pattern = "(?<=\>)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=\<\/)"
$text = 'tag3>日本語</tag>漢字</tag>.'
$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }
$matches
This works fine and will return the following:
日本語
漢字
However, I want it to also grab on or more English characters before or after the Japanese characters, and the whole thing wrapped between > and </
For this string:
tag3>Some text before 日本語 and some text after</tag><Before text 漢字</tag>
It should grab these:
Some text before 日本語 and some text after
Before text 漢字
答案1
得分: 1
强烈建议的一般性建议:
- 最好避免对XML文本进行_String_解析,因为它在本质上是有限且脆弱的;始终更可取的是使用专用的XML解析器,例如.NET的
System.Xml.XmlDocument
类,PowerShell通过其[xml]
类型加速器和XML DOM的基于属性的适应性可以轻松访问;请参阅此答案以获取示例。
您可以根据需要优化您的regex如下所示:
$pattern = '(?<=[^>])[^>\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^>\P{IsBasicLatin}]*(?=</)'
$text = '<tag3>Some text before 日本語 and some text after</tag3><tag>Before text 漢字</tag>.'
# 用于诊断目的,直接输出到控制台。
$text |
Select-String -Pattern $pattern -AllMatches |
ForEach-Object { $_.Matches.Value }
输出:
Some text before 日本語 and some text after
Before text 漢字
有关正则表达式的解释和实验的能力,请参阅此regex101.com页面。
英文:
<!-- language-all: sh -->
The obligatory general recommendation:
- String parsing of XML text is best avoided, because it is inherently limited and brittle; it's always preferable to use a dedicated XML parser, such as .NET's
System.Xml.XmlDocument
class, which PowerShell provides easy access to via its[xml]
type accelerator and the property-based adaptation of the XML DOM; see this answer for an example.
You can refine your regex as follows:
$pattern = '(?<=[^/]>)[^>\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^>\P{IsBasicLatin}]*(?=</)'
$text = '<tag3>Some text before 日本語 and some text after</tag3><tag>Before text 漢字</tag>.'
# Outputs directly to the console for diagnostic purposes.
$text |
Select-String -Pattern $pattern -AllMatches |
ForEach-Object { $_.Matches.Value }
Output:
Some text before 日本語 and some text after
Before text 漢字
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论