2023年6月12日 05:10:03go评论71阅读模式

英文:

Regex matching strings with mixture of Japanese and English characters

问题

以下是已经翻译好的内容：

我有这个PowerShell脚本，最终将用于翻译一个包含一些日语单词并用英语替换的XML文件。目前，这只是一个简单的正则表达式匹配示例：

$pattern = "(?<=&gt;)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=&lt;/)"
$text = 'tag3&gt;日本語&lt;/tag&gt;漢字&lt;/tag&gt;.'

$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }

$matches

这个工作得很好，将返回以下内容：

日本語
漢字

然而，我希望它还可以捕获日语字符之前或之后的一个或多个英文字符，并且整个内容都包裹在&gt;和&lt;/中。

对于这个字符串：

tag3&gt;Some text before 日本語 and some text after&lt;/tag&gt;&lt;Before text 漢字&lt;/tag&gt;

它应该捕获这些内容：

Some text before 日本語 and some text after
Before text 漢字

英文:

I have this script in PowerShell which I am going to use eventually to translate an XML file with some Japanese words and replace with the English. For now this is a simple regex matching example:

$pattern = &quot;(?&lt;=\&gt;)[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+(?=\&lt;\/)&quot;
$text = &#39;tag3&gt;日本語&lt;/tag&gt;漢字&lt;/tag&gt;.&#39;

$matches = $text | Select-String -Pattern $pattern -AllMatches | ForEach-Object { $_.Matches.Value }

$matches

This works fine and will return the following:

日本語
漢字

However, I want it to also grab on or more English characters before or after the Japanese characters, and the whole thing wrapped between > and </

For this string:

tag3&gt;Some text before 日本語 and some text after&lt;/tag&gt;&lt;Before text 漢字&lt;/tag&gt;

It should grab these:

Some text before 日本語 and some text after
Before text 漢字

答案1

得分: 1

强烈建议的一般性建议：

最好避免对XML文本进行_String_解析，因为它在本质上是有限且脆弱的；始终更可取的是使用专用的XML解析器，例如.NET的System.Xml.XmlDocument类，PowerShell通过其[xml]类型加速器和XML DOM的基于属性的适应性可以轻松访问；请参阅此答案以获取示例。

您可以根据需要优化您的regex如下所示：

$pattern = '(?<=[^>])[^>\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^>\P{IsBasicLatin}]*(?=</)'

$text = '<tag3>Some text before 日本語 and some text after</tag3><tag>Before text 漢字</tag>.'

# 用于诊断目的，直接输出到控制台。
$text |
  Select-String -Pattern $pattern -AllMatches |
  ForEach-Object { $_.Matches.Value }

输出：

Some text before 日本語 and some text after
Before text 漢字

有关正则表达式的解释和实验的能力，请参阅此regex101.com页面。

英文:

The obligatory general recommendation:

String parsing of XML text is best avoided, because it is inherently limited and brittle; it's always preferable to use a dedicated XML parser, such as .NET's System.Xml.XmlDocument class, which PowerShell provides easy access to via its [xml] type accelerator and the property-based adaptation of the XML DOM; see this answer for an example.

You can refine your regex as follows:

$pattern = &#39;(?&lt;=[^/]&gt;)[^&gt;\P{IsBasicLatin}]*[\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}]+[^&gt;\P{IsBasicLatin}]*(?=&lt;/)&#39;

$text = &#39;&lt;tag3&gt;Some text before 日本語 and some text after&lt;/tag3&gt;&lt;tag&gt;Before text 漢字&lt;/tag&gt;.&#39;

# Outputs directly to the console for diagnostic purposes.
$text |
  Select-String -Pattern $pattern -AllMatches |
  ForEach-Object { $_.Matches.Value }

Output:

Some text before 日本語 and some text after
Before text 漢字

For an explanation of the regex and the ability to experiment with it, see this regex101.com page.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

正则表达式匹配包含日语和英语字符混合的字符串

问题

答案1

使用Powershell/Python获取连接到Azure租户的所有IP地址。

只匹配特定字符串，前提是整行不包含特定术语。

不使用 Out-GridView 尝试

使用正则表达式在大括号之间递归地捕获分组。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论