如何处理泰文字符串中的组合字符以及\p{L}模式?

huangapple go评论169阅读模式
英文:

How to handle combining characters along with the \p{L} pattern for Thai strings?

问题

I understand your request. Here's the translated code snippet:

需要检测具有Unicode字符的文本,限制为仅包含字母(例如,没有符号、表情符号等,只能用于任何Unicode语言中的人名)。`\p{L}`类别似乎能胜任,但它不能识别泰语字符串。我不会说泰语,所以我从ChatGPT获取了一些常见的泰国名字,它们在我的测试中都失败了。我在[RegExr][1]上尝试过(请参见测试选项卡),还编写了一个简单的测试程序:

```csharp
using System.Text.RegularExpressions;

Console.OutputEncoding = System.Text.Encoding.UTF8;

string pattern = @"^[\p{L}\s]+$";

string englishText = "Mary";
Console.Write($"{englishText}: ");
Console.WriteLine(Regex.IsMatch(englishText, pattern, RegexOptions.IgnoreCase).ToString()); // true

string germanText = "RöschenÜmit";
Console.Write($"{germanText}: ");
Console.WriteLine(Regex.IsMatch(germanText, pattern, RegexOptions.IgnoreCase).ToString()); // true

string thaiText = "อรุณรัตน์";
Console.Write($"{thaiText}: ");
Console.WriteLine(Regex.IsMatch(thaiText, pattern, RegexOptions.IgnoreCase).ToString()); // false

string japaneseText = "タクミたくみく";
Console.Write($"{japaneseText }: ");
Console.WriteLine(Regex.IsMatch(japaneseText, pattern, RegexOptions.IgnoreCase).ToString()); // true

我注意到,当我尝试测试泰语字符串中的每个单独字符时,它似乎将它们识别为有效的Unicode字母,但作为字符串,它失败了。为了确保没有隐藏字符,我检查了原始值,但没有看到任何可疑的东西。有什么想法吗?

附言:我知道测试中的某些字符来自不同的集合,名字可能包含空格、破折号等,但这不是重点。我只是在尝试解决泰语字符串的问题。

COMMENT:
泰语字符串包含组合字符,我猜这可能导致检测字母时出现问题,即使这些字符看起来像单个字母(例如 {0e23, 0xe38} 结果为“รุ”)。


<details>
<summary>英文:</summary>

I need to detect text with Unicode characters restricting it to letters only (e.g. no symbols, emojis, etc., just something that can be used in a person&#39;s name in any Unicode language). The `\p{L}` category seems to do the trick, but it does not recognize Thai strings. I do not speak Thai, so I got a few common Thai names from ChatGPT and they all fail in my test. Tried it at [RegExr][1] (see the Tests tab) and also wrote a simple test program:

using System.Text.RegularExpressions;

Console.OutputEncoding = System.Text.Encoding.UTF8;

string pattern = @"^[\p{L}\s]+$";

string englishText = "Mary";
Console.Write($"{englishText}: ");
Console.WriteLine(Regex.IsMatch(englishText, pattern, RegexOptions.IgnoreCase).ToString()); // true

string germanText = "RöschenÜmit";
Console.Write($"{germanText}: ");
Console.WriteLine(Regex.IsMatch(germanText, pattern, RegexOptions.IgnoreCase).ToString()); // true

string thaiText = "อรุณรัตน์";
Console.Write($"{thaiText}: ");
Console.WriteLine(Regex.IsMatch(thaiText, pattern, RegexOptions.IgnoreCase).ToString()); // false

string japaneseText = "タクミたくみく";
Console.Write($"{japaneseText }: ");
Console.WriteLine(Regex.IsMatch(japaneseText, pattern, RegexOptions.IgnoreCase).ToString()); // true


I noticed when I try testing each individual character in the Thai string, it seems to recognize them as valid Unicode letters, but as a string, it fails. Just to make sure I do not have any hidden characters, I checked the [raw values][2] and I did not see anything suspicious. Any ideas what&#39;s going on here?

P.S. I know that some of the characters in the test are from different sets and names may include spaces, dashes, etc., but this is not the point. I&#39;m just trying to solve the Thai strings issue here.

  [1]: https://regexr.com/7agft
  [2]: https://qaz.wtf/u/show.cgi?show=%E0%B8%AD%E0%B8%A3%E0%B8%B8%E0%B8%93%E0%B8%A3%E0%B8%B1%E0%B8%95%E0%B8%99%E0%B9%8C&amp;type=string

COMMENT:
Thai string contains combining character which I guess causes the problem in detecting letters even if those look as single letter (i.e. {0e23, 0xe38} results in &quot;รุ&quot;). 

</details>


# 答案1
**得分**: 3

以下是代码的翻译部分:

如果我们打印出`thaiText`的转储:

```csharp
string thaiText = "อรุณรัตน์";

var report = string.Join(Environment.NewLine, thaiText
  .Select(c => $"{c} : \\u{(int)c:x4} : {char.GetUnicodeCategory(c)}"));

Console.WriteLine(report);

我们将得到行为不端的原因:NonSpacingMark位于OtherLetter之间:

 : \u0e2d : OtherLetter
 : \u0e23 : OtherLetter
 : \u0e38 : NonSpacingMark &lt;- 不匹配
 : \u0e13 : OtherLetter
 : \u0e23 : OtherLetter
 : \u0e31 : NonSpacingMark &lt;- 不匹配
 : \u0e15 : OtherLetter
 : \u0e19 : OtherLetter
 : \u0e4c : NonSpacingMark &lt;- 不匹配

从技术上讲,为了摆脱这些标记,我们可以使用规范化

// 思路是将标记与字母组合成一个字母,应该匹配
thaiText = thaiText.Normalize(NormalizationForm.FormD);

但在我的工作站上它不起作用的原因是一个问题

因此,如果规范化在您的情况下不起作用(或者您希望采取更安全的措施),您可以尝试匹配泰文符号;要么只有泰文

string pattern = @"^[\p{IsThai}\s]+$";

要么将其与所有其他字符(字母或泰文字符作为特例)混合使用:

string pattern = @"^[\p{L}\p{IsThai}\s]+$";

或者允许 两种 字母 (\p{L}) 和这些非间距标记 (\p{Mn}):

string pattern = @"^[\p{L}\p{Mn}\s]+$";
英文:

If we print out thaiText dump:

string thaiText = &quot;อรุณรัตน์&quot;;

var report = string.Join(Environment.NewLine, thaiText
  .Select(c =&gt; $&quot;{c} : \\u{(int)c:x4} : {char.GetUnicodeCategory(c)}&quot;));

Console.WriteLine(report);

We'll get the cause of misbehaviour: NonSpacingMarks category between the OtherLetters:

อ : \u0e2d : OtherLetter
ร : \u0e23 : OtherLetter
ุ : \u0e38 : NonSpacingMark &lt;- doesn&#39;t match
ณ : \u0e13 : OtherLetter
ร : \u0e23 : OtherLetter
ั : \u0e31 : NonSpacingMark &lt;- doesn&#39;t match
ต : \u0e15 : OtherLetter
น : \u0e19 : OtherLetter
์ : \u0e4c : NonSpacingMark &lt;- doesn&#39;t match

Technically, to get rid of these marks we can use normalization:

// The idea is to combine marks and letters into a letter which should match
thaiText = thaiText.Normalize(NormalizationForm.FormD);

but it doesn't work at my workstation and the reason is an issue

So if normalization doesn't work in your case as well (or you want to be on the safer side of the road), you can try match Thai symbols; either only Thai

string pattern = @&quot;^[\p{IsThai}\s]+$&quot;;

or mixing with all the other ones (letters or Thai letters as a special case):

string pattern = @&quot;^[\p{L}\p{IsThai}\s]+$&quot;;

or allow both letters (\p{L}) and these non-spacing marks (\p{Mn}):

string pattern = @&quot;^[\p{L}\p{Mn}\s]+$&quot;;

答案2

得分: 2

发生这种情况是因为有一些“mark”字符,你需要与字母分开匹配。一些语言使用这些字符,比如泰米尔语。这个正则表达式将匹配泰语字符串:

^[\p{L}\p{M}\s]+$

关于\p{M}的信息来自regular-expressions.info

\p{M}或\p{Mark}:一个用于与另一个字符结合的字符(如重音符、变音符、封闭框等)。

另外,有标记字符的字符串比如อรุณรัตน์,而没有标记字符的字符串比如อรณรตน - 只使用p{L} 就可以匹配后者。

英文:

It happens because there are "mark" characters that you need to match separately from letters. Some languages use these characters, e.g. also Tamil. This regex will match the Thai string:

^[\p{L}\p{M}\s]+$

Info about \p{M} from regular-expressions.info:

> \p{M} or \p{Mark}: a character intended to be combined with another
> character (e.g. accents, umlauts, enclosing boxes, etc.).

Also, comparison of string with mark characters: อรุณรัตน์ and string without them: อรณรตน - this one is matched with just p{L}.

huangapple
  • 本文由 发表于 2023年3月21日 00:45:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75793047.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定