How to Check If The Rune is Chinese Punctuation Character in Go

huangapple go评论130阅读模式
英文:

How to Check If The Rune is Chinese Punctuation Character in Go

问题

对于像 这样的中文标点符号,如何在Go中进行检测?

我尝试使用unicode包中的字符范围表,像下面的代码,但是Han并不包括这些标点符号。

请告诉我应该使用哪个字符范围表来完成这个任务?(请避免使用regex,因为它性能较低。)

  1. for _, r := range strToDetect {
  2. if unicode.Is(unicode.Han, r) {
  3. return true
  4. }
  5. }
英文:

For Chinese punctuation chars like , how to detect via Go?

I tried with range table of package unicode like the code below, but Han doesn't include those punctuation chars.

Can you please tell me which range table should I use for this task? (Please refraining from using regex because it's low performance.)

  1. for _, r := range strToDetect {
  2. if unicode.Is(unicode.Han, r) {
  3. return true
  4. }
  5. }

答案1

得分: 1

标点符号散布在不同的Unicode代码块中。


Unicode®标准
版本14.0 - 核心规范

第6章
写作系统和标点符号
https://www.unicode.org/versions/latest/ch06.pdf

标点符号。本章的其余部分涉及一个特殊情况:标点符号,它们往往散布在不同的块中,并且可能被许多脚本共同使用。标点字符出现在块中的几个广泛分散的位置,包括基本拉丁文、拉丁-1补充、常规标点、补充标点和CJK符号和标点。特定脚本的块中也偶尔会有标点字符。


以下是你的两个例子,

〜波浪线U+301C

。表意句号U+3002


  1. package main
  2. import (
  3. "fmt"
  4. "unicode"
  5. )
  6. func main() {
  7. // CJK符号和标点Unicode块
  8. for r := rune(' '); r <= '〿'; r++ {
  9. if unicode.IsPunct(r) {
  10. fmt.Printf("%[1]U\t%[1]c\n", r)
  11. }
  12. }
  13. }

https://go.dev/play/p/WoJjM6JKTYR

  1. U+3001
  2. U+3002
  3. U+3003
  4. U+3008
  5. U+3009
  6. U+300A
  7. U+300B
  8. U+300C
  9. U+300D
  10. U+300E
  11. U+300F
  12. U+3010
  13. U+3011
  14. U+3014
  15. U+3015
  16. U+3016
  17. U+3017
  18. U+3018
  19. U+3019
  20. U+301A
  21. U+301B
  22. U+301C
  23. U+301D
  24. U+301E
  25. U+301F
  26. U+3030
  27. U+303D
英文:

Puctuation marks are scattered about in different Unicode code blocks.


> The Unicode® Standard
> Version 14.0 – Core Specification
>
> Chapter 6
> Writing Systems and Punctuation
> https://www.unicode.org/versions/latest/ch06.pdf
>
> Punctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.


Here are two of your examples,

〜 Wave Dash U+301C

。Ideographic Full Stop U+3002


  1. package main
  2. import (
  3. &quot;fmt&quot;
  4. &quot;unicode&quot;
  5. )
  6. func main() {
  7. // CJK Symbols and Punctuation Unicode block
  8. for r := rune(&#39;\u3000&#39;); r &lt;= &#39;\u303F&#39;; r++ {
  9. if unicode.IsPunct(r) {
  10. fmt.Printf(&quot;%[1]U\t%[1]c\n&quot;, r)
  11. }
  12. }
  13. }

https://go.dev/play/p/WoJjM6JKTYR

  1. U+3001
  2. U+3002
  3. U+3003
  4. U+3008
  5. U+3009
  6. U+300A
  7. U+300B
  8. U+300C
  9. U+300D
  10. U+300E
  11. U+300F
  12. U+3010
  13. U+3011
  14. U+3014
  15. U+3015
  16. U+3016
  17. U+3017
  18. U+3018
  19. U+3019
  20. U+301A
  21. U+301B
  22. U+301C
  23. U+301D
  24. U+301E
  25. U+301F
  26. U+3030
  27. U+303D

huangapple
  • 本文由 发表于 2022年2月3日 21:13:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/70971932.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定