How to Check If The Rune is Chinese Punctuation Character in Go

huangapple go评论91阅读模式
英文:

How to Check If The Rune is Chinese Punctuation Character in Go

问题

对于像 这样的中文标点符号,如何在Go中进行检测?

我尝试使用unicode包中的字符范围表,像下面的代码,但是Han并不包括这些标点符号。

请告诉我应该使用哪个字符范围表来完成这个任务?(请避免使用regex,因为它性能较低。)

for _, r := range strToDetect {
    if unicode.Is(unicode.Han, r) {
        return true
    }
}
英文:

For Chinese punctuation chars like , how to detect via Go?

I tried with range table of package unicode like the code below, but Han doesn't include those punctuation chars.

Can you please tell me which range table should I use for this task? (Please refraining from using regex because it's low performance.)

for _, r := range strToDetect {
	if unicode.Is(unicode.Han, r) {
		return true
	}
}

答案1

得分: 1

标点符号散布在不同的Unicode代码块中。


Unicode®标准
版本14.0 - 核心规范

第6章
写作系统和标点符号
https://www.unicode.org/versions/latest/ch06.pdf

标点符号。本章的其余部分涉及一个特殊情况:标点符号,它们往往散布在不同的块中,并且可能被许多脚本共同使用。标点字符出现在块中的几个广泛分散的位置,包括基本拉丁文、拉丁-1补充、常规标点、补充标点和CJK符号和标点。特定脚本的块中也偶尔会有标点字符。


以下是你的两个例子,

〜波浪线U+301C

。表意句号U+3002


package main

import (
	"fmt"
	"unicode"
)

func main() {
	// CJK符号和标点Unicode块
	for r := rune(' '); r <= '〿'; r++ {
		if unicode.IsPunct(r) {
			fmt.Printf("%[1]U\t%[1]c\n", r)
		}
	}
}

https://go.dev/play/p/WoJjM6JKTYR

U+3001	、
U+3002	。
U+3003	〃
U+3008	〈
U+3009	〉
U+300A	《
U+300B	》
U+300C	「
U+300D	」
U+300E	『
U+300F	』
U+3010	【
U+3011	】
U+3014	〔
U+3015	〕
U+3016	〖
U+3017	〗
U+3018	〘
U+3019	〙
U+301A	〚
U+301B	〛
U+301C	〜
U+301D	〝
U+301E	〞
U+301F	〟
U+3030	〰
U+303D	〽
英文:

Puctuation marks are scattered about in different Unicode code blocks.


> The Unicode® Standard
> Version 14.0 – Core Specification
>
> Chapter 6
> Writing Systems and Punctuation
> https://www.unicode.org/versions/latest/ch06.pdf
>
> Punctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.


Here are two of your examples,

〜 Wave Dash U+301C

。Ideographic Full Stop U+3002


package main

import (
	&quot;fmt&quot;
	&quot;unicode&quot;
)

func main() {
	// CJK Symbols and Punctuation Unicode block
	for r := rune(&#39;\u3000&#39;); r &lt;= &#39;\u303F&#39;; r++ {
		if unicode.IsPunct(r) {
			fmt.Printf(&quot;%[1]U\t%[1]c\n&quot;, r)
		}
	}
}

https://go.dev/play/p/WoJjM6JKTYR

U+3001	、
U+3002	。
U+3003	〃
U+3008	〈
U+3009	〉
U+300A	《
U+300B	》
U+300C	「
U+300D	」
U+300E	『
U+300F	』
U+3010	【
U+3011	】
U+3014	〔
U+3015	〕
U+3016	〖
U+3017	〗
U+3018	〘
U+3019	〙
U+301A	〚
U+301B	〛
U+301C	〜
U+301D	〝
U+301E	〞
U+301F	〟
U+3030	〰
U+303D	〽

huangapple
  • 本文由 发表于 2022年2月3日 21:13:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/70971932.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定