英文:
How to Check If The Rune is Chinese Punctuation Character in Go
问题
对于像~
,
。
这样的中文标点符号,如何在Go中进行检测?
我尝试使用unicode
包中的字符范围表,像下面的代码,但是Han
并不包括这些标点符号。
请告诉我应该使用哪个字符范围表来完成这个任务?(请避免使用regex
,因为它性能较低。)
for _, r := range strToDetect {
if unicode.Is(unicode.Han, r) {
return true
}
}
英文:
For Chinese punctuation chars like ~
,
。
, how to detect via Go?
I tried with range table of package unicode
like the code below, but Han
doesn't include those punctuation chars.
Can you please tell me which range table should I use for this task? (Please refraining from using regex
because it's low performance.)
for _, r := range strToDetect {
if unicode.Is(unicode.Han, r) {
return true
}
}
答案1
得分: 1
标点符号散布在不同的Unicode代码块中。
Unicode®标准
版本14.0 - 核心规范第6章
写作系统和标点符号
https://www.unicode.org/versions/latest/ch06.pdf标点符号。本章的其余部分涉及一个特殊情况:标点符号,它们往往散布在不同的块中,并且可能被许多脚本共同使用。标点字符出现在块中的几个广泛分散的位置,包括基本拉丁文、拉丁-1补充、常规标点、补充标点和CJK符号和标点。特定脚本的块中也偶尔会有标点字符。
以下是你的两个例子,
〜波浪线U+301C
。表意句号U+3002
package main
import (
"fmt"
"unicode"
)
func main() {
// CJK符号和标点Unicode块
for r := rune(' '); r <= '〿'; r++ {
if unicode.IsPunct(r) {
fmt.Printf("%[1]U\t%[1]c\n", r)
}
}
}
https://go.dev/play/p/WoJjM6JKTYR
U+3001 、
U+3002 。
U+3003 〃
U+3008 〈
U+3009 〉
U+300A 《
U+300B 》
U+300C 「
U+300D 」
U+300E 『
U+300F 』
U+3010 【
U+3011 】
U+3014 〔
U+3015 〕
U+3016 〖
U+3017 〗
U+3018 〘
U+3019 〙
U+301A 〚
U+301B 〛
U+301C 〜
U+301D 〝
U+301E 〞
U+301F 〟
U+3030 〰
U+303D 〽
英文:
Puctuation marks are scattered about in different Unicode code blocks.
> The Unicode® Standard
> Version 14.0 – Core Specification
>
> Chapter 6
> Writing Systems and Punctuation
> https://www.unicode.org/versions/latest/ch06.pdf
>
> Punctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.
Here are two of your examples,
〜 Wave Dash U+301C
。Ideographic Full Stop U+3002
package main
import (
"fmt"
"unicode"
)
func main() {
// CJK Symbols and Punctuation Unicode block
for r := rune('\u3000'); r <= '\u303F'; r++ {
if unicode.IsPunct(r) {
fmt.Printf("%[1]U\t%[1]c\n", r)
}
}
}
https://go.dev/play/p/WoJjM6JKTYR
U+3001 、
U+3002 。
U+3003 〃
U+3008 〈
U+3009 〉
U+300A 《
U+300B 》
U+300C 「
U+300D 」
U+300E 『
U+300F 』
U+3010 【
U+3011 】
U+3014 〔
U+3015 〕
U+3016 〖
U+3017 〗
U+3018 〘
U+3019 〙
U+301A 〚
U+301B 〛
U+301C 〜
U+301D 〝
U+301E 〞
U+301F 〟
U+3030 〰
U+303D 〽
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论