英文:
How to retrieve the first “complete” character of a []rune?
问题
我正在尝试编写一个函数:
func Anonymize(name string) string
该函数用于对姓名进行匿名化处理。以下是一些输入和输出示例,以便你了解它的预期功能:
Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.
该函数应该适用于由任意字符组成的姓名。在实现这个函数时,我有以下问题:
给定一个 []rune
或 string
,我如何确定需要取多少个 rune 才能得到一个完整的字符?这里的完整是指所有修饰符和组合重音也都被考虑在内。例如,如果输入是 []rune{0x0041, 0x0308, 0x0066, 0x0067}
(对应字符串 "ÄBC",其中 Ä 表示 A 和一个组合重音的组合),函数应该返回 2,因为前两个 rune 组成了第一个字符 Ä。如果我只取第一个 rune,得到的是 A,这是不正确的。
我需要这个问题的答案,因为我想要匿名化的姓名可能以一个带重音的字符开头,我不想去掉这个重音。
英文:
I am trying to write a function
func Anonymize(name string) string
that anonymizes names. Here are some examples of pairs of input and output so you get an idea of what it is supposed to do:
Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.
This function is supposed to work with names composed out of arbitrary characters. While implementing this function, I had the following question:
Given a []rune
or string
, how do I figure out how many runes I have to take to get a complete character, complete in the sense that all modifiers and combining accents corresponding to the character are taken, too. For instance, if the input is []rune{0x0041, 0x0308, 0x0066, 0x0067}
(corresponding to the string ÄBC where Ä is represented as the combination of an A and a combining diaresis), the function should return 2 because the first two runes yield the first character, Ä. If I just took the first rune, I would get A which is incorrect.
I need an answer to this question because the name I want to anonymize might begin with an accented character and I don't want to remove the accent.
答案1
得分: 2
你可以尝试以下函数(受到“Go语言字符串长度”启发):
func FirstGraphemeLen(str string) int {
re := regexp.MustCompile(`\PM\pM*|.`)
return len([]rune(re.FindAllString(str, -1)[0]))
}
参考这个示例:
r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
s := string(r)
fmt.Println(s, len(r), FirstGraphemeLen(s))
输出:
ÄfÄg 6 2
该字符串可能使用了6个符文,但它的第一个图形素使用了2个。
OP FUZxxl使用了另一种方法,使用了unicode.IsMark(r)
。
> IsMark
报告符文是否为标记字符(类别M)。
源代码(来自FUZxxl的play.golang.org)包括:
// 从姓氏中获取一个字符,包括所有修饰符
r, _, err := ln.ReadRune()
if err != nil {
/* ... */
}
aln = append(aln, r)
for {
r, _, err = ln.ReadRune()
if err != nil {
goto done
}
if !unicode.IsMark(r) {
break
}
aln = append(aln, r)
}
aln = append(aln, '.')
/* ... */
英文:
You can try the following function (inspired by "Go language string length"):
func FirstGraphemeLen(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len([]rune(re.FindAllString(str, -1)[0]))
}
See this example:
r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
s := string(r)
fmt.Println(s, len(r), FirstGraphemeLen(s))
Output:
ÄfÄg 6 2
That string might use 6 runes, but its first grapheme uses 2.
The OP FUZxxl used another approach, using unicode.IsMark(r)
> IsMark
reports whether the rune is a mark character (category M).
The source (from FUZxxl's play.golang.org) includes:
// take one character including all modifiers from the last name
r, _, err := ln.ReadRune()
if err != nil {
/* ... */
}
aln = append(aln, r)
for {
r, _, err = ln.ReadRune()
if err != nil {
goto done
}
if !unicode.IsMark(r) {
break
}
aln = append(aln, r)
}
aln = append(aln, '.')
/* ... */
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论