如何检索[]rune的第一个“完整”字符?

huangapple go评论76阅读模式
英文:

How to retrieve the first “complete” character of a []rune?

问题

我正在尝试编写一个函数:

func Anonymize(name string) string

该函数用于对姓名进行匿名化处理。以下是一些输入和输出示例,以便你了解它的预期功能:

Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.

该函数应该适用于由任意字符组成的姓名。在实现这个函数时,我有以下问题:

给定一个 []runestring,我如何确定需要取多少个 rune 才能得到一个完整的字符?这里的完整是指所有修饰符和组合重音也都被考虑在内。例如,如果输入是 []rune{0x0041, 0x0308, 0x0066, 0x0067}(对应字符串 "ÄBC",其中 Ä 表示 A 和一个组合重音的组合),函数应该返回 2,因为前两个 rune 组成了第一个字符 Ä。如果我只取第一个 rune,得到的是 A,这是不正确的。

我需要这个问题的答案,因为我想要匿名化的姓名可能以一个带重音的字符开头,我不想去掉这个重音。

英文:

I am trying to write a function

func Anonymize(name string) string

that anonymizes names. Here are some examples of pairs of input and output so you get an idea of what it is supposed to do:

Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.

This function is supposed to work with names composed out of arbitrary characters. While implementing this function, I had the following question:

Given a []rune or string, how do I figure out how many runes I have to take to get a complete character, complete in the sense that all modifiers and combining accents corresponding to the character are taken, too. For instance, if the input is []rune{0x0041, 0x0308, 0x0066, 0x0067} (corresponding to the string ÄBC where Ä is represented as the combination of an A and a combining diaresis), the function should return 2 because the first two runes yield the first character, Ä. If I just took the first rune, I would get A which is incorrect.

I need an answer to this question because the name I want to anonymize might begin with an accented character and I don't want to remove the accent.

答案1

得分: 2

你可以尝试以下函数(受到“Go语言字符串长度”启发):

func FirstGraphemeLen(str string) int {
    re := regexp.MustCompile(`\PM\pM*|.`)
    return len([]rune(re.FindAllString(str, -1)[0]))
}

参考这个示例

r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
s := string(r)
fmt.Println(s, len(r), FirstGraphemeLen(s))

输出:

ÄfÄg 6 2

该字符串可能使用了6个符文,但它的第一个图形素使用了2个。


OP FUZxxl使用了另一种方法,使用了unicode.IsMark(r)

> IsMark报告符文是否为标记字符(类别M)。

源代码(来自FUZxxl的play.golang.org)包括:

// 从姓氏中获取一个字符,包括所有修饰符
r, _, err := ln.ReadRune()
if err != nil {
    /* ... */
}

aln = append(aln, r)

for {
    r, _, err = ln.ReadRune()
    if err != nil {
        goto done
    }

    if !unicode.IsMark(r) {
        break
    }

    aln = append(aln, r)
}

aln = append(aln, '.')
/* ... */
英文:

You can try the following function (inspired by "Go language string length"):

func FirstGraphemeLen(str string) int {
	re := regexp.MustCompile("\\PM\\pM*|.")
	return len([]rune(re.FindAllString(str, -1)[0]))
}

See this example:

r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
s := string(r)
fmt.Println(s, len(r), FirstGraphemeLen(s))

Output:

ÄfÄg 6 2

That string might use 6 runes, but its first grapheme uses 2.


The OP FUZxxl used another approach, using unicode.IsMark(r)

> IsMark reports whether the rune is a mark character (category M).

The source (from FUZxxl's play.golang.org) includes:

// take one character including all modifiers from the last name
r, _, err := ln.ReadRune()
if err != nil {
    /* ... */
}

aln = append(aln, r)

for {
	r, _, err = ln.ReadRune()
	if err != nil {
		goto done
	}

	if !unicode.IsMark(r) {
		break
	}

	aln = append(aln, r)
}

aln = append(aln, '.')
/* ... */

huangapple
  • 本文由 发表于 2014年12月24日 05:54:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/27628574.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定