将regexp.FindStringIndex的结果转换为字符索引。

huangapple go评论76阅读模式
英文:

Convert regexp.FindStringIndex results to character indices

问题

regexp.FindStringIndex(s string, n int) []int 函数返回匹配项的字节索引。在简单的情况下,这些位置对应于字符串中的“字符位置”。然而,某些字符会破坏这个假设。例如:

package main

import (
	"fmt"
	"regexp"
)

var (
	re   = regexp.MustCompile(`bbb`)
	str1 = "aaa bbb ccc"
	str2 = "aaa✌️bbb ccc"
)

func main() {
	fmt.Println(str1, re.FindStringIndex(str1))
	fmt.Println(str2, re.FindStringIndex(str2))
}

结果:

aaa bbb ccc [4 7]
aaa✌️bbb ccc [9 12]

为什么会这样,如何将 FindStringIndex 的结果转换为定位字符串中的字符而不是字节?

编辑:明确一点,在我的具体用例中,这些字符索引被发送到 JavaScript 中以操作 HTML,而 JavaScript 需要知道子字符串相对于字符的偏移量,而不是字节。如果在 Go 中进行进一步的操作,可以使用 FindStringIndex 的原始结果轻松地对字符串进行切片,但这不是这种情况。

英文:

The regexp.FindStringIndex(s string, n int) []int function returns byte indices of matches. In simple scenarios, these locations correspond to the "character position" in the string. However, certain characters foil this assumption. For example:

package main

import (
	"fmt"
	"regexp"
)

var (
	re   = regexp.MustCompile(`bbb`)
	str1 = "aaa bbb ccc"
	str2 = "aaa️bbb ccc"
)

func main() {
	fmt.Println(str1, re.FindStringIndex(str1))
	fmt.Println(str2, re.FindStringIndex(str2))
}

Result:

aaa bbb ccc [4 7]
aaa✌️bbb ccc [9 12]

Why is this and how could one convert the FindStringIndex result to locate characters within a string rather than bytes?

EDIT: To be clear, in my specific use case these character indices are being sent to Javascript to manipulate HTML, and the JS needs to know the offsets of substrings in terms of characters, not bytes. If further manipulation were happening in Go it would be easy to slice into the strings using the raw results of FindStringIndex, but this is not the case.

答案1

得分: 1

这是因为Go中的字符串(默认/约定)是以UTF-8编码的,而你写的字符在UTF-8编码中占据了多个字节。

这遵循了Go的常规约定,其中字符串的偏移量与字节切片的偏移量相同(即它们是字节偏移量,而不是字符偏移量)。这不仅适用于regexp包,而且适用于Go中的字符串工作方式。

如果你真的想确定字符的偏移量,可以使用utf8包中的方法来计算每个字符。或者,range操作符也可以通过其内置行为为你完成这个操作。以下代码片段将根据字节偏移量确定字符串中的字符偏移量:

byteOffset := 6
cc := 0
for i := range str {
    if i >= byteOffset {
        return cc
    }
    cc++
}

然而,重要的是要理解通常情况下你不需要计算字符的偏移量。一般的想法是,尽可能将Go中的字符串视为不透明,并且只有在需要进行特定的字符串操作时才会进行utf-8编码。很有可能,你在此之后编写的需要字符偏移量的代码可以重构为使用字节偏移量,以获得更好的效果。

英文:

This is because strings in Go are (by default/convention) encoded in UTF-8, and the character you wrote occupies more than one byte in UTF-8 encoding.

This follows the normal convention for Go, where offsets into strings act the same as they do for byte slices (i.e. they are byte offsets, not character offsets). This is not specific to the regexp package, it's how strings work in Go in general.

If you really wish to determine the offset in characters, you can use one of methods from the utf8 package to count each character. Or, the range operator also does this for you from its built-in behavior. This snippet will determine the character offset in a string given a byte offset:

byteOffset := 6
cc := 0
for i := range str {
    if i >= byteOffset {
        return cc
    }
    cc++
}

However, it is important to understand that normally you don't need to count characters. The general idea is that strings in Go are treated as opaque for as long as possible and the utf-8 encoding is done "lazily" only when you need to for specific string operations that require it. The odds are, whatever code you wrote after this which requires a character offset can be refactored to good/better effect to use a byte offset instead.

答案2

得分: 0

不重复冗长的文章关于字符编码的主题,简单来说,某些字符具有比其他字符更复杂的数据表示,并且需要更多的字节。

虽然Go语言有一个名为rune的概念,它是一个单个的Unicode码点,但这并不一定等同于一个"字符"。实际上,"用户感知字符"的正确术语是graphemegrapheme cluster

现在我们明确了术语,任务是将FindStringIndex返回的字节索引映射到图形簇索引。我不知道在Go标准库中有没有这样做的方法,但我找到了一个名为Uniseg的包,它允许我们在字符串中识别图形簇。从自述文件中可以看到:

> 在Go中,字符串是只读的字节切片。可以使用for循环或通过转换:[]rune(str)将其转换为Unicode码点。然而,多个码点可能组合成一个用户感知的字符,或者Unicode规范所称的"图形簇"。

> 该包提供了一个工具来迭代这些图形簇。这可以用于确定用户感知字符的数量,将字符串拆分为其预期的位置,或提取形成一个单元的单个字符。

自述文件还包含了关于字符串、码点和图形簇的优秀示例,有助于解开这个主题的神秘面纱。

那么我们如何使用这个包来解决我们的问题呢?

package main

import (
	"fmt"
	"regexp"

	"github.com/rivo/uniseg"
)

var (
	re   = regexp.MustCompile(`bbb`)
	str1 = "aaa bbb ccc"
	str2 = "aaa✌️bbb ccc"
)

func main() {
	fmt.Println(str1, re.FindStringIndex(str1), mapCoords(str1, re.FindStringIndex(str1)))
	fmt.Println(str2, re.FindStringIndex(str2), mapCoords(str2, re.FindStringIndex(str2)))
}

func mapCoords(s string, byteCoords []int) (graphemeCoords []int) {
	graphemeCoords = make([]int, 2)
	gr := uniseg.NewGraphemes(s)
	graphemeIndex := -1
	for gr.Next() {
		graphemeIndex++
		a, b := gr.Positions()
		if a == byteCoords[0] {
			graphemeCoords[0] = graphemeIndex
		}
		if b == byteCoords[1] {
			graphemeCoords[1] = graphemeIndex + 1
			break
		}
	}
	return
}

结果:

aaa bbb ccc [4 7] [4 7]
aaa✌️bbb ccc [9 12] [4 7]

Playground

英文:

Without repeating lengthy articles on the subject of character encodings, the simplicity is that some characters have more complex data representations and require more bytes than others.

Although Go has the concept of a rune which is a single unicode code point, that's not necessarily equivalent to a "character". The correct term for "user-perceived character" is actually a grapheme or grapheme cluster.

Now that we have our terminology straight, the task is to map the byte indices from FindStringIndex into grapheme cluster indices. I don't know of a way to do this in the Go standard library, but I found a package called Uniseg that allows us to identify grapheme clusters within a string. From the readme:

> In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster".
>
> This package provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

The readme also contains excellent examples of strings vs. code points vs. graphemes that help demystify the subject.

So how do we use this package to solve our problem?

package main

import (
	"fmt"
	"regexp"

	"github.com/rivo/uniseg"
)

var (
	re   = regexp.MustCompile(`bbb`)
	str1 = "aaa bbb ccc"
	str2 = "aaa️bbb ccc"
)

func main() {
	fmt.Println(str1, re.FindStringIndex(str1), mapCoords(str1, re.FindStringIndex(str1)))
	fmt.Println(str2, re.FindStringIndex(str2), mapCoords(str2, re.FindStringIndex(str2)))
}

func mapCoords(s string, byteCoords []int) (graphemeCoords []int) {
	graphemeCoords = make([]int, 2)
	gr := uniseg.NewGraphemes(s)
	graphemeIndex := -1
	for gr.Next() {
		graphemeIndex++
		a, b := gr.Positions()
		if a == byteCoords[0] {
			graphemeCoords[0] = graphemeIndex
		}
		if b == byteCoords[1] {
			graphemeCoords[1] = graphemeIndex + 1
			break
		}
	}
	return
}

Result:

aaa bbb ccc [4 7] [4 7]
aaa✌️bbb ccc [9 12] [4 7]

Playground

huangapple
  • 本文由 发表于 2022年5月5日 03:28:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/72118471.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定