2012年4月13日 06:59:36go评论215阅读模式

英文:

Shared GAE datastore, Go <-> Java, regexp.FindStringIndex index shifting (byte-index vs utf-8-char-index)

问题

Short version:
这会打印出3，因为在Go中，字符串基本上是字节的切片，而表示这个字符需要三个字节。我如何让len和regexp函数按字符而不是字节工作。

Background:
我正在使用JDO（Java）将文本保存到GAE数据存储中。
然后，我使用Go处理文本，具体来说，我使用regexp.FindStringIndex并将索引保存到数据存储中。
然后，在Java环境中，我通过json将未修改的文本和索引发送到GWT客户端。
在某个地方，索引会“偏移”，所以当它在客户端上时，它们就会有偏差。
这个问题似乎与字符编码有关，我假设Java/Go以不同的utf-8字符/字节方式解释文本（索引）。我在regexp包中看到了对Runes的引用。
我认为我可以让regexp.FindStringIndex在go中返回字节索引，或者让GWT客户端理解utf-8索引。
有什么建议吗？如果将来需要国际化应用程序，我应该使用UTF-8，对吗？
谢谢

EDIT:
另外，当我在服务器上使用Java找到索引时，一切都正常。
在客户端（GWT）上，我使用text.substring(start,end)。
测试：
这段代码输出10，而不是4。
计划是让FindStringIndex返回4，有什么想法吗？

Update 2: 位置转换
这段代码将打印出[[0 1] [5 6] [7 8] [15 16]]。
offset := 0
posMap := make([]int,len(s)) //将字节位置映射到字符位置
for pos, char := range s {
fmt.Printf("字符%c从字节位置%d开始，具有偏移量%d和字符位置%d。\n", char, pos, offset, pos-offset)
posMap[pos] = offset
offset += utf8.RuneLen(char) - 1
}
fmt.Println("posMap =", posMap)
for pos, value := range byteIndex {
fmt.Printf("位置：%d 值：%d 减去 %d\n", pos, value, posMap[value[0]])
value1 -= posMap[value[0]]
value[0] -= posMap[value[0]]
}
fmt.Println(byteIndex) //[[0 1] [3 4] [5 6] [9 10]]

Update 2
lastPos := -1
for pos, char := range s {
offset += pos - lastPos - 1
fmt.Printf("字符%c从字节位置%d开始，具有偏移量%d和字符位置%d。\n", char, pos, offset, pos-offset)
posMap[pos] = offset
lastPos = pos
}

英文:

Short version:
This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.

package main
import &quot;fmt&quot;
func main() {
	fmt.Println(len(&quot;ウ&quot;))//returns 3
	fmt.Println(utf8.RuneCountInString(&quot;ウ&quot;))//returns 1
}

Background:

I'm saving text into the GAE datastore using JDO (Java).

Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.

Then back in Java land I send the unmodified text, and index to the GWT client via json.

Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.

It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.

I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.

Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?

Thanks

EDIT:

Also when I was finding the index using Java on the server things just worked.

On the client (GWT) I'm using text.substring(start,end)

TEST:

package main

import &quot;regexp&quot;
import &quot;fmt&quot;

func main() {
	fmt.Print(regexp.MustCompile(`a`).FindStringIndex(&quot;ウィキa&quot;)[1])
}

The code outputs 10, not 4.

The plan is to get FindStringIndex to return 4, any ideas?

Update 2: Position Conversion

func main() {
	s:=&quot;ab日aba本語ba&quot;;
	byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
    fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]

    offset :=0
    posMap := make([]int,len(s))//maps byte-positions to char-positions
    for pos, char := range s {
		fmt.Printf(&quot;character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n&quot;, char, pos,offset,pos-offset)
    	posMap[pos]=offset
    	offset += utf8.RuneLen(char)-1
	}
	fmt.Println(&quot;posMap =&quot;,posMap)
	for pos ,value:= range byteIndex{
		fmt.Printf(&quot;pos:%d value:%d subtract %d\n&quot;,pos,value,posMap[value[0]])
		value[1]-=posMap[value[0]]
		value[0]-=posMap[value[0]]
	}
	fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]

}

*** Update 2 ***

    lastPos:=-1
    for pos, char := range s {
    	offset +=pos-lastPos-1
		fmt.Printf(&quot;character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n&quot;, char, pos,offset,pos-offset)
    	posMap[pos]=offset
    	lastPos=pos
	}

答案1

得分: 4

正如你可能已经了解到的，Go和Java对待字符串的方式是不同的。在Java中，字符串是一系列的码点（字符）；而在Go中，字符串是一系列的字节。在Go中，文本操作函数在必要时理解UTF-8码点，但由于字符串表示为字节，它们返回和处理的索引是字节索引，而不是字符索引。

正如你在评论中观察到的，你可以使用RuneReader和FindReaderIndex来获取字符索引而不是字节索引。strings.Reader提供了RuneReader的实现，所以你可以使用strings.NewReader将字符串包装成RuneReader。

另一个选项是将你想要获取字符长度的子字符串传递给utf8.RuneLen，它返回UTF-8字符串中的字符数。然而，使用RuneReader可能更高效。

英文:

As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.

As you observe in the comments, you can use a RuneReader and FindReaderIndex to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader, so you can use strings.NewReader to wrap a string in a RuneReader.

Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader is probably more efficient, however.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Shared GAE datastore, Go <-> Java, regexp.FindStringIndex index shifting (byte-index vs utf-8-char-index)

问题

答案1

Java将文件从一个文件夹复制到另一个文件夹并更改名称

从gRPC服务器拦截器获取流式文件的大小。

如何从复杂的映射中删除一个键？

如何在 RecyclerView 的选定项目上实现迷你均衡器？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论