Shared GAE datastore, Go <-> Java, regexp.FindStringIndex index shifting (byte-index vs utf-8-char-index)

huangapple go评论90阅读模式
英文:

Shared GAE datastore, Go <-> Java, regexp.FindStringIndex index shifting (byte-index vs utf-8-char-index)

问题

Short version:
这会打印出3,因为在Go中,字符串基本上是字节的切片,而表示这个字符需要三个字节。我如何让len和regexp函数按字符而不是字节工作。

Background:
我正在使用JDO(Java)将文本保存到GAE数据存储中。
然后,我使用Go处理文本,具体来说,我使用regexp.FindStringIndex并将索引保存到数据存储中。
然后,在Java环境中,我通过json将未修改的文本和索引发送到GWT客户端。
在某个地方,索引会“偏移”,所以当它在客户端上时,它们就会有偏差。
这个问题似乎与字符编码有关,我假设Java/Go以不同的utf-8字符/字节方式解释文本(索引)。我在regexp包中看到了对Runes的引用。
我认为我可以让regexp.FindStringIndex在go中返回字节索引,或者让GWT客户端理解utf-8索引。
有什么建议吗?如果将来需要国际化应用程序,我应该使用UTF-8,对吗?
谢谢

EDIT:
另外,当我在服务器上使用Java找到索引时,一切都正常。
在客户端(GWT)上,我使用text.substring(start,end)。
测试:
这段代码输出10,而不是4。
计划是让FindStringIndex返回4,有什么想法吗?

Update 2: 位置转换
这段代码将打印出[[0 1] [5 6] [7 8] [15 16]]。
offset := 0
posMap := make([]int,len(s)) //将字节位置映射到字符位置
for pos, char := range s {
fmt.Printf("字符%c从字节位置%d开始,具有偏移量%d和字符位置%d。\n", char, pos, offset, pos-offset)
posMap[pos] = offset
offset += utf8.RuneLen(char) - 1
}
fmt.Println("posMap =", posMap)
for pos, value := range byteIndex {
fmt.Printf("位置:%d 值:%d 减去 %d\n", pos, value, posMap[value[0]])
value1 -= posMap[value[0]]
value[0] -= posMap[value[0]]
}
fmt.Println(byteIndex) //[[0 1] [3 4] [5 6] [9 10]]

Update 2
lastPos := -1
for pos, char := range s {
offset += pos - lastPos - 1
fmt.Printf("字符%c从字节位置%d开始,具有偏移量%d和字符位置%d。\n", char, pos, offset, pos-offset)
posMap[pos] = offset
lastPos = pos
}

英文:

Short version:
This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.

package main
import &quot;fmt&quot;
func main() {
	fmt.Println(len(&quot;ウ&quot;))//returns 3
	fmt.Println(utf8.RuneCountInString(&quot;ウ&quot;))//returns 1
}

Background:

I'm saving text into the GAE datastore using JDO (Java).

Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.

Then back in Java land I send the unmodified text, and index to the GWT client via json.

Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.

It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.

I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.

Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?

Thanks

EDIT:

Also when I was finding the index using Java on the server things just worked.

On the client (GWT) I'm using text.substring(start,end)

TEST:

package main

import &quot;regexp&quot;
import &quot;fmt&quot;

func main() {
	fmt.Print(regexp.MustCompile(`a`).FindStringIndex(&quot;ウィキa&quot;)[1])
}

The code outputs 10, not 4.

The plan is to get FindStringIndex to return 4, any ideas?

Update 2: Position Conversion

func main() {
	s:=&quot;ab日aba本語ba&quot;;
	byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
    fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]

    offset :=0
    posMap := make([]int,len(s))//maps byte-positions to char-positions
    for pos, char := range s {
		fmt.Printf(&quot;character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n&quot;, char, pos,offset,pos-offset)
    	posMap[pos]=offset
    	offset += utf8.RuneLen(char)-1
	}
	fmt.Println(&quot;posMap =&quot;,posMap)
	for pos ,value:= range byteIndex{
		fmt.Printf(&quot;pos:%d value:%d subtract %d\n&quot;,pos,value,posMap[value[0]])
		value[1]-=posMap[value[0]]
		value[0]-=posMap[value[0]]
	}
	fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]

}

*** Update 2 ***

    lastPos:=-1
    for pos, char := range s {
    	offset +=pos-lastPos-1
		fmt.Printf(&quot;character %c starts at byte position %d, has an offset of %d, and a char position of %d.\n&quot;, char, pos,offset,pos-offset)
    	posMap[pos]=offset
    	lastPos=pos
	}

答案1

得分: 4

正如你可能已经了解到的,Go和Java对待字符串的方式是不同的。在Java中,字符串是一系列的码点(字符);而在Go中,字符串是一系列的字节。在Go中,文本操作函数在必要时理解UTF-8码点,但由于字符串表示为字节,它们返回和处理的索引是字节索引,而不是字符索引。

正如你在评论中观察到的,你可以使用RuneReaderFindReaderIndex来获取字符索引而不是字节索引。strings.Reader提供了RuneReader的实现,所以你可以使用strings.NewReader将字符串包装成RuneReader

另一个选项是将你想要获取字符长度的子字符串传递给utf8.RuneLen,它返回UTF-8字符串中的字符数。然而,使用RuneReader可能更高效。

英文:

As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.

As you observe in the comments, you can use a RuneReader and FindReaderIndex to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader, so you can use strings.NewReader to wrap a string in a RuneReader.

Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader is probably more efficient, however.

huangapple
  • 本文由 发表于 2012年4月13日 06:59:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/10133044.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定