Umlauts and slices(音译:Umlauts和slices)

huangapple go评论110阅读模式
英文:

Umlauts and slices

问题

我在阅读一个具有固定列长度格式的文件时遇到了一些问题。有些列可能包含umlauts(德语中的变音符号)。

Umlauts似乎使用2个字节而不是一个字节。这不是我预期的行为。是否有任何返回子字符串的函数?在这种情况下,Slice似乎不起作用。

以下是一些示例代码:

package main

import (
	"fmt"
)

func main() {
	umlautsString := "Rhön"
	fmt.Println(len(umlautsString))
	fmt.Println(umlautsString[0:4])
}

输出结果:

5
Rhö

请注意,这里的输出结果是正确的。len(umlautsString)返回的是字符串的字节数,而不是字符数。由于ö是一个umlaut字符,它占用了2个字节。因此,Rhö实际上是字符串的前4个字节,而不是前3个字符。

英文:

I'm having some trouble while reading a file which has a fixed column length format. Some columns may contain umlauts.

Umlauts seem to use 2 bytes instead of one. This is not the behaviour I was expecting. Is there any kind of function which returns a substring? Slice does not seem to work in this case.

Here's some sample code:

http://play.golang.org/p/ZJ1axy7UXe

umlautsString := "Rhön"
fmt.Println(len(umlautsString))
fmt.Println(umlautsString[0:4])

Prints:

5
Rhö

答案1

得分: 12

在Go语言中,字符串的切片是按字节计数的,而不是按rune计数。这就是为什么"Rhön"[0:3]会得到Rhö的第一个字节。

UTF-8编码的字符被表示为rune,因为UTF-8将字符编码为多个字节(最多四个字节),以提供更大范围的字符。

如果你想使用[]语法对字符串进行切片,请先将字符串转换为[]rune。示例(在play中查看):

umlautsString := "Rhön"
runes := []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö

值得注意的是:这篇关于Go语言中字符串表示的博文

英文:

In go, a slice of a string counts bytes, not runes. This is why "Rhön"[0:3] gives you Rh and the first byte of ö.

Characters encoded in UTF-8 are represented as runes because UTF-8 encodes characters in more than one
byte (up to four bytes) to provide a bigger range of characters.

If you want to slice a string with the [] syntax, convert the string to []rune before.
Example (on play):

umlautsString := "Rhön"
runes = []rune(umlautsString)
fmt.Println(string(runes[0:3])) // Rhö

Noteworthy: This golang blog post about string representation in go.

答案2

得分: 3

你可以将string转换为[]rune并进行操作:

package main

import "fmt"

func main() {
  umlautsString := "Rhön"
  
  fmt.Println(len(umlautsString))
  
  subStrRunes:= []rune(umlautsString)
    
  fmt.Println(len(subStrRunes))
  
  fmt.Println(string(subStrRunes[0:4]))
}

希望对你有帮助!

英文:

You can convert string to []rune and work with it:

package main

import "fmt"

func main() {
  umlautsString := "Rhön"
  
  fmt.Println(len(umlautsString))
  
  subStrRunes:= []rune(umlautsString)
    
  fmt.Println(len(subStrRunes))
  
  fmt.Println(string(subStrRunes[0:4]))
}

http://play.golang.org/p/__WfitzMOJ

Hope that helps!

答案3

得分: 0

另一个选项是utf8string包:

package main
import "golang.org/x/exp/utf8string"

func main() {
   s := utf8string.NewString("🌹🌞🌚🌙🌜")
   // 示例 1
   n := s.RuneCount()
   println(n == 5)
   // 示例 2
   t := s.Slice(0, 2)
   println(t == "🌹🌞")
}

https://pkg.go.dev/golang.org/x/exp/utf8string

英文:

Another option is the utf8string package:

package main
import "golang.org/x/exp/utf8string"

func main() {
   s := utf8string.NewString("🧡💛💚💙💜")
   // example 1
   n := s.RuneCount()
   println(n == 5)
   // example 2
   t := s.Slice(0, 2)
   println(t == "🧡💛")
}

https://pkg.go.dev/golang.org/x/exp/utf8string

huangapple
  • 本文由 发表于 2013年10月18日 00:33:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/19432380.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定