英文:
String prefix of requested length in golang working with utf-8 symbols
问题
有没有一种优雅的方法来裁剪字符串并创建漂亮的字符串前缀在Go语言中?我有以下的开始函数:
func prettyCrop(in string, cropLength int) string {
if len(in) < cropLength {
return in
} else {
in = in[0:cropLength]
in = strings.TrimRightFunc(in, func(r rune) bool {
if r == ' ' {
return true
}
return false
})
return in + "…"
}
}
它对英文文本来说效果还不错,但在处理一些更复杂的文本时会出现问题。看看这个例子:
prettyCrop("čřč čřč", 8) // čř?…
TrimRightFunc
在这里的工作方式不符合我的预期。我期望它返回 čřč
。为什么这个函数没有返回有效的 UTF-8 字符串?有没有相关的库可以解决这个问题?我该如何修复它?有没有更好的解决方案?
英文:
Is there some elegant way to crop string and create pretty string prefixes in golang? I have this function for start:
func prettyCrop(in string, cropLength int) string {
if len(in) < cropLength {
return in
} else {
in = in[0:cropLength]
in = strings.TrimRightFunc(in, func(r rune) bool {
if r == ' ' {
return true
}
return false
})
return in + "…"
}
}
It works good enough for english texts, but has problems with something more complicated. See this example:
prettyCrop("čřč čřč", 8) //čř?…
TrimRightFunc is not working as I expect here. I expect it to return čřč
. Why is this function not returning valid utf-8 string? Is there a library for this? How can I fix it? Is there a better solution?
答案1
得分: 2
问题在于对string
进行切片时,切片的是表示字符串的UTF-8编码字节切片,而不是字符串的字符或rune
。这也意味着,如果string
包含由多个字节表示的字符(UTF-8编码),对string
进行切片可能会导致无效的UTF-8编码序列。
假设cropLength
表示字符限制(而不是字节计数限制),你应该首先将string
转换为[]rune
,然后对其进行操作:
func prettyCrop(in string, cropLength int) string {
in2 := []rune(in)
if len(in2) < cropLength {
return in
} else {
in2 = in2[:cropLength]
in = strings.TrimRightFunc(string(in2), func(r rune) bool {
if r == ' ' {
return true
}
return false
})
return in + "…"
}
}
测试代码:
for i := 0; i < 7; i++ {
fmt.Println(prettyCrop("čřč čřč", i))
}
输出结果(在Go Playground上尝试):
…
č…
čř…
čřč…
čřč…
čřč č…
čřč čř…
性能注意事项:
上面的示例不太“高性能”,因为:
-
它将整个
in
字符串转换为[]rune
,只需使用for range
获取其前cropLength
个符文即可。 -
调用
strings.TrimRightFunc()
需要将[]rune
再次转换为string
,然后执行字符串连接以生成结果。可以通过手动遍历[]rune
,并只创建一个返回的string
来避免这种情况。
英文:
The problem is that slicing a string
slices the UTF-8 encoded byte slice that represents the string, not the characters or rune
s of the string
; this also means that if the string
contains characters that are represented by multiple bytes in UTF-8 encoding, slicing / cutting the string
may result in an invalid UTF-8 encoded sequence.
Assuming cropLength
means to be a character limit (and not a byte-count limit), you should first convert the string
to a []rune
, and operate on that:
func prettyCrop(in string, cropLength int) string {
in2 := []rune(in)
if len(in2) < cropLength {
return in
} else {
in2 = in2[:cropLength]
in = strings.TrimRightFunc(string(in2), func(r rune) bool {
if r == ' ' {
return true
}
return false
})
return in + "…"
}
}
Testing it:
for i := 0; i < 7; i++ {
fmt.Println(prettyCrop("čřč čřč", i))
}
Output (try it on the Go Playground):
…
č…
čř…
čřč…
čřč…
čřč č…
čřč čř…
Performance notes:
The above example is not "performance" friendly, because:
- It converts the whole
in
string to[]rune
, it would be enough to just get its firstcropLength
runes with afor range
. - Calling
strings.TrimRightFunc()
requires to convert the[]rune
back tostring
, and then again a string concatenation is performed to generate the result. This could be avoided by manually looping over the[]rune
, and only create a singlestring
that is returned.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论