英文:
Go's LeftStr, RightStr, SubStr
问题
我相信Go语言中没有LeftStr(str,n)
(取前n个字符),RightStr(str,n)
(取后n个字符)和SubStr(str,pos,n)
(取pos后的n个字符)函数,所以我尝试自己写一个。
// 取前n个字符
func Left(str string, num int) string {
if num <= 0 {
return ``
}
if num > len(str) {
num = len(str)
}
return str[:num]
}
// 取后n个字符
func Right(str string, num int) string {
if num <= 0 {
return ``
}
max := len(str)
if num > max {
num = max
}
num = max - num
return str[num:]
}
但我相信当字符串包含Unicode字符时,这些函数会给出错误的输出。对于这些函数,最快的解决方案是什么?使用for range
循环是唯一的方法吗?
英文:
I believe there are no LeftStr(str,n)
(take at most n first characters), RightStr(str,n)
(take at most n last characters) and SubStr(str,pos,n)
(take first n characters after pos) function in Go, so I tried to make one
// take at most n first characters
func Left(str string, num int) string {
if num <= 0 {
return ``
}
if num > len(str) {
num = len(str)
}
return str[:num]
}
// take at most last n characters
func Right(str string, num int) string {
if num <= 0 {
return ``
}
max := len(str)
if num > max {
num = max
}
num = max - num
return str[num:]
}
But I believe those functions will give incorrect output when the string contains unicode characters. What's the fastest solution for those function, is using for range
loop is the only way?
答案1
得分: 2
如已在评论中提到的,
组合字符、修改符文和其他多符文
“字符”
可能会导致困难。
对于对Go中的Unicode处理感兴趣的任何人,可能应该阅读Go博客文章
“Go中的字符串、字节、符文和字符”
和“Go中的文本规范化”。
特别是后者讨论了golang.org/x/text/unicode/norm
包,该包可以帮助处理其中的一些问题。
您可以考虑从字符串中分割出前(或后)的“n个字符”的几个级别,这些级别越来越准确(或越来越了解Unicode)。
-
只使用n个字节。
这可能会在符文的中间分割,但是它的时间复杂度为O(1),非常简单,并且在许多情况下,您知道输入只包含单字节符文。
例如:str[:n]
。 -
在n个符文后进行分割。
这可能会在字符的中间分割。这可以很容易地完成,但代价是通过string([]rune(str)[:n])
进行复制和转换。
您可以通过使用unicode/utf8
包的DecodeRuneInString
(和DecodeLastRuneInString
)函数依次获取前n个符文的长度,然后返回str[:sum]
(O(n),无需分配)来避免转换和复制。 -
在第n个“边界”后进行分割。
一种方法是重复使用
norm.NFC.FirstBoundaryInString(str)
或norm.Iter
来找到要分割的字节位置,然后返回str[:pos]
。
考虑显示的字符串“cafés”,它可以在Go代码中表示为:“cafés”,“caf\u00E9s”或“caf\xc3\xa9s”,它们都会得到相同的六个字节。或者它可以表示为“cafe\u0301s”或“cafe\xcc\x81s”,它们都会得到相同的七个字节。
第一种“方法”可能会将它们分割为“caf\xc3”+“\xa9s”和“cafe\xcc”+“\x81s”。
第二种可能会将它们分割为“caf\u00E9”+“s”(“café”+“s”)和“cafe”+“\u0301s”(“cafe”+“́s”)。
第三种应该将它们分割为“caf\u00E9”+“s”和“cafe\u0301”+“s”(都显示为“café”+“s”)。
英文:
As mentioned in already in comments,
combining characters, modifying runes, and other multi-rune
"characters"
can cause difficulties.
Anyone interested in Unicode handling in Go should probably read the Go Blog articles
"Strings, bytes, runes and characters in Go"
and "Text normalization in Go".
In particular, the later talks about the golang.org/x/text/unicode/norm
package which can help in handling some of this.
You can consider several levels increasingly of more accurate (or increasingly more Unicode aware) spiting the first (or last) "n characters" from a string.
-
Just use n bytes.
This may split in the middle of a rune but is O(1), is very simple, and in many cases you know the input consists of only single byte runes.
E.g.str[:n]
. -
Split after n runes.
This may split in the middle of a character. This can be done easily, but at the expense of copying and converting with juststring([]rune(str)[:n])
.
You can avoid the conversion and copying by using theunicode/utf8
package'sDecodeRuneInString
(andDecodeLastRuneInString
) functions to get the length of each of the first n runes in turn and then returnstr[:sum]
(O(n), no allocation). -
Split after the n'th "boundary".
One way to do this is to use
norm.NFC.FirstBoundaryInString(str)
repeatedly
ornorm.Iter
to find the byte position to split at and then returnstr[:pos]
.
Consider the displayed string "cafés" which could be represented in Go code as: "cafés", "caf\u00E9s", or "caf\xc3\xa9s" which all result in the identical six bytes. Alternative it could represented as "cafe\u0301s" or "cafe\xcc\x81s" which both result in the identical seven bytes.
The first "method" above may split those into "caf\xc3"+"\xa9s" and cafe\xcc"+"\x81s".
The second may split them into "caf\u00E9"+"s" ("café"+"s") and "cafe"+"\u0301s" ("cafe"+"́s").
The third should split them into "caf\u00E9"+"s" and "cafe\u0301"+"s" (both shown as "café"+"s").
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论