Go's LeftStr, RightStr, SubStr

huangapple go评论93阅读模式
英文:

Go's LeftStr, RightStr, SubStr

问题

我相信Go语言中没有LeftStr(str,n)(取前n个字符),RightStr(str,n)(取后n个字符)和SubStr(str,pos,n)(取pos后的n个字符)函数,所以我尝试自己写一个。

// 取前n个字符
func Left(str string, num int) string {
    if num <= 0 {
        return ``
    }
    if num > len(str) {
        num = len(str)
    }
    return str[:num]
}

// 取后n个字符
func Right(str string, num int) string {
    if num <= 0 {
        return ``
    }
    max := len(str)
    if num > max {
        num = max
    }
    num = max - num
    return str[num:]
}

但我相信当字符串包含Unicode字符时,这些函数会给出错误的输出。对于这些函数,最快的解决方案是什么?使用for range循环是唯一的方法吗?

英文:

I believe there are no LeftStr(str,n) (take at most n first characters), RightStr(str,n) (take at most n last characters) and SubStr(str,pos,n) (take first n characters after pos) function in Go, so I tried to make one

// take at most n first characters
func Left(str string, num int) string {
	if num &lt;= 0 {
		return ``
	}
	if num &gt; len(str) {
		num = len(str)
	}
	return str[:num]
}

// take at most last n characters
func Right(str string, num int) string {
	if num &lt;= 0 {
		return ``
	}
	max := len(str)
	if num &gt; max {
		num = max
	}
	num = max - num
	return str[num:]
}

But I believe those functions will give incorrect output when the string contains unicode characters. What's the fastest solution for those function, is using for range loop is the only way?

答案1

得分: 2

如已在评论中提到的,
组合字符、修改符文和其他多符文
“字符”
可能会导致困难。

对于对Go中的Unicode处理感兴趣的任何人,可能应该阅读Go博客文章
“Go中的字符串、字节、符文和字符”
“Go中的文本规范化”
特别是后者讨论了golang.org/x/text/unicode/norm包,该包可以帮助处理其中的一些问题。

您可以考虑从字符串中分割出前(或后)的“n个字符”的几个级别,这些级别越来越准确(或越来越了解Unicode)。

  1. 只使用n个字节。
    这可能会在符文的中间分割,但是它的时间复杂度为O(1),非常简单,并且在许多情况下,您知道输入只包含单字节符文。
    例如:str[:n]

  2. 在n个符文后进行分割。
    这可能会在字符的中间分割。这可以很容易地完成,但代价是通过string([]rune(str)[:n])进行复制和转换。
    您可以通过使用unicode/utf8包的DecodeRuneInString(和DecodeLastRuneInString)函数依次获取前n个符文的长度,然后返回str[:sum](O(n),无需分配)来避免转换和复制。

  3. 在第n个“边界”后进行分割。
    一种方法是重复使用
    norm.NFC.FirstBoundaryInString(str)
    norm.Iter来找到要分割的字节位置,然后返回str[:pos]

考虑显示的字符串“cafés”,它可以在Go代码中表示为:“cafés”,“caf\u00E9s”或“caf\xc3\xa9s”,它们都会得到相同的六个字节。或者它可以表示为“cafe\u0301s”或“cafe\xcc\x81s”,它们都会得到相同的七个字节。

第一种“方法”可能会将它们分割为“caf\xc3”+“\xa9s”和“cafe\xcc”+“\x81s”。

第二种可能会将它们分割为“caf\u00E9”+“s”(“café”+“s”)和“cafe”+“\u0301s”(“cafe”+“́s”)。

第三种应该将它们分割为“caf\u00E9”+“s”和“cafe\u0301”+“s”(都显示为“café”+“s”)。

英文:

As mentioned in already in comments,
combining characters, modifying runes, and other multi-rune
"characters"
can cause difficulties.

Anyone interested in Unicode handling in Go should probably read the Go Blog articles
"Strings, bytes, runes and characters in Go"
and "Text normalization in Go".
In particular, the later talks about the golang.org/x/text/unicode/norm package which can help in handling some of this.

You can consider several levels increasingly of more accurate (or increasingly more Unicode aware) spiting the first (or last) "n characters" from a string.

  1. Just use n bytes.
    This may split in the middle of a rune but is O(1), is very simple, and in many cases you know the input consists of only single byte runes.
    E.g. str[:n].

  2. Split after n runes.
    This may split in the middle of a character. This can be done easily, but at the expense of copying and converting with just string([]rune(str)[:n]).
    You can avoid the conversion and copying by using the unicode/utf8 package's DecodeRuneInString (and DecodeLastRuneInString) functions to get the length of each of the first n runes in turn and then return str[:sum] (O(n), no allocation).

  3. Split after the n'th "boundary".
    One way to do this is to use
    norm.NFC.FirstBoundaryInString(str) repeatedly
    or norm.Iter to find the byte position to split at and then return str[:pos].

Consider the displayed string "cafés" which could be represented in Go code as: "cafés", "caf\u00E9s", or "caf\xc3\xa9s" which all result in the identical six bytes. Alternative it could represented as "cafe\u0301s" or "cafe\xcc\x81s" which both result in the identical seven bytes.

The first "method" above may split those into "caf\xc3"+"\xa9s" and cafe\xcc"+"\x81s".

The second may split them into "caf\u00E9"+"s" ("café"+"s") and "cafe"+"\u0301s" ("cafe"+"́s").

The third should split them into "caf\u00E9"+"s" and "cafe\u0301"+"s" (both shown as "café"+"s").

huangapple
  • 本文由 发表于 2015年4月2日 14:24:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/29406316.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定