String prefix of requested length in golang working with utf-8 symbols

huangapple go评论74阅读模式
英文:

String prefix of requested length in golang working with utf-8 symbols

问题

有没有一种优雅的方法来裁剪字符串并创建漂亮的字符串前缀在Go语言中?我有以下的开始函数:

func prettyCrop(in string, cropLength int) string {
    if len(in) < cropLength {
        return in
    } else {
        in = in[0:cropLength]
        in = strings.TrimRightFunc(in, func(r rune) bool {
            if r == ' ' {
                return true
            }
            return false
        })
        return in + "…"
    }
}

它对英文文本来说效果还不错,但在处理一些更复杂的文本时会出现问题。看看这个例子:

prettyCrop("čřč čřč", 8) // čř?…

TrimRightFunc 在这里的工作方式不符合我的预期。我期望它返回 čřč。为什么这个函数没有返回有效的 UTF-8 字符串?有没有相关的库可以解决这个问题?我该如何修复它?有没有更好的解决方案?

英文:

Is there some elegant way to crop string and create pretty string prefixes in golang? I have this function for start:

func prettyCrop(in string, cropLength int) string {
	if len(in) &lt; cropLength {
		return in
	} else {
		in = in[0:cropLength]
		in = strings.TrimRightFunc(in, func(r rune) bool {
			if r == &#39; &#39; {
				return true
			}
			return false
		})
		return in + &quot;…&quot;
	}
}

It works good enough for english texts, but has problems with something more complicated. See this example:

prettyCrop(&quot;čřč čřč&quot;, 8) //čř?…

TrimRightFunc is not working as I expect here. I expect it to return čřč. Why is this function not returning valid utf-8 string? Is there a library for this? How can I fix it? Is there a better solution?

答案1

得分: 2

问题在于对string进行切片时,切片的是表示字符串的UTF-8编码字节切片,而不是字符串的字符或rune。这也意味着,如果string包含由多个字节表示的字符(UTF-8编码),对string进行切片可能会导致无效的UTF-8编码序列。

假设cropLength表示字符限制(而不是字节计数限制),你应该首先将string转换为[]rune,然后对其进行操作:

func prettyCrop(in string, cropLength int) string {
    in2 := []rune(in)
    if len(in2) < cropLength {
        return in
    } else {
        in2 = in2[:cropLength]
        in = strings.TrimRightFunc(string(in2), func(r rune) bool {
            if r == ' ' {
                return true
            }
            return false
        })
        return in + "…"
    }
}

测试代码:

for i := 0; i < 7; i++ {
    fmt.Println(prettyCrop("čřč čřč", i))
}

输出结果(在Go Playground上尝试):

…
č…
čř…
čřč…
čřč…
čřč č…
čřč čř…

性能注意事项:

上面的示例不太“高性能”,因为:

  • 它将整个in字符串转换为[]rune,只需使用for range获取其前cropLength个符文即可。

  • 调用strings.TrimRightFunc()需要将[]rune再次转换为string,然后执行字符串连接以生成结果。可以通过手动遍历[]rune,并只创建一个返回的string来避免这种情况。

英文:

The problem is that slicing a string slices the UTF-8 encoded byte slice that represents the string, not the characters or runes of the string; this also means that if the string contains characters that are represented by multiple bytes in UTF-8 encoding, slicing / cutting the string may result in an invalid UTF-8 encoded sequence.

Assuming cropLength means to be a character limit (and not a byte-count limit), you should first convert the string to a []rune, and operate on that:

func prettyCrop(in string, cropLength int) string {
	in2 := []rune(in)
	if len(in2) &lt; cropLength {
		return in
	} else {
		in2 = in2[:cropLength]
		in = strings.TrimRightFunc(string(in2), func(r rune) bool {
			if r == &#39; &#39; {
				return true
			}
			return false
		})
		return in + &quot;…&quot;
	}
}

Testing it:

for i := 0; i &lt; 7; i++ {
	fmt.Println(prettyCrop(&quot;čřč čřč&quot;, i))
}

Output (try it on the Go Playground):

…
č…
čř…
čřč…
čřč…
čřč č…
čřč čř…

Performance notes:

The above example is not "performance" friendly, because:

  • It converts the whole in string to []rune, it would be enough to just get its first cropLength runes with a for range.
  • Calling strings.TrimRightFunc() requires to convert the []rune back to string, and then again a string concatenation is performed to generate the result. This could be avoided by manually looping over the []rune, and only create a single string that is returned.

huangapple
  • 本文由 发表于 2017年4月10日 21:44:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/43324908.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定