英文:
Split a String into 10kb chunks in Go
问题
我可以帮你翻译这段内容。以下是翻译好的文本:
我在Go语言中有一个很长的字符串,我想将它分成较小的块。每个块的大小应该最多为10kb。这些块应该按照字符(rune)进行分割(不应该在字符的中间进行分割)。
在Go语言中,有什么惯用的方法可以实现这个功能?我是否只需要在字符串的字节范围内循环遍历?我是否遗漏了一些有用的标准库包?
英文:
I have a large string in Go and I'd like to split it up into smaller chunks. Each chunk should be at most 10kb. The chunks should be split on runes (not in the middle of a rune).
What is the idiomatic way to do this in go? Should I just be looping over the range of the string bytes? Am I missing some helpful stdlib packages?
答案1
得分: 8
使用RuneStart来扫描rune边界。在边界处切割字符串。
var chunks []string
for len(s) > 10000 {
i := 10000
for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
i--
}
chunks = append(chunks, s[:i])
s = s[i:]
}
if len(s) > 0 {
chunks = append(chunks, s)
}
使用这种方法,应用程序只检查块边界处的几个字节,而不是整个字符串。
该代码的编写是为了确保在字符串不是有效的UTF-8编码时能够继续执行。你可能希望将这种情况视为错误处理,或者以不同的方式切割字符串。
英文:
Use RuneStart to scan for a rune boundary. Slice the string at the boundary.
var chunks []string
for len(s) > 10000 {
i := 10000
for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
i--
}
chunks = append(chunks, s[:i])
s = s[i:]
}
if len(s) > 0 {
chunks = append(chunks, s)
}
Using the approach, the application examines a few bytes at the chunk boundaries instead of the entire string.
The code is written to guarantee progress when the string is not a valid UTF-8 encoding. You might want to handle this situation as an error or split the string in a different way.
答案2
得分: 3
分割字符串(或任何切片或数组)的惯用方法是使用切片操作。由于您想按rune(字符)进行分割,所以必须遍历整个字符串,因为您事先不知道每个切片将包含多少个字节。
slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
count++
if count%10001 == 0 {
slices = append(slices, longString[lastIndex:i])
lastIndex = i
}
}
警告:我没有运行或测试过这段代码,但它传达了一般原则。在字符串上循环时,循环的是rune而不是字节,自动为您解码UTF-8。使用切片操作符[]
将您的新字符串表示为longString
的子切片,这意味着不需要复制字符串的任何字节。
请注意,i
是字符串中的字节索引,每次循环迭代时可能增加多个字节。
编辑:
抱歉,我没有看到您想要限制字节数,而不是Unicode代码点。您也可以相对容易地实现这一点。
slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
if i-lastIndex > 10000 {
slices = append(slices, longString[lastIndex:lastI])
lastIndex = lastI
}
lastI = i
}
在play.golang.org上有一个可工作的示例,它还处理了字符串末尾的剩余字节。
英文:
The idiomatic way to split a string (or any slice or array) is by using slicing. Since you want to split by rune you'd have to loop through the entire string since you don't know in advance how many bytes each slice will contain.
slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
count++
if count%10001 == 0 {
slices = append(slices, longString[lastIndex:i])
lastIndex = i
}
}
Warning: I have not run or tested this code, but it conveys the general principles. Looping over a string loops over the runes and not the bytes, automatically decoding the UTF-8 for you. And using the slice operator []
represents your new strings as subslices of longString
which means that no bytes from the string needs to be copied.
Note that i
is the byte index in the string and may be incremented by more that 1 in each loop iteration.
EDIT:
Sorry, I didn't see you wanted to limit the number of bytes, not Unicode code points. You can implement that as well relatively easily.
slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
if i-lastIndex > 10000 {
slices = append(slices, longString[lastIndex:lastI])
lastIndex = lastI
}
lastI = i
}
A working example at play.golang.org, which also takes care of the leftover bytes at the end of the string.
答案3
得分: 1
请查看这段代码:
package main
import (
"fmt"
"math/rand"
"time"
)
func init() {
rand.Seed(time.Now().UnixNano())
}
var alphabet = []rune{
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'ø', 'å', 'A', 'B', 'C',
'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Æ', 'Ø', 'Å',
}
func randomString(n int) string {
b := make([]rune, n, n)
for k, _ := range b {
b[k] = alphabet[rand.Intn(len(alphabet))]
}
return string(b)
}
const (
chunkSize int = 100
lead4Mask byte = 0xF8 // 必须等于0xF0
lead3Mask byte = 0xF0 // 必须等于0xE0
lead2Mask byte = 0xE0 // 必须等于0xC0
lead1Mask byte = 0x80 // 必须等于0x00
trailMask byte = 0xC0 // 必须等于0x80
)
func longestPrefix(s string, n int) int {
for i := (n - 1); ; i-- {
if (s[i] & lead1Mask) == 0x00 {
return i + 1
}
if (s[i] & trailMask) != 0x80 {
return i
}
}
panic("永远不会到达此处")
}
func main() {
s := randomString(100000)
for len(s) > chunkSize {
cut := longestPrefix(s, chunkSize)
fmt.Println(s[:cut])
s = s[cut:]
}
fmt.Println(s)
}
我使用丹麦/挪威字母表生成一个包含100000个符文的随机字符串。
然后,"magic"在于longestPrefix
函数。为了帮助你理解位移操作的部分,请参考下面的图示:
该程序按照每行一个的方式打印出最长可能的小块(小于等于chunkSize)。
英文:
Check out this code:
package main
import (
"fmt"
"math/rand"
"time"
)
func init() {
rand.Seed(time.Now().UnixNano())
}
var alphabet = []rune{
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'ø', 'å', 'A', 'B', 'C',
'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Æ', 'Ø', 'Å',
}
func randomString(n int) string {
b := make([]rune, n, n)
for k, _ := range b {
b[k] = alphabet[rand.Intn(len(alphabet))]
}
return string(b)
}
const (
chunkSize int = 100
lead4Mask byte = 0xF8 // must equal 0xF0
lead3Mask byte = 0xF0 // must equal 0xE0
lead2Mask byte = 0xE0 // must equal 0xC0
lead1Mask byte = 0x80 // must equal 0x00
trailMask byte = 0xC0 // must equal 0x80
)
func longestPrefix(s string, n int) int {
for i := (n - 1); ; i-- {
if (s[i] & lead1Mask) == 0x00 {
return i + 1
}
if (s[i] & trailMask) != 0x80 {
return i
}
}
panic("never reached")
}
func main() {
s := randomString(100000)
for len(s) > chunkSize {
cut := longestPrefix(s, chunkSize)
fmt.Println(s[:cut])
s = s[cut:]
}
fmt.Println(s)
}
I'm using the danish/norwegian alphabet to generate a random string of 100000 runes.
Then, the "magic" lays in longestPrefix
. To help you with the bit-shifting part, refer to the following graphic:
The program prints out the respective longest possible chunks <= chunkSize, one per line.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论