将一个字符串在Go语言中分割成10kb大小的块。

huangapple go评论87阅读模式
英文:

Split a String into 10kb chunks in Go

问题

我可以帮你翻译这段内容。以下是翻译好的文本:

我在Go语言中有一个很长的字符串,我想将它分成较小的块。每个块的大小应该最多为10kb。这些块应该按照字符(rune)进行分割(不应该在字符的中间进行分割)。

在Go语言中,有什么惯用的方法可以实现这个功能?我是否只需要在字符串的字节范围内循环遍历?我是否遗漏了一些有用的标准库包?

英文:

I have a large string in Go and I'd like to split it up into smaller chunks. Each chunk should be at most 10kb. The chunks should be split on runes (not in the middle of a rune).

What is the idiomatic way to do this in go? Should I just be looping over the range of the string bytes? Am I missing some helpful stdlib packages?

答案1

得分: 8

使用RuneStart来扫描rune边界。在边界处切割字符串。

var chunks []string
for len(s) > 10000 {
    i := 10000
    for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
        i--
    }
    chunks = append(chunks, s[:i])
    s = s[i:]
}
if len(s) > 0 {
    chunks = append(chunks, s)
}

使用这种方法,应用程序只检查块边界处的几个字节,而不是整个字符串。

该代码的编写是为了确保在字符串不是有效的UTF-8编码时能够继续执行。你可能希望将这种情况视为错误处理,或者以不同的方式切割字符串。

playground示例

英文:

Use RuneStart to scan for a rune boundary. Slice the string at the boundary.

var chunks []string
for len(s) > 10000 {
	i := 10000
	for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
		i--
	}
	chunks = append(chunks, s[:i])
	s = s[i:]
}
if len(s) > 0 {
	chunks = append(chunks, s)
}

Using the approach, the application examines a few bytes at the chunk boundaries instead of the entire string.

The code is written to guarantee progress when the string is not a valid UTF-8 encoding. You might want to handle this situation as an error or split the string in a different way.

playground example

答案2

得分: 3

分割字符串(或任何切片或数组)的惯用方法是使用切片操作。由于您想按rune(字符)进行分割,所以必须遍历整个字符串,因为您事先不知道每个切片将包含多少个字节。

slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
    count++
    if count%10001 == 0 {
        slices = append(slices, longString[lastIndex:i])
        lastIndex = i
    }
}

警告:我没有运行或测试过这段代码,但它传达了一般原则。在字符串上循环时,循环的是rune而不是字节,自动为您解码UTF-8。使用切片操作符[]将您的新字符串表示为longString的子切片,这意味着不需要复制字符串的任何字节。

请注意,i是字符串中的字节索引,每次循环迭代时可能增加多个字节。

编辑:

抱歉,我没有看到您想要限制字节数,而不是Unicode代码点。您也可以相对容易地实现这一点。

slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
    if i-lastIndex > 10000 {
        slices = append(slices, longString[lastIndex:lastI])
        lastIndex = lastI
    }
    lastI = i
}

play.golang.org上有一个可工作的示例,它还处理了字符串末尾的剩余字节。

英文:

The idiomatic way to split a string (or any slice or array) is by using slicing. Since you want to split by rune you'd have to loop through the entire string since you don't know in advance how many bytes each slice will contain.

slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
    count++
    if count%10001 == 0 {
	    slices = append(slices, longString[lastIndex:i])
	    lastIndex = i
    }
}

Warning: I have not run or tested this code, but it conveys the general principles. Looping over a string loops over the runes and not the bytes, automatically decoding the UTF-8 for you. And using the slice operator [] represents your new strings as subslices of longString which means that no bytes from the string needs to be copied.

Note that i is the byte index in the string and may be incremented by more that 1 in each loop iteration.

EDIT:

Sorry, I didn't see you wanted to limit the number of bytes, not Unicode code points. You can implement that as well relatively easily.

slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
    if i-lastIndex > 10000 {
	    slices = append(slices, longString[lastIndex:lastI])
	    lastIndex = lastI
    }
    lastI = i
}

A working example at play.golang.org, which also takes care of the leftover bytes at the end of the string.

答案3

得分: 1

请查看这段代码

package main

import (
    "fmt"
    "math/rand"
    "time"
)

func init() {
    rand.Seed(time.Now().UnixNano())
}

var alphabet = []rune{
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
    'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'ø', 'å', 'A', 'B', 'C',
    'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
    'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Æ', 'Ø', 'Å',
}

func randomString(n int) string {
    b := make([]rune, n, n)
    for k, _ := range b {
        b[k] = alphabet[rand.Intn(len(alphabet))]
    }
    return string(b)
}

const (
    chunkSize int  = 100
    lead4Mask byte = 0xF8 // 必须等于0xF0
    lead3Mask byte = 0xF0 // 必须等于0xE0
    lead2Mask byte = 0xE0 // 必须等于0xC0
    lead1Mask byte = 0x80 // 必须等于0x00
    trailMask byte = 0xC0 // 必须等于0x80
)


func longestPrefix(s string, n int) int {
    for i := (n - 1); ; i-- {
        if (s[i] & lead1Mask) == 0x00 {
            return i + 1
        }
        if (s[i] & trailMask) != 0x80 {
            return i
        }
    }
    panic("永远不会到达此处")
}

func main() {
    s := randomString(100000)
    for len(s) > chunkSize {
        cut := longestPrefix(s, chunkSize)
        fmt.Println(s[:cut])
        s = s[cut:]
    }
    fmt.Println(s)
}

我使用丹麦/挪威字母表生成一个包含100000个符文的随机字符串。

然后,"magic"在于longestPrefix函数。为了帮助你理解位移操作的部分,请参考下面的图示:

将一个字符串在Go语言中分割成10kb大小的块。

该程序按照每行一个的方式打印出最长可能的小块(小于等于chunkSize)。

英文:

Check out this code:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

func init() {
	rand.Seed(time.Now().UnixNano())
}

var alphabet = []rune{
	'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
	'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'ø', 'å', 'A', 'B', 'C',
	'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
	'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Æ', 'Ø', 'Å',
}

func randomString(n int) string {
	b := make([]rune, n, n)
	for k, _ := range b {
		b[k] = alphabet[rand.Intn(len(alphabet))]
	}
	return string(b)
}

const (
	chunkSize int  = 100
	lead4Mask byte = 0xF8 // must equal 0xF0
	lead3Mask byte = 0xF0 // must equal 0xE0
	lead2Mask byte = 0xE0 // must equal 0xC0
	lead1Mask byte = 0x80 // must equal 0x00
	trailMask byte = 0xC0 // must equal 0x80
)


func longestPrefix(s string, n int) int {
	for i := (n - 1); ; i-- {
		if (s[i] & lead1Mask) == 0x00 {
			return i + 1
		}
		if (s[i] & trailMask) != 0x80 {
			return i
		}
	}
	panic("never reached")
}

func main() {
	s := randomString(100000)
	for len(s) > chunkSize {
		cut := longestPrefix(s, chunkSize)
		fmt.Println(s[:cut])
		s = s[cut:]
	}
	fmt.Println(s)
}

I'm using the danish/norwegian alphabet to generate a random string of 100000 runes.

Then, the "magic" lays in longestPrefix. To help you with the bit-shifting part, refer to the following graphic:

将一个字符串在Go语言中分割成10kb大小的块。

The program prints out the respective longest possible chunks <= chunkSize, one per line.

huangapple
  • 本文由 发表于 2015年7月20日 16:32:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/31511952.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定