2015年7月20日 16:32:18go评论103阅读模式

英文:

Split a String into 10kb chunks in Go

问题

我可以帮你翻译这段内容。以下是翻译好的文本：

我在Go语言中有一个很长的字符串，我想将它分成较小的块。每个块的大小应该最多为10kb。这些块应该按照字符（rune）进行分割（不应该在字符的中间进行分割）。

在Go语言中，有什么惯用的方法可以实现这个功能？我是否只需要在字符串的字节范围内循环遍历？我是否遗漏了一些有用的标准库包？

英文:

I have a large string in Go and I'd like to split it up into smaller chunks. Each chunk should be at most 10kb. The chunks should be split on runes (not in the middle of a rune).

What is the idiomatic way to do this in go? Should I just be looping over the range of the string bytes? Am I missing some helpful stdlib packages?

答案1

得分: 8

使用RuneStart来扫描rune边界。在边界处切割字符串。

var chunks []string
for len(s) > 10000 {
    i := 10000
    for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
        i--
    }
    chunks = append(chunks, s[:i])
    s = s[i:]
}
if len(s) > 0 {
    chunks = append(chunks, s)
}

使用这种方法，应用程序只检查块边界处的几个字节，而不是整个字符串。

该代码的编写是为了确保在字符串不是有效的UTF-8编码时能够继续执行。你可能希望将这种情况视为错误处理，或者以不同的方式切割字符串。

playground示例

英文:

Use RuneStart to scan for a rune boundary. Slice the string at the boundary.

var chunks []string
for len(s) &gt; 10000 {
	i := 10000
	for i &gt;= 10000 - utf8.UTFMax &amp;&amp; !utf8.RuneStart(s[i]) {
		i--
	}
	chunks = append(chunks, s[:i])
	s = s[i:]
}
if len(s) &gt; 0 {
	chunks = append(chunks, s)
}

Using the approach, the application examines a few bytes at the chunk boundaries instead of the entire string.

The code is written to guarantee progress when the string is not a valid UTF-8 encoding. You might want to handle this situation as an error or split the string in a different way.

playground example

答案2

得分: 3

分割字符串（或任何切片或数组）的惯用方法是使用切片操作。由于您想按rune（字符）进行分割，所以必须遍历整个字符串，因为您事先不知道每个切片将包含多少个字节。

slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
    count++
    if count%10001 == 0 {
        slices = append(slices, longString[lastIndex:i])
        lastIndex = i
    }
}

警告：我没有运行或测试过这段代码，但它传达了一般原则。在字符串上循环时，循环的是rune而不是字节，自动为您解码UTF-8。使用切片操作符[]将您的新字符串表示为longString的子切片，这意味着不需要复制字符串的任何字节。

请注意，i是字符串中的字节索引，每次循环迭代时可能增加多个字节。

编辑：

抱歉，我没有看到您想要限制字节数，而不是Unicode代码点。您也可以相对容易地实现这一点。

slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
    if i-lastIndex > 10000 {
        slices = append(slices, longString[lastIndex:lastI])
        lastIndex = lastI
    }
    lastI = i
}

在play.golang.org上有一个可工作的示例，它还处理了字符串末尾的剩余字节。

英文:

The idiomatic way to split a string (or any slice or array) is by using slicing. Since you want to split by rune you'd have to loop through the entire string since you don't know in advance how many bytes each slice will contain.

slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
    count++
    if count%10001 == 0 {
	    slices = append(slices, longString[lastIndex:i])
	    lastIndex = i
    }
}

Warning: I have not run or tested this code, but it conveys the general principles. Looping over a string loops over the runes and not the bytes, automatically decoding the UTF-8 for you. And using the slice operator [] represents your new strings as subslices of longString which means that no bytes from the string needs to be copied.

Note that i is the byte index in the string and may be incremented by more that 1 in each loop iteration.

EDIT:

Sorry, I didn't see you wanted to limit the number of bytes, not Unicode code points. You can implement that as well relatively easily.

slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
    if i-lastIndex &gt; 10000 {
	    slices = append(slices, longString[lastIndex:lastI])
	    lastIndex = lastI
    }
    lastI = i
}

A working example at play.golang.org, which also takes care of the leftover bytes at the end of the string.

答案3

得分: 1

请查看这段代码：

package main

import (
    "fmt"
    "math/rand"
    "time"
)

func init() {
    rand.Seed(time.Now().UnixNano())
}

var alphabet = []rune{
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
    'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'ø', 'å', 'A', 'B', 'C',
    'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
    'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Æ', 'Ø', 'Å',
}

func randomString(n int) string {
    b := make([]rune, n, n)
    for k, _ := range b {
        b[k] = alphabet[rand.Intn(len(alphabet))]
    }
    return string(b)
}

const (
    chunkSize int  = 100
    lead4Mask byte = 0xF8 // 必须等于0xF0
    lead3Mask byte = 0xF0 // 必须等于0xE0
    lead2Mask byte = 0xE0 // 必须等于0xC0
    lead1Mask byte = 0x80 // 必须等于0x00
    trailMask byte = 0xC0 // 必须等于0x80
)


func longestPrefix(s string, n int) int {
    for i := (n - 1); ; i-- {
        if (s[i] & lead1Mask) == 0x00 {
            return i + 1
        }
        if (s[i] & trailMask) != 0x80 {
            return i
        }
    }
    panic("永远不会到达此处")
}

func main() {
    s := randomString(100000)
    for len(s) > chunkSize {
        cut := longestPrefix(s, chunkSize)
        fmt.Println(s[:cut])
        s = s[cut:]
    }
    fmt.Println(s)
}

我使用丹麦/挪威字母表生成一个包含100000个符文的随机字符串。

然后，"magic"在于longestPrefix函数。为了帮助你理解位移操作的部分，请参考下面的图示：

将一个字符串在Go语言中分割成10kb大小的块。

该程序按照每行一个的方式打印出最长可能的小块（小于等于chunkSize）。

英文:

Check out this code:

package main

import (
	&quot;fmt&quot;
	&quot;math/rand&quot;
	&quot;time&quot;
)

func init() {
	rand.Seed(time.Now().UnixNano())
}

var alphabet = []rune{
	&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;, &#39;e&#39;, &#39;f&#39;, &#39;g&#39;, &#39;h&#39;, &#39;i&#39;, &#39;j&#39;, &#39;k&#39;, &#39;l&#39;, &#39;m&#39;, &#39;n&#39;, &#39;o&#39;, &#39;p&#39;,
	&#39;q&#39;, &#39;r&#39;, &#39;s&#39;, &#39;t&#39;, &#39;u&#39;, &#39;v&#39;, &#39;w&#39;, &#39;x&#39;, &#39;y&#39;, &#39;z&#39;, &#39;&#230;&#39;, &#39;&#248;&#39;, &#39;&#229;&#39;, &#39;A&#39;, &#39;B&#39;, &#39;C&#39;,
	&#39;D&#39;, &#39;E&#39;, &#39;F&#39;, &#39;G&#39;, &#39;H&#39;, &#39;I&#39;, &#39;J&#39;, &#39;K&#39;, &#39;L&#39;, &#39;M&#39;, &#39;N&#39;, &#39;O&#39;, &#39;P&#39;, &#39;Q&#39;, &#39;R&#39;, &#39;S&#39;,
	&#39;T&#39;, &#39;U&#39;, &#39;V&#39;, &#39;W&#39;, &#39;X&#39;, &#39;Y&#39;, &#39;Z&#39;, &#39;&#198;&#39;, &#39;&#216;&#39;, &#39;&#197;&#39;,
}

func randomString(n int) string {
	b := make([]rune, n, n)
	for k, _ := range b {
		b[k] = alphabet[rand.Intn(len(alphabet))]
	}
	return string(b)
}

const (
	chunkSize int  = 100
	lead4Mask byte = 0xF8 // must equal 0xF0
	lead3Mask byte = 0xF0 // must equal 0xE0
	lead2Mask byte = 0xE0 // must equal 0xC0
	lead1Mask byte = 0x80 // must equal 0x00
	trailMask byte = 0xC0 // must equal 0x80
)


func longestPrefix(s string, n int) int {
	for i := (n - 1); ; i-- {
		if (s[i] &amp; lead1Mask) == 0x00 {
			return i + 1
		}
		if (s[i] &amp; trailMask) != 0x80 {
			return i
		}
	}
	panic(&quot;never reached&quot;)
}

func main() {
	s := randomString(100000)
	for len(s) &gt; chunkSize {
		cut := longestPrefix(s, chunkSize)
		fmt.Println(s[:cut])
		s = s[cut:]
	}
	fmt.Println(s)
}

I'm using the danish/norwegian alphabet to generate a random string of 100000 runes.

Then, the "magic" lays in longestPrefix. To help you with the bit-shifting part, refer to the following graphic:

将一个字符串在Go语言中分割成10kb大小的块。

The program prints out the respective longest possible chunks <= chunkSize, one per line.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将一个字符串在Go语言中分割成10kb大小的块。

问题

答案1

答案2

答案3

如何在Go中将模板渲染到多个布局？

Go-Sublime-build配置

获取WebSocket中的完整URL

将Python代码转换为Go时性能差劲。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论