2013年11月28日 03:56:58go评论97阅读模式

英文:

Efficient appending to a variable-length container of strings (Golang)

问题

问题：

我需要对一个大型日志文件（几个GB长）的每一行应用多个正则表达式，收集非空匹配项，并将它们全部放入一个数组中（用于序列化并通过网络发送）。

如果回答这个问题是正确的，切片并不能提供太多帮助：

> 如果切片的容量不足，append将需要分配新的内存并将旧内存复制过去。对于长度<1024的切片，它将使容量翻倍，对于长度>1024的切片，它将增加1.25倍。

由于可能会有成千上万个正则表达式匹配项，我无法真正预测切片的长度/容量。我也不能让它太大，以防万一，因为这会浪费内存（或者不会吗？如果内存分配器足够聪明，不会分配太多未写入的内存，也许我可以使用一个巨大的切片容量而没有太多的伤害）。

所以我在考虑以下替代方案：

有一个匹配项的双向链表（http://golang.org/pkg/container/list/）
计算它的长度（len()会起作用吗？）
预分配一个具有这个容量的切片
将字符串指针复制到这个切片中

在Go中是否有更简便的方法实现这个目标（具有近似O(1)的append复杂度）？

（我是Go语言的新手）

英文:

The problem:

I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for serialization and sending it over the network).

Slices are not much help if answer to this question holds:

> If the slice does not have sufficient capacity, append will need to allocate new memory and copy the old one over. For slices with <1024 elements, it will double the capacity, for slices with >1024 elements it will increase it by factor 1.25.

Since there can be literally hundreds of thousands of regex matches, I can't really predict the length / capacity of a slice. I can't make it too big either "just in case" bc this will waste memory (or will it? if memory allocator is smart enough not to allocate too much memory that is not written into, maybe I could use a huge slice capacity without much harm?).

So I'm thinking about following alternative:

have a doubly-linked list of matches (http://golang.org/pkg/container/list/)
calc its length (will len() work?)
preallocate a slice of this capacity
copy string pointers to this slice

Is there a less laborious way of achieving this goal in Go (append with ~ O(1) append complexity)?

(golang newbie here of course)

答案1

得分: 15

append()的平摊（平均）成本已经是O(1)，因为它每次按比例增加数组的大小。随着数组变得越来越大，增加它的成本会变得更高，但比例上更加罕见。一个包含1000万个元素的切片增长的成本将比一个包含100万个元素的切片增长的成本高10倍，但由于我们分配的额外容量与大小成比例，直到下一次增长，也会有10倍的append(slice, item)调用。增加的成本和减少的重新分配频率相互抵消，使得平均成本保持恒定，即O(1)。

相同的思想也适用于其他语言的动态大小数组，例如微软的std::vector实现显然每次增加数组大小的50%。平摊O(1)并不意味着你不需要为分配付费，只是你随着数组变大以相同的平均速率继续付费。

在我的笔记本电脑上，我可以在77毫秒内运行一百万次slice = append(slice, someStaticString)。其中一个快速的原因是，正如下面siritinga所指出的，"复制"字符串以扩大数组实际上只是复制字符串头部（一个指针/长度对），而不是复制内容。复制100,000个字符串头部仍然不到2MB，与你处理的其他数据量相比不算什么大问题。

在微基准测试中，container/list对我来说大约慢了3倍；链表的追加时间当然也是恒定的，但我想append的常数更低，因为它通常只需写入几个字的内存而不是分配一个列表项等等。这段计时代码在Playground中无法运行，但你可以将其复制到本地并运行以查看结果：http://play.golang.org/p/uYyMScmOjX

有时，你可以有针对性地预先分配空间以避免重新分配/复制（在这个例子中，使用make([]string, 0, 1000000)将运行时间从约77毫秒减少到约10毫秒），但是，当然，通常情况下你没有足够的关于预期数据大小等信息来获得有价值的收益，最好还是让内置算法来处理。

但是，你在这里提出了一个关于类似grep的应用程序的更具体的问题（感谢你提出了一个带有上下文的详细问题）。对于这个问题，最重要的建议是，如果你要搜索大量的日志，最好避免将整个输出缓冲到内存中。

你可以编写一个函数来流式处理结果：logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp)；或者你可以将out定义为chan []byte或func(match []byte) (err error)，如果你不希望发送结果的代码与grep代码过于纠缠。

（关于[]byte和string：在这里，[]byte似乎能够完成工作，并且在进行I/O时避免了[]byte<=>string的转换，所以我更喜欢使用它。不过，我不知道你在做什么，如果你需要string，那也没问题。）

如果你确实将整个匹配列表保存在内存中，请注意，保留对大字符串或字节切片的部分的引用会阻止整个源字符串/切片被垃圾回收。因此，如果你选择这种方式，反直觉地，你实际上可能希望复制匹配项以避免将所有源日志数据保存在内存中。

英文:

append()'s average (amortized) cost is already O(1) because it grows the array by a percentage each time. As the array gets larger, growing it gets costlier but proportionately rarer. A 10M-item slice will be 10x more expensive to grow than a 1M-item slice, but since the extra capacity we're allocating is proportional to the size, it'll also be 10x as many append(slice, item) calls until the next time it grows. The increasing cost and decreasing frequency of reallocations cancel out, leaving the average cost constant, i.e., O(1).

The same idea applies other languages' dynamically-sized arrays, too: Microsoft's std::vector implementation apparently grows the array by 50% each time, for example. Amortized O(1) doesn't mean you pay nothing for allocations, only that you continue paying at the same average rate as your array gets bigger.

On my laptop, I could run a million slice = append(slice, someStaticString)s in 77ms. One reason it's quick, which siritinga noted below, is that "copying" the string to enlarge the array is really just copying the string header (a pointer/length pair), not copying the contents. 100,000 string headers is still under 2MB to copy, not a big deal compared to the other quantities of data you're working with.

container/list turned out ~3x slower for me in a microbenchmark; linked-list append is also constant time, of course, but I imagine append has a lower constant because it can usually just write to a couple words of memory and not allocate a list item, etc. The timing code won't work in the Playground but you can copy this locally and run it to see yourself: http://play.golang.org/p/uYyMScmOjX

Sometimes, you can usefully pre-allocate space to avoid reallocations/copies (in this example, using make([]string, 0, 1000000) takes the runtime from ~77ms to ~10ms), but, of course, often-to-usually just you don't have enough info about the expected data size and so on to eke out worthwhile gains, and you're better off leaving it to the built-in algorithm.

But you're asking a more specific question here about a grep-like application (and thanks for asking a detailed question with context). For that, bottom-line recommendation is that if you're searching gigs of logs, it's probably best to avoid buffering the whole output in RAM at all.

You could write something to stream the results as a single function: logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp); you could alternatively make out a chan []byte or func(match []byte) (err error) if you don't want the code that sends results to be too enmeshed with the grep code.

(On []byte vs. string: a []byte seems to do the job here and avoids []byte<=>string conversions when you do I/O, so I'd prefer that. I don't know what all you're doing, though, and if you need string it's fine.)

If you do keep the whole match list in RAM, be aware that keeping around a reference to part of a big string or byte slice keeps the whole source string/slice from being garbage collected. So if you go that route, then counterintuitively you may actually want to copy matches to avoid keeping all of the source log data in RAM.

答案2

得分: 3

我尝试将你的问题简化为一个非常简单的例子。

由于可能有“成千上万个正则表达式匹配项”，我对matches切片的容量进行了大量的初始分配，设置为1M（1024 * 1024）个条目。切片是一个引用类型。切片头部的结构体包含长度、容量和指针，总共占据24（3 * 8）个字节，在64位操作系统上。因此，对于1M个条目的切片的初始分配仅占据24（24 * 1）MB的内存。如果条目超过1M个，将分配一个容量为1.25（1 + 1 / 4）M个条目的新切片，并将现有的1M个切片头部条目（24MB）复制到其中。

总之，通过最初过度分配切片的容量，可以避免许多append操作的开销。更大的内存问题是保存和引用每个匹配项的所有数据。更大的CPU时间问题是执行regexp.FindAll的时间。

以下是示例代码：

package main

import (
	"bufio"
	"fmt"
	"os"
	"regexp"
)

var searches = []*regexp.Regexp{
	regexp.MustCompile("configure"),
	regexp.MustCompile("unknown"),
	regexp.MustCompile("PATH"),
}

var matches = make([][]byte, 0, 1024*1024)

func main() {
	logName := "config.log"
	log, err := os.Open(logName)
	if err != nil {
		fmt.Fprintln(os.Stderr, err)
		os.Exit(1)
	}
	defer log.Close()
	scanner := bufio.NewScanner(log)
	for scanner.Scan() {
		line := scanner.Bytes()
		for _, s := range searches {
			for _, m := range s.FindAll(line, -1) {
				matches = append(matches, append([]byte(nil), m...))
			}
		}
	}
	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, err)
	}
	// 输出匹配项
	fmt.Println(len(matches))
	for i, m := range matches {
		fmt.Println(string(m))
		if i >= 16 {
			break
		}
	}
}

希望对你有帮助！

英文:

I tried to distill your question into a very simple example.

Since there can be "hundreds of thousands of regex matches", I did a large initial allocation of 1 M (1024 * 1024) entries for the matches slice capacity. A slice is a reference type. A slice header 'struct' has length, a capacity, and a pointer for a total of 24 (3 * 8) bytes on a 64-bit OS. The initial allocation for a slice of 1 M entries is therefore only 24 (24 * 1) MB. If there are more than 1 M entries, a new slice with capacity of 1.25 (1 + 1 / 4) M entries will be allocated and the existing 1 M slice header entries (24 MB) will be copied to it.

In summary, you can avoid much of the the overhead of many appends by initially over allocating the slice capacity. The bigger memory problem is all the data that is saved and referenced for each match. The far bigger CPU time problem is the time taken to perform the regexp.FindAll's.

package main

import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;os&quot;
	&quot;regexp&quot;
)

var searches = []*regexp.Regexp{
	regexp.MustCompile(&quot;configure&quot;),
	regexp.MustCompile(&quot;unknown&quot;),
	regexp.MustCompile(&quot;PATH&quot;),
}

var matches = make([][]byte, 0, 1024*1024)

func main() {
	logName := &quot;config.log&quot;
	log, err := os.Open(logName)
	if err != nil {
		fmt.Fprintln(os.Stderr, err)
		os.Exit(1)
	}
	defer log.Close()
	scanner := bufio.NewScanner(log)
	for scanner.Scan() {
		line := scanner.Bytes()
		for _, s := range searches {
			for _, m := range s.FindAll(line, -1) {
				matches = append(matches, append([]byte(nil), m...))
			}
		}
	}
	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, err)
	}
	// Output matches
	fmt.Println(len(matches))
	for i, m := range matches {
		fmt.Println(string(m))
		if i &gt;= 16 {
			break
		}
	}
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

高效地向变长字符串容器追加字符串（Golang）

问题

答案1

答案2

如何防止 gofmt 添加 `//go:build test`？

使用gopacket从原始字节创建数据包，并将其写入pcap文件。

Golang：没有必需的模块提供包。

Do I need to set a map to nil in order for it to be garbage collected?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论