高效地向变长字符串容器追加字符串(Golang)

huangapple go评论77阅读模式
英文:

Efficient appending to a variable-length container of strings (Golang)

问题

问题:

我需要对一个大型日志文件(几个GB长)的每一行应用多个正则表达式,收集非空匹配项,并将它们全部放入一个数组中(用于序列化并通过网络发送)。

如果回答这个问题是正确的,切片并不能提供太多帮助:

> 如果切片的容量不足,append将需要分配新的内存并将旧内存复制过去。对于长度<1024的切片,它将使容量翻倍,对于长度>1024的切片,它将增加1.25倍。

由于可能会有成千上万个正则表达式匹配项,我无法真正预测切片的长度/容量。我也不能让它太大,以防万一,因为这会浪费内存(或者不会吗?如果内存分配器足够聪明,不会分配太多未写入的内存,也许我可以使用一个巨大的切片容量而没有太多的伤害)。

所以我在考虑以下替代方案:

  1. 有一个匹配项的双向链表(http://golang.org/pkg/container/list/)
  2. 计算它的长度(len()会起作用吗?)
  3. 预分配一个具有这个容量的切片
  4. 将字符串指针复制到这个切片中

在Go中是否有更简便的方法实现这个目标(具有近似O(1)的append复杂度)?

(我是Go语言的新手)

英文:

The problem:

I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for serialization and sending it over the network).

Slices are not much help if answer to this question holds:

> If the slice does not have sufficient capacity, append will need to allocate new memory and copy the old one over. For slices with <1024 elements, it will double the capacity, for slices with >1024 elements it will increase it by factor 1.25.

Since there can be literally hundreds of thousands of regex matches, I can't really predict the length / capacity of a slice. I can't make it too big either "just in case" bc this will waste memory (or will it? if memory allocator is smart enough not to allocate too much memory that is not written into, maybe I could use a huge slice capacity without much harm?).

So I'm thinking about following alternative:

  1. have a doubly-linked list of matches (http://golang.org/pkg/container/list/)
  2. calc its length (will len() work?)
  3. preallocate a slice of this capacity
  4. copy string pointers to this slice

Is there a less laborious way of achieving this goal in Go (append with ~ O(1) append complexity)?

(golang newbie here of course)

答案1

得分: 15

append()的平摊(平均)成本已经是O(1),因为它每次按比例增加数组的大小。随着数组变得越来越大,增加它的成本会变得更高,但比例上更加罕见。一个包含1000万个元素的切片增长的成本将比一个包含100万个元素的切片增长的成本高10倍,但由于我们分配的额外容量与大小成比例,直到下一次增长,也会有10倍的append(slice, item)调用。增加的成本和减少的重新分配频率相互抵消,使得平均成本保持恒定,即O(1)。

相同的思想也适用于其他语言的动态大小数组,例如微软的std::vector实现显然每次增加数组大小的50%。平摊O(1)并不意味着你不需要为分配付费,只是你随着数组变大以相同的平均速率继续付费。

在我的笔记本电脑上,我可以在77毫秒内运行一百万次slice = append(slice, someStaticString)。其中一个快速的原因是,正如下面siritinga所指出的,"复制"字符串以扩大数组实际上只是复制字符串头部(一个指针/长度对),而不是复制内容。复制100,000个字符串头部仍然不到2MB,与你处理的其他数据量相比不算什么大问题。

在微基准测试中,container/list对我来说大约慢了3倍;链表的追加时间当然也是恒定的,但我想append的常数更低,因为它通常只需写入几个字的内存而不是分配一个列表项等等。这段计时代码在Playground中无法运行,但你可以将其复制到本地并运行以查看结果:http://play.golang.org/p/uYyMScmOjX

有时,你可以有针对性地预先分配空间以避免重新分配/复制(在这个例子中,使用make([]string, 0, 1000000)将运行时间从约77毫秒减少到约10毫秒),但是,当然,通常情况下你没有足够的关于预期数据大小等信息来获得有价值的收益,最好还是让内置算法来处理。


但是,你在这里提出了一个关于类似grep的应用程序的更具体的问题(感谢你提出了一个带有上下文的详细问题)。对于这个问题,最重要的建议是,如果你要搜索大量的日志,最好避免将整个输出缓冲到内存中。

你可以编写一个函数来流式处理结果:logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp);或者你可以将out定义为chan []bytefunc(match []byte) (err error),如果你不希望发送结果的代码与grep代码过于纠缠。

(关于[]bytestring:在这里,[]byte似乎能够完成工作,并且在进行I/O时避免了[]byte<=>string的转换,所以我更喜欢使用它。不过,我不知道你在做什么,如果你需要string,那也没问题。)

如果你确实将整个匹配列表保存在内存中,请注意,保留对大字符串或字节切片的部分的引用会阻止整个源字符串/切片被垃圾回收。因此,如果你选择这种方式,反直觉地,你实际上可能希望复制匹配项以避免将所有源日志数据保存在内存中。

英文:

append()'s average (amortized) cost is already O(1) because it grows the array by a percentage each time. As the array gets larger, growing it gets costlier but proportionately rarer. A 10M-item slice will be 10x more expensive to grow than a 1M-item slice, but since the extra capacity we're allocating is proportional to the size, it'll also be 10x as many append(slice, item) calls until the next time it grows. The increasing cost and decreasing frequency of reallocations cancel out, leaving the average cost constant, i.e., O(1).

The same idea applies other languages' dynamically-sized arrays, too: Microsoft's std::vector implementation apparently grows the array by 50% each time, for example. Amortized O(1) doesn't mean you pay nothing for allocations, only that you continue paying at the same average rate as your array gets bigger.

On my laptop, I could run a million slice = append(slice, someStaticString)s in 77ms. One reason it's quick, which siritinga noted below, is that "copying" the string to enlarge the array is really just copying the string header (a pointer/length pair), not copying the contents. 100,000 string headers is still under 2MB to copy, not a big deal compared to the other quantities of data you're working with.

container/list turned out ~3x slower for me in a microbenchmark; linked-list append is also constant time, of course, but I imagine append has a lower constant because it can usually just write to a couple words of memory and not allocate a list item, etc. The timing code won't work in the Playground but you can copy this locally and run it to see yourself: http://play.golang.org/p/uYyMScmOjX

Sometimes, you can usefully pre-allocate space to avoid reallocations/copies (in this example, using make([]string, 0, 1000000) takes the runtime from ~77ms to ~10ms), but, of course, often-to-usually just you don't have enough info about the expected data size and so on to eke out worthwhile gains, and you're better off leaving it to the built-in algorithm.


But you're asking a more specific question here about a grep-like application (and thanks for asking a detailed question with context). For that, bottom-line recommendation is that if you're searching gigs of logs, it's probably best to avoid buffering the whole output in RAM at all.

You could write something to stream the results as a single function: logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp); you could alternatively make out a chan []byte or func(match []byte) (err error) if you don't want the code that sends results to be too enmeshed with the grep code.

(On []byte vs. string: a []byte seems to do the job here and avoids []byte<=>string conversions when you do I/O, so I'd prefer that. I don't know what all you're doing, though, and if you need string it's fine.)

If you do keep the whole match list in RAM, be aware that keeping around a reference to part of a big string or byte slice keeps the whole source string/slice from being garbage collected. So if you go that route, then counterintuitively you may actually want to copy matches to avoid keeping all of the source log data in RAM.

答案2

得分: 3

我尝试将你的问题简化为一个非常简单的例子。

由于可能有“成千上万个正则表达式匹配项”,我对matches切片的容量进行了大量的初始分配,设置为1M(1024 * 1024)个条目。切片是一个引用类型。切片头部的结构体包含长度、容量和指针,总共占据24(3 * 8)个字节,在64位操作系统上。因此,对于1M个条目的切片的初始分配仅占据24(24 * 1)MB的内存。如果条目超过1M个,将分配一个容量为1.25(1 + 1 / 4)M个条目的新切片,并将现有的1M个切片头部条目(24MB)复制到其中。

总之,通过最初过度分配切片的容量,可以避免许多append操作的开销。更大的内存问题是保存和引用每个匹配项的所有数据。更大的CPU时间问题是执行regexp.FindAll的时间。

以下是示例代码:

package main

import (
	"bufio"
	"fmt"
	"os"
	"regexp"
)

var searches = []*regexp.Regexp{
	regexp.MustCompile("configure"),
	regexp.MustCompile("unknown"),
	regexp.MustCompile("PATH"),
}

var matches = make([][]byte, 0, 1024*1024)

func main() {
	logName := "config.log"
	log, err := os.Open(logName)
	if err != nil {
		fmt.Fprintln(os.Stderr, err)
		os.Exit(1)
	}
	defer log.Close()
	scanner := bufio.NewScanner(log)
	for scanner.Scan() {
		line := scanner.Bytes()
		for _, s := range searches {
			for _, m := range s.FindAll(line, -1) {
				matches = append(matches, append([]byte(nil), m...))
			}
		}
	}
	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, err)
	}
	// 输出匹配项
	fmt.Println(len(matches))
	for i, m := range matches {
		fmt.Println(string(m))
		if i >= 16 {
			break
		}
	}
}

希望对你有帮助!

英文:

I tried to distill your question into a very simple example.

Since there can be "hundreds of thousands of regex matches", I did a large initial allocation of 1 M (1024 * 1024) entries for the matches slice capacity. A slice is a reference type. A slice header 'struct' has length, a capacity, and a pointer for a total of 24 (3 * 8) bytes on a 64-bit OS. The initial allocation for a slice of 1 M entries is therefore only 24 (24 * 1) MB. If there are more than 1 M entries, a new slice with capacity of 1.25 (1 + 1 / 4) M entries will be allocated and the existing 1 M slice header entries (24 MB) will be copied to it.

In summary, you can avoid much of the the overhead of many appends by initially over allocating the slice capacity. The bigger memory problem is all the data that is saved and referenced for each match. The far bigger CPU time problem is the time taken to perform the regexp.FindAll's.

package main

import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;os&quot;
	&quot;regexp&quot;
)

var searches = []*regexp.Regexp{
	regexp.MustCompile(&quot;configure&quot;),
	regexp.MustCompile(&quot;unknown&quot;),
	regexp.MustCompile(&quot;PATH&quot;),
}

var matches = make([][]byte, 0, 1024*1024)

func main() {
	logName := &quot;config.log&quot;
	log, err := os.Open(logName)
	if err != nil {
		fmt.Fprintln(os.Stderr, err)
		os.Exit(1)
	}
	defer log.Close()
	scanner := bufio.NewScanner(log)
	for scanner.Scan() {
		line := scanner.Bytes()
		for _, s := range searches {
			for _, m := range s.FindAll(line, -1) {
				matches = append(matches, append([]byte(nil), m...))
			}
		}
	}
	if err := scanner.Err(); err != nil {
		fmt.Fprintln(os.Stderr, err)
	}
	// Output matches
	fmt.Println(len(matches))
	for i, m := range matches {
		fmt.Println(string(m))
		if i &gt;= 16 {
			break
		}
	}
}

huangapple
  • 本文由 发表于 2013年11月28日 03:56:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/20251900.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定