内存中的字符串去重

huangapple go评论91阅读模式
英文:

in-memory string deduplication

问题

上下文:我正在编写一个处理日志数据的程序,涉及将几GB的数据加载到内存中,交叉检查各种内容,找到数据之间的相关性,并将结果写入另一个文件。(这实际上是在将数据加载到Druid.io集群之前的一个烹饪/去规范化步骤。)我希望避免将信息写入数据库,这样可以提高性能并简化代码 - 可以预见的将来,一次处理的数据量可以通过增加机器的内存来处理。

我的问题是,是否有必要在代码中明确去重字符串;如果是的话,有什么好的方法。这些日志文件中的许多值都是完全相同的文本(大约有文件中总文本值的25%是唯一的,粗略估计)。

由于我们谈论的是几GB的数据,虽然内存便宜且可以使用交换空间,但仍然有限制,如果我不小心,很可能会达到限制。如果我这样做:

strstore := make(map[string]string)

// 做一些涉及切片和处理文本的工作,结果是:
var a = "我们找出的某个字符串,大约有75%的可能性是重复的"

// 注意,这里还有大约10个类似的变量,这里只展示了"a",为了简单起见

if _, ok := strstore[a]; !ok {
    strstore[a] = a
} else {
    a = strstore[a]
}

// 现在对"a"做更多的操作,并将其保存在某个映射中的结构体中

在我看来,这将具有“重用”现有字符串的效果,但会增加哈希查找和比较的成本。看起来是一个不错的权衡。

然而,如果实际上是新的字符串导致内存碎片化,并留下各种未被使用的空洞,那么这种方法可能并不那么有用。

我还可以尝试保持一个大的字节数组/切片,其中包含字符数据,并对其进行索引,但这将使代码难以编写(尤其是在[]byte和字符串之间进行转换时,这涉及到它自己的分配),而且我可能只是在做一件本来应该由Go运行时处理的糟糕工作。

希望能得到关于解决这个问题的任何建议,或者如果有人在这方面有特别有用的机制的经验。

英文:

Context: I'm writing something to process log data which involves loading several GB of data into memory and cross checking various things, finding correlations in data and writing the results out to another file. (This is essentially a cooking/denormalization step before loading into a Druid.io cluster.) I want to avoid having to write the information to a database for both performance and code simplicity - it is assumed that in the foreseeable future the volume of data processed at one time can be handled by adding memory to the machine.

My question is if it is a good idea to attempt to explicitly deduplicate strings in my code; and if so, what is a good approach. Many of the values in these log files are the same exact pieces of text (probably about 25% of the total text values in the file are unique, rough guess).

Since we're talking about gigs of data, and while ram is cheap and swap is possible, there is still a limit and if I'm careless I will very likely hit it. If I do something like this:

strstore := make(map[string]string)

// do some work that involves slicing and dicing some text, resulting in:
var a = "some string that we figured out that has about a 75% chance of being duplicate"

// note that there are about 10 other such variables that are calculated here, only "a" shown for simplicity

if _, ok := strstore[a]; !ok {
    strstore[a] = a
} else {
    a = strstore[a]
}

// now do some more stuff with "a" and keep it in a struct which is in
// some map some place

It would seem to me that this would have the effect of "reusing" existing strings, at the cost of a hash lookup and compare. Seemingly a good trade off.

However, this might not be that helpful if the strings that are in fact new cause memory to be fragmented and have various holes that are left unclaimed.

I could also try to keep one big byte array/slice which has the character data and index into that, but it would make the code hard to write (esp having to mess around with conversion between []byte and strings, which involves it's own allocation) and I would probably just be doing a poor job of something that is really the Go runtime's domain anyway.

Looking for any advice on approaches to this problem, or if anyone's experience with this sort of thing has yielded particularly useful mechanisms to address this.

答案1

得分: 1

这里有很多有趣的数据结构和算法可以在这里使用。这取决于你在统计和处理阶段想要做什么。

我不确定你的日志有多可压缩,但你可以预处理数据,具体取决于你的用例:https://github.com/alecthomas/mph/blob/master/README.md

还可以查看一些背景资料中的这些数据结构:

https://github.com/Workiva/go-datastructures

英文:

There are a lot of interesting data structures and algorithms that you could use here. It depends on what you are trying do in the stats and processing stages.

I am not sure how compressible your logs are but you could pre-process the data, again depending on your uses cases : https://github.com/alecthomas/mph/blob/master/README.md

Take a look at some of these data structures as well for background :

https://github.com/Workiva/go-datastructures

huangapple
  • 本文由 发表于 2015年9月20日 04:31:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/32672911.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定