英文:
Compact data structure for storing parsed log lines in Go (i.e. compact data structure for multiple enums in Go)
问题
我正在处理一个脚本,用于解析和绘制数据库日志文件中的信息。一些示例日志行可能如下所示:
Tue Dec 2 03:21:09.543 [rsHealthPoll] DBClientCursor::init call() failed
Tue Dec 2 03:21:09.543 [rsHealthPoll] replset info example.com:27017 heartbeat failed, retrying
Thu Nov 20 00:05:13.189 [conn1264369] insert foobar.fs.chunks ninserted:1 keyUpdates:0 locks(micros) w:110298 110ms
Thu Nov 20 00:06:19.136 [conn1263135] update foobar.fs.chunks query: { files_id: ObjectId('54661657b23a225c1e4b00ac'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:675 137ms
Thu Nov 20 00:06:19.136 [conn1258266] update foobar.fs.chunks query: { files_id: ObjectId('54661657ae3a22741e0132df'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:687 186ms
Thu Nov 20 00:12:14.859 [conn1113639] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1416453003000|74 } } cursorid:7965836327322142721 ntoreturn:0 keyUpdates:0 numYields: 15 locks(micros) r:351042 nreturned:3311 reslen:56307 188ms
并非每个日志行都包含所有字段,但我们解析出的一些字段包括:
- 日期时间
- 查询持续时间
- 线程名称
- 连接编号(例如1234、532434、53433)
- 日志级别(例如警告、错误、信息、调试等)
- 日志组件(例如存储、日志、命令、索引等)
- 操作类型(例如查询、插入、删除等)
- 命名空间
日志文件的总大小通常很大(几百兆字节到几个千兆字节)。目前脚本使用Python编写,除了字段之外,还存储原始的日志行和标记化版本。然而,结果的内存消耗实际上是原始日志文件大小的数倍。因此,我希望改进的主要问题之一是内存消耗。
为了好玩和学习,我想尝试使用Go重新编写这个脚本,并考虑是否可以使用更紧凑的数据结构。
其中许多字段是枚举类型。对于其中一些字段,值的集合是预先知道的(例如日志级别、日志组件)。对于其他字段(例如线程名称、连接编号、命名空间),我们将在解析日志文件时动态确定集合。
计划的更改
首先,许多枚举类型字段存储为字符串。所以我猜想一个改进是使用uint8
来存储它,然后使用常量(对于我们预先知道的值)或者使用某种映射表回到原始字符串(对于我们动态确定的值)。或者是否有其他原因我更喜欢常量而不是某种映射结构?
其次,我们可以将原始日志行存储为字符串的方式改为存储到磁盘上的原始文件的偏移量。
问题
- 您是否对上述两个计划的更改有任何问题?这些是一个好的起点吗?
- 您对于优化存储日志行的内存消耗有其他的提示/建议吗?
- 我知道对于位图,有像Roaring Bitmaps(http://roaringbitmap.org/)这样的东西,它们是压缩的位图,您仍然可以正常访问/修改。显然,这类东西的总称是简洁数据结构。然而,是否有类似Roaring Bitmaps的枚举类型的等效物?或者其他一种巧妙的紧凑存储方式?
- 我还考虑了布隆过滤器,也许可以使用它们来存储每个日志行是否在某个集合中(例如日志级别警告、日志级别错误)- 但是,它只能在这些集合中的一个中,所以我不知道这是否有意义。而且,不确定如何处理误报。
您有什么想法?
英文:
I'm working on a script that parses and graph information from a database logfile. Some examples loglines might be:
Tue Dec 2 03:21:09.543 [rsHealthPoll] DBClientCursor::init call() failed
Tue Dec 2 03:21:09.543 [rsHealthPoll] replset info example.com:27017 heartbeat failed, retrying
Thu Nov 20 00:05:13.189 [conn1264369] insert foobar.fs.chunks ninserted:1 keyUpdates:0 locks(micros) w:110298 110ms
Thu Nov 20 00:06:19.136 [conn1263135] update foobar.fs.chunks query: { files_id: ObjectId('54661657b23a225c1e4b00ac'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:675 137ms
Thu Nov 20 00:06:19.136 [conn1258266] update foobar.fs.chunks query: { files_id: ObjectId('54661657ae3a22741e0132df'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:687 186ms
Thu Nov 20 00:12:14.859 [conn1113639] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1416453003000|74 } } cursorid:7965836327322142721 ntoreturn:0 keyUpdates:0 numYields: 15 locks(micros) r:351042 nreturned:3311 reslen:56307 188ms
Not every logline contains all fields, but some of the fields we parse out include:
- Datetime
- Query Duration
- Name of Thread
- Connection Number (e.g. 1234, 532434, 53433)
- Logging Level (e.g. Warning, Error, Info, Debug etc.)
- Logging Component (e.g. Storage, Journal, Commands, Indexin etc.)
- Type of operation (e.g. Query, Insert, Delete etc.)
- Namespace
The total logfile can often be fairly large (several hundred MBs up to a coupe of GBs). Currently the script is in Python, and as well as the fields, it's also storing the original raw logline as well as a tokenised version - the resulting memory consumption though is actually several multiples of the original logfile size. Hence, memory consumption is one of the main things I'd like to improve.
For fun/learning, I thought I might try re-doing this in Go, and looking at whether we could use a more compact data structure.
Many of the fields are enumerations (enums) - for some of them the set of values is known in advance (e.g. logging leve, logging component). For others (e.g. name of thread, connection number, namespace), we'll work out the set at runtime as we parse the logfile.
Planned Changes
Firstly, many of these enums are stored as strings. So I'm guessing one improvement will be move to using something like an uint8
to store it, and then either using consts (for the ones we know in advance), or having some kind of mapping table back to the original string (for the ones we work out.) Or are there any other reaosns I'd prefer consts versus some kind of mapping structure?
Secondly, rather than storing the original logline as a string, we can probably store an offset back to the original file on disk.
Questions
- Do you see any issues with either of the two planned changes above? Are these a good starting point?
- Do you have any other tips/suggestions for optimising the memory consumption of how we store the loglines?
- I know for bitmaps, there's things like Roaring Bitmaps (http://roaringbitmap.org/), which are compressed bitmaps which you can still access/modify normally whilst compressed. Apparently the overall term for things like this is succinct data structures.
However, are there any equivalents to roaring bitmaps but for enumerations? Or any other clever way of storing this compactly? - I also thought of bloom filters, and maybe using those to store whether each logline was in a set (i.e. logging level warning, logging level error) - however, it can only be in one of those sets, so I don't know if that makes sense. Also, not sure how to handle the false positives.
Thoughts?
答案1
得分: 0
你对上述两个计划变更有什么意见吗?这是一个好的起点吗?
对于这两个变更没有任何问题。如果日志确实是按行分隔的,你可以只存储行号,但是存储字节偏移可能更加健壮。标准的io.Reader
接口返回读取的字节数,因此你可以利用它来获取偏移量。
你对优化存储日志行的内存消耗有其他的建议吗?
这取决于你想要如何使用它们,但是一旦它们被分词化(并且你已经从行中获取了所需的数据),为什么要在内存中保留这些行呢?它们已经在文件中了,而且你现在已经有了一个偏移量,可以快速查找它们。
是否有类似于 roaring bitmaps 的枚举等效物?或者其他一种紧凑存储的聪明方式?
我倾向于将每个枚举类型定义为 int,并使用iota
。类似于以下的方式:
package main
import (
"fmt"
"time"
)
type LogLevel int
type LogComponent int
type Operation int
const (
Info LogLevel = iota
Warning
Debug
Error
)
const (
Storage LogComponent = iota
Journal
Commands
Indexin
)
const (
Query Operation = iota
Insert
Delete
)
type LogLine struct {
DateTime time.Time
QueryDuration time.Duration
ThreadName string
ConNum uint
Level LogLevel
Comp LogComponent
Op Operation
Namespace string
}
func main() {
l := &LogLine{
time.Now(),
10 * time.Second,
"query1",
1000,
Info,
Journal,
Delete,
"ns1",
}
fmt.Printf("%v\n", l)
}
输出结果为 &{2009-11-10 23:00:00 +0000 UTC 10s query1 1000 0 1 2 ns1}
。
你可以对结构体的某些字段进行压缩,但是这样你需要为每个字段定义位范围,并且会失去一些开放性。例如,将 LogLevel 定义为前两位,Component 定义为接下来的两位,依此类推。
我还考虑了布隆过滤器,也许可以使用它们来存储每个日志行是否在某个集合中(例如,日志级别警告、日志级别错误),但是每个日志行只能属于其中一个集合,所以我不知道这是否有意义。而且,我不确定如何处理误判。
对于你目前的示例来说,布隆过滤器可能有些过度设计。可能更容易的方法是为每个枚举类型使用一个[]int
,或者使用其他主索引来跟踪行号与(例如)日志级别之间的关系。正如你所说,每个日志行只能属于一个集合。实际上,根据枚举字段的数量,使用打包的枚举作为map[int][]int
的标识符可能更容易。
Set := make(map[int][]int)
Set[int(Delete)<<4 + int(Journal)<<2 + int(Debug)] = []int{7, 45, 900} // 在这个集合中的行号。
你可以在这里查看一个完整的(虽然有些笨拙)示例:链接。
英文:
Do you see any issues with either of the two planned changes above? Are these a good starting point?
No problems with either. If the logs are definitely line-delimited you can just store the line number, but it may be more robust to store the byte-offset. The standard io.Reader
interface returns the number of bytes read so you can use that to gain the offset.
Do you have any other tips/suggestions for optimising the memory consumption of how we store the loglines?
It depends on what you want to use them for, but once they've been tokenized (and you've got the data you want from the line), why hold onto the line in memory? It's already in the file, and you've now got an offset to look it up again quickly.
are there any equivalents to roaring bitmaps but for enumerations? Or any other clever way of storing this compactly?
I'd tend to just define each enum type as an int, and use iota
. Something like:
package main
import (
"fmt"
"time"
)
type LogLevel int
type LogComponent int
type Operation int
const (
Info LogLevel = iota
Warning
Debug
Error
)
const (
Storage LogComponent = iota
Journal
Commands
Indexin
)
const (
Query Operation = iota
Insert
Delete
)
type LogLine struct {
DateTime time.Time
QueryDuration time.Duration
ThreadName string
ConNum uint
Level LogLevel
Comp LogComponent
Op Operation
Namespace string
}
func main() {
l := &LogLine{
time.Now(),
10 * time.Second,
"query1",
1000,
Info,
Journal,
Delete,
"ns1",
}
fmt.Printf("%v\n", l)
}
Produces &{2009-11-10 23:00:00 +0000 UTC 10s query1 1000 0 1 2 ns1}
.
You could pack some of the struct fields, but then you need to define bit-ranges for each field and you lose some open-endedness. For example define LogLevel as the first 2 bits, Component as the next 2 bits etc.
I also thought of bloom filters, and maybe using those to store whether each logline was in a set (i.e. logging level warning, logging level error) - however, it can only be in one of those sets, so I don't know if that makes sense. Also, not sure how to handle the false positives.
For your current example, bloom filters may be overkill. It may be easier to have a []int for each enum, or some other master "index" that keeps track of line-number to (for example) log level relationships. As you said, each log line can only be in one set. In fact, depending on the number of enum fields, it may be easier to use the packed enums as an identifier for something like a map[int][]int
.
Set := make(map[int][]int)
Set[int(Delete) << 4 + int(Journal) << 2 + int(Debug)] = []int{7, 45, 900} // Line numbers in this set.
See here for a complete, although hackish example.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论