2022年3月17日 18:08:26go评论112阅读模式

英文:

Performance issue in Golang's key-value store (Badger DB)

问题

在badgerDB中，我们有数十亿个键，类型为string，值的类型为HashValueList。在我们的用例中，HashValueList的长度可能达到数百万。在将其插入BadgerDb之前，我们必须将键和值编码为[]byte，我们使用encoding/gob包进行编码。因此，每当我们需要值时，我们都必须再次解码它们。这个解码过程在我们的情况下会导致开销。

为了减少解码开销，我们将设计更改为前缀迭代。通过前缀迭代，我们将集合中的每个值都存储为一个独立的Badger KV对，而不是一个具有大值的单个键。键的前缀将是原始哈希值键。然后，我们需要添加一个后缀以确保在原始集合的值集合中具有唯一性。

因此，在您的原始方案中，类似于：

k1 -> [v1, v2, v3, ..., vn]
...
km -> [w1, ..., wm]

现在变成了：

k1@1 -> v1
k1@2 -> v2
k1@3 -> v2
...
k1@n -> vn
...
km@1 -> w1
...
km@m -> wm

为了从数据库中获取值，我们有n个goroutine正在读取KeyChan通道，并将值写入ValChan。

func Get(db *badger.DB, KeyChan <-chan string, ValChan chan []byte) {
	
	var val []byte
	for key := range KeyChan {
	
		txn := db.NewTransaction(false)
		opts := badger.DefaultIteratorOptions
		opts.Prefix = []byte(key)
		it := txn.NewIterator(opts)
	
		prefix := []byte(key)
		for it.Rewind(); it.ValidForPrefix(prefix); it.Next() {
			item := it.Item()
			val, err := item.ValueCopy(val[:])
			ValChan <- val
			item = nil
			if err != nil {
				fmt.Println(err)
			}
		}
		it.Close()
		txn.Discard()
	}
}

在前缀迭代中，Get函数在一段时间后变得非常慢。我们收集了一个5秒的执行跟踪，结果如下：

这里的github.com/dgraph-io/badger/v3.(*Iterator).fill.func1 N=2033535在内部创建了大量的Goroutine。

在调度程序等待中花费了大量的时间。

我们如何提高前缀迭代的性能？对于我们的用例，是否有更好的方法？

谢谢。

英文:

In badgerDB, we have billions of keys of type string and values of type HashValueList. In our use case length of HashValueList might be in millions. We have to encode key and value in []byte before inserting into BadgerDb; we are using the encoding/gob package. So, whenever we need values, we have to decode them again. This decoding process is causing overhead in our case.

type HashValue struct {
    Row_id  string
    Address string
}
type HashValueList []HashValue

To mitigate decoding overhead, we change our design to prefix iteration. With prefix iteration, we stored each of the values from our collection as a distinct Badger KV pair, rather than a single key with a large value. The prefix of the key would be the original hash value key. we'd then need to add a suffix to provide uniqueness over the collection of values from the original collection.
So in your original scheme have something like:

k1 -&gt; [v1, v2, v3, ..., vn]
...
km -&gt; [w1, ..., wm]

Now have something like:

k1@1 -&gt; v1
k1@2 -&gt; v2
k1@3 -&gt; v2
...
k1@n -&gt; vn
...
km@1 -&gt; w1
...
km@m -&gt; wm

To find the values from DB, we have n goroutines that are reading the KeyChan channel and writing values to ValChan.

func Get(db *badger.DB, KeyChan &lt;-chan string, ValChan chan []byte) {
	
	var val []byte
	for key := range KeyChan {
	
		txn := db.NewTransaction(false)
		opts := badger.DefaultIteratorOptions
		opts.Prefix = []byte(key)
		it := txn.NewIterator(opts)
	
		prefix := []byte(key)
		for it.Rewind(); it.ValidForPrefix(prefix); it.Next() {
			item := it.Item()
			val, err := item.ValueCopy(val[:])
			ValChan &lt;- val
			item = nil
			if err != nil {
				fmt.Println(err)
			}
		}
		it.Close()
		txn.Discard()
	}
}

In Prefix iteration, Get func to gets very slow after a while. We collected a 5-second execution trace, and the results are below:

Here github.com/dgraph-io/badger/v3.(*Iterator).fill.func1 N=2033535 Creating huge number of Goroutine internally.

The significant time spent in Scheduler Wait.

How do we improve the prefix iteration performance? Is there any better approach for our use case?

Thanks

答案1

得分: 2

如果你想要快速迭代一个前缀并收集结果，你应该考虑在Badger中使用Stream框架。它使用多个goroutine来使迭代速度达到磁盘允许的最快速度。

https://pkg.go.dev/github.com/outcaste-io/badger/v3#Stream

另外，一般来说，使用Go通道来收集可能有数百万个结果的数据会非常慢。通道最好用于批量处理结果并减少与其交互的次数。

Stream框架通过提供一个串行执行的Stream.Send函数来解决所有这些问题。你甚至可以保持编码数据的原样，在Send函数中进行解码，或者从Send函数中复制数据。

英文:

If you want to iterate over a prefix really fast and collect the results, you should consider using the Stream framework within Badger. It uses many goroutines to make iteration as fast as disk would allow.

https://pkg.go.dev/github.com/outcaste-io/badger/v3#Stream

Also, in general, using a Go channel to collect what could be millions of results would be really slow. Channels are best used by batching up the results and decreasing how many times you interact with them.

Stream Framework takes care of all of that by giving you a serially executed Stream.Send function. You can even keep the encoded data as it is, and decode it within the Send, or copy it over from Send.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Golang的键值存储（Badger DB）中的性能问题

问题

答案1

golang的`range`关键字是否会破坏uint类型的信息？

Merging two JSON strings in golang

将Unix纪元作为字符串转换为Go中的time.Time。

获取嵌套对象的方法是使用一个查询。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。