英文:
Elements incorrectly evicted from eBPF LRU hash map
问题
我观察到在 eBPF LRU 哈希映射(BPF_MAP_TYPE_LRU_HASH
)中,元素被错误地驱逐。在下面的代码中,我向一个大小为 8 的 LRU 哈希映射中插入元素,并每秒打印其内容:
package main
import (
"fmt"
"github.com/cilium/ebpf"
"log"
"time"
)
func main() {
spec := ebpf.MapSpec{
Name: "test_map",
Type: ebpf.LRUHash,
KeySize: 4,
ValueSize: 8,
MaxEntries: 8,
}
hashMap, err := ebpf.NewMap(&spec)
if err != nil {
log.Fatalln("Could not create map:", err)
}
var insertKey uint32
for range time.Tick(time.Second) {
err = hashMap.Update(insertKey, uint64(insertKey), ebpf.UpdateAny)
if err != nil {
log.Printf("Update failed. insertKey=%d|value=%d|err=%s", insertKey, insertKey, err)
}
var key uint32
var value uint64
count := 0
elementsStr := ""
iter := hashMap.Iterate()
for iter.Next(&key, &value) {
elementsStr += fmt.Sprintf("(%d, %d) ", key, value)
count++
}
log.Printf("Total elements: %d, elements: %s", count, elementsStr)
insertKey++
}
}
当我运行上述程序时,我看到以下输出:
2023/03/29 17:32:29 Total elements: 1, elements: (0, 0)
2023/03/29 17:32:30 Total elements: 2, elements: (1, 1) (0, 0)
2023/03/29 17:32:31 Total elements: 3, elements: (1, 1) (0, 0) (2, 2)
2023/03/29 17:32:32 Total elements: 3, elements: (3, 3) (0, 0) (2, 2)
...
由于该映射有八个条目,我期望第四行显示四个值,但实际上只显示了三个,因为条目 (1, 1)
被驱逐了。
如果我将 max_entries
更改为 1024,我注意到在插入第 200 个元素后出现了这个问题,但有时会在此之后发生。这并不一致。
这个问题不仅限于从用户空间创建/插入映射,因为我在创建映射并向其插入的 XDP 程序中观察到了这个问题;上述代码重现了我在真实程序中观察到的问题。在我的真实程序中,也有 1024 个条目,我注意到在插入第 16 个元素后出现了这个问题。
我在运行 Linux 内核 5.16.7 的生产服务器上进行了测试。
我在 Linux 虚拟机上进行测试,并将内核升级到了 6.2.8,我观察到驱逐策略不同。例如,当 max_entries
为 8 时,我观察到以下输出:
2023/03/29 20:38:02 Total elements: 1, elements: (0, 0)
2023/03/29 20:38:03 Total elements: 2, elements: (0, 0) (1, 1)
2023/03/29 20:38:04 Total elements: 3, elements: (0, 0) (2, 2) (1, 1)
2023/03/29 20:38:05 Total elements: 4, elements: (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:06 Total elements: 5, elements: (4, 4) (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:07 Total elements: 6, elements: (4, 4) (0, 0) (2, 2) (1, 1) (5, 5) (3, 3)
2023/03/29 20:38:08 Total elements: 7, elements: (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:09 Total elements: 8, elements: (7, 7) (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:10 Total elements: 1, elements: (8, 8)
...
当 max_entries
为 1024 时,我注意到在添加第 1025 个元素后,总元素数为 897。我无法在我们生产服务器上使用内核 6.2.8 进行测试。
英文:
I observe that elements are incorrectly evicted in an eBPF LRU hash map (BPF_MAP_TYPE_LRU_HASH
). In the following code I insert into an LRU hash map of size 8 and print its contents every second:
package main
import (
"fmt"
"github.com/cilium/ebpf"
"log"
"time"
)
func main() {
spec := ebpf.MapSpec{
Name: "test_map",
Type: ebpf.LRUHash,
KeySize: 4,
ValueSize: 8,
MaxEntries: 8,
}
hashMap, err := ebpf.NewMap(&spec)
if err != nil {
log.Fatalln("Could not create map:", err)
}
var insertKey uint32
for range time.Tick(time.Second) {
err = hashMap.Update(insertKey, uint64(insertKey), ebpf.UpdateAny)
if err != nil {
log.Printf("Update failed. insertKey=%d|value=%d|err=%s", insertKey, insertKey, err)
}
var key uint32
var value uint64
count := 0
elementsStr := ""
iter := hashMap.Iterate()
for iter.Next(&key, &value) {
elementsStr += fmt.Sprintf("(%d, %d) ", key, value)
count++
}
log.Printf("Total elements: %d, elements: %s", count, elementsStr)
insertKey++
}
}
When I run the above program, I see this:
2023/03/29 17:32:29 Total elements: 1, elements: (0, 0)
2023/03/29 17:32:30 Total elements: 2, elements: (1, 1) (0, 0)
2023/03/29 17:32:31 Total elements: 3, elements: (1, 1) (0, 0) (2, 2)
2023/03/29 17:32:32 Total elements: 3, elements: (3, 3) (0, 0) (2, 2)
...
Since the map has eight entries, I would expect the fourth line to show four values, but it shows only three because entry (1, 1)
was evicted.
If I change max_entries
to 1024, I notice this problem happens after inserting the 200th element, but sometimes it happens after that. It's not consistent.
This issue is not limited to creating/inserting the map from user space because I observe this problem in my XDP program that creates the map and inserts into it; the above reproduces the issue I observe in my real program. In my real program that also had 1024 entries, I noticed this problem happened after inserting the 16 element.
I tested this on our production servers that run Linux kernel 5.16.7.
I do my testing on a Linux VM, and I upgraded my kernel to 6.2.8, and I observe that the eviction policy is different. For example, when max_entries
is 8, I observe this:
2023/03/29 20:38:02 Total elements: 1, elements: (0, 0)
2023/03/29 20:38:03 Total elements: 2, elements: (0, 0) (1, 1)
2023/03/29 20:38:04 Total elements: 3, elements: (0, 0) (2, 2) (1, 1)
2023/03/29 20:38:05 Total elements: 4, elements: (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:06 Total elements: 5, elements: (4, 4) (0, 0) (2, 2) (1, 1) (3, 3)
2023/03/29 20:38:07 Total elements: 6, elements: (4, 4) (0, 0) (2, 2) (1, 1) (5, 5) (3, 3)
2023/03/29 20:38:08 Total elements: 7, elements: (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:09 Total elements: 8, elements: (7, 7) (4, 4) (0, 0) (2, 2) (1, 1) (6, 6) (5, 5) (3, 3)
2023/03/29 20:38:10 Total elements: 1, elements: (8, 8)
...
When max_entries
is 1024, I notice after the 1025'th element is added, the total elements are 897. I can't test with kernel 6.2.8 on our production servers.
答案1
得分: 2
LRU哈希映射并不能保证恰好有最大数量的项目,并且该实现明显是为了在超过8个项目的情况下提供良好的性能。从代码的快速浏览中,我看到以下情况:
-
LRU被分为两部分,一个是“活动列表”,一个是“非活动列表”,其中一个任务会根据最近是否访问过元素的情况定期将元素从一个列表移动到另一个列表。这不是真正的LRU(每次访问时,项目不会被移动到头部)。
-
当映射已满,并且需要驱逐某个项目以插入新项目时,代码会在一次遍历中从非活动列表中驱逐最多128个项目;只有在非活动列表为空时,它才会从活动列表中驱逐一个项目。
-
还有一个每个CPU的“本地空闲列表”,用于存放等待填充数据的已分配项目;当本地空闲列表为空时,它会尝试从全局空闲列表中获取,如果全局空闲列表也为空,则进入驱逐路径。本地空闲列表的目标大小为4个项目。
因此,6.2.8中的行为似乎是直接且一致的:可能所有的键都在“非活动列表”上(对于扫描类型的访问模式来说并不奇怪,或者可能只是因为它们中的任何一个都没有机会被提升),并且它们都被丢弃了。对于5.16,我不太清楚,但它可能与本地空闲列表以及所有更新都来自同一个CPU有关。
基本上,我认为该数据类型并不是用于你正在使用的方式,而且错误在于你的期望。如果你不同意,我认为你将不得不与内核开发人员讨论此问题。
英文:
The LRU hashmap doesn't guarantee that there are exactly the maximum number of items, and the implementation is clearly geared towards providing good performance with far more than 8 items. What I see from a fairly quick glance at the code:
-
The LRU is separated into two parts, an "active list", and an "inactive list", with a task that moves elements from one to the other periodically depending on whether or not they've been accessed recently. It's not a true LRU (items don't get moved to the head every single time they're accessed).
-
When the map is full, and something needs to be evicted in order to insert a new item, the code will evict up to 128 items from the inactive list in a single pass; only if the inactive list is empty does it evict a single item from the active list.
-
There is also a per-CPU "local free list" of allocated items waiting to be filled with data; when that runs empty, it attempts to pull from the global free list, and if that is empty it goes to the eviction path. The target size of the local free list is 4 items.
So the behavior in 6.2.8 seems straightforward and consistent: presumably all of your keys are on the "inactive list" (not too surprising for a scanning-type access pattern, or perhaps it's just that none of them had a chance to get promoted yet), and all of them get tossed out. I'm less clear about 5.16, but it probably relates to the local free list and all of the updates running from the same CPU.
Basically I think that the data type wasn't meant to be used in the way you're using it, and the bug is in your expectations. If you disagree, I think you'll have to take it up with the kernel developers.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论