OOM无原因(Arch/Raspberry)

huangapple go评论65阅读模式
英文:

OOM for no reason (Arch/Raspberry)

问题

我的Raspberry Pi 4B每次执行某些操作时都会崩溃(例如,当备份任务启动时)。我在上面运行Arch Linux(armv7l)。内存使用率始终低于15%。

以下是日志,包括free -hw的输出,记录在OOM发生7秒前。

net-restart.sh是一个简单的bash脚本。它做的最复杂的事情就是ping,所以在有超过3 GiB可用内存时,它不应该导致OOM。有时它被PostgreSQL的清理服务触发,有时是rsync基于的备份。当发生OOM时,它会一直杀死一个进程,直到完全崩溃。

自从这个问题开始发生以来,我已经升级了内核(和其他一些东西)几次。在它开始之前没有进行软件更改。是硬件问题吗?

顺便说一下,我也尝试过添加交换空间(2 GiB),但没有帮助。

23:00:02 free[10890]:                总计        已用        可用      共享的     缓冲区       缓存   可用内存
23:00:02 free[10890]: 内存:           3.7Gi        82Mi       3.2Gi       2.0Mi       0.0Ki       442Mi       3.6Gi
23:00:02 free[10890]: 交换空间:             0B          0B          0B

23:00:09 kernel: oom_kill_process: 抑制了13个回调
23:00:09 kernel: net-restart.sh调用oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
23:00:09 kernel: CPU: 2 PID: 10992 Comm: net-restart.sh 受污染: G         C         6.1.14-1-rpi-ARCH #1
23:00:09 kernel: 硬件名称: BCM2711
23:00:09 kernel: 从show_stack+0x18/0x1c展开回溯
23:00:09 kernel: 从dump_stack_lvl+0x90/0xac显示栈
23:00:09 kernel: 从dump_header+0x54/0x1fc转储标题
23:00:09 kernel: 从oom_kill_process+0x23c/0x248转储oom_kill_process
23:00:09 kernel: 从__alloc_pages+0xa98/0x1044转储out_of_memory
23:00:09 kernel: 从__pmd_alloc+0x3c/0x1d8转储__alloc_pages
23:00:09 kernel: 从copy_page_range+0xcac/0xcc4转储__pmd_alloc
23:00:09 kernel: 从dup_mm+0x440/0x5a4转储copy_page_range
23:00:09 kernel: 从kernel_clone+0xda0/0x164c转储dup_mm
23:00:09 kernel: 从sys_clone+0x78/0x9c转储copy_process
23:00:09 kernel: 从ret_fast_syscall+0x0/0x1c转储kernel_clone
23:00:09 kernel: 异常堆栈(0xf08b1fa8至0xf08b1ff0)
23:00:09 kernel: 1fa0:                   b6fd0088 00000001 01200011 00000000 00000000 00000000
23:00:09 kernel: 1fc0: b6fd0088 00000001 b6efae58 00000078 bea210fc 0055d2bc bea2107c 005844e0
23:00:09 kernel: 1fe0: b6fd05a0 bea20f08 b6e2d260 b6e2d684
23:00:09 kernel: 内存信息:
23:00:09 kernel: 活跃的匿名页:7451 非活跃的匿名页:603 孤立的匿名页:0
                                                 活跃的文件页:39567 非活跃的文件页:70065 孤立的文件页:0
                                                 不可驱逐的:0 脏页:143 回写页:0
                                                 可回收的slab页:3166 不可回收的slab页:6791
                                                 映射的页:23163 共享内存:594 页表:267
                                                 安全页表:0 弹回:0
                                                 内核杂项可回收的:0
                                                 可用内存:848488 可用内存页面:30 可用CMA:80063
23:00:09 kernel: 节点 0 活跃的匿名页:29804kB 非活跃的匿名页:2412kB 活跃的文件页:158268kB 非活跃的文件页:280260kB 不可驱逐的:0kB 孤立(匿名):0kB 孤立(文件):0kB 映射的页:92652kB 脏页:572kB 回写页:0kB 共享内存:2376kB 回写临时页:0kB 内核堆栈:2360kB 页表:1068kB 安全页>
23:00:09 kernel: DMA可用内存:323468kB 增量:0kB 最小值:3236kB 低:4044kB 高:4852kB 保留的高级别原子:0KB 活跃的匿名页:0kB 非活跃的匿名页:0kB 活跃的文件页:8076kB 非活跃的文件页:279068kB 不可驱逐的:0kB 写等待中:0kB 现有:786432kB 管理:664228kB 锁定:0kB

<details>
<summary>英文:</summary>

My Raspberry Pi 4B is dying every time it does *something* (for example, when backup job starts). I&#39;m running Arch Linux (`armv7l`) on it. The memory usage is *always* below 15%.

Below is the log, including an output from `free -hw`, which logged 7 seconds before OOM.

`net-restart.sh` is a simple bash script. The most *complicated* thing it does is `ping`, so there&#39;s no reason for it to cause OOM when there&#39;s more than 3 GiB free. Sometimes it&#39;s triggered by PostgreSQL vacuum service, sometimes `rsync`-based backup. When it goes OOM, it just starts killing one process after another until it dies completely.

I have upgraded the kernel (and other stuff) few times since this started to happen. And there was no SW change before it started. A HW problem?

Btw, I have also tried to add swap (2 GiB), but it didn&#39;t help.

23:00:02 free[10890]: total used free shared buffers cache available
23:00:02 free[10890]: Mem: 3,7Gi 82Mi 3,2Gi 2,0Mi 0,0Ki 442Mi 3,6Gi
23:00:02 free[10890]: Swap: 0B 0B 0B

23:00:09 kernel: oom_kill_process: 13 callbacks suppressed
23:00:09 kernel: net-restart.sh invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
23:00:09 kernel: CPU: 2 PID: 10992 Comm: net-restart.sh Tainted: G C 6.1.14-1-rpi-ARCH #1
23:00:09 kernel: Hardware name: BCM2711
23:00:09 kernel: unwind_backtrace from show_stack+0x18/0x1c
23:00:09 kernel: show_stack from dump_stack_lvl+0x90/0xac
23:00:09 kernel: dump_stack_lvl from dump_header+0x54/0x1fc
23:00:09 kernel: dump_header from oom_kill_process+0x23c/0x248
23:00:09 kernel: oom_kill_process from out_of_memory+0x218/0x34c
23:00:09 kernel: out_of_memory from __alloc_pages+0xa98/0x1044
23:00:09 kernel: __alloc_pages from __pmd_alloc+0x3c/0x1d8
23:00:09 kernel: __pmd_alloc from copy_page_range+0xcac/0xcc4
23:00:09 kernel: copy_page_range from dup_mm+0x440/0x5a4
23:00:09 kernel: dup_mm from copy_process+0xda0/0x164c
23:00:09 kernel: copy_process from kernel_clone+0xac/0x3a8
23:00:09 kernel: kernel_clone from sys_clone+0x78/0x9c
23:00:09 kernel: sys_clone from ret_fast_syscall+0x0/0x1c
23:00:09 kernel: Exception stack(0xf08b1fa8 to 0xf08b1ff0)
23:00:09 kernel: 1fa0: b6fd0088 00000001 01200011 00000000 00000000 00000000
23:00:09 kernel: 1fc0: b6fd0088 00000001 b6efae58 00000078 bea210fc 0055d2bc bea2107c 005844e0
23:00:09 kernel: 1fe0: b6fd05a0 bea20f08 b6e2d260 b6e2d684
23:00:09 kernel: Mem-Info:
23:00:09 kernel: active_anon:7451 inactive_anon:603 isolated_anon:0
active_file:39567 inactive_file:70065 isolated_file:0
unevictable:0 dirty:143 writeback:0
slab_reclaimable:3166 slab_unreclaimable:6791
mapped:23163 shmem:594 pagetables:267
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:848488 free_pcp:30 free_cma:80063
23:00:09 kernel: Node 0 active_anon:29804kB inactive_anon:2412kB active_file:158268kB inactive_file:280260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:92652kB dirty:572kB writeback:0kB shmem:2376kB writeback_tmp:0kB kernel_stack:2360kB pagetables:1068kB sec_pagetab>
23:00:09 kernel: DMA free:323468kB boost:0kB min:3236kB low:4044kB high:4852kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:8076kB inactive_file:279068kB unevictable:0kB writepending:0kB present:786432kB managed:664228kB mlocked:0kB bounce:0kB free_pcp:120kB >
23:00:09 kernel: lowmem_reserve[]: 0 0 3188 3188
23:00:09 kernel: DMA: 1434kB (UMEC) 1198kB (UMEC) 6816kB (UMEC) 2332kB (UEC) 164kB (C) 1128kB (C) 0256kB 1512kB (C) 01024kB 02048kB 78*4096kB (C) = 323540kB
23:00:09 kernel: 110236 total pagecache pages
23:00:09 kernel: 0 pages in swap cache
23:00:09 kernel: Free swap = 0kB
23:00:09 kernel: Total swap = 0kB
23:00:09 kernel: 1012736 pages RAM
23:00:09 kernel: 816128 pages HighMem/MovableOnly
23:00:09 kernel: 30551 pages reserved
23:00:09 kernel: 81920 pages cma reserved
23:00:09 kernel: Tasks state (memory values in pages):
23:00:09 kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
23:00:09 kernel: [ 242] 0 242 12050 4296 98304 0 -250 systemd-journal
23:00:09 kernel: [ 243] 0 243 7022 1837 61440 0 -1000 systemd-udevd
23:00:09 kernel: [ 516] 81 516 2843 1047 49152 0 -900 dbus-daemon
23:00:09 kernel: [ 550] 0 550 2422 1664 45056 0 -1000 sshd
23:00:09 kernel: [ 554] 0 554 196576 7435 167936 0 -999 containerd
23:00:09 kernel: [ 651] 0 651 203978 13307 245760 0 -500 dockerd
23:00:09 kernel: [ 10882] 978 10882 4543 2764 65536 0 0 systemd-resolve
23:00:09 kernel: [ 10888] 0 10888 1097 201 36864 0 0 agetty
23:00:09 kernel: [ 10889] 977 10889 6022 965 65536 0 0 systemd-timesyn
23:00:09 kernel: [ 10890] 0 10890 2676 341 49152 0 0 free
23:00:09 kernel: [ 10897] 0 10897 3543 1493 57344 0 0 systemd-logind
23:00:09 kernel: [ 10992] 0 10992 2169 824 40960 0 0 net-restart.sh
23:00:09 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=net-restart.service,mems_allowed=0,global_oom,task_memcg=/,task=systemd-resolve,pid=10882,uid=978
23:00:09 kernel: Out of memory: Killed process 10882 (systemd-resolve) total-vm:18172kB, anon-rss:1548kB, file-rss:9508kB, shmem-rss:0kB, UID:978 pgtables:64kB oom_score_adj:0



I&#39;ve tried to reduce memory usage of my `rsync` backup, I&#39;ve added a service that logs memory stats to see what&#39;s going on, I&#39;ve tried to add swap. Still puzzled.

</details>


# 答案1
**得分**: 0

我大约一个月以来也遇到了rsync相同的问题。(我看到进程意外终止,花了点时间才找到是rsync引起的)有时候调用'du -hs'也会出现类似的情况。Rsync是从SD卡到USB3驱动器的。两个驱动器都看起来很健康。

rsync引发oom-killer:gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE),order=0,oom_score_adj=0
Mar 19 04:08:56 rasp4 kernel: CPU: 1 PID: 1164 Comm: rsync Tainted: G C 6.1.19-2-rpi-ARCH #1
Mar 19 04:08:56 rasp4 kernel: 硬件名称:BCM2711
Mar 19 04:08:56 rasp4 kernel: 从show_stack+0x18/0x1c的unwind_backtrace
Mar 19 04:08:56 rasp4 kernel: 从dump_stack_lvl+0x90/0xac的show_stack
Mar 19 04:08:56 rasp4 kernel: 从dump_header+0x54/0x1fc的dump_stack_lvl
Mar 19 04:08:56 rasp4 kernel: 从oom_kill_process+0x23c/0x248的dump_header
Mar 19 04:08:56 rasp4 kernel: 从out_of_memory+0x218/0x34c的oom_kill_process
Mar 19 04:08:56 rasp4 kernel: 从__alloc_pages+0xa98/0x1044的out_of_memory
Mar 19 04:08:56 rasp4 kernel: 从new_slab+0x384/0x43c的__alloc_pages
Mar 19 04:08:56 rasp4 kernel: 从___slab_alloc+0x3e8/0xa0c的new_slab
Mar 19 04:08:56 rasp4 kernel: 从kmem_cache_alloc_lru+0x4fc/0x640的___slab_alloc
Mar 19 04:08:56 rasp4 kernel: 从__d_alloc+0x2c/0x1bc的kmem_cache_alloc_lru
Mar 19 04:08:56 rasp4 kernel: 从d_alloc+0x18/0x74的__d_alloc
Mar 19 04:08:56 rasp4 kernel: 从d_alloc_parallel+0x50/0x3b8的d_alloc
Mar 19 04:08:56 rasp4 kernel: 从__lookup_slow+0x60/0x138的d_alloc_parallel
Mar 19 04:08:56 rasp4 kernel: 从walk_component+0xf4/0x164的__lookup_slow
Mar 19 04:08:56 rasp4 kernel: 从path_lookupat+0x7c/0x1a4的walk_component
Mar 19 04:08:56 rasp4 kernel: 从filename_lookup+0xc0/0x190的path_lookupat
Mar 19 04:08:56 rasp4 kernel: 从vfs_statx+0x7c/0x168的filename_lookup
Mar 19 04:08:56 rasp4 kernel: 从do_statx+0x70/0xb0的vfs_statx


<details>
<summary>英文:</summary>

I have the same problem with rsync since about a month.
(I just saw processes dying, took a bit to localize that rsync causes it)
Also I saw a similar effect sometimes calling &#39;du -hs&#39;.
Rsync is from SD card to USB3 drive. Both drives seem healthy.

rsync invoked oom-killer: gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE), order=0, oom_score_adj=0
Mar 19 04:08:56 rasp4 kernel: CPU: 1 PID: 1164 Comm: rsync Tainted: G C 6.1.19-2-rpi-ARCH #1
Mar 19 04:08:56 rasp4 kernel: Hardware name: BCM2711
Mar 19 04:08:56 rasp4 kernel: unwind_backtrace from show_stack+0x18/0x1c
Mar 19 04:08:56 rasp4 kernel: show_stack from dump_stack_lvl+0x90/0xac
Mar 19 04:08:56 rasp4 kernel: dump_stack_lvl from dump_header+0x54/0x1fc
Mar 19 04:08:56 rasp4 kernel: dump_header from oom_kill_process+0x23c/0x248
Mar 19 04:08:56 rasp4 kernel: oom_kill_process from out_of_memory+0x218/0x34c
Mar 19 04:08:56 rasp4 kernel: out_of_memory from __alloc_pages+0xa98/0x1044
Mar 19 04:08:56 rasp4 kernel: __alloc_pages from new_slab+0x384/0x43c
Mar 19 04:08:56 rasp4 kernel: new_slab from ___slab_alloc+0x3e8/0xa0c
Mar 19 04:08:56 rasp4 kernel: ___slab_alloc from kmem_cache_alloc_lru+0x4fc/0x640
Mar 19 04:08:56 rasp4 kernel: kmem_cache_alloc_lru from __d_alloc+0x2c/0x1bc
Mar 19 04:08:56 rasp4 kernel: __d_alloc from d_alloc+0x18/0x74
Mar 19 04:08:56 rasp4 kernel: d_alloc from d_alloc_parallel+0x50/0x3b8
Mar 19 04:08:56 rasp4 kernel: d_alloc_parallel from __lookup_slow+0x60/0x138
Mar 19 04:08:56 rasp4 kernel: __lookup_slow from walk_component+0xf4/0x164
Mar 19 04:08:56 rasp4 kernel: walk_component from path_lookupat+0x7c/0x1a4
Mar 19 04:08:56 rasp4 kernel: path_lookupat from filename_lookup+0xc0/0x190
Mar 19 04:08:56 rasp4 kernel: filename_lookup from vfs_statx+0x7c/0x168
Mar 19 04:08:56 rasp4 kernel: vfs_statx from do_statx+0x70/0xb0


</details>



# 答案2
**得分**: 0

根据https://archlinuxarm.org/forum/viewtopic.php?f=23&amp;t=16377,这个问题在6.1.21-2版本中已经得到解决(或者说被绕过)。

我终于有时间测试了一下(使用当前版本6.1.21-3),看起来运行正常。

<details>
<summary>英文:</summary>

According to https://archlinuxarm.org/forum/viewtopic.php?f=23&amp;t=16377, this issue has been solved (or rather by-passed) in 6.1.21-2.

I finally got some time to test it (with current version - 6.1.21-3) and it seems to work fine.

</details>



# 答案3
**得分**: 0

问题出在这里:https://github.com/raspberrypi/linux/issues/5395

这个问题已经在 Linux 上游提交中得到解决:https://github.com/torvalds/linux/commit/669281ee7ef731fb5204df9d948669bf32a5e68d

这个提交已经在内核的 6.6 版本中发布,并被后移至 6.1 分支的 6.1.54 版本。

如果你不能轻松更新内核,一种解决方法是禁用 MGLRU:

$ echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled
0


或者进行内核构建时禁用 `CONFIG_LRU_GEN`。

<details>
<summary>英文:</summary>

The issue is this one: https://github.com/raspberrypi/linux/issues/5395

That got fixed by linux upstream commit https://github.com/torvalds/linux/commit/669281ee7ef731fb5204df9d948669bf32a5e68d

Such commit was released on version 6.6 of the kernel and backported to the 6.1 branch on 6.1.54


If you can&#39;t easily update the kernel a workaround is to disable MGLRU

    $ echo 0 | sudo tee /sys/kernel/mm/lru_gen/enabled
    0

Or to do a kernel build with `CONFIG_LRU_GEN` disabled

</details>



huangapple
  • 本文由 发表于 2023年3月15日 19:01:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75743833.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定