Linux OOM 介绍-谢先斌的博客

Linux OOM 介绍

发布时间： 2024-02-15 更新时间： 2024-07-14 总字数：1870 阅读时间：4m 作者：谢先斌 IP上海

OOM(out of memory) 当 Linux 内存使用压力时，Linux 内核会杀掉一些不太重要的进程，通过如下文件判断

现象

OOM 日志一般在 /var/log/syslog、/var/log/message 或 dmesg -T| grep -E -i -B100 'killed process' 中，示例如下：

kernel: [xxx] Out of memory: Kill process 888 (python) score 888 or sacrifice childSep
kernel: [xxx] Killed process 888 (python) total-vm:1200000kB, anon-rss:600000kB, file-rss:0kB

Linux 内存情况说明

linux中

total-vm(total virtual memory) 进程占用的总的虚拟内存
RSS(Resident Set Size, 常驻内存集, 驻留内存) 表示进程在 RAM 中占用实际物理内存的大小，并不包含在 SWAP 中占用的虚拟内存，但包括共享库占用的内存（只要共享库在内存中）
anon-rss(anonymous rss, 匿名驻留集) 虚拟内存实际占用的物理内存，如malloc的内存
file-rss 虚拟内存实际占用的磁盘空间，映射到设备和文件上的内存页面
shmem-rss
VSZ 表示进程分配的虚拟内存
VSZ 包括进程可以访问的所有内存，包括进入交换分区的内容，以及共享库占用的内存。

其他：

cgroups v1 中通过 /sys/fs/cgroup/memory 控制
man top 有详细介绍

控制系统 OOM 的参数

/proc/sys/vm/panic_on_oom panic 时是否触发 OOM，候选值
- 0 表示启动 OOM killer，通过 /proc/sys/vm/oom_kill_allocating_task 判断杀死哪些进程
- 1 表示有可能会触发 kernel panic(内核崩溃)，也有可能启动 OOM killer
- 2 表示强制触发 kernel panic
kernel panic 10 秒后自动重启系统：echo "kernel.panic=10" >> /etc/sysctl.conf
/proc/sys/vm/oom_kill_allocating_task
- 0 表示 kill 掉得分最高的进程
- 非0 表示会 kill 掉当前申请内存而触发OOM的进程，但不会杀死系统进程（如init）或者被用户设置了 oom_score_adj 的进程
/proc/sys/vm/oom_dump_tasks 控制 OOM 时，记录进程标识信息、使用的虚拟内存总量、物理内存、进程的页表信息等等
- 0 关闭打印上述日志
  - 在大型系统中，可能存在有上千个进程，打印使用内存信息可能会造成性能问题
- 非0 有三种情况会打印进程内存使用情况
  - 由 OOM 导致 kernel panic 时
  - 没有找到符合条件的进程 kill 时
  - 找到符合条件的进程并 kill 时
/proc/sys/vm/overcommit_memorys
- Linux 允许进程在申请内存的时候是允许 overcommit，即允许进程申请超过实际物理内存上限的内存
- overcommit_memorys 3中策略
  - 0 启发式策略
    - 比较严重的Overcommit将不能成功，比如突然申请 128TB 的内存
    - 轻微的overcommit将被允许
  - 1 永远允许overcommit
  - 2 永远禁止 overcommit
- Linux malloc 有如下描述

By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is  no  guarantee
that  the  memory  really is available.  This is a really bad bug.  In
case it turns out that the system is out of memory, one or more processes
will be killed by the infamous OOM killer.  In case Linux is employed
under circumstances where it would be less desirable to suddenly lose
some randomly picked processes, and moreover the kernel version is
sufficiently recent, one can switch off this overcommitting behavior
using a command like:

# echo 2  > /proc/sys/vm/overcommit_memory

See also the kernel Documentation directory, files vm/overcommit-accounting and sysctl/vm.txt.

OOM score 相关参数

/proc/${pid}/oom_score 内核对进程的打分
- 若进程消耗的内存越大，OOM 分数越高，被 OOM 的概率就越大
- 该参数只反映该进程的可用资源在系统中所占的百分比，并没有该进程有 多重要 的概念
- 若进程运行了很长时间，且消耗很多 CPU 时间，通常它的 oom_score 会偏小
/proc/${pid}/oom_score_adj 用户对进程的打分，允许用户当遇到内存不足的情况时，通过配置该参数杀死指定进程
- 旧参数为 oom_adj
  - 取值范围 [-17, +15]，默认值为 0，值越大越容易被 kill 掉
  - 设置为 -17(echo -17 > /proc/$PID/oom_adj) 的话，表示永远禁止 OOM
- 范围从 -1000(OOM_SCORE_ADJ_MIN) 到 +1000(OOM_SCORE_ADJ_MAX)
  - -1000 表示禁止 OOM killer 杀死该进程
  - echo "-1000" > /proc/$(pidof nginx)/oom_score_adj
- 计算公式：
  - min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)
  - oom = oom_score + oom_score_adj
    - oom_score = 内存消耗（常驻内存RSS+进程页面+交换内存）/总内存（总的物理内存+交换分区）*1000
- 特殊情况
  - root 进程拥有 3% 的内存使用特权，man choom 中有介绍
  - Niced processes are most likely less important, so double their badness points
  - Processes with CAP_SYS_ADMIN and CAP_SYS_RAWIO, points /= 4
- 查看工具
  - pidstat -r -p <pid>
- 内核代码
  - 计算 oom unsigned long oom_badness()
  - 找出 oom 最大的进程 static struct task_struct *select_bad_process()
  - 发送kill信号关闭进程 oom_kill_process()
使用场景
- docker run --help | grep oom
  - --oom-kill-disable Disable OOM Killer
  - --oom-score-adj int Tune host’s OOM preferences (-1000 to 1000)
- k8s
  - Guaranteed 高优先级
  - BestEffort 极低优先级
  - Burstable 上述公式计算得到

有用的脚本

打印所以 oom_score 不为 0 的程序

#!/bin/bash
# Displays running processes in descending order of OOM score
printf 'PID\tOOM Score\tOOM Adj\tCommand\n'
while read -r pid comm; do [ -f /proc/$pid/oom_score ] && [ $(cat /proc/$pid/oom_score) != 0 ] && printf '%d\t%d\t\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$(cat /proc/$pid/oom_score_adj)" "$comm"; done < <(ps -e -o pid= -o comm=) | sort -k 2nr

获取 swap 使用量

for i in $(ls /proc | grep "^[0-9]" | awk '$0>100'); do awk '/Swap:/{a=a+$2}END{print '"$i"',a/1024"M"}' /proc/$i/smaps;done| sort -k2nr | head

定期清理缓存

echo 1 > /proc/sys/vm/drop_caches

choom

$ dpkg -S `which choom`
util-linux: /usr/bin/choom

$ apt install util-linux

help

man choom

choom --help ...

$ choom --help

Usage:
 choom [options] -p pid
 choom [options] -n number -p pid
 choom [options] -n number [--] command [args...]]

Display and adjust OOM-killer score.

Options:
 -n, --adjust <num>     specify the adjust score value
 -p, --pid <num>        process ID

 -h, --help             display this help
 -V, --version          display version

For more details see choom(1).

或 man choom 查看详细说明

示例

配合 Supervisord 配置 command 参数

command = sudo choom -n -500 -- sudo -u root /bin/abc

vmstat 1/iostat -x -k 1

vmstat 1 实时监控内存情况

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 1  0      0 2336712  51388 993864    0    0    56    21  159    1  0  0 100  0  0  0
 0  0      0 2336712  51388 993904    0    0     0     0  358  280  0  0 100  0  0  0
...

dstat

dstat 查看 OOM score

$ dstat --top-oom
--out-of-memory---
    kill score
wireplumber    801
wireplumber    801
wireplumber    801^C

oomctl

man oomctl
systemctl status systemd-oomd.service

/proc/meminfo

$ cat /proc/meminfo |grep -E "Buffer|Cache|Swap|Mem|Shmem|Slab|SReclaimable|SUnreclaim"
MemTotal:        3961028 kB
MemFree:         2327412 kB  # 空闲的物理内存
MemAvailable:    3119948 kB  # 可用的物理内存，~= MemFree+Buffers+Cached
Buffers:           51452 kB  # Buffer Cache 对磁盘块设备数据的缓存
Cached:           924680 kB  # Page Cache 对文件系统上文件数据的缓存，~=MemFree+SReclaimable
SwapCached:            0 kB
SwapTotal:       3867644 kB  # 虚拟内存，利用磁盘空间虚拟出的一块逻辑内存
SwapFree:        3867644 kB
Shmem:             13684 kB  # 进程间共同使用的共享内存
Slab:             204820 kB  # Linux 内存管理机制
SReclaimable:      69244 kB  # Slab 可回收部分
SUnreclaim:       135576 kB  # Slab 不可回收部分
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB

cgroup 限制内存

yum install -y libcgroup libcgroup-tools

# 限制程序最多只能使用30M内存
cd /sys/fs/cgroup/memory/
mkdir memory_limit
cd memory_limit
mkdir 30m_limit
cd 30m_limit
echo 30m > memory.limit_in_bytes

# 运行程序
cgexec -g memory:memory_limit/30m_limit python xxx.py

Linux OOM 介绍

现象

Linux 内存情况说明

控制系统 OOM 的参数

OOM score 相关参数

有用的脚本

打印所以 oom_score 不为 0 的程序

获取 swap 使用量

定期清理缓存

choom

help

choom --help ...

示例

vmstat 1/iostat -x -k 1

dstat

oomctl

/proc/meminfo

cgroup 限制内存

扩展

参考

Linux OOM 介绍

现象

Linux 内存情况说明

控制系统 OOM 的参数

OOM score 相关参数

有用的脚本

打印所以 oom_score 不为 0 的程序

获取 swap 使用量

定期清理缓存

choom

help

choom --help ...

示例

vmstat 1/iostat -x -k 1

dstat

oomctl

/proc/meminfo

cgroup 限制内存

扩展

参考

Cookie Notice!