nvidia-smi(NVIDIA System Management Interface)
是基于 NVIDIA Management Library (NVML)
的 GPU
的系统管理接口,主要用于显卡的管理和状态监控。
nvidia-smi 详解
$ nvidia-smi
Thu Sep 12 11:34:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67 Driver Version: 460.67 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:00:06.0 Off | N/A |
| 32% 55C P2 168W / 250W | 4693MiB / 11178MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:00:07.0 Off | N/A |
| 31% 55C P2 170W / 250W | 4693MiB / 11178MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11973 C /root/chrome 4691MiB |
| 1 N/A N/A 11973 C /root/chrome 4691MiB |
+-----------------------------------------------------------------------------+
说明:
NVIDIA-SMI
nvidia-smi 的版本号
Driver Version
驱动版本号
CUDA Version
CUDA 版本号
GPU
GPU 卡序列号
Fan
显示风扇转速,数值在 0 到 100 之间;如果不是通过风扇冷却或者风扇故障,显示为 N/A
Name
GPU 名称为 GeForce GTX
Temp
显卡内部的温度,单位是摄氏度
Perf
持久模式状态,范围在 P0 到 P12 之间,P0 表示最大性能,P12 表示状态最小性能
Persistence-M
Persistence Mode 的缩写,持久性模式
- 启用持久性模式后,即使没有活动的客户端,NVIDIA 驱动程序也会保持加载状态,可以最大程度地减少与运行依赖的应用程序(例如 CUDA 程序)相关的驱动程序加载延迟
- Persistence-M 的值为
On
时,持续模式为打开状态:sudo nvidia-smi -pm 1
Off
为关闭状态:sudo nvidia-smi -pm 0
Pwr:Usage/Cap
当前能耗 / 总能耗
Bus-Id
GPU 所在的 PCIe 总线地址
Disp.A
Display Active
显示器是否连接到 GPU 的输出端口
Memory Usage
已用 GPU 显存/总 GPU 显存
Volatile Uncorr. ECC
未 corrected 错误的易失性 ECC(Error Correction Code)
内存错误计数。用于检测内存错误
nvidia-smi -e 1
开启 ECC 模式,重启生效,开启 ECC 之后,虽然能够避免内存错误,但是会损失 15-25% 的性能,同时显存也会减少一部分
GPU-Util
GPU 利用率
Compute M.
计算模式,设置方法:nvidia-smi -c 0
,一共有三种计算模式:
- 0/Default 多个进程共享,会有竞争和等待,默认模式
- 2/Prohibited 禁用显卡
- 3/Exclusive 进程独占
MIG M.
MIG(Multi-Instance GPU)
模式,将一个物理 GPU 分成多个独立、隔离的实例
Processes
相关字段
其他数据格式:
# 实时刷新
$ nvidia-smi -l
$ nvidia-smi -l 1 # 每隔1秒输出一次,同 watch -n 1 nvidia-smi
$ nvidia-smi -l 5 # 每5秒刷新一次
# 获取型号和 UUID
$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331)
# Json 格式,https://github.com/Cheedoong/xml2json
$ nvidia-smi -x -q | xml2json | jq
# 查看当前GPU的时钟频率
nvidia-smi -q -d CLOCK
# 查看每个GPU卡当前的状态,以及时钟慢下来的原因
# HW Slowdown和Unknown 为active状态,很可能是电源或者冷却系统的问题
nvidia-smi -q -d PERFORMANCE/performance
# 查看每块GPU卡的内存使用
nvidia-smi -q -d memory
# help
$ nvidia-smi --help-query-gpu
# 获取 csv 格式数据
$ nvidia-smi --query-gpu=index,name,uuid,memory.used,memory.total --format=csv,noheader
0, GeForce GTX 1080 Ti, GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d, 4693 MiB, 11178 MiB
1, GeForce GTX 1080 Ti, GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331, 4693 MiB, 11178 MiB
$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2021/09/12 13:39:31.884, GeForce GTX 1080 Ti, 00000000:00:06.0, 460.67, P2, 3, 3, 55, 100 %, 73 %, 11178 MiB, 6485 MiB, 4693 MiB
2021/09/12 13:39:31.885, GeForce GTX 1080 Ti, 00000000:00:07.0, 460.67, P2, 3, 3, 55, 100 %, 73 %, 11178 MiB, 6485 MiB, 4693 MiB
$ nvidia-smi --query-gpu=index,name,serial,uuid,pci.bus_id,fan.speed,temperature.gpu,pstate,power.draw,power.limit,memory.used,memory.total,utilization.gpu --format=csv,noheader
0, GeForce GTX 1080 Ti, [N/A], GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d, 00000000:00:06.0, 32 %, 55, P2, 168.90 W, 250.00 W, 4693 MiB, 11178 MiB, 100 %
1, GeForce GTX 1080 Ti, [N/A], GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331, 00000000:00:07.0, 32 %, 55, P2, 170.57 W, 250.00 W, 4693 MiB, 11178 MiB, 100 %
# 查看 GPU 的状态详情
$ nvidia-smi -q -i 0
$ nvidia-smi -q -x # xml 格式输出
# 查看 GPU 的时钟频率
$ nvidia-smi -q -d SUPPORTED_CLOCKS -i 0
GPU Util 误解
nvidia-smi topo
$ nvidia-smi topo -p2p n
$ nvidia-smi topo -mp
GPU0 mlx5_0 mlx5_1 CPU Affinity
GPU0 X SYS SYS 0-23
mlx5_0 SYS X PIX
mlx5_1 SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
- 说明(参考):
- GPU 之间一般是 NV8,表示 8 条 NVLink 连接
- NIC 之间
- 在同一片 CPU 上:NODE,表示不需要跨 NUMA,但需要跨 PCIe 交换芯片
- 不在同一片 CPU 上:SYS,表示需要跨 NUMA
- GPU 和 NIC 之间
- 在同一片 CPU 上,且在同一个 PCIe Switch 芯片下面:PXB,表示只需要跨 PCIe 交换芯片;
- 在同一片 CPU 上,且不在同一个 PCIe Switch 芯片下面:NODE,表示需要跨 PCIe 交换芯片和 PCIe Host Bridge;
- 不在同一片 CPU 上:SYS,表示需要跨 NUMA、PCIe 交换芯片,距离最远;
- 更多信息参考
PCI(Peripheral Component Interconnect, 外设部件互连标准)
是目前个人电脑中使用最为广泛的接口,几乎所有的主板产品上都带有这种插槽
- NVLink 是英伟达(NVIDIA)开发并推出的一种总线及其通信协议
- NVLink 采用点对点结构、串列传输,用于中央处理器(CPU)与图形处理器(GPU)之间的连接,也可用于多个图形处理器之间的相互连接
- NVSwitch 高速互联技术,通过 NVSwitch 高速互联技术能够让不同的 GPU 之间进行高速互联
nvidia-smi nvlink
$ nvidia-smi nvlink -h
nvlink -- Display NvLink information.
Usage: nvidia-smi nvlink [options]
Options include:
[-h | --help]: Display help information
[-i | --id]: Enumeration index, PCI bus ID or UUID.
[-l | --link]: Limit a command to a specific link. Without this flag, all link information is displayed.
[-s | --status]: Display link state (active/inactive).
[-c | --capabilities]: Display link capabilities.
[-p | --pcibusid]: Display remote node PCI bus ID for a link.
[-R | --remotelinkinfo]: Display remote device PCI bus ID and NvLink ID for a link.
[-sc | --setcontrol]: Setting counter control is deprecated!
[-gc | --getcontrol]: Getting counter control is deprecated!
[-g | --getcounters]: Getting counters using option -g is deprecated.
Please use option -gt/--getthroughput instead.
[-r | --resetcounters]: Resetting counters is deprecated!
[-e | --errorcounters]: Display error counters for a link.
[-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
[-re | --reseterrorcounters]: Reset all error counters to zero.
[-gt | --getthroughput]: Display link throughput counters for specified counter type
The arguments consist of character string representing the type of traffic counted:
d: Display tx and rx data payload in KiB
r: Display tx and rx data payload and protocol overhead in KiB if supported
常用的命令
# 检查 GPU 0 的 NVLink 状态
$ nvidia-smi nvlink -s -i 0
$ nvidia-smi nvlink --status -i 0
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-xxxxxxx)
Link 0: 25 GB/s
Link 1: 25 GB/s
...
Link 10: 25 GB/s
Link 11: 25 GB/s
# 查看 GPU 0 卡的 NVLink 功能
$ nvidia-smi nvlink -c -i 0
$ nvidia-smi nvlink --capabilities -i 0
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-xxxxxxx)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false
...
# 检查GPU 0的NVLink数据传输,监控 NVIDIA NVLink 的指标
# --getthroughput 参数:d 实际传输的数据负载(KiB),即剥离了传输协议部分的真实数据量;r 包括协议负载和数据负载的传输总数据量(KiB)
$ nvidia-smi nvlink -gt d -i 0
$ nvidia-smi nvlink --getthroughput d -i 0
GPU 0: NVIDIA A800-SXM4-80GB (UUID: GPU-xxxxxxx)
Link 0: Data Tx: 18217690481 KiB
Link 0: Data Rx: 14208675690 KiB
Link 1: Data Tx: 18217788589 KiB
Link 1: Data Rx: 14208268861 KiB
# 查看 GPU 的连接方式
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity
GPU0 X NV2 NV2 NV2 NODE NODE 0,2,4,6,8,10 0
GPU1 NV2 X NV2 NV2 NODE NODE 0,2,4,6,8,10 0
GPU2 NV2 NV2 X NV2 NODE NODE 0,2,4,6,8,10 0
GPU3 NV2 NV2 NV2 X NODE NODE 0,2,4,6,8,10 0
NIC0 NODE NODE NODE NODE X PIX
NIC1 NODE NODE NODE NODE PIX X