Linux GPU 相关命令使用
lspci
VGA
$ lspci | grep -i vga
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:06.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
00:07.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
NVIDIA GPU
$ lspci | grep -i nvidia
00:06.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
00:07.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
显卡详细信息
$ lspci -v -s 00:06.0 # 00:06.0 位置代号
00:06.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Dell GP102 [GeForce GTX 1080 Ti]
Physical Slot: 6
Flags: bus master, fast devsel, latency 0, IRQ 33
Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f0000000 (64-bit, prefetchable) [size=32M]
I/O ports at c000 [size=128]
[virtual] Expansion ROM at fe000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
NVIDIA 工具
nvidia-smi(NVIDIA System Management Interface)
是基于 NVIDIA Management Library (NVML)
的 GPU
的系统管理接口,主要用于显卡的管理和状态监控。
$ nvidia-smi
Thu Sep 12 11:34:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67 Driver Version: 460.67 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:00:06.0 Off | N/A |
| 32% 55C P2 168W / 250W | 4693MiB / 11178MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:00:07.0 Off | N/A |
| 31% 55C P2 170W / 250W | 4693MiB / 11178MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11973 C /root/chrome 4691MiB |
| 1 N/A N/A 11973 C /root/chrome 4691MiB |
+-----------------------------------------------------------------------------+
# 实时刷新
$ nvidia-smi -l
# 获取型号和 UUID
$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331)
# Json 格式,https://github.com/Cheedoong/xml2json
$ nvidia-smi -x -q | xml2json | jq
说明:
- GPU:序号
- Fan:显示风扇转速,数值在0到100之间;如果不是通过风扇冷却或者风扇故障,显示为 N/A
- Name:GPU名称为 GeForce GTX
- Temp:显卡内部的温度,单位是摄氏度
- Perf:表征性能状态,范围在P0到P12之间,P0表示最大性能,P12表示状态最小性能
- Persistence-M:关闭
- Pwr:Usage/Cap:当前能耗 / 总能耗
- Bus-Id:总线信息
- Disp.A:
Display Active
,表示 GPU 的显示是否初始化
- Memory Usage:显存使用量 / 总量
- Volatile GPU-Util:浮动的 GPU 利用率
- Uncorr. ECC:ECC相关
- Compute M:计算模式
- MIG M.
其他数据格式:
# help
$ nvidia-smi --help-query-gpu
# 获取 csv 格式数据
$ nvidia-smi --query-gpu=index,name,uuid,memory.used,memory.total --format=csv,noheader
0, GeForce GTX 1080 Ti, GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d, 4693 MiB, 11178 MiB
1, GeForce GTX 1080 Ti, GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331, 4693 MiB, 11178 MiB
$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2021/09/12 13:39:31.884, GeForce GTX 1080 Ti, 00000000:00:06.0, 460.67, P2, 3, 3, 55, 100 %, 73 %, 11178 MiB, 6485 MiB, 4693 MiB
2021/09/12 13:39:31.885, GeForce GTX 1080 Ti, 00000000:00:07.0, 460.67, P2, 3, 3, 55, 100 %, 73 %, 11178 MiB, 6485 MiB, 4693 MiB
$ nvidia-smi --query-gpu=index,name,serial,uuid,pci.bus_id,fan.speed,temperature.gpu,pstate,power.draw,power.limit,memory.used,memory.total,utilization.gpu --format=csv,noheader
0, GeForce GTX 1080 Ti, [N/A], GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d, 00000000:00:06.0, 32 %, 55, P2, 168.90 W, 250.00 W, 4693 MiB, 11178 MiB, 100 %
1, GeForce GTX 1080 Ti, [N/A], GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331, 00000000:00:07.0, 32 %, 55, P2, 170.57 W, 250.00 W, 4693 MiB, 11178 MiB, 100 %
# 每隔1秒输出一次,同 watch -n 1 nvidia-smi
$ nvidia-smi -l 1
F&Q
nvidia-smi 执行慢的问题
执行如下命令可以有效提高速度:
sudo nvidia-persistenced --persistence-mode
No CMAKE_CUDA_COMPILER could be found
C++ 依赖 cuda 编译时,错误
CMake Error at CMakeLists.txt:4 (project):
No CMAKE_CUDA_COMPILER could be found.
Tell CMake where to find the compiler by setting either the environment
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
path to the compiler, or to the compiler name if it is in the PATH.
解决方式:
export PATH=/usr/local/cuda/bin:$PATH
指定 GPU
训练模型时,使用如下变量设置使用的GPU编号
export CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0,1
os.environ["CUDA_VISIBLE_DEVICES"] = '0,1' # 一般在程序开头设置