GPU 相关命令使用

发布时间: 更新时间: 总字数:1082 阅读时间:3m 作者: IP上海 分享 网址

Linux GPU 相关命令使用

lspci

VGA

$ lspci | grep -i vga
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:06.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
00:07.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

NVIDIA GPU

$ lspci | grep -i nvidia
00:06.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
00:07.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

显卡详细信息

$ lspci -v -s 00:06.0  # 00:06.0 位置代号
00:06.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Dell GP102 [GeForce GTX 1080 Ti]
	Physical Slot: 6
	Flags: bus master, fast devsel, latency 0, IRQ 33
	Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at f0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at c000 [size=128]
	[virtual] Expansion ROM at fe000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

NVIDIA 工具

nvidia-smi(NVIDIA System Management Interface) 是基于 NVIDIA Management Library (NVML)GPU 的系统管理接口,主要用于显卡的管理和状态监控。

$ nvidia-smi
Thu Sep 12 11:34:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.67       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:00:06.0 Off |                  N/A |
| 32%   55C    P2   168W / 250W |   4693MiB / 11178MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:00:07.0 Off |                  N/A |
| 31%   55C    P2   170W / 250W |   4693MiB / 11178MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11973      C   /root/chrome                     4691MiB |
|    1   N/A  N/A     11973      C   /root/chrome                     4691MiB |
+-----------------------------------------------------------------------------+

# 实时刷新
$ nvidia-smi -l

# 获取型号和 UUID
$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331)

# Json 格式,https://github.com/Cheedoong/xml2json
$ nvidia-smi -x -q | xml2json | jq

说明:

  • GPU:序号
  • Fan:显示风扇转速,数值在0到100之间;如果不是通过风扇冷却或者风扇故障,显示为 N/A
  • Name:GPU名称为 GeForce GTX
  • Temp:显卡内部的温度,单位是摄氏度
  • Perf:表征性能状态,范围在P0到P12之间,P0表示最大性能,P12表示状态最小性能
  • Persistence-M:关闭
  • Pwr:Usage/Cap:当前能耗 / 总能耗
  • Bus-Id:总线信息
  • Disp.A:Display Active,表示 GPU 的显示是否初始化
  • Memory Usage:显存使用量 / 总量
  • Volatile GPU-Util:浮动的 GPU 利用率
  • Uncorr. ECC:ECC相关
  • Compute M:计算模式
  • MIG M.

其他数据格式:

# help
$ nvidia-smi --help-query-gpu

# 获取 csv 格式数据
$ nvidia-smi --query-gpu=index,name,uuid,memory.used,memory.total --format=csv,noheader
0, GeForce GTX 1080 Ti, GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d, 4693 MiB, 11178 MiB
1, GeForce GTX 1080 Ti, GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331, 4693 MiB, 11178 MiB

$ nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv
timestamp, name, pci.bus_id, driver_version, pstate, pcie.link.gen.max, pcie.link.gen.current, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2021/09/12 13:39:31.884, GeForce GTX 1080 Ti, 00000000:00:06.0, 460.67, P2, 3, 3, 55, 100 %, 73 %, 11178 MiB, 6485 MiB, 4693 MiB
2021/09/12 13:39:31.885, GeForce GTX 1080 Ti, 00000000:00:07.0, 460.67, P2, 3, 3, 55, 100 %, 73 %, 11178 MiB, 6485 MiB, 4693 MiB

$ nvidia-smi --query-gpu=index,name,serial,uuid,pci.bus_id,fan.speed,temperature.gpu,pstate,power.draw,power.limit,memory.used,memory.total,utilization.gpu --format=csv,noheader
0, GeForce GTX 1080 Ti, [N/A], GPU-8c46c14a-b9f8-76cc-1ca0-578288422e7d, 00000000:00:06.0, 32 %, 55, P2, 168.90 W, 250.00 W, 4693 MiB, 11178 MiB, 100 %
1, GeForce GTX 1080 Ti, [N/A], GPU-ea7ca19c-8e5e-82e2-3730-77eb4fbb5331, 00000000:00:07.0, 32 %, 55, P2, 170.57 W, 250.00 W, 4693 MiB, 11178 MiB, 100 %

# 每隔1秒输出一次,同 watch -n 1 nvidia-smi
$ nvidia-smi -l 1

F&Q

nvidia-smi 执行慢的问题

执行如下命令可以有效提高速度:

sudo nvidia-persistenced --persistence-mode

No CMAKE_CUDA_COMPILER could be found

C++ 依赖 cuda 编译时,错误

CMake Error at CMakeLists.txt:4 (project):
  No CMAKE_CUDA_COMPILER could be found.

  Tell CMake where to find the compiler by setting either the environment
  variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
  path to the compiler, or to the compiler name if it is in the PATH.

解决方式:

export PATH=/usr/local/cuda/bin:$PATH

如何指定 GPU

训练模型时,使用如下变量设置使用的GPU编号

export CUDA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0,1
  • pytorch 示例
os.environ["CUDA_VISIBLE_DEVICES"] = '0,1' # 一般在程序开头设置

Unable to determine the device handle for GPU 0000:06:00.0

解决方式:

  • 下电重启
  • nvidia 驱损坏,删除重装
Home Archives Categories Tags Statistics
本文总阅读量 次 本站总访问量 次 本站总访客数