Kubernetes 支持 GPU 设备调度配置部署
介绍
Kubernetes 支持 GPU 设备调度,需要做如下工作:
- k8s node 安装 nvidia 驱动
- k8s node 安装 nvidia-docker2
- k8s 安装
NVIDIA/k8s-device-plugin
- 为节点打 label
- 安装
NVIDIA/dcgm-exporter
:用来为 Prometheus 获取监控信息
如上动作,可通过 NVIDIA/gpu-operator 实现,下面是手动部署过程
nvidia-container-toolkit
主要作用是将 NVIDIA GPU
设备挂载到容器中
NVIDIA Container Toolkit 组成:
- The
nvidia-docker
wrappernvidia-docker
通过环境变量指定 docker 容器需要使用节点上哪些 GPU(即 nvidia-docker2
)
- The NVIDIA Container Runtime (
nvidia-container-runtime
)nvidia-container-runtime
将容器 runC spec 作为输入,然后将 nvidia-container-toolkit
脚本作为一个 prestart hook
注入到 runC spec 中,将修改后的 runC spec 交给 runC 处理
- The NVIDIA Container Runtime Hook (
nvidia-container-toolkit
/ nvidia-container-runtime-hook
)nvidia-container-toolkit
实现了 runC prestart hook
接口的脚本(在 runC 创建容器后,启动前调用,主要作用是修改与容器相关联的 config.json,注入一些在容器中使用 NVIDIA GPU 设备所需要的信息)
- The NVIDIA Container Library and CLI (
libnvidia-container
, nvidia-container-cli
)libnvidia-container
实现在容器当中支持使用 GPU 设备的 lib 库nvidia-container-cli
CLI 工具
针对 docker、containerd 和 lxc 图参考
安装
nvidia 驱动
下载地址:https://www.nvidia.cn/Download/index.aspx?lang=cn
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
修改 /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl restart docker
安装 nvidia-docker2
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd
- 配置
/etc/docker/daemon.json
中是否包括:"default-runtime": "nvidia"
- 重启 Docker
sudo systemctl restart docker
docker version
为节点打 label
AMD GPU 可安装 node-labeller 为 GPU 节点打 gpu 相关的标签,如:beta.amd.com/gpu.family.AI=1
安装 NVIDIA/k8s-device-plugin
在 kubernetes 集群安装 nvidia-device-plugin-daemonset
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
验证
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
安装 NVIDIA/dcgm-exporter
kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml