简介
NVIDIA device plugin是以dameonset方式部署到k8s集群,部署后可以实现:
- 暴露集群中n每个node节点的gpu数量
- 跟踪gpu健康状态
- 可以在k8s集群中运行gpu容器
前置条件
- NVIDIA drivers ~= 384.81
- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
- nvidia-container-runtime configured as the default low-level runtime
- Kubernetes version >= 1.10
快速开始
准备GPU节点
1.为每台节点安装nvidia-container-toolkit
2.设置nvidia-container-runtime为默认容器运行时
#cat /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"data-root": "/data/docker",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}