k3s: update nvidia gpu docs (#445944) (b8026bc5) · Commits · nix / nixpkgs

maintainers/team-list.nix

+1 −0

Original line number	Diff line number	Diff line
		@@ -667,6 +667,7 @@ with lib.maintainers;
		members = [
		euank
		frederictobiasc
		heywoodlh
		marcusramberg
		mic92
		rorosen

pkgs/applications/networking/cluster/k3s/docs/examples/NVIDIA.md

+221 −23

Original line number	Diff line number	Diff line
		# Nvidia GPU Support

		To use Nvidia GPU in the cluster the nvidia-container-runtime and runc are needed. To get the two components it suffices to add the following to the configuration
		> Note: this article assumes `services.k3s.enable = true;` is already set

		## Enable the Nvidia driver

		```
		virtualisation.docker = {
		enable = true;
		enableNvidia = true;
		hardware.nvidia = {
		open = true;
		package = config.boot.kernelPackages.nvidiaPackages.stable; # change to match your kernel
		nvidiaSettings = true;
		};

		# Hack for getting the nvidia driver recognized
		services.xserver = {
		enable = false;
		videoDrivers = [ "nvidia" ];
		};
		environment.systemPackages = with pkgs; [ docker runc ];

		nixpkgs.config.allowUnfreePredicate = pkg: builtins.elem (lib.getName pkg) [
		"nvidia-x11"
		"nvidia-settings"
		];
		```

		Also, enable the Nvidia container toolkit:

		```
		hardware.nvidia-container-toolkit.enable = true;
		hardware.nvidia-container-toolkit.mount-nvidia-executables = true;

		environment.systemPackages = with pkgs; [
		nvidia-container-toolkit
		];
		```

		Rebuild your NixOS configuration.

		### Verify that the GPU is accessible

		Use the following command to ensure the GPU is accessible:

		```
		nvidia-smi
		```

		If there is an error in the output, a reboot may be required for the driver to be assigned to the GPU.

		Additionally, `lspci -k` can be used to ensure the driver has been assigned to the GPU:

		```
		# lspci -k \| grep -i nvidia

		01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 Rev. A] (rev a1)
		Kernel driver in use: nvidia
		Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
		```

		Note, using docker here is a workaround, it will install nvidia-container-runtime and that will cause it to be accessible via /run/current-system/sw/bin/nvidia-container-runtime, currently its not directly accessible in nixpkgs.
		## Configure k3s

		You now need to create a new file in `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` with the following

		@@ -22,15 +68,8 @@ You now need to create a new file in `/var/lib/rancher/k3s/agent/etc/containerd/
		runtime_engine = ""
		runtime_root = ""
		runtime_type = "io.containerd.runc.v2"

		[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
		BinaryName = "/run/current-system/sw/bin/nvidia-container-runtime"
		```

		Update: As of 12/03/2024 It appears that the last two lines above are added by default, and if the two lines are present (as shown above) it will refuse to start the server. You will need to remove the two lines from that point onward.

		Note here we are pointing the nvidia runtime to "/run/current-system/sw/bin/nvidia-container-runtime".

		Now apply the following runtime class to k3s cluster:

		```yaml
		@@ -43,16 +82,175 @@ metadata:
		name: nvidia
		```

		Following [k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm) install the helm chart with `runtimeClassName: nvidia` set. In order to passthrough the nvidia card into the container, your deployments spec must contain
		Restart k3s:

		```yaml
		```
		systemctl restart k3s.service
		```

		Ensure that the Nvidia runtime is detected by k3s:

		```
		grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
		```

		Apply the DaemonSet in the [generic-cdi-plugin README](https://github.com/OlfillasOdikno/generic-cdi-plugin):

		```
		apiVersion: v1
		kind: Namespace
		metadata:
		name: generic-cdi-plugin
		---
		apiVersion: apps/v1
		kind: DaemonSet
		metadata:
		name: generic-cdi-plugin-daemonset
		namespace: generic-cdi-plugin
		spec:
		selector:
		matchLabels:
		name: generic-cdi-plugin
		template:
		metadata:
		labels:
		name: generic-cdi-plugin
		app.kubernetes.io/component: generic-cdi-plugin
		app.kubernetes.io/name: generic-cdi-plugin
		spec:
		containers:
		- image: ghcr.io/olfillasodikno/generic-cdi-plugin:main
		name: generic-cdi-plugin
		command:
		- /generic-cdi-plugin
		- /var/run/cdi/nvidia-container-toolkit.json
		imagePullPolicy: Always
		securityContext:
		privileged: true
		tty: true
		volumeMounts:
		- name: kubelet
		mountPath: /var/lib/kubelet
		- name: nvidia-container-toolkit
		mountPath: /var/run/cdi/nvidia-container-toolkit.json
		volumes:
		- name: kubelet
		hostPath:
		path: /var/lib/kubelet
		- name: nvidia-container-toolkit
		hostPath:
		path: /var/run/cdi/nvidia-container-toolkit.json
		affinity:
		nodeAffinity:
		requiredDuringSchedulingIgnoredDuringExecution:
		nodeSelectorTerms:
		- matchExpressions:
		- key: "nixos-nvidia-cdi"
		operator: In
		values:
		- "enabled"
		```

		Apply the following node label (replace `#CHANGEME` with your node name):

		```
		kind: Node
		apiVersion: v1
		metadata:
		name: #CHANGEME
		labels:
		nixos-nvidia-cdi: enabled
		```

		Now, GPU-enabled pods can be run with this configuration:

		```
		spec:
		runtimeClassName: nvidia
		# for each container
		containers:
		resources:
		requests:
		nvidia.com/gpu-all: "1"
		limits:
		nvidia.com/gpu-all: "1"
		```

		### Test pod

		This is a complete pod configuration for reference/testing:

		```
		---
		apiVersion: v1
		kind: Pod
		metadata:
		name: gpu-test
		namespace: default
		spec:
		runtimeClassName: nvidia # <- THIS FOR GPU
		containers:
		- name: gpu-test
		image: nvidia/cuda:12.6.3-base-ubuntu22.04
		command: [ "/bin/bash", "-c", "--" ]
		args: [ "while true; do sleep 30; done;" ]
		env:
		- name: NVIDIA_VISIBLE_DEVICES
		value: all
		- name: NVIDIA_DRIVER_CAPABILITIES
		value: all
		resources: # <- THIS FOR GPU
		requests:
		nvidia.com/gpu-all: "1"
		limits:
		nvidia.com/gpu-all: "1"
		```

		Once the pod is running, use the following command to test that the GPU was detected:

		```
		kubectl exec -n default -it pod/gpu-test -- nvidia-smi
		```

		If successful, the output will look like the following:

		```
		Thu Sep 25 04:17:42 2025

		to test its working exec onto a pod and run nvidia-smi. For more configurability of nvidia related matters in k3s look in [k3s-docs](https://docs.k3s.io/advanced#nvidia-container-runtime-support).
		+-----------------------------------------------------------------------------------------+

		\| NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 \|

		+-----------------------------------------+------------------------+----------------------+

		\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|

		\| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \|

		\| \| \| MIG M. \|

		\|=========================================+========================+======================\|

		\| 0 NVIDIA GeForce RTX 2060 Off \| 00000000:01:00.0 On \| N/A \|

		\| 0% 36C P8 10W / 190W \| 104MiB / 6144MiB \| 0% Default \|

		\| \| \| N/A \|

		+-----------------------------------------+------------------------+----------------------+



		+-----------------------------------------------------------------------------------------+

		\| Processes: \|

		\| GPU GI CI PID Type Process name GPU Memory \|

		\| ID ID Usage \|

		\|=========================================================================================\|

		\| No running processes found \|

		+-----------------------------------------------------------------------------------------+
		```

Admin message