By Joe Keegan and Anish Singh Walia

If you’ve trained anything at scale you’ve likely met Slurm, the workload manager that runs most of the world’s HPC clusters and a fair chunk of its AI training too. It’s the thing that takes sbatch my-job.sh, finds the right nodes, lines up the GPUs, runs your script, and gets out of the way. It’s been around for two decades and it’s not going anywhere.
The friction is that operating Slurm has traditionally meant babysitting bare-metal nodes: imaging them, keeping packages in sync, managing daemons, restarting services after hardware events. None of that is what you signed up for if your goal is to train models.
Slinky is SchedMD’s answer to that: a Kubernetes operator that runs Slurm on Kubernetes. The Slurm controller (slurmctld), the worker daemons (slurmd), the login nodes, and the optional accounting database all run as pods. Kubernetes handles lifecycle, scheduling onto hardware, restarts, and rolling upgrades. Slurm handles the job scheduling that users actually interact with.
In this tutorial we’ll get a working Slinky cluster running on DigitalOcean Kubernetes (DOKS) with NVIDIA B300 GPU nodes, validate that GPUs and the RDMA fabric work, and run a multi-node NCCL all-reduce across two nodes. By the end you’ll have a real cluster you can srun into.
slurmctld, slurmd, and login pods run on DOKS while Slurm still owns job scheduling (sbatch, srun).mgmt + GPU workers) plus managed NFS (ReadWriteMany) matches the usual HPC login-writes / worker-reads workflow.NetworkAttachmentDefinition resources and request rdma/fabricN in the Slurm chart.NET/IB over mlx5_* is working—not just that pods schedule.Two node pools: a small mgmt pool of CPU nodes for the control plane, and a gpu pool of B300 droplets for actual work. Shared NFS so login and worker pods see the same /shared for job scripts and outputs. Multus-attached RDMA fabric NICs on the workers so NCCL can do collective ops across nodes at full bandwidth.
Accounting (slurmdbd) and Prometheus metrics are both supported and we’ll point at where to turn them on, but we’ll keep the default install minimal.
Five things to create before any helm install:
A VPC in a region that supports managed NFS. ric1 works. Create a VPC.
A DOKS cluster in that VPC with two node pools:
DOKS automatically taints GPU pools with nvidia.com/gpu:NoSchedule and labels them doks.digitalocean.com/gpu-brand=nvidia. That keeps non-GPU pods off the expensive nodes for you.
Managed NFS in the same VPC. Note the Mount Source, you’ll need it for the PV. Create managed NFS.
A DigitalOcean Container Registry (DOCR) for the custom slurmd and login images. The one-click DOKS integration wires the pull credentials into the cluster so you don’t have to manage Secrets. Create a registry if you don’t have one already.
NFS performance tuning for GPU nodes. B300 nodes support jumbo frames (MTU 9000), but pods that mount NFS before the interface is tuned negotiate at MTU 1500 and never renegotiate, capping throughput for the life of the mount. Follow Optimize NFS Performance on GPU Nodes before scheduling workloads on the GPU pool. We skip this step as part of this getting started guide, but this is critical for ensuring high throughput performance with DigitalOcean’s managed NFS service.
Create the Slurm namespace once the cluster is up:
kubectl create namespace slurm
slurmd imageWorkers need more than the upstream slurmd image ships with. They need the CUDA runtime, NCCL, the nccl-tests benchmarks (so we can validate the fabric), RDMA userspace, and MPI. This Dockerfile builds exactly that on top of the official Slinky base.
A few notes on the Dockerfile:
sm_103 / B300).nccl-tests is compiled with MPI=1 so it can be launched via srun --mpi=pmix.libibverbs, rdma-core, ibverbs-utils, perftest) is installed so ibv_devices works inside the pod for fabric debugging.Build and push:
doctl registry login
docker build \
-t registry.digitalocean.com/<your-registry>/slurmd-cuda:25.11 \
docker/slurmd-cuda/
docker push registry.digitalocean.com/<your-registry>/slurmd-cuda:25.11
login imageThe login pod is where your users actually live. It’s what they kubectl exec into to write job scripts, pull code, and run sbatch. But the upstream login image ships only the Slurm client commands (sinfo, srun, sbatch, …). There’s no editor, no git, no python3, no sudo, no curl. So a user logs in, runs sinfo, and then… can’t clone their repo, can’t edit a file, can’t run a script, can’t install anything. The pod is technically a login node and practically a dead end.
The fix is the same pattern as the worker image: a thin layer of developer tooling on top of the official Slinky base. This Dockerfile installs vim, nano, git, python3 + pip, sudo, curl, wget, and less on top of ghcr.io/slinkyproject/login, enough to actually get work done. (The sudo grant is broad for convenience here; tighten it before you hand the cluster to real users.)
Build and push it the same way:
docker build \
-t registry.digitalocean.com/<your-registry>/slurm-login:25.11 \
docker/login/
docker push registry.digitalocean.com/<your-registry>/slurm-login:25.11
This image is optional. Leave it out and the LoginSet falls back to the upstream login image with just the Slurm clients. But for any cluster real people will use, it’s the difference between a login node and a login node you can do something on.
Slinky requires cert-manager for its admission webhook TLS, so install this one. kube-prometheus-stack is optional: if you install it, Slinky will publish a ServiceMonitor and a built-in Grafana dashboard so Slurm metrics show up automatically.
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--set crds.enabled=true
For monitoring, install kube-prometheus-stack into a prometheus namespace. If you do, set controller.metrics.enabled: true in the Slurm values below and Slinky will publish a ServiceMonitor for Prometheus to scrape.
Slinky doesn’t manage shared storage for you, but a login-pod-writes / worker-pod-reads workflow is the whole point of an HPC-style cluster, so we wire up a ReadWriteMany NFS volume. The server and path come from the Mount Source of your managed NFS.
# slurm-nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: slurm-nfs-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
storageClassName: ""
mountOptions:
- vers=4.1
- nconnect=8
nfs:
server: <nfs-private-ip>
path: <nfs-mount-path>
---
# slurm-nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: slurm-nfs-pvc
namespace: slurm
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
volumeName: slurm-nfs-pv
resources:
requests:
storage: 100Gi
kubectl apply -f slurm-nfs-pv.yaml
kubectl apply -f slurm-nfs-pvc.yaml
kubectl get pvc -n slurm slurm-nfs-pvc # should be Bound
B300 nodes ship with 16 dedicated fabric NICs (fabric0–fabric15, two per GPU) for RoCE (RDMA over Converged Ethernet). These specialized network interfaces enable high-speed, low-latency data transfers between GPUs across different nodes—crucial for accelerating distributed AI, HPC, and machine learning workloads.
By default, Kubernetes pods don’t have access to these extra NICs, because each pod typically only connects to the cluster’s main network.
This is where Multus CNI comes in. Multus CNI is a network plugin for Kubernetes that allows pods to connect to multiple networks, not just the primary one.
In this setup, Multus enables you to attach one or more of the B300’s RDMA NICs directly to selected pods by moving those network interfaces into the pod’s network namespace. As a result, pods that need ultra-fast networking, like those doing GPU-to-GPU communication, can take direct advantage of the hardware, rather than sharing a single connection. This approach is essential for workloads that demand maximum network performance, such as distributed training or MPI(Message Passing Interface) jobs.
Install Multus:
kubectl apply -f \
https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
kubectl rollout status daemonset/kube-multus-ds -n kube-system --timeout=120s
Each NIC needs a NetworkAttachmentDefinition that uses the host-device CNI to move it into the pod. NADs are namespace-scoped; Multus only resolves them from the same namespace as the pod, so the slurm worker pods need them in slurm. The pattern is identical for all 16, just swap the device:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: roce-net-fabric0
spec:
config: '{
"cniVersion": "0.3.1",
"type": "host-device",
"device": "fabric0"
}'
A ready-made bundle of all 16 NADs(NetworkAttachmentDefinitions) can be retrieved here fabric-nads.yaml. Note that this creates 16 networkattachmentdefinition resources, older NVIDIA GPU systems only have 8 fabric NICs, so you’d only create roce-net-fabric0 through roce-net-fabric7.
kubectl apply -n slurm -f manifests/fabric-nads.yaml
kubectl get net-attach-def -n slurm
# Expect 16 NADs: roce-net-fabric0 through roce-net-fabric15
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--set 'crds.enabled=true' \
--namespace slurm
You should end up with slurm-operator and slurm-operator-webhook pods Running on the mgmt nodes (the webhook is a separate deployment in chart 1.1.0). The operator registers a slurm-operator-webhook ValidatingWebhookConfiguration:
kubectl get pods -n slurm -l app.kubernetes.io/instance=slurm-operator
kubectl get validatingwebhookconfigurations | grep slurm
This is the values file that pulls it all together. Save as slurm-values.yaml:
# Controller (slurmctld). Uncomment the metrics block only if you installed
# kube-prometheus-stack, otherwise the ServiceMonitor CRD won't exist and
# `helm install` will fail.
controller:
extraConfMap:
ReturnToService: 2
# metrics:
# enabled: true
# serviceMonitor:
# enabled: true
# labels:
# release: prometheus
# Login nodes: a single login pod with the NFS share mounted at /shared.
loginsets:
slinky:
enabled: true
login:
# The custom login image with dev tools built above. Drop the image block
# to fall back to the upstream login image (Slurm clients only).
image:
repository: registry.digitalocean.com/<your-registry>/slurm-login
tag: "25.11"
volumeMounts:
- name: shared-nfs
mountPath: /shared
podSpec:
volumes:
- name: shared-nfs
persistentVolumeClaim:
claimName: slurm-nfs-pvc
service:
spec:
type: ClusterIP
# GPU device paths. Slurm can't autodetect these from inside a container,
# so it needs to be told explicitly. On B300, GPUs are always at
# /dev/nvidia0 through /dev/nvidia7.
configFiles:
gres.conf: |
Name=gpu File=/dev/nvidia[0,1,2,3,4,5,6,7]
# GPU worker nodes. 8 GPUs and 16 fabric NICs per pod on B300.
nodesets:
slinky:
replicas: 2 # Match your B300 node count
slurmd:
image:
repository: registry.digitalocean.com/<your-registry>/slurmd-cuda
tag: "25.11"
resources:
requests:
nvidia.com/gpu: 8
rdma/fabric0: 1
rdma/fabric1: 1
rdma/fabric2: 1
rdma/fabric3: 1
rdma/fabric4: 1
rdma/fabric5: 1
rdma/fabric6: 1
rdma/fabric7: 1
rdma/fabric8: 1
rdma/fabric9: 1
rdma/fabric10: 1
rdma/fabric11: 1
rdma/fabric12: 1
rdma/fabric13: 1
rdma/fabric14: 1
rdma/fabric15: 1
limits:
nvidia.com/gpu: 8
rdma/fabric0: 1
rdma/fabric1: 1
rdma/fabric2: 1
rdma/fabric3: 1
rdma/fabric4: 1
rdma/fabric5: 1
rdma/fabric6: 1
rdma/fabric7: 1
rdma/fabric8: 1
rdma/fabric9: 1
rdma/fabric10: 1
rdma/fabric11: 1
rdma/fabric12: 1
rdma/fabric13: 1
rdma/fabric14: 1
rdma/fabric15: 1
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- name: shared-nfs
mountPath: /shared
- name: shm
mountPath: /dev/shm
extraConfMap:
Gres: "gpu:8"
partition:
enabled: true # Required, otherwise no `slinky` partition is created
configMap:
State: UP
MaxTime: UNLIMITED
# Multus annotation moves all 16 fabric NICs into the pod.
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: >-
roce-net-fabric0@fabric0,
roce-net-fabric1@fabric1,
roce-net-fabric2@fabric2,
roce-net-fabric3@fabric3,
roce-net-fabric4@fabric4,
roce-net-fabric5@fabric5,
roce-net-fabric6@fabric6,
roce-net-fabric7@fabric7,
roce-net-fabric8@fabric8,
roce-net-fabric9@fabric9,
roce-net-fabric10@fabric10,
roce-net-fabric11@fabric11,
roce-net-fabric12@fabric12,
roce-net-fabric13@fabric13,
roce-net-fabric14@fabric14,
roce-net-fabric15@fabric15
podSpec:
nodeSelector:
doks.digitalocean.com/gpu-brand: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: shared-nfs
persistentVolumeClaim:
claimName: slurm-nfs-pvc
# NCCL uses /dev/shm; give it room for large collectives.
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
Install it:
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--namespace slurm \
--values slurm-values.yaml
A few moments later you should see the cluster come up:
kubectl get pods -n slurm
# slurm-controller-... Running
# slurm-login-slinky-... Running
# slurm-restapi-... Running
# slurm-worker-slinky-0 Running
# slurm-worker-slinky-1 Running
Hop into the login pod and confirm Slurm sees the workers:
kubectl exec -it -n slurm deploy/slurm-login-slinky -- bash
sinfo -N -l
# All workers should be idle in the `slinky` partition (and the default `all` partition).
scontrol show node slinky-0 | grep -i gres
# Gres=gpu:8
Because we built the custom login image, this is also a pod you can actually work in: git clone your job repo, edit scripts with vim, pip install a helper, all without leaving the login node.
The single most useful smoke test for a GPU cluster is a multi-node NCCL(NVIDIA Collective Communications Library) all-reduce. If it runs at hundreds of GB/s of bus bandwidth and NCCL_DEBUG=INFO reports NET/IB transport over the mlx5_* devices, your fabric is correctly attached and RoCE is being used end-to-end. If it collapses to a few GB/s, NCCL fell back to TCP and you’ve got fabric work to do.
From the login pod, write the job script to NFS so the workers can read it:
mkdir -p /shared/jobs /shared/output
cat > /shared/jobs/nccl-allreduce-2node.sh <<'EOF'
#!/bin/bash
#SBATCH --job-name=nccl-allreduce-2node
#SBATCH --partition=slinky
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --output=/shared/output/allreduce-2node-%j.out
#SBATCH --error=/shared/output/allreduce-2node-%j.err
#SBATCH --time=01:00:00
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}
export NCCL_DEBUG=INFO
# Keep MPI control traffic off the RDMA NICs.
export OMPI_MCA_btl=self,tcp
export OMPI_MCA_btl_tcp_if_include=eth0
srun --mpi=pmix \
/usr/local/bin/all_reduce_perf \
-b 1G -e 16G -f 2 -g 1 -c 1 -n 100
EOF
sbatch /shared/jobs/nccl-allreduce-2node.sh
squeue
Watch for the output file once the job finishes:
cat /shared/output/allreduce-2node-*.out
What you’re looking for in the output:
all_reduce_perf ramping from 1G to 16G.NCCL_DEBUG=INFO lines that mention NET/IB and list mlx5_0 through mlx5_15. That’s the fabric being used.rdma/fabricN resource requests.If something doesn’t look right, the first place to look is inside a worker pod:
kubectl exec -n slurm slurm-worker-slinky-0 -- ip -br link | grep fabric
# fabric0..fabric15, all UP
kubectl exec -n slurm slurm-worker-slinky-0 -- ibv_devices
# mlx5_0..mlx5_15
kubectl exec -n slurm slurm-worker-slinky-0 -- ibv_devinfo
# Port state PORT_ACTIVE, link layer Ethernet (= RoCE)
Slinky is SchedMD’s Slurm operator for Kubernetes. The controller, workers, and login components run as pods; Kubernetes handles restarts and node lifecycle while users still submit jobs with sbatch and srun. Bare-metal Slurm means you operate the OS and daemons on each node yourself.
DOKS gives you a managed control plane, node pools, and integrations (NFS, DOCR, GPU taints) in one place. GPU Droplets alone are simpler for one-off training. Slurm on Kubernetes pays off when you need multi-node scheduling, shared /shared storage, and repeatable job scripts across a fleet.
On B300 nodes, yes—each GPU pair maps to dedicated fabricN interfaces, and the Helm values request rdma/fabric0 through rdma/fabric15. Older NVIDIA GPU shapes often expose 8 fabric NICs; use roce-net-fabric0 through roce-net-fabric7 only in that case.
B300 nodes support jumbo frames (MTU 9000). If a pod mounts NFS before the node interface is tuned, the mount can stay at MTU 1500 for its lifetime and cap throughput. For production training I/O, follow Optimize NFS Performance on GPU Nodes before heavy jobs.
Inside a worker pod, confirm fabric0–fabric15 are UP, ibv_devices lists mlx5_0–mlx5_15, and job output shows NET/IB in NCCL_DEBUG=INFO. Single-digit GB/s usually means TCP fallback—re-check the Multus annotation and rdma/fabricN resource requests in slurm-values.yaml.
You can, but most production paths split training (Slurm/HPC) from inference (HTTP APIs). On DigitalOcean, teams often fine-tune on GPU Droplets or this Slurm cluster, then serve with Dedicated Inference and BYOM or Kubernetes tooling such as vLLM model loading on Kubernetes. See Serverless vs Dedicated vs Batch Inference for how serving modes compare.
You now have a Slurm cluster on Kubernetes that schedules jobs across B300 GPU nodes and does collective ops over RDMA. From here:
sacct, sreport, or fair-share scheduling. Uncomment the accounting: block in slurm-values.yaml. The Slinky chart can either deploy an in-cluster MariaDB automatically (handy for dev/test) or talk to a managed MySQL instance in the same VPC (recommended for production).controller.metrics.enabled: true in the values. Slinky publishes a Grafana dashboard out of the box./shared volume, and the slinky partition are all that user-facing job scripts need to interact with.Slinky lets you keep Kubernetes for what Kubernetes is good at (fleet management, hardware lifecycle, observability) while letting Slurm do what Slurm is good at: scheduling HPC jobs on big GPUs. On DOKS with B300 nodes you get that whole stack with managed control plane, managed NFS, managed registry, and a fabric that actually delivers RoCE bandwidth.
fabric-nads.yaml, and sample manifests used in this guideHappy training.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
A Senior Solutions Architect at DigitalOcean focusing on Cloud Architecture, Kubernetes, Automation and Infrastructure-as-Code.
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.