Report this

What is the reason for this report?

Getting Started with Slinky on DigitalOcean Kubernetes

Published on May 22, 2026
Getting Started with Slinky on DigitalOcean Kubernetes

Introduction

If you’ve trained anything at scale you’ve likely met Slurm, the workload manager that runs most of the world’s HPC clusters and a fair chunk of its AI training too. It’s the thing that takes sbatch my-job.sh, finds the right nodes, lines up the GPUs, runs your script, and gets out of the way. It’s been around for two decades and it’s not going anywhere.

The friction is that operating Slurm has traditionally meant babysitting bare-metal nodes: imaging them, keeping packages in sync, managing daemons, restarting services after hardware events. None of that is what you signed up for if your goal is to train models.

Slinky is SchedMD’s answer to that: a Kubernetes operator that runs Slurm on Kubernetes. The Slurm controller (slurmctld), the worker daemons (slurmd), the login nodes, and the optional accounting database all run as pods. Kubernetes handles lifecycle, scheduling onto hardware, restarts, and rolling upgrades. Slurm handles the job scheduling that users actually interact with.

In this tutorial we’ll get a working Slinky cluster running on DigitalOcean Kubernetes (DOKS) with NVIDIA B300 GPU nodes, validate that GPUs and the RDMA fabric work, and run a multi-node NCCL all-reduce across two nodes. By the end you’ll have a real cluster you can srun into.

Key takeaways

  • Slinky is SchedMD’s Kubernetes operator for Slurm: slurmctld, slurmd, and login pods run on DOKS while Slurm still owns job scheduling (sbatch, srun).
  • A split node pool (small CPU mgmt + GPU workers) plus managed NFS (ReadWriteMany) matches the usual HPC login-writes / worker-reads workflow.
  • B300 nodes expose 8 GPUs and 16 fabric NICs; you attach RoCE with Multus NetworkAttachmentDefinition resources and request rdma/fabricN in the Slurm chart.
  • The NCCL all-reduce job is the practical proof that multi-node NET/IB over mlx5_* is working—not just that pods schedule.
  • For serving fine-tuned weights after training, most teams move to Dedicated Inference with BYOM or a Kubernetes inference stack—not long-lived Slurm partitions.

What you’ll build

Two node pools: a small mgmt pool of CPU nodes for the control plane, and a gpu pool of B300 droplets for actual work. Shared NFS so login and worker pods see the same /shared for job scripts and outputs. Multus-attached RDMA fabric NICs on the workers so NCCL can do collective ops across nodes at full bandwidth.

Accounting (slurmdbd) and Prometheus metrics are both supported and we’ll point at where to turn them on, but we’ll keep the default install minimal.

Prerequisites

Five things to create before any helm install:

  1. A VPC in a region that supports managed NFS. ric1 works. Create a VPC.

  2. A DOKS cluster in that VPC with two node pools:

    • mgmt: 3 × CPU Optimized 4 vCPU / 8 GiB
    • gpu: 2+ × NVIDIA B300 droplets (8 GPUs + 16 fabric NICs per node)

    DOKS automatically taints GPU pools with nvidia.com/gpu:NoSchedule and labels them doks.digitalocean.com/gpu-brand=nvidia. That keeps non-GPU pods off the expensive nodes for you.

  3. Managed NFS in the same VPC. Note the Mount Source, you’ll need it for the PV. Create managed NFS.

  4. A DigitalOcean Container Registry (DOCR) for the custom slurmd and login images. The one-click DOKS integration wires the pull credentials into the cluster so you don’t have to manage Secrets. Create a registry if you don’t have one already.

  5. NFS performance tuning for GPU nodes. B300 nodes support jumbo frames (MTU 9000), but pods that mount NFS before the interface is tuned negotiate at MTU 1500 and never renegotiate, capping throughput for the life of the mount. Follow Optimize NFS Performance on GPU Nodes before scheduling workloads on the GPU pool. We skip this step as part of this getting started guide, but this is critical for ensuring high throughput performance with DigitalOcean’s managed NFS service.

Create the Slurm namespace once the cluster is up:

kubectl create namespace slurm

How to build and push the custom slurmd image

Workers need more than the upstream slurmd image ships with. They need the CUDA runtime, NCCL, the nccl-tests benchmarks (so we can validate the fabric), RDMA userspace, and MPI. This Dockerfile builds exactly that on top of the official Slinky base.

A few notes on the Dockerfile:

  • CUDA 12.9 is the floor because that’s when NVIDIA added native codegen for Blackwell Ultra (sm_103 / B300).
  • nccl-tests is compiled with MPI=1 so it can be launched via srun --mpi=pmix.
  • RDMA userspace (libibverbs, rdma-core, ibverbs-utils, perftest) is installed so ibv_devices works inside the pod for fabric debugging.

Build and push:

doctl registry login

docker build \
  -t registry.digitalocean.com/<your-registry>/slurmd-cuda:25.11 \
  docker/slurmd-cuda/

docker push registry.digitalocean.com/<your-registry>/slurmd-cuda:25.11

How to build and push the custom login image

The login pod is where your users actually live. It’s what they kubectl exec into to write job scripts, pull code, and run sbatch. But the upstream login image ships only the Slurm client commands (sinfo, srun, sbatch, …). There’s no editor, no git, no python3, no sudo, no curl. So a user logs in, runs sinfo, and then… can’t clone their repo, can’t edit a file, can’t run a script, can’t install anything. The pod is technically a login node and practically a dead end.

The fix is the same pattern as the worker image: a thin layer of developer tooling on top of the official Slinky base. This Dockerfile installs vim, nano, git, python3 + pip, sudo, curl, wget, and less on top of ghcr.io/slinkyproject/login, enough to actually get work done. (The sudo grant is broad for convenience here; tighten it before you hand the cluster to real users.)

Build and push it the same way:

docker build \
  -t registry.digitalocean.com/<your-registry>/slurm-login:25.11 \
  docker/login/

docker push registry.digitalocean.com/<your-registry>/slurm-login:25.11

This image is optional. Leave it out and the LoginSet falls back to the upstream login image with just the Slurm clients. But for any cluster real people will use, it’s the difference between a login node and a login node you can do something on.

How to install cert-manager (required for the Slinky operator)

Slinky requires cert-manager for its admission webhook TLS, so install this one. kube-prometheus-stack is optional: if you install it, Slinky will publish a ServiceMonitor and a built-in Grafana dashboard so Slurm metrics show up automatically.

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --set crds.enabled=true

For monitoring, install kube-prometheus-stack into a prometheus namespace. If you do, set controller.metrics.enabled: true in the Slurm values below and Slinky will publish a ServiceMonitor for Prometheus to scrape.

How to wire managed NFS as a ReadWriteMany volume for Slurm

Slinky doesn’t manage shared storage for you, but a login-pod-writes / worker-pod-reads workflow is the whole point of an HPC-style cluster, so we wire up a ReadWriteMany NFS volume. The server and path come from the Mount Source of your managed NFS.

# slurm-nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: slurm-nfs-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  mountOptions:
    - vers=4.1
    - nconnect=8
  nfs:
    server: <nfs-private-ip>
    path: <nfs-mount-path>
---
# slurm-nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: slurm-nfs-pvc
  namespace: slurm
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  volumeName: slurm-nfs-pv
  resources:
    requests:
      storage: 100Gi
kubectl apply -f slurm-nfs-pv.yaml
kubectl apply -f slurm-nfs-pvc.yaml
kubectl get pvc -n slurm slurm-nfs-pvc      # should be Bound

How to attach B300 RDMA fabric NICs with Multus

B300 nodes ship with 16 dedicated fabric NICs (fabric0fabric15, two per GPU) for RoCE (RDMA over Converged Ethernet). These specialized network interfaces enable high-speed, low-latency data transfers between GPUs across different nodes—crucial for accelerating distributed AI, HPC, and machine learning workloads.

By default, Kubernetes pods don’t have access to these extra NICs, because each pod typically only connects to the cluster’s main network.

This is where Multus CNI comes in. Multus CNI is a network plugin for Kubernetes that allows pods to connect to multiple networks, not just the primary one.

In this setup, Multus enables you to attach one or more of the B300’s RDMA NICs directly to selected pods by moving those network interfaces into the pod’s network namespace. As a result, pods that need ultra-fast networking, like those doing GPU-to-GPU communication, can take direct advantage of the hardware, rather than sharing a single connection. This approach is essential for workloads that demand maximum network performance, such as distributed training or MPI(Message Passing Interface) jobs.

Install Multus:

kubectl apply -f \
  https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml

kubectl rollout status daemonset/kube-multus-ds -n kube-system --timeout=120s

Each NIC needs a NetworkAttachmentDefinition that uses the host-device CNI to move it into the pod. NADs are namespace-scoped; Multus only resolves them from the same namespace as the pod, so the slurm worker pods need them in slurm. The pattern is identical for all 16, just swap the device:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric0
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric0"
    }'

A ready-made bundle of all 16 NADs(NetworkAttachmentDefinitions) can be retrieved here fabric-nads.yaml. Note that this creates 16 networkattachmentdefinition resources, older NVIDIA GPU systems only have 8 fabric NICs, so you’d only create roce-net-fabric0 through roce-net-fabric7.

kubectl apply -n slurm -f manifests/fabric-nads.yaml
kubectl get net-attach-def -n slurm
# Expect 16 NADs: roce-net-fabric0 through roce-net-fabric15

Install the SchedMD Slinky Slurm operator

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --set 'crds.enabled=true' \
  --namespace slurm

You should end up with slurm-operator and slurm-operator-webhook pods Running on the mgmt nodes (the webhook is a separate deployment in chart 1.1.0). The operator registers a slurm-operator-webhook ValidatingWebhookConfiguration:

kubectl get pods -n slurm -l app.kubernetes.io/instance=slurm-operator
kubectl get validatingwebhookconfigurations | grep slurm

Deploy the Slurm cluster with Helm values

This is the values file that pulls it all together. Save as slurm-values.yaml:

# Controller (slurmctld). Uncomment the metrics block only if you installed
# kube-prometheus-stack, otherwise the ServiceMonitor CRD won't exist and
# `helm install` will fail.
controller:
  extraConfMap:
    ReturnToService: 2
  # metrics:
  #   enabled: true
  #   serviceMonitor:
  #     enabled: true
  #     labels:
  #       release: prometheus

# Login nodes: a single login pod with the NFS share mounted at /shared.
loginsets:
  slinky:
    enabled: true
    login:
      # The custom login image with dev tools built above. Drop the image block
      # to fall back to the upstream login image (Slurm clients only).
      image:
        repository: registry.digitalocean.com/<your-registry>/slurm-login
        tag: "25.11"
      volumeMounts:
        - name: shared-nfs
          mountPath: /shared
    podSpec:
      volumes:
        - name: shared-nfs
          persistentVolumeClaim:
            claimName: slurm-nfs-pvc
    service:
      spec:
        type: ClusterIP

# GPU device paths. Slurm can't autodetect these from inside a container,
# so it needs to be told explicitly. On B300, GPUs are always at
# /dev/nvidia0 through /dev/nvidia7.
configFiles:
  gres.conf: |
    Name=gpu File=/dev/nvidia[0,1,2,3,4,5,6,7]

# GPU worker nodes. 8 GPUs and 16 fabric NICs per pod on B300.
nodesets:
  slinky:
    replicas: 2                                # Match your B300 node count
    slurmd:
      image:
        repository: registry.digitalocean.com/<your-registry>/slurmd-cuda
        tag: "25.11"
      resources:
        requests:
          nvidia.com/gpu: 8
          rdma/fabric0: 1
          rdma/fabric1: 1
          rdma/fabric2: 1
          rdma/fabric3: 1
          rdma/fabric4: 1
          rdma/fabric5: 1
          rdma/fabric6: 1
          rdma/fabric7: 1
          rdma/fabric8: 1
          rdma/fabric9: 1
          rdma/fabric10: 1
          rdma/fabric11: 1
          rdma/fabric12: 1
          rdma/fabric13: 1
          rdma/fabric14: 1
          rdma/fabric15: 1
        limits:
          nvidia.com/gpu: 8
          rdma/fabric0: 1
          rdma/fabric1: 1
          rdma/fabric2: 1
          rdma/fabric3: 1
          rdma/fabric4: 1
          rdma/fabric5: 1
          rdma/fabric6: 1
          rdma/fabric7: 1
          rdma/fabric8: 1
          rdma/fabric9: 1
          rdma/fabric10: 1
          rdma/fabric11: 1
          rdma/fabric12: 1
          rdma/fabric13: 1
          rdma/fabric14: 1
          rdma/fabric15: 1
      securityContext:
        capabilities:
          add:
            - IPC_LOCK
      volumeMounts:
        - name: shared-nfs
          mountPath: /shared
        - name: shm
          mountPath: /dev/shm
    extraConfMap:
      Gres: "gpu:8"
    partition:
      enabled: true                              # Required, otherwise no `slinky` partition is created
      configMap:
        State: UP
        MaxTime: UNLIMITED
    # Multus annotation moves all 16 fabric NICs into the pod.
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: >-
          roce-net-fabric0@fabric0,
          roce-net-fabric1@fabric1,
          roce-net-fabric2@fabric2,
          roce-net-fabric3@fabric3,
          roce-net-fabric4@fabric4,
          roce-net-fabric5@fabric5,
          roce-net-fabric6@fabric6,
          roce-net-fabric7@fabric7,
          roce-net-fabric8@fabric8,
          roce-net-fabric9@fabric9,
          roce-net-fabric10@fabric10,
          roce-net-fabric11@fabric11,
          roce-net-fabric12@fabric12,
          roce-net-fabric13@fabric13,
          roce-net-fabric14@fabric14,
          roce-net-fabric15@fabric15
    podSpec:
      nodeSelector:
        doks.digitalocean.com/gpu-brand: nvidia
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      volumes:
        - name: shared-nfs
          persistentVolumeClaim:
            claimName: slurm-nfs-pvc
        # NCCL uses /dev/shm; give it room for large collectives.
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi

Install it:

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --namespace slurm \
  --values slurm-values.yaml

A few moments later you should see the cluster come up:

kubectl get pods -n slurm
# slurm-controller-...    Running
# slurm-login-slinky-...  Running
# slurm-restapi-...       Running
# slurm-worker-slinky-0   Running
# slurm-worker-slinky-1   Running

Hop into the login pod and confirm Slurm sees the workers:

kubectl exec -it -n slurm deploy/slurm-login-slinky -- bash

sinfo -N -l
# All workers should be idle in the `slinky` partition (and the default `all` partition).

scontrol show node slinky-0 | grep -i gres
# Gres=gpu:8

Because we built the custom login image, this is also a pod you can actually work in: git clone your job repo, edit scripts with vim, pip install a helper, all without leaving the login node.

Validate the fabric with a multi-node NCCL all-reduce

The single most useful smoke test for a GPU cluster is a multi-node NCCL(NVIDIA Collective Communications Library) all-reduce. If it runs at hundreds of GB/s of bus bandwidth and NCCL_DEBUG=INFO reports NET/IB transport over the mlx5_* devices, your fabric is correctly attached and RoCE is being used end-to-end. If it collapses to a few GB/s, NCCL fell back to TCP and you’ve got fabric work to do.

From the login pod, write the job script to NFS so the workers can read it:

mkdir -p /shared/jobs /shared/output

cat > /shared/jobs/nccl-allreduce-2node.sh <<'EOF'
#!/bin/bash
#SBATCH --job-name=nccl-allreduce-2node
#SBATCH --partition=slinky
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --output=/shared/output/allreduce-2node-%j.out
#SBATCH --error=/shared/output/allreduce-2node-%j.err
#SBATCH --time=01:00:00

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}
export NCCL_DEBUG=INFO

# Keep MPI control traffic off the RDMA NICs.
export OMPI_MCA_btl=self,tcp
export OMPI_MCA_btl_tcp_if_include=eth0

srun --mpi=pmix \
  /usr/local/bin/all_reduce_perf \
  -b 1G -e 16G -f 2 -g 1 -c 1 -n 100
EOF

sbatch /shared/jobs/nccl-allreduce-2node.sh
squeue

Watch for the output file once the job finishes:

cat /shared/output/allreduce-2node-*.out

What you’re looking for in the output:

  • A bandwidth table from all_reduce_perf ramping from 1G to 16G.
  • NCCL_DEBUG=INFO lines that mention NET/IB and list mlx5_0 through mlx5_15. That’s the fabric being used.
  • Inter-node bus bandwidth in the hundreds of GB/s at large message sizes. Anything that tops out in the single digits of GB/s means NCCL fell back to TCP. Sanity-check the Multus annotation and the rdma/fabricN resource requests.

If something doesn’t look right, the first place to look is inside a worker pod:

kubectl exec -n slurm slurm-worker-slinky-0 -- ip -br link | grep fabric
# fabric0..fabric15, all UP

kubectl exec -n slurm slurm-worker-slinky-0 -- ibv_devices
# mlx5_0..mlx5_15

kubectl exec -n slurm slurm-worker-slinky-0 -- ibv_devinfo
# Port state PORT_ACTIVE, link layer Ethernet (= RoCE)

FAQs

1. What is Slinky, and how is it different from Slurm on bare metal?

Slinky is SchedMD’s Slurm operator for Kubernetes. The controller, workers, and login components run as pods; Kubernetes handles restarts and node lifecycle while users still submit jobs with sbatch and srun. Bare-metal Slurm means you operate the OS and daemons on each node yourself.

2. Why run Slurm on DOKS instead of only GPU Droplets?

DOKS gives you a managed control plane, node pools, and integrations (NFS, DOCR, GPU taints) in one place. GPU Droplets alone are simpler for one-off training. Slurm on Kubernetes pays off when you need multi-node scheduling, shared /shared storage, and repeatable job scripts across a fleet.

3. Do I need all 16 fabric NetworkAttachmentDefinitions?

On B300 nodes, yes—each GPU pair maps to dedicated fabricN interfaces, and the Helm values request rdma/fabric0 through rdma/fabric15. Older NVIDIA GPU shapes often expose 8 fabric NICs; use roce-net-fabric0 through roce-net-fabric7 only in that case.

4. Why does the tutorial mention NFS MTU tuning if we skip it here?

B300 nodes support jumbo frames (MTU 9000). If a pod mounts NFS before the node interface is tuned, the mount can stay at MTU 1500 for its lifetime and cap throughput. For production training I/O, follow Optimize NFS Performance on GPU Nodes before heavy jobs.

5. My NCCL job shows low GB/s—what should I check first?

Inside a worker pod, confirm fabric0fabric15 are UP, ibv_devices lists mlx5_0mlx5_15, and job output shows NET/IB in NCCL_DEBUG=INFO. Single-digit GB/s usually means TCP fallback—re-check the Multus annotation and rdma/fabricN resource requests in slurm-values.yaml.

6. Can I serve the model from this same cluster after training?

You can, but most production paths split training (Slurm/HPC) from inference (HTTP APIs). On DigitalOcean, teams often fine-tune on GPU Droplets or this Slurm cluster, then serve with Dedicated Inference and BYOM or Kubernetes tooling such as vLLM model loading on Kubernetes. See Serverless vs Dedicated vs Batch Inference for how serving modes compare.

Conclusion

You now have a Slurm cluster on Kubernetes that schedules jobs across B300 GPU nodes and does collective ops over RDMA. From here:

  • Turn on accounting if you want sacct, sreport, or fair-share scheduling. Uncomment the accounting: block in slurm-values.yaml. The Slinky chart can either deploy an in-cluster MariaDB automatically (handy for dev/test) or talk to a managed MySQL instance in the same VPC (recommended for production).
  • Turn on metrics by installing kube-prometheus-stack and keeping controller.metrics.enabled: true in the values. Slinky publishes a Grafana dashboard out of the box.
  • Submit real workloads. The login pod, the /shared volume, and the slinky partition are all that user-facing job scripts need to interact with.

Slinky lets you keep Kubernetes for what Kubernetes is good at (fleet management, hardware lifecycle, observability) while letting Slurm do what Slurm is good at: scheduling HPC jobs on big GPUs. On DOKS with B300 nodes you get that whole stack with managed control plane, managed NFS, managed registry, and a fabric that actually delivers RoCE bandwidth.

Happy training.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Joe Keegan
Joe Keegan
Author
Sr. Solutions Architect
See author profile

A Senior Solutions Architect at DigitalOcean focusing on Cloud Architecture, Kubernetes, Automation and Infrastructure-as-Code.

Anish Singh Walia
Anish Singh Walia
Author
Sr Technical Content Strategist and Team Lead
See author profile

I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.