By Jeff Fan and Anish Singh Walia
Large language models (LLMs) are powering a new generation of AI applications, but running them efficiently at scale requires robust, distributed infrastructure. DigitalOcean Kubernetes (DOKS) provides a flexible, cloud-native platform for deploying and managing these workloads.
In this tutorial, you’ll learn how to deploy llm-d—a distributed LLM inference framework—on DigitalOcean Kubernetes using automated deployment scripts. Whether you’re a DevOps engineer, ML engineer, or platform architect, this tutorial will help you establish a scalable, production-ready LLM inference service on Kubernetes.
Estimated Deployment Time: 15-20 minutes
This tutorial focuses on basic llm-d deployment on DigitalOcean Kubernetes with automated scripts.
llm-d is an advanced, open-source distributed inference framework purpose-built for serving large language models (LLMs) at scale in Kubernetes environments. It is designed to maximize GPU utilization, throughput, and reliability for production AI workloads, and is especially well-suited for multi-node, multi-GPU clusters.
Key Features and Capabilities:
Disaggregated LLM Inference Pipeline
llm-d separates the LLM inference process into two distinct stages—prefill (context processing) and decode (token generation)—which can be distributed across different GPU nodes. This disaggregation enables highly parallelized prefill computation and efficient, sequential decode execution, allowing for better resource allocation and higher throughput compared to monolithic serving approaches.
Intelligent GPU Resource Management
The framework automatically detects and allocates available GPU resources, supporting a range of NVIDIA GPUs (including RTX 4000 Ada, RTX 6000 Ada, and L40S). It dynamically assigns workloads based on GPU memory, compute requirements, and current cluster load, ensuring optimal utilization and minimizing bottlenecks.
Kubernetes-Native, Cloud-Ready Architecture
llm-d is designed from the ground up for Kubernetes, leveraging native constructs for service discovery, scaling, and fault tolerance. It supports automated deployment, rolling updates, and seamless integration with Kubernetes-native monitoring and logging tools.
OpenAI-Compatible API
llm-d exposes an OpenAI-compatible API endpoint, making it easy to integrate with existing AI applications, SDKs, and tools that expect the OpenAI API format.
Scalability and High Availability
The architecture supports horizontal scaling of both prefill and decode nodes, enabling you to independently scale different parts of the inference pipeline based on workload patterns. Built-in health checks and failover mechanisms ensure robust, production-grade reliability.
Advanced KV Cache Management
llm-d implements efficient key-value (KV) cache sharing and management, which reduces redundant computation and memory usage across requests—critical for high-throughput, low-latency LLM serving.
Observability and Monitoring
The framework integrates with popular observability stacks (such as Prometheus and Grafana), providing real-time metrics on GPU utilization, request latency, throughput, and error rates.
llm-d represents a next-generation distributed LLM inference platform, specifically designed for Kubernetes environments. Unlike traditional single-node solutions, llm-d brings distributed computing capabilities to LLM inference.
llm-d represents a next-generation distributed LLM inference platform, specifically designed for Kubernetes environments. Unlike traditional single-node solutions, llm-d brings distributed computing capabilities to LLM inference.
llm-d is designed to be a modern, cloud-native distributed LLM inference platform, specifically for Kubernetes environments. Unlike traditional single-node solutions, llm-d brings distributed computing capabilities to LLM inference.
Deploying llm-d on DigitalOcean Kubernetes (DOKS) allows you to take advantage of managed GPU infrastructure, automated scaling, and a developer-friendly cloud platform. This combination empowers teams to:
In summary, llm-d brings modern, cloud-native distributed systems engineering to LLM inference, making it possible to deliver high-performance, scalable, and reliable AI services on Kubernetes with ease.
Think of the difference between fast fashion retail and bespoke tailoring - this perfectly captures the fundamental differences between traditional web applications and LLM inference:
Traditional Web Applications vs. LLM Inference:
Comparison Aspect | Traditional Web Apps (Fast Fashion) | LLM Inference (Bespoke Tailoring Workshop) |
---|---|---|
Service Process | Store displays S·M·L standard sizes, customers grab and checkout | Measurement → Pattern Making → Fitting → Alterations → Delivery |
Request Lifespan | Milliseconds to seconds (instant checkout) | Seconds to minutes (stitch by stitch execution) |
Resource Requirements | Similar fabric and manufacturing time per item | Vastly different fabric usage and handcraft time per suit |
Statefulness | Staff don’t remember your previous purchases | Tailor remembers your measurements and preferences |
Cost | Low unit price, mass production | High unit price, precision handcraft |
Traditional LLM Serving = “One-Person-Does-Everything Tailor”
Problems with this approach:
llm-d’s Disaggregated Approach = “Modern Bespoke Tailoring Production Line”
Station | Process Analogy | Specialized Optimization |
---|---|---|
Prefill Station | Measurement + Pattern Making Room | High parallel computation, CPU/GPU collaboration |
Decode Station | Sewing Room | Sequential output focus, maximum memory bandwidth |
Smart Gateway | Master Tailor Manager | Dynamic order assignment based on KV Cache and load |
Benefits Achieved:
To summarize, Fast fashion emphasizes “grab and go”; bespoke tailoring pursues “measured perfection”. llm-d separates measurement from sewing, with intelligent master tailor coordination, making AI inference both personalized and efficient.
First, let’s get the llm-d deployer repository and set up our environment:
# Clone the llm-d deployer repository
git clone https://github.com/iambigmomma/llm-d-deployer.git
cd llm-d-deployer/quickstart/infra/doks-digitalocean
# Set your HuggingFace token (required for model downloads)
export HF_TOKEN=hf_your_token_here
# Verify doctl is authenticated
doctl auth list
For Meta Llama Models (Llama-3.2-3B-Instruct):
The meta-llama/Llama-3.2-3B-Instruct
model used in this tutorial requires special access:
Alternative Open Models (No License Required):
If you prefer to avoid the approval process, consider these open alternatives:
google/gemma-2b-it
- Google’s open instruction-tuned modelQwen/Qwen2.5-3B-Instruct
- Alibaba’s multilingual modelmicrosoft/Phi-3-mini-4k-instruct
- Microsoft’s efficient small modelTo use alternative models, you’ll need to modify the deployment configuration files accordingly.
Our automated script will create a complete DOKS cluster with both CPU and GPU nodes:
# Run the automated cluster setup script
./setup-gpu-cluster.sh -c
The script will:
When prompted, select your preferred GPU type:
# Check cluster status
kubectl get nodes
# Verify GPU nodes are ready
kubectl get nodes -l doks.digitalocean.com/gpu-brand=nvidia
# Check GPU resources are available
kubectl describe nodes -l doks.digitalocean.com/gpu-brand=nvidia | grep nvidia.com/gpu
You should see output similar to:
NAME STATUS ROLES AGE VERSION
pool-gpu-xxxxx Ready <none> 3m v1.31.1
pool-gpu-yyyyy Ready <none> 3m v1.31.1
This is completely normal! DigitalOcean API calls may occasionally timeout during node provisioning. If you see the script stop after creating the GPU node pool:
Wait 30 seconds for the API operations to complete
Re-run the same command:
./setup-gpu-cluster.sh
The script will automatically detect existing components and continue from where it left off
No duplicate resources will be created - the script is designed to be safely re-run
The script has intelligent state detection and will skip already completed steps, making it completely safe to re-run multiple times.
Now let’s deploy llm-d using our automated deployment scripts. This is a two-step process for better reliability and troubleshooting:
First, let’s deploy the core llm-d inference services:
# Deploy llm-d with your chosen GPU configuration
./deploy-llm-d.sh -g rtx-6000-ada -t your_hf_token
What Gets Deployed:
After llm-d is running, optionally setup comprehensive monitoring:
# Navigate to monitoring directory
cd monitoring
# Setup Prometheus, Grafana, and llm-d dashboards
./setup-monitoring.sh
Monitoring Components:
# Watch llm-d deployment progress
kubectl get pods -n llm-d -w
# Check all components are running
kubectl get all -n llm-d
Wait until all pods show Running
status:
NAME READY STATUS RESTARTS AGE
meta-llama-llama-3-2-3b-instruct-decode-xxx 1/1 Running 0 3m
meta-llama-llama-3-2-3b-instruct-prefill-xxx 1/1 Running 0 3m
llm-d-inference-gateway-xxx 1/1 Running 0 3m
redis-xxx 1/1 Running 0 3m
# Check monitoring stack status
kubectl get pods -n llm-d-monitoring
# Access Grafana dashboard
kubectl port-forward -n llm-d-monitoring svc/prometheus-grafana 3000:80
Now let’s test that everything is working correctly using our test script:
# Navigate to the test directory
cd /path/to/llm-d-deployer/quickstart
# Run the automated test
./test-request.sh
If you prefer to test manually:
# Port-forward to the gateway service
kubectl port-forward -n llm-d svc/llm-d-inference-gateway-istio 8080:80 &
# Test the API with a simple request
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "user", "content": "Explain Kubernetes in simple terms"}
],
"max_tokens": 150,
"stream": false
}' | jq
You should see a successful JSON response like:
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"message": {
"content": "Kubernetes (also known as K8s) is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications...",
"reasoning_content": null,
"role": "assistant",
"tool_calls": []
},
"stop_reason": null
}
],
"created": 1752523066,
"id": "chatcmpl-76c2a86b-5460-4752-9f20-03c67ca5b0ba",
"kv_transfer_params": null,
"model": "meta-llama/Llama-3.2-3B-Instruct",
"object": "chat.completion",
"prompt_logprobs": null,
"usage": {
"completion_tokens": 150,
"prompt_tokens": 41,
"prompt_tokens_details": null,
"total_tokens": 191
}
}
If you completed Step 3B (monitoring setup), you can access the comprehensive monitoring dashboard:
# Port-forward to Grafana
kubectl port-forward -n llm-d-monitoring svc/prometheus-grafana 3000:80
# Get admin password
kubectl get secret prometheus-grafana -n llm-d-monitoring -o jsonpath="{.data.admin-password}" | base64 -d
Grafana Access: http://localhost:3000
Username: admin
Password: (from command above)
After monitoring setup, you’ll find:
The dashboard may take 1-2 minutes to appear as it’s loaded by Grafana’s sidecar.
Request Performance Metrics:
Resource Utilization Metrics:
llm-d Specific Metrics:
Kubernetes Metrics:
Performance Optimization Indicators:
These metrics help you:
Symptoms: Script terminates after “GPU node pool created successfully” Cause: DigitalOcean API response delays during node provisioning (this is normal!) Solution:
# Wait 30 seconds, then re-run the script
./setup-gpu-cluster.sh
# The script will automatically continue from where it left off
# No duplicate resources will be created
Symptoms: Pods stuck in Pending
state
Solution: Check GPU node availability and resource requests
kubectl describe pods -n llm-d | grep -A 5 "Events:"
Symptoms: Pods showing download errors Solution: Verify HF_TOKEN is set correctly
kubectl logs -n llm-d -l app=decode
Symptoms: API requests failing Solution: Check all pods are running and services are available
kubectl get pods -n llm-d
kubectl get svc -n llm-d
Symptoms: llm-d dashboard not visible in Grafana after running monitoring setup Solution: Check dashboard ConfigMap and Grafana sidecar
# Check if dashboard ConfigMap exists
kubectl get configmap llm-d-dashboard -n llm-d-monitoring
# Check ConfigMap labels
kubectl get configmap llm-d-dashboard -n llm-d-monitoring -o yaml | grep grafana_dashboard
# If missing, re-run monitoring setup
cd monitoring && ./setup-monitoring.sh
Congratulations! You now have a working llm-d deployment on DigitalOcean Kubernetes. Your deployment includes:
When you’re done experimenting, you have two cleanup options:
If you want to keep your DOKS cluster but remove llm-d components:
# Navigate back to the deployment directory
cd /path/to/llm-d-deployer/quickstart/infra/doks-digitalocean
# Remove llm-d components using the uninstall flag
./deploy-llm-d.sh -u
# Optionally remove monitoring (if installed)
# kubectl delete namespace llm-d-monitoring
This will:
If you want to remove everything including the cluster:
# Delete the cluster (this will remove all resources)
doctl kubernetes cluster delete llm-d-cluster
Tip: Use Option 1 if you plan to experiment with different llm-d configurations or other Kubernetes workloads on the same cluster. Use Option 2 for complete cleanup when you’re finished with all experiments.
By following this tutorial, you have learned how to deploy llm-d—a powerful, distributed LLM inference framework—on DigitalOcean Kubernetes (DOKS) with GPU support. You set up a production-ready cluster, configured GPU resources, deployed llm-d components, and validated distributed LLM inference using an OpenAI-compatible API. This approach enables you to efficiently serve large language models at scale, optimize GPU utilization, and build robust, scalable AI services on a cloud-native platform.
With your deployment in place, you can now scale resources, experiment with different models, monitor performance, and integrate LLM inference into your own applications. Whether you are building real-time generative AI products or supporting large-scale inference workloads, llm-d on DOKS provides a flexible and cost-effective foundation.
Happy deploying with llm-d on Kubernetes!
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I’m a Senior Solutions Architect in Munich with a background in DevOps, Cloud, Kubernetes and GenAI. I help bridge the gap for those new to the cloud and build lasting relationships. Curious about cloud or SaaS? Let’s connect over a virtual coffee! ☕
Helping Businesses stand out with AI, SEO, & Technical content that drives Impact & Growth | Senior Technical Writer @ DigitalOcean | 2x Medium Top Writers | 2 Million+ monthly views & 34K Subscribers | Ex Cloud Engineer @ AMEX | Ex SRE(DevOps) @ NUTANIX
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.