By Akshit Pratiush and Anish Singh Walia

A practical guide for developers and teams new to AI inference, comparing Serverless, Dedicated, Batch, and Inference Router so you can ship smarter from day one.
This DigitalOcean inference mode comparison explains how to choose the right serving model for your latency, cost, and scale needs. The DigitalOcean Inference Engine is the AI inference layer inside DigitalOcean’s cloud platform. It gives you one place to run large language models, from a quick experiment to a production application serving thousands of users, without managing GPU clusters, inference servers, or complex orchestration layers.
New to AI inference? Inference is the act of sending input to an AI model and receiving output. The complexity and cost come from what happens in between: which GPU runs the model, how many requests it handles at once, how long it takes, and how you pay for it. The Inference Engine gives you control over all of that.
The platform contains four distinct inference modes, each designed for a different stage of development or type of workload:
| Mode | The one-line summary |
|---|---|
| Serverless Inference | Fire a request, pay for tokens, repeat |
| Dedicated Inference | Provision a private GPU endpoint you fully control |
| Batch Inference | Process thousands of prompts as a single overnight job |
| Inference Router | Automatically route each request to the best model for cost and speed |
This guide explains what each mode does, when to use it, and how to get started, with real-world examples along the way.
Per token, per model.
Serverless Inference is the fastest way to get a model running. You call an API endpoint, you get a response, and you are billed only for the tokens consumed in that request and response. There is no GPU to provision, no infrastructure to maintain, and no idle cost when nobody is sending requests.
It supports a broad catalog of models, from open-source models like DeepSeek, Llama, Qwen, and Mistral to commercial models from Anthropic (Claude) and OpenAI (GPT). DigitalOcean launch messaging also highlights Day 0 access for select model releases. For recent platform launches from Deploy, see Introducing DigitalOcean AI-Native Cloud for Production AI Workloads and Powering the Inference Era.
Because it runs on shared infrastructure, latency can vary under load. For most early-stage applications and experiments, this is usually fine. When you need tighter latency control at scale, that is when teams often move to Dedicated Inference.
curl -X POST 'https://inference.do-ai.run/v1/chat/completions' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-120b",
"messages": [
{"role": "user", "content": "Summarize this document in 3 bullet points."}
]
}'
A two-person startup is building an AI writing assistant. They do not know how many users they will have next month. Serverless Inference lets them ship today, paying only for what they use, without reserving GPUs that might sit idle. When traffic grows to a predictable baseline, they can migrate to Dedicated Inference for better unit economics. For deeper migration context, see Dedicated vs. Serverless Inference as You Scale and What’s New on DigitalOcean’s Inference Engine.
Per GPU hour of uptime.
Dedicated Inference gives you a private, single-tenant GPU endpoint, your own inference server in the cloud that DigitalOcean manages for you. Nobody else shares your GPU. You choose the GPU type (H100, H200, B300, MI300, MI325, and more), choose the model, and get a dedicated HTTPS endpoint that is yours alone.
Because it is single-tenant, there is no “noisy neighbor” effect. Latency is consistent and predictable. You can configure parameters like concurrency limits, max sequence length, speculative decoding, and LoRA adapters that are simply not tunable in Serverless. Most importantly, you can bring your own fine-tuned or custom models, which Serverless Inference does not support.
Based on current official documentation, Dedicated Inference deployments are available in ATL1, NYC2, and TOR1.
The pricing model flips from per-token to per-GPU-hour. At high steady-state traffic, Dedicated is often cheaper than Serverless because you are not paying a per-token markup. At low traffic, you still pay for idle time. As a rule of thumb, if inference runs continuously and predictably, Dedicated often wins on cost.
Seamless migration: You can move a model you tested on Serverless Inference into a Dedicated endpoint with a single click in the console. No schema changes to your API calls are required.
# Use your Dedicated Inference endpoint URL and DI access token
curl -X POST '$DEDICATED_INFERENCE_PUBLIC_ENDPOINT/v1/chat/completions' \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"model": "$MODEL_SLUG",
"messages": [
{"role": "user", "content": "Explain quantum entanglement simply."}
]
}'
An AI-native SaaS company has 15,000 daily active users using their code completion product. Requests are constant and predictable throughout the day. Per-token pricing at that volume becomes expensive quickly. With Dedicated Inference, they provision H100s, host their fine-tuned code model trained on their proprietary codebase, and pay a flat hourly rate. They also configure concurrency limits to match their traffic profile closely, which is not possible in Serverless Inference. For architecture details, read DigitalOcean Dedicated Inference: A Technical Deep Dive and How to Deploy NVIDIA Dynamo for LLM Inference.
Significantly lower cost than real-time inference.
Batch Inference is for when you have a lot of work to do and none of it needs to happen right now. Instead of sending requests one by one and waiting for each response, you package everything into a JSONL file, submit it as a single job, and come back when it is done, typically within 24 hours.
The trade-off is simple: you give up real-time responses, and in exchange you get lower token costs. Batch traffic runs on off-peak GPU capacity, isolated from real-time inference, so it does not affect your live applications. The system is also fault-tolerant. Transient errors are retried automatically, individual request failures do not fail the whole job, and if a job is cancelled midway through, all completed requests are saved and billed.
Current model support: Batch Inference is available for OpenAI and Anthropic text-only models. Open-source, DO-hosted, multimodal, and image generation workloads are not supported yet.
# Step 1: Request a pre-signed upload URL and upload your JSONL file
curl -X POST 'https://inference.do-ai.run/v1/batches/files' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H 'Content-Type: application/json' \
-d '{"file_name": "my_prompts.jsonl"}'
# Step 2: Create the batch job
curl -X POST 'https://inference.do-ai.run/v1/batches' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H 'Content-Type: application/json' \
-d '{
"file_id": "<file_id>",
"completion_window": "24h",
"parameters": {
"temperature": 0.2,
"max_tokens": 1024
}
}'
# Step 3: Poll for completion and download results
curl 'https://inference.do-ai.run/v1/batches/<batch_id>' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
A legal tech company needs to classify 200,000 contract clauses by risk category. Doing this in real time with a frontier model would be expensive and slow. They prepare a JSONL file with all 200,000 prompts, submit a batch job before end of business, and pull the fully processed results the next morning at a fraction of the real-time cost. If a handful of requests fail overnight, they run a continuation job on just those remaining records. For a practical implementation pattern, see Build a Bulk Inference Content Pipeline with DigitalOcean Serverless Inference.
You pay for the underlying inference usage of the selected models.
The Inference Router sits in front of your models and automatically decides which model should handle each incoming request based on task type, current latency, and cost. Instead of wiring every request to your most expensive frontier model, the router matches complexity to capability.
It is built on Plano, an open-source AI-native proxy. When a request arrives, a lightweight routing model resolves the intent of the prompt in about 200ms, then ranks candidate models using live cost and latency data that is refreshed continuously. A simple FAQ question gets routed to a fast, cheap model. A complex multi-step reasoning task gets routed to a frontier model. You pay only for what each task actually needs.
You can start with preset routers for common workflows (Software Engineering, Writing, General Q&A, Knowledge Base, Document Intelligence) or define your own task categories using plain language descriptions, no code required. A real-time dashboard shows you which models are selected and how often, giving you visibility into routing decisions.
Model Affinity: You can send the
X-Model-Affinityheader with a session identifier to keep subsequent requests pinned to the same routed model for more consistent agent behavior.
# Just prefix your model name with "router:" and your router name
curl -X POST 'https://inference.do-ai.run/v1/chat/completions' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "router:my-custom-router",
"messages": [
{"role": "user", "content": "How do I reset my password?"}
]
}'
# The router decides which model answers, not you.
A developer tools company has an AI assistant that handles code generation, documentation writing, and simple FAQ answers, three very different tasks with different complexity levels. Without a router, every request hits a frontier model at premium prices. With the Inference Router, they define task categories in plain language. Complex code questions go to a strong coding model. FAQ lookups go to a small, fast open-source model. Their monthly inference bill drops with no changes to application code.
| Serverless | Dedicated | Batch | Router | |
|---|---|---|---|---|
| Response type | Real-time | Real-time | Async (up to 24h) | Real-time |
| Pricing model | Per token | Per GPU hour | Discounted tokens | Underlying usage |
| Infrastructure setup | None required | Pick GPU and model | None required | Configure tasks |
| Custom / fine-tuned models | Not supported | Yes (BYOM) | Not supported | Via Dedicated endpoint |
| Latency profile | Variable on shared infrastructure | More predictable on isolated GPUs | Not applicable | Depends on selected target model |
| Scale to zero (no idle cost) | Yes | No (hourly billing) | Yes | Yes |
| Best cost scenario | Low or variable volume | High steady-state | Any high-volume offline work | Mixed task types |
| OpenAI-compatible API | Yes | Yes | Yes | Yes |
| Auth token | Model Access Key (or PAT) | Dedicated endpoint access token | Model Access Key | Model Access Key (or PAT) |
| OSS model support | Yes | Yes | OpenAI and Anthropic only | Yes |
Work through these questions to find your answer quickly.
Do you need results immediately, or can you wait hours? If you can wait → Batch Inference. It is the cheapest option for bulk work.
Are you prototyping, experimenting, or in early development? Start with → Serverless Inference. Zero setup, pay as you go, access to every model.
Do you have high, predictable production traffic and strict latency requirements? Upgrade to → Dedicated Inference. Private GPU resources and more predictable latency under steady load.
Do you need to host a fine-tuned or custom model? Only option → Dedicated Inference. BYOM is only supported here.
Are you sending a mix of simple and complex requests and want to cut costs automatically? Layer on → Inference Router. Intelligent routing reduces cost without changing your application code.
Most teams follow a natural path:
You do not have to follow this path in order. Many teams run multiple modes simultaneously. A company might use Serverless for their chatbot, Dedicated for their fine-tuned internal model, Batch for their nightly data pipeline, and Router to orchestrate all of it.
Use this checklist to decide whether to stay on your current mode or switch.
For design ideas that combine retrieval and tool calling with model serving, see Build an AI app with LLM tool calling and managed databases on DigitalOcean. For inference performance context, see How we built the most performant DeepSeek V3.2, MiniMax-M2.5 and Qwen 3.5 397B on DigitalOcean Serverless Inference.
All four inference modes are available in the DigitalOcean control panel today.
Serverless Inference is request-based and billed by token usage, so it is best for variable traffic and fast iteration. Dedicated Inference runs on isolated GPUs billed by hour, so it is better for steady high-volume workloads, tighter latency control, and custom model hosting.
Move when you see a stable traffic baseline, repeated p95 latency misses, or monthly token spend that is consistently higher than projected hourly GPU cost. A short benchmark with production-like prompts is usually enough to find your break-even point.
For non-interactive workloads, usually yes. Batch Inference runs asynchronously and is priced at up to a 50% discount for eligible OpenAI and Anthropic text workloads. Completed work is preserved if jobs are cancelled or expire.
Inference Router evaluates incoming requests against task definitions and applies policies such as cost efficiency or speed optimization. If a selected model is unavailable or rate-limited, fallback models can take over automatically. Router is currently in public preview.
Yes. Many teams use a hybrid pattern: Serverless for burst traffic, Dedicated for baseline production traffic, Router for mixed prompt complexity, and Batch for offline pipelines such as evaluations, enrichment, and bulk classification.
Choosing the right inference mode is less about one permanent choice and more about matching each workload to the right operating model. Start simple, measure what matters, and evolve toward a hybrid architecture as your product and traffic mature. If you want the larger product vision behind this shift to inference-first systems, read Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud.
Explore the Inference Engine documentation and launch your first workload in the DigitalOcean control panel. If you are ready to productionize, use Dedicated Inference for predictable traffic and Inference Router for automatic model selection.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.