Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
GPU
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

DigitalOcean Inference Mode Comparison for Your Each Use Case

Published on May 8, 2026

AI Inference

AI/ML

Solutions Architect

By Akshit Pratiush and Anish Singh Walia

DigitalOcean Inference Mode Comparison for Your Each Use Case

A practical guide for developers and teams new to AI inference, comparing Serverless, Dedicated, Batch, and Inference Router so you can ship smarter from day one.

What is the DigitalOcean Inference Engine?

This DigitalOcean inference mode comparison explains how to choose the right serving model for your latency, cost, and scale needs. The DigitalOcean Inference Engine is the AI inference layer inside DigitalOcean’s cloud platform. It gives you one place to run large language models, from a quick experiment to a production application serving thousands of users, without managing GPU clusters, inference servers, or complex orchestration layers.

New to AI inference? Inference is the act of sending input to an AI model and receiving output. The complexity and cost come from what happens in between: which GPU runs the model, how many requests it handles at once, how long it takes, and how you pay for it. The Inference Engine gives you control over all of that.

The platform contains four distinct inference modes, each designed for a different stage of development or type of workload:

Mode	The one-line summary
Serverless Inference	Fire a request, pay for tokens, repeat
Dedicated Inference	Provision a private GPU endpoint you fully control
Batch Inference	Process thousands of prompts as a single overnight job
Inference Router	Automatically route each request to the best model for cost and speed

This guide explains what each mode does, when to use it, and how to get started, with real-world examples along the way.

Key takeaways

Start with Serverless Inference when traffic is uncertain and speed of shipping matters most.
Move to Dedicated Inference when your request volume is steady, latency SLOs are strict, or you need BYOM.
Use Batch Inference for large non-interactive jobs where lower cost matters more than immediate output.
Add Inference Router when your workload has mixed prompt complexity and you want automatic cost and latency balancing.
Track break-even decisions using p95 latency, tokens per day, and monthly cost per successful request.

Serverless Inference

Dedicated pricing model

Per token, per model.

Serverless Inference is the fastest way to get a model running. You call an API endpoint, you get a response, and you are billed only for the tokens consumed in that request and response. There is no GPU to provision, no infrastructure to maintain, and no idle cost when nobody is sending requests.

It supports a broad catalog of models, from open-source models like DeepSeek, Llama, Qwen, and Mistral to commercial models from Anthropic (Claude) and OpenAI (GPT). DigitalOcean launch messaging also highlights Day 0 access for select model releases. For recent platform launches from Deploy, see Introducing DigitalOcean AI-Native Cloud for Production AI Workloads and Powering the Inference Era.

Because it runs on shared infrastructure, latency can vary under load. For most early-stage applications and experiments, this is usually fine. When you need tighter latency control at scale, that is when teams often move to Dedicated Inference.

Dedicated best-fit use cases

Prototypes and proof-of-concept apps
Unpredictable or spiky traffic patterns
Low-to-medium volume production workloads
Rapid iteration across many models
Internal tools and demos

Dedicated key characteristics

Zero infrastructure management
Scales to zero with no idle GPU costs
Access to a large model catalog through one API endpoint
OpenAI-compatible API
Integrated DigitalOcean billing with no separate vendor accounts

Dedicated quick start

curl -X POST 'https://inference.do-ai.run/v1/chat/completions' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Summarize this document in 3 bullet points."}
    ]
  }'

Dedicated real-world example

A two-person startup is building an AI writing assistant. They do not know how many users they will have next month. Serverless Inference lets them ship today, paying only for what they use, without reserving GPUs that might sit idle. When traffic grows to a predictable baseline, they can migrate to Dedicated Inference for better unit economics. For deeper migration context, see Dedicated vs. Serverless Inference as You Scale and What’s New on DigitalOcean’s Inference Engine.

Dedicated Inference

Batch pricing model

Per GPU hour of uptime.

Dedicated Inference gives you a private, single-tenant GPU endpoint, your own inference server in the cloud that DigitalOcean manages for you. Nobody else shares your GPU. You choose the GPU type (H100, H200, B300, MI300, MI325, and more), choose the model, and get a dedicated HTTPS endpoint that is yours alone.

Because it is single-tenant, there is no “noisy neighbor” effect. Latency is consistent and predictable. You can configure parameters like concurrency limits, max sequence length, speculative decoding, and LoRA adapters that are simply not tunable in Serverless. Most importantly, you can bring your own fine-tuned or custom models, which Serverless Inference does not support.

Based on current official documentation, Dedicated Inference deployments are available in ATL1, NYC2, and TOR1.

The pricing model flips from per-token to per-GPU-hour. At high steady-state traffic, Dedicated is often cheaper than Serverless because you are not paying a per-token markup. At low traffic, you still pay for idle time. As a rule of thumb, if inference runs continuously and predictably, Dedicated often wins on cost.

Seamless migration: You can move a model you tested on Serverless Inference into a Dedicated endpoint with a single click in the console. No schema changes to your API calls are required.

Batch best-fit use cases

High-volume, consistent production traffic
User-facing applications where latency SLAs matter
Custom or fine-tuned model hosting (Bring Your Own Model)
Data privacy requirements that demand an isolated execution environment
Full control over GPU type, region, and inference engine parameters

Batch key characteristics

Single-tenant, private GPU instance
No cold starts, always warm and ready
Bring Your Own Model (BYOM) support for fine-tuned LLMs
Scale behavior controlled through accelerator and node configuration
Separate DI access token (intentionally distinct from the general DO API token for security)

Quick start

# Use your Dedicated Inference endpoint URL and DI access token
curl -X POST '$DEDICATED_INFERENCE_PUBLIC_ENDPOINT/v1/chat/completions' \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "$MODEL_SLUG",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement simply."}
    ]
  }'

Batch real-world example

An AI-native SaaS company has 15,000 daily active users using their code completion product. Requests are constant and predictable throughout the day. Per-token pricing at that volume becomes expensive quickly. With Dedicated Inference, they provision H100s, host their fine-tuned code model trained on their proprietary codebase, and pay a flat hourly rate. They also configure concurrency limits to match their traffic profile closely, which is not possible in Serverless Inference. For architecture details, read DigitalOcean Dedicated Inference: A Technical Deep Dive and How to Deploy NVIDIA Dynamo for LLM Inference.

Batch Inference

Router pricing model

Significantly lower cost than real-time inference.

Batch Inference is for when you have a lot of work to do and none of it needs to happen right now. Instead of sending requests one by one and waiting for each response, you package everything into a JSONL file, submit it as a single job, and come back when it is done, typically within 24 hours.

The trade-off is simple: you give up real-time responses, and in exchange you get lower token costs. Batch traffic runs on off-peak GPU capacity, isolated from real-time inference, so it does not affect your live applications. The system is also fault-tolerant. Transient errors are retried automatically, individual request failures do not fail the whole job, and if a job is cancelled midway through, all completed requests are saved and billed.

Current model support: Batch Inference is available for OpenAI and Anthropic text-only models. Open-source, DO-hosted, multimodal, and image generation workloads are not supported yet.

Router best-fit use cases

Document classification at scale
Overnight data enrichment pipelines
Running model evaluation suites against large test sets
Building RAG knowledge base indexes
Sentiment analysis or entity extraction on large text corpora

Router key characteristics

Up to 50,000 requests per job file
200 MB maximum file size per job
Automatic retry on 429 and 5xx errors (up to 2 retries with exponential backoff)
Monitor job status via polling or configure webhooks
Continuation jobs for resuming partially completed work

How it works in three steps

# Step 1: Request a pre-signed upload URL and upload your JSONL file
curl -X POST 'https://inference.do-ai.run/v1/batches/files' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{"file_name": "my_prompts.jsonl"}'

# Step 2: Create the batch job
curl -X POST 'https://inference.do-ai.run/v1/batches' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "file_id": "<file_id>",
    "completion_window": "24h",
    "parameters": {
      "temperature": 0.2,
      "max_tokens": 1024
    }
  }'

# Step 3: Poll for completion and download results
curl 'https://inference.do-ai.run/v1/batches/<batch_id>' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"

Real-world example

A legal tech company needs to classify 200,000 contract clauses by risk category. Doing this in real time with a frontier model would be expensive and slow. They prepare a JSONL file with all 200,000 prompts, submit a batch job before end of business, and pull the fully processed results the next morning at a fraction of the real-time cost. If a handful of requests fail overnight, they run a continuation job on just those remaining records. For a practical implementation pattern, see Build a Bulk Inference Content Pipeline with DigitalOcean Serverless Inference.

Inference Router

Pricing model

You pay for the underlying inference usage of the selected models.

The Inference Router sits in front of your models and automatically decides which model should handle each incoming request based on task type, current latency, and cost. Instead of wiring every request to your most expensive frontier model, the router matches complexity to capability.

It is built on Plano, an open-source AI-native proxy. When a request arrives, a lightweight routing model resolves the intent of the prompt in about 200ms, then ranks candidate models using live cost and latency data that is refreshed continuously. A simple FAQ question gets routed to a fast, cheap model. A complex multi-step reasoning task gets routed to a frontier model. You pay only for what each task actually needs.

You can start with preset routers for common workflows (Software Engineering, Writing, General Q&A, Knowledge Base, Document Intelligence) or define your own task categories using plain language descriptions, no code required. A real-time dashboard shows you which models are selected and how often, giving you visibility into routing decisions.

Model Affinity: You can send the X-Model-Affinity header with a session identifier to keep subsequent requests pinned to the same routed model for more consistent agent behavior.

Best for

Applications that handle diverse, mixed-complexity task types
Reducing inference costs without sacrificing response quality
Teams that want automatic model selection without building routing logic themselves
High-volume applications with a mix of simple and complex prompts

Key characteristics

Approximately 200ms routing overhead per request
Preset router configurations for common workflows
Custom task definitions using plain natural language
Live cost and latency metrics dashboard
Routes to both Serverless and Dedicated Inference endpoints
Uses a model access key, same as Serverless Inference
Inference Router is currently in public preview

Router quick start

# Just prefix your model name with "router:" and your router name
curl -X POST 'https://inference.do-ai.run/v1/chat/completions' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "router:my-custom-router",
    "messages": [
      {"role": "user", "content": "How do I reset my password?"}
    ]
  }'
# The router decides which model answers, not you.

Router real-world example

A developer tools company has an AI assistant that handles code generation, documentation writing, and simple FAQ answers, three very different tasks with different complexity levels. Without a router, every request hits a frontier model at premium prices. With the Inference Router, they define task categories in plain language. Complex code questions go to a strong coding model. FAQ lookups go to a small, fast open-source model. Their monthly inference bill drops with no changes to application code.

Side-by-Side Comparison

	Serverless	Dedicated	Batch	Router
Response type	Real-time	Real-time	Async (up to 24h)	Real-time
Pricing model	Per token	Per GPU hour	Discounted tokens	Underlying usage
Infrastructure setup	None required	Pick GPU and model	None required	Configure tasks
Custom / fine-tuned models	Not supported	Yes (BYOM)	Not supported	Via Dedicated endpoint
Latency profile	Variable on shared infrastructure	More predictable on isolated GPUs	Not applicable	Depends on selected target model
Scale to zero (no idle cost)	Yes	No (hourly billing)	Yes	Yes
Best cost scenario	Low or variable volume	High steady-state	Any high-volume offline work	Mixed task types
OpenAI-compatible API	Yes	Yes	Yes	Yes
Auth token	Model Access Key (or PAT)	Dedicated endpoint access token	Model Access Key	Model Access Key (or PAT)
OSS model support	Yes	Yes	OpenAI and Anthropic only	Yes

How to Decide Which One to Use

Work through these questions to find your answer quickly.

Do you need results immediately, or can you wait hours? If you can wait → Batch Inference. It is the cheapest option for bulk work.

Are you prototyping, experimenting, or in early development? Start with → Serverless Inference. Zero setup, pay as you go, access to every model.

Do you have high, predictable production traffic and strict latency requirements? Upgrade to → Dedicated Inference. Private GPU resources and more predictable latency under steady load.

Do you need to host a fine-tuned or custom model? Only option → Dedicated Inference. BYOM is only supported here.

Are you sending a mix of simple and complex requests and want to cut costs automatically? Layer on → Inference Router. Intelligent routing reduces cost without changing your application code.

The Typical Progression

Most teams follow a natural path:

Start on Serverless Inference to validate the idea. Zero commitment, pay as you learn.
Migrate to Dedicated Inference as traffic grows and becomes predictable. Better unit economics and tighter latency control.
Add the Inference Router to reduce costs across a growing model portfolio.
Introduce Batch Inference whenever offline data jobs appear, such as evals, enrichment pipelines, and classification at scale.

You do not have to follow this path in order. Many teams run multiple modes simultaneously. A company might use Serverless for their chatbot, Dedicated for their fine-tuned internal model, Batch for their nightly data pipeline, and Router to orchestrate all of it.

Metrics and break-even checklist

Use this checklist to decide whether to stay on your current mode or switch.

Latency target: If your p95 latency target is strict and misses are increasing, test Dedicated Inference first.
Traffic shape: If daily traffic is unpredictable or bursty, keep Serverless as your default entry point.
Cost per successful request: Compare serverless token spend versus hourly GPU spend at current and forecast traffic.
TTFT and throughput: If time to first token and tokens per second are not meeting product expectations, benchmark Dedicated and Router policies.
Workload type: Move offline workflows to Batch to reduce cost and reserve real-time capacity for user-facing requests.

For design ideas that combine retrieval and tool calling with model serving, see Build an AI app with LLM tool calling and managed databases on DigitalOcean. For inference performance context, see How we built the most performant DeepSeek V3.2, MiniMax-M2.5 and Qwen 3.5 397B on DigitalOcean Serverless Inference.

Get Started

All four inference modes are available in the DigitalOcean control panel today.

Inference Engine Console: cloud.digitalocean.com/model-studio
Full Documentation: docs.digitalocean.com/products/inference
Serverless Quickstart: Use Serverless Inference
Dedicated Quickstart: Use Dedicated Inference
Batch Quickstart: Use Batch Inference
Router Quickstart: Use Inference Router
Pricing Details: Inference Pricing
API Reference: DigitalOcean AI Inference API

FAQs

1. What is the difference between serverless and dedicated inference on DigitalOcean?

Serverless Inference is request-based and billed by token usage, so it is best for variable traffic and fast iteration. Dedicated Inference runs on isolated GPUs billed by hour, so it is better for steady high-volume workloads, tighter latency control, and custom model hosting.

2. When should I move from serverless inference to dedicated inference?

Move when you see a stable traffic baseline, repeated p95 latency misses, or monthly token spend that is consistently higher than projected hourly GPU cost. A short benchmark with production-like prompts is usually enough to find your break-even point.

3. Is batch inference cheaper than real-time inference?

For non-interactive workloads, usually yes. Batch Inference runs asynchronously and is priced at up to a 50% discount for eligible OpenAI and Anthropic text workloads. Completed work is preserved if jobs are cancelled or expire.

4. How does Inference Router choose models, and does it support fallback?

Inference Router evaluates incoming requests against task definitions and applies policies such as cost efficiency or speed optimization. If a selected model is unavailable or rate-limited, fallback models can take over automatically. Router is currently in public preview.

5. Can I run multiple inference modes together in one architecture?

Yes. Many teams use a hybrid pattern: Serverless for burst traffic, Dedicated for baseline production traffic, Router for mixed prompt complexity, and Batch for offline pipelines such as evaluations, enrichment, and bulk classification.

Conclusion

Choosing the right inference mode is less about one permanent choice and more about matching each workload to the right operating model. Start simple, measure what matters, and evolve toward a hybrid architecture as your product and traffic mature. If you want the larger product vision behind this shift to inference-first systems, read Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud.

Keep building on DigitalOcean

Explore the Inference Engine documentation and launch your first workload in the DigitalOcean control panel. If you are ready to productionize, use Dedicated Inference for predictable traffic and Inference Router for automatic model selection.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Akshit Pratiush

Author

Senior Solutions Architect

See author profile

Anish Singh Walia

Author

Sr Technical Content Strategist and Team Lead

See author profile

I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix

Category:

Tags:

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Learn more

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.

DigitalOcean Inference Mode Comparison for Your Each Use Case

About the author(s)

Still looking for an answer?

Deploy on DigitalOcean

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

Get our newsletter

The developer cloud

Start building today