Report this

What is the reason for this report?

DigitalOcean Inference Mode Comparison for Your Each Use Case

Published on May 8, 2026
DigitalOcean Inference Mode Comparison for Your Each Use Case

A practical guide for developers and teams new to AI inference, comparing Serverless, Dedicated, Batch, and Inference Router so you can ship smarter from day one.

What is the DigitalOcean Inference Engine?

This DigitalOcean inference mode comparison explains how to choose the right serving model for your latency, cost, and scale needs. The DigitalOcean Inference Engine is the AI inference layer inside DigitalOcean’s cloud platform. It gives you one place to run large language models, from a quick experiment to a production application serving thousands of users, without managing GPU clusters, inference servers, or complex orchestration layers.

New to AI inference? Inference is the act of sending input to an AI model and receiving output. The complexity and cost come from what happens in between: which GPU runs the model, how many requests it handles at once, how long it takes, and how you pay for it. The Inference Engine gives you control over all of that.

The platform contains four distinct inference modes, each designed for a different stage of development or type of workload:

Mode The one-line summary
Serverless Inference Fire a request, pay for tokens, repeat
Dedicated Inference Provision a private GPU endpoint you fully control
Batch Inference Process thousands of prompts as a single overnight job
Inference Router Automatically route each request to the best model for cost and speed

This guide explains what each mode does, when to use it, and how to get started, with real-world examples along the way.

Key takeaways

  • Start with Serverless Inference when traffic is uncertain and speed of shipping matters most.
  • Move to Dedicated Inference when your request volume is steady, latency SLOs are strict, or you need BYOM.
  • Use Batch Inference for large non-interactive jobs where lower cost matters more than immediate output.
  • Add Inference Router when your workload has mixed prompt complexity and you want automatic cost and latency balancing.
  • Track break-even decisions using p95 latency, tokens per day, and monthly cost per successful request.

Serverless Inference

Dedicated pricing model

Per token, per model.

Serverless Inference is the fastest way to get a model running. You call an API endpoint, you get a response, and you are billed only for the tokens consumed in that request and response. There is no GPU to provision, no infrastructure to maintain, and no idle cost when nobody is sending requests.

It supports a broad catalog of models, from open-source models like DeepSeek, Llama, Qwen, and Mistral to commercial models from Anthropic (Claude) and OpenAI (GPT). DigitalOcean launch messaging also highlights Day 0 access for select model releases. For recent platform launches from Deploy, see Introducing DigitalOcean AI-Native Cloud for Production AI Workloads and Powering the Inference Era.

Because it runs on shared infrastructure, latency can vary under load. For most early-stage applications and experiments, this is usually fine. When you need tighter latency control at scale, that is when teams often move to Dedicated Inference.

Dedicated best-fit use cases

  • Prototypes and proof-of-concept apps
  • Unpredictable or spiky traffic patterns
  • Low-to-medium volume production workloads
  • Rapid iteration across many models
  • Internal tools and demos

Dedicated key characteristics

  • Zero infrastructure management
  • Scales to zero with no idle GPU costs
  • Access to a large model catalog through one API endpoint
  • OpenAI-compatible API
  • Integrated DigitalOcean billing with no separate vendor accounts

Dedicated quick start

curl -X POST 'https://inference.do-ai.run/v1/chat/completions' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Summarize this document in 3 bullet points."}
    ]
  }'

Dedicated real-world example

A two-person startup is building an AI writing assistant. They do not know how many users they will have next month. Serverless Inference lets them ship today, paying only for what they use, without reserving GPUs that might sit idle. When traffic grows to a predictable baseline, they can migrate to Dedicated Inference for better unit economics. For deeper migration context, see Dedicated vs. Serverless Inference as You Scale and What’s New on DigitalOcean’s Inference Engine.

Dedicated Inference

Batch pricing model

Per GPU hour of uptime.

Dedicated Inference gives you a private, single-tenant GPU endpoint, your own inference server in the cloud that DigitalOcean manages for you. Nobody else shares your GPU. You choose the GPU type (H100, H200, B300, MI300, MI325, and more), choose the model, and get a dedicated HTTPS endpoint that is yours alone.

Because it is single-tenant, there is no “noisy neighbor” effect. Latency is consistent and predictable. You can configure parameters like concurrency limits, max sequence length, speculative decoding, and LoRA adapters that are simply not tunable in Serverless. Most importantly, you can bring your own fine-tuned or custom models, which Serverless Inference does not support.

Based on current official documentation, Dedicated Inference deployments are available in ATL1, NYC2, and TOR1.

The pricing model flips from per-token to per-GPU-hour. At high steady-state traffic, Dedicated is often cheaper than Serverless because you are not paying a per-token markup. At low traffic, you still pay for idle time. As a rule of thumb, if inference runs continuously and predictably, Dedicated often wins on cost.

Seamless migration: You can move a model you tested on Serverless Inference into a Dedicated endpoint with a single click in the console. No schema changes to your API calls are required.

Batch best-fit use cases

  • High-volume, consistent production traffic
  • User-facing applications where latency SLAs matter
  • Custom or fine-tuned model hosting (Bring Your Own Model)
  • Data privacy requirements that demand an isolated execution environment
  • Full control over GPU type, region, and inference engine parameters

Batch key characteristics

  • Single-tenant, private GPU instance
  • No cold starts, always warm and ready
  • Bring Your Own Model (BYOM) support for fine-tuned LLMs
  • Scale behavior controlled through accelerator and node configuration
  • Separate DI access token (intentionally distinct from the general DO API token for security)

Quick start

# Use your Dedicated Inference endpoint URL and DI access token
curl -X POST '$DEDICATED_INFERENCE_PUBLIC_ENDPOINT/v1/chat/completions' \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "$MODEL_SLUG",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement simply."}
    ]
  }'

Batch real-world example

An AI-native SaaS company has 15,000 daily active users using their code completion product. Requests are constant and predictable throughout the day. Per-token pricing at that volume becomes expensive quickly. With Dedicated Inference, they provision H100s, host their fine-tuned code model trained on their proprietary codebase, and pay a flat hourly rate. They also configure concurrency limits to match their traffic profile closely, which is not possible in Serverless Inference. For architecture details, read DigitalOcean Dedicated Inference: A Technical Deep Dive and How to Deploy NVIDIA Dynamo for LLM Inference.

Batch Inference

Router pricing model

Significantly lower cost than real-time inference.

Batch Inference is for when you have a lot of work to do and none of it needs to happen right now. Instead of sending requests one by one and waiting for each response, you package everything into a JSONL file, submit it as a single job, and come back when it is done, typically within 24 hours.

The trade-off is simple: you give up real-time responses, and in exchange you get lower token costs. Batch traffic runs on off-peak GPU capacity, isolated from real-time inference, so it does not affect your live applications. The system is also fault-tolerant. Transient errors are retried automatically, individual request failures do not fail the whole job, and if a job is cancelled midway through, all completed requests are saved and billed.

Current model support: Batch Inference is available for OpenAI and Anthropic text-only models. Open-source, DO-hosted, multimodal, and image generation workloads are not supported yet.

Router best-fit use cases

  • Document classification at scale
  • Overnight data enrichment pipelines
  • Running model evaluation suites against large test sets
  • Building RAG knowledge base indexes
  • Sentiment analysis or entity extraction on large text corpora

Router key characteristics

  • Up to 50,000 requests per job file
  • 200 MB maximum file size per job
  • Automatic retry on 429 and 5xx errors (up to 2 retries with exponential backoff)
  • Monitor job status via polling or configure webhooks
  • Continuation jobs for resuming partially completed work

How it works in three steps

# Step 1: Request a pre-signed upload URL and upload your JSONL file
curl -X POST 'https://inference.do-ai.run/v1/batches/files' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{"file_name": "my_prompts.jsonl"}'

# Step 2: Create the batch job
curl -X POST 'https://inference.do-ai.run/v1/batches' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "file_id": "<file_id>",
    "completion_window": "24h",
    "parameters": {
      "temperature": 0.2,
      "max_tokens": 1024
    }
  }'

# Step 3: Poll for completion and download results
curl 'https://inference.do-ai.run/v1/batches/<batch_id>' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"

Real-world example

A legal tech company needs to classify 200,000 contract clauses by risk category. Doing this in real time with a frontier model would be expensive and slow. They prepare a JSONL file with all 200,000 prompts, submit a batch job before end of business, and pull the fully processed results the next morning at a fraction of the real-time cost. If a handful of requests fail overnight, they run a continuation job on just those remaining records. For a practical implementation pattern, see Build a Bulk Inference Content Pipeline with DigitalOcean Serverless Inference.

Inference Router

Pricing model

You pay for the underlying inference usage of the selected models.

The Inference Router sits in front of your models and automatically decides which model should handle each incoming request based on task type, current latency, and cost. Instead of wiring every request to your most expensive frontier model, the router matches complexity to capability.

It is built on Plano, an open-source AI-native proxy. When a request arrives, a lightweight routing model resolves the intent of the prompt in about 200ms, then ranks candidate models using live cost and latency data that is refreshed continuously. A simple FAQ question gets routed to a fast, cheap model. A complex multi-step reasoning task gets routed to a frontier model. You pay only for what each task actually needs.

You can start with preset routers for common workflows (Software Engineering, Writing, General Q&A, Knowledge Base, Document Intelligence) or define your own task categories using plain language descriptions, no code required. A real-time dashboard shows you which models are selected and how often, giving you visibility into routing decisions.

Model Affinity: You can send the X-Model-Affinity header with a session identifier to keep subsequent requests pinned to the same routed model for more consistent agent behavior.

Best for

  • Applications that handle diverse, mixed-complexity task types
  • Reducing inference costs without sacrificing response quality
  • Teams that want automatic model selection without building routing logic themselves
  • High-volume applications with a mix of simple and complex prompts

Key characteristics

  • Approximately 200ms routing overhead per request
  • Preset router configurations for common workflows
  • Custom task definitions using plain natural language
  • Live cost and latency metrics dashboard
  • Routes to both Serverless and Dedicated Inference endpoints
  • Uses a model access key, same as Serverless Inference
  • Inference Router is currently in public preview

Router quick start

# Just prefix your model name with "router:" and your router name
curl -X POST 'https://inference.do-ai.run/v1/chat/completions' \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "router:my-custom-router",
    "messages": [
      {"role": "user", "content": "How do I reset my password?"}
    ]
  }'
# The router decides which model answers, not you.

Router real-world example

A developer tools company has an AI assistant that handles code generation, documentation writing, and simple FAQ answers, three very different tasks with different complexity levels. Without a router, every request hits a frontier model at premium prices. With the Inference Router, they define task categories in plain language. Complex code questions go to a strong coding model. FAQ lookups go to a small, fast open-source model. Their monthly inference bill drops with no changes to application code.

Side-by-Side Comparison

Serverless Dedicated Batch Router
Response type Real-time Real-time Async (up to 24h) Real-time
Pricing model Per token Per GPU hour Discounted tokens Underlying usage
Infrastructure setup None required Pick GPU and model None required Configure tasks
Custom / fine-tuned models Not supported Yes (BYOM) Not supported Via Dedicated endpoint
Latency profile Variable on shared infrastructure More predictable on isolated GPUs Not applicable Depends on selected target model
Scale to zero (no idle cost) Yes No (hourly billing) Yes Yes
Best cost scenario Low or variable volume High steady-state Any high-volume offline work Mixed task types
OpenAI-compatible API Yes Yes Yes Yes
Auth token Model Access Key (or PAT) Dedicated endpoint access token Model Access Key Model Access Key (or PAT)
OSS model support Yes Yes OpenAI and Anthropic only Yes

How to Decide Which One to Use

Work through these questions to find your answer quickly.

Do you need results immediately, or can you wait hours? If you can wait → Batch Inference. It is the cheapest option for bulk work.

Are you prototyping, experimenting, or in early development? Start with → Serverless Inference. Zero setup, pay as you go, access to every model.

Do you have high, predictable production traffic and strict latency requirements? Upgrade to → Dedicated Inference. Private GPU resources and more predictable latency under steady load.

Do you need to host a fine-tuned or custom model? Only option → Dedicated Inference. BYOM is only supported here.

Are you sending a mix of simple and complex requests and want to cut costs automatically? Layer on → Inference Router. Intelligent routing reduces cost without changing your application code.

The Typical Progression

Most teams follow a natural path:

  1. Start on Serverless Inference to validate the idea. Zero commitment, pay as you learn.
  2. Migrate to Dedicated Inference as traffic grows and becomes predictable. Better unit economics and tighter latency control.
  3. Add the Inference Router to reduce costs across a growing model portfolio.
  4. Introduce Batch Inference whenever offline data jobs appear, such as evals, enrichment pipelines, and classification at scale.

You do not have to follow this path in order. Many teams run multiple modes simultaneously. A company might use Serverless for their chatbot, Dedicated for their fine-tuned internal model, Batch for their nightly data pipeline, and Router to orchestrate all of it.

Metrics and break-even checklist

Use this checklist to decide whether to stay on your current mode or switch.

  • Latency target: If your p95 latency target is strict and misses are increasing, test Dedicated Inference first.
  • Traffic shape: If daily traffic is unpredictable or bursty, keep Serverless as your default entry point.
  • Cost per successful request: Compare serverless token spend versus hourly GPU spend at current and forecast traffic.
  • TTFT and throughput: If time to first token and tokens per second are not meeting product expectations, benchmark Dedicated and Router policies.
  • Workload type: Move offline workflows to Batch to reduce cost and reserve real-time capacity for user-facing requests.

For design ideas that combine retrieval and tool calling with model serving, see Build an AI app with LLM tool calling and managed databases on DigitalOcean. For inference performance context, see How we built the most performant DeepSeek V3.2, MiniMax-M2.5 and Qwen 3.5 397B on DigitalOcean Serverless Inference.

Get Started

All four inference modes are available in the DigitalOcean control panel today.

FAQs

1. What is the difference between serverless and dedicated inference on DigitalOcean?

Serverless Inference is request-based and billed by token usage, so it is best for variable traffic and fast iteration. Dedicated Inference runs on isolated GPUs billed by hour, so it is better for steady high-volume workloads, tighter latency control, and custom model hosting.

2. When should I move from serverless inference to dedicated inference?

Move when you see a stable traffic baseline, repeated p95 latency misses, or monthly token spend that is consistently higher than projected hourly GPU cost. A short benchmark with production-like prompts is usually enough to find your break-even point.

3. Is batch inference cheaper than real-time inference?

For non-interactive workloads, usually yes. Batch Inference runs asynchronously and is priced at up to a 50% discount for eligible OpenAI and Anthropic text workloads. Completed work is preserved if jobs are cancelled or expire.

4. How does Inference Router choose models, and does it support fallback?

Inference Router evaluates incoming requests against task definitions and applies policies such as cost efficiency or speed optimization. If a selected model is unavailable or rate-limited, fallback models can take over automatically. Router is currently in public preview.

5. Can I run multiple inference modes together in one architecture?

Yes. Many teams use a hybrid pattern: Serverless for burst traffic, Dedicated for baseline production traffic, Router for mixed prompt complexity, and Batch for offline pipelines such as evaluations, enrichment, and bulk classification.

Conclusion

Choosing the right inference mode is less about one permanent choice and more about matching each workload to the right operating model. Start simple, measure what matters, and evolve toward a hybrid architecture as your product and traffic mature. If you want the larger product vision behind this shift to inference-first systems, read Powering the Inference Era: Inside the DigitalOcean AI-Native Cloud.

Keep building on DigitalOcean

Explore the Inference Engine documentation and launch your first workload in the DigitalOcean control panel. If you are ready to productionize, use Dedicated Inference for predictable traffic and Inference Router for automatic model selection.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Akshit Pratiush
Akshit Pratiush
Author
Senior Solutions Architect
See author profile
Anish Singh Walia
Anish Singh Walia
Author
Sr Technical Content Strategist and Team Lead
See author profile

I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.