DigitalOcean Inference Engine

The Inference Engine for production AI: every model, every modality, on one platform, with the Router to optimize every call.

Three problems with fragmented inference

Model obsolescence

Models evolve quickly, and the one you chose months ago is rarely the best option today. Without intelligent routing, staying current means repeated migrations, rewrites, and vendor churn.

Cost that compounds at every layer

Inference costs don't just grow with tokens. Requests pass through multiple vendors, each adding markup on compute and orchestration, and every hop between services incurs egress. Teams end up overpaying for simple workloads or building complex routing systems just to stay efficient.

Operational blind spots

When inference runs across fragmented services, observability becomes an afterthought. Teams lose end-to-end visibility into latency by model, cost per request, and error rates, and can't optimize what they can't measure.

DigitalOcean Inference Engine Layers

Control and Evaluate AI Inference

A unified control plane for AI inference. Define routing policies, evaluate model performance, and run experiments to continuously optimize how models behave in production.

Replace fragmented tooling with a single system for policy definition, output validation, and model testing.

Control plane for intelligence

Inference Router (Public Preview)

The Inference Router is the control plane for production AI systems, unifying how models are selected and optimized across every inference call. It replaces manual routing logic with policy-driven control that adapts in real time.

Teams define routing behavior with simple policies in natural language or structured rules, enabling intent-based control over cost and latency, without hardcoding models.

Auto-route by cost or latency

Override any request at runtime

Failover that just works

Pin trajectories for agent consistency

Runs on Serverless and Dedicated

Full Model Catalog, one endpoint

Evaluate routers like models

Evaluate models with real production data

Model Evaluations (Public Preview)

For teams validating model performance using real datasets before deploying to production.

Model Evaluations enables structured testing of catalog and Bring Your Own Models as well as inference routers. It utilizes LLM-as-a-judge to offer unified visibility into quality and latency.

Evaluate anything: catalog, BYOM, and routers

Real datasets and LLM-as-a-Judge scoring

Correctness, completeness, faithfulness, and safety

Latency, tokens, and cost per run

Compare everything side by side

Re-run as models evolve

Test, compare, and move faster from idea to production

Model Playground

For rapid experimentation and comparison across all model types.

The Model Playground lets teams test text, image, audio, and video models side by side and export production-ready API code directly from their configuration.

Every modality, side by side

Live parameter controls

Real-time inference with any catalog model

Zero code to test

Export curl or SDK instantly

Playground to production in one click

Run AI Inference in Production

The runtime for AI inference. Execute real-time, batch, and dedicated workloads through a single system that abstracts infrastructure complexity.

Real-time inference

Serverless Inference

For production APIs, agents, and applications that require real-time responses.

Real-time text generation, image generation, audio, and video inference

70+ curated open-source and frontier models

Day 0 access to select OpenAI and Anthropic model releases

Intelligent routing for cost and latency optimization

Built-in observability (tokens, latency, errors, spend)

Multimodal generation (text to image, video, speech)

Agentic workflows via Messages API

Asynchronous AI at scale

Batch Inference

For large-scale workloads that do not require real-time latency.

Async job-based inference via API or SDK

24-hour result delivery SLA

Up to 50% cost reduction vs real-time inference

Isolated rate limits from production workloads

Transparent job lifecycle tracking (queued → processing → complete)

OpenAI and Anthropic-compatible batch schema for easy migration

Large-scale evaluation, enrichment, and moderation pipelines

Controlled model hosting

Dedicated Inference

For sustained workloads requiring infrastructure-level control and performance guarantees.

Dedicated GPU endpoints in selected regions

Bring Your Own Model deployment

Custom GPU type and scaling configuration

Pre-tuned inference stack with optimized performance defaults

Managed orchestration and scaling without Kubernetes complexity

High-throughput production workloads and agent systems

Fine-tuned control over latency and performance profiles

Serverless Inference is fantastic because we can make as many calls as we need without worrying about provisioning infrastructure. It just scales automatically.

Carlo Ruiz

Infrastructure Engineer, Traversal

Built for the multimodal era

Modern AI applications are not text-only. The Inference Engine natively supports:

Text generation

Image generation

Video generation

Speech generation

Vision-language understanding

All through a single API key. No separate vendors. No fragmented billing. No additional infrastructure.

The latest models — by design

Weekly open-source refreshes, one-line model switching, and Day 0 access to select frontier releases keep production teams moving without migrations.

Operational intelligence at every step

Cost optimization

Cost optimization

Delivers 3x throughput vs. AWS Bedrock, $0.65/M serverless tokens, and $6/hr dedicated inference.

Observability built in

Observability built in

Track token usage, TTFT, latency, errors, spend, and batch lifecycle without external tooling. Ranked #1 on Artificial Analysis for performance efficiency across leading inference providers.

Platform native

Platform native

One security model, one billing system, and one infrastructure layer from GPU to API.

Built into DigitalOcean AI-Native Cloud

Your inference layer is part of your stack

Run your inference workloads alongside your existing infrastructure with no stitched-together vendors, fragmented billing, or hidden complexity.

FAQs about Inference Engine

What is the Inference Engine?

The Inference Engine is DigitalOcean's production system for serving AI models at scale. It brings together Serverless, Batch, and Dedicated Inference under a single OpenAI and Anthropic-compatible endpoint so developers can run real-time, asynchronous, or reserved workloads without managing infrastructure.

How does Intelligent Router (Public Preview) work?

Instead of manually choosing models for each request, developers can rely on system-level routing or presets that automatically match requests to the most appropriate model based on task type, cost, and performance needs. This reduces the need to hardcode model decisions and helps optimize inference in production.

What models are available on DigitalOcean?
DigitalOcean provides a curated catalog of 70+ open-source models alongside early access to select frontier models from providers like OpenAI and Anthropic. Additional models available include Qwen 3.5 397B, Deepseek v3.2, Gemma 4, and others, all accessible through a single API key.

You can also import models from Hugging Face and bring your own custom models from Spaces into your Model Catalog, giving you a single pane of glass to manage and deploy everything in one place.
What is Multimodal Inference?

Multimodal Inference allows developers to generate and process images, video, and audio directly through DigitalOcean’s API. It includes capabilities like text-to-image, text-to-video, and text-to-speech, all running natively within the same platform as text models.

How is Batch Inference different from real-time inference?

Batch Inference is designed for large, asynchronous workloads that do not require immediate responses. It allows developers to submit large job sets and receive results within 24 hours at significantly lower cost than real-time inference.

What is the Model Playground used for?

The Model Playground is an interactive environment for testing text, image, audio, and video models side by side. It allows developers to adjust parameters and export ready-to-use API code directly from their configurations.

How is pricing handled across the platform?

DigitalOcean uses a pay-as-you-go model with spend-based limits rather than fixed token caps. Certain workloads also benefit from features like off-peak discounts and batch pricing to reduce overall inference costs.

Who is the Inference Engine built for?

It is designed for AI engineers and technical teams building production AI applications at scale. This includes AI-native companies, enterprise teams modernizing workflows, and developers who need flexibility across models, modalities, and deployment types.

Start building with the Inference Engine

One platform for every model. One system for every workload. One engine for production AI.