DigitalOcean Evaluations

Evaluate any LLM, dedicated endpoint, or inference routing policy on DigitalOcean Inference Engine. Compare quality and performance in a single LLM-as-a-Judge workflow.

Benchmarks don't predict production. Your data does.

Shipping without validation

Shipping without validation

Generic benchmark scores show how a model performs on public datasets, not on your prompts, use case, or production endpoints. Without validation on real workloads, teams deploy on intuition and discover quality regressions only after they impact users.

No consistent way to compare

No consistent way to compare

Base models, fine-tuned models, imported models, and routing policies often live in separate systems. Without a consistent evaluation framework, every comparison requires custom tooling, making results harder to trust and repeat.

Evaluation that doesn't scale

Evaluation that doesn't scale

As models evolve, evaluation often remains a manual, one-time exercise. Without repeatable workflows, teams struggle to detect regressions, track progress across versions, or build evaluation into their deployment process.

Evaluate your entire inference stack

Serverless Inference models

Serverless Inference models

Evaluate any model in the DigitalOcean Model Catalog against your own datasets.

Dedicated Inference endpoints

Dedicated Inference endpoints

Test models deployed on reserved GPU capacity using the same endpoints you run in production.

BYOM models

BYOM models

Evaluate custom or fine-tuned models imported from Hugging Face or DigitalOcean Spaces.

Routing Policies

Routing Policies

Evaluate any router configuration as a candidate alongside single models, compare quality, and latency against your dataset, and catch misconfigurations before they reach production traffic.

Everything you need to evaluate with confidence

Get started

LLM-as-a-Judge with pre-built and custom rubrics

Score outputs against criteria that match your domain, not a generic rubric someone else defined.

  • Six pre-built metrics: correctness, completeness, faithfulness, PII, toxicity, bias
  • Custom rubrics: define your own judge instructions and scoring criteria
  • Per-item drill-down with full judge rationale
  • Configurable judge model from the Model Catalog

Evaluation presets

Save your full evaluation configuration and re-run it identically as models evolve.

  • Save judge model, metrics, system prompt, and model parameters as a reusable preset
  • One-click re-run across model versions
  • Eliminates configuration drift between runs
  • Compare v1 to v2 without rebuilding from scratch
  • Institutional memory across the model lifecycle

Quality and performance in one run

Stop stitching quality scores and serving metrics from separate tools.

  • Judge scores alongside TTFT, total latency, throughput, and tokens-per-request, all in the same dashboard
  • Aggregate and per-item views side by side
  • Answer "best accuracy and performance?" without leaving the platform
  • Export results for deeper analysis

MCP and programmatic access

Wire evaluation into your development workflow, not just before launch.

  • Trigger evaluation jobs via the MCP interface
  • Integrate quality checks into model registration events
  • API and SDK endpoints for automated pipelines
  • Evaluation as a first-class step in CI/CD
  • Trigger on schedule, on deployment, or from any agent workflow

Dataset management

Manage the datasets your evaluations depend on—versioned, reusable, and traceable.

  • Upload CSV or JSONL datasets (up to 1GB or 1,000 rows per dataset)
  • Dataset versioning with lineage tracking across evaluation runs
  • View, reuse, and delete datasets from a single surface
  • Upload via Console or cURL

Built for production evaluation

Go from one-off testing to repeatable, production-aligned evaluation workflows across your inference stack.

Repeatable comparisons

Save evaluation configurations and re-run them consistently as systems evolve. Same judge, same rubric, same dataset—so results are comparable across versions.

Everything in one place

Upload datasets, configure judges, run evaluations, and analyze results in a single platform across your inference stack. No custom pipelines or disconnected tools required.

Measure what you'll deploy

Evaluate against the same inference endpoints used in production. What you measure matches what your users actually experience.

Frequently Asked Questions

What models can I use as candidates?
Any model on Serverless Inference, any model deployed on a Dedicated Inference endpoint, any BYOM model you've imported, and any Inference Router configuration you've created. All candidates run against your actual production endpoints.
What models can I use as the judge?
Evaluations supports a range of judge models, including DeepSeek-V4-Pro and Qwen3-32B. Access to premium commercial models as candidates or judges requires a tier 2 account.
How is evaluation billed?
You pay for inference tokens consumed by the candidate model and the judge model. Dataset and result storage is subject to no additional charge for the first 12 months; after that, we reserve the right to bill for storage.
What is the difference between an evaluation preset and a template?

A preset saves your full evaluation configuration, judge model, metrics, system prompt, model parameters, so you can re-run it consistently across model versions.

Can I trigger evaluations automatically?

Yes. Via the MCP interface, evaluation jobs can be triggered programmatically — on model registration, on a schedule, or as part of a deployment pipeline. API and SDK endpoints are available for CI/CD integration.

What dataset formats are supported?

CSV and JSONL, up to 1GB or 1,000 rows per dataset. Datasets can include an optional ground truth column for faithfulness scoring.

How many concurrent evaluation runs can I run?

Up to 3 concurrent evaluation runs. Limits on datasets (10–100 depending on tier) and custom metrics (50) apply.

Is my evaluation data used to train models?

No. Your inputs, outputs, and ground truth are sent to the judge model provider for scoring only. They are not stored outside DigitalOcean and not used to train models.

Resources

PDocs

How to run your first evaluation

Blog

Evaluations: Prove Your Routing Policy Actually Works

Articles

API Reference: Evaluations

PDocs

Evaluations overview

Evaluate before you ship

Run your first evaluation in minutes. Compare quality and performance on real workloads—without managing infrastructure.

Get started