• Blog
  • Docs
  • Careers
  • Get Support
  • Contact Sales
DigitalOcean
  • Featured AI Products

    Compute

    Build, deploy, and scale cloud compute resources

    Containers and Images

    Safely store and manage containers and backups

    Managed Databases

    Fully managed resources running popular database engines

    Management and Dev Tools

    Control infrastructure and gather insights

    Networking

    Secure and control traffic to apps

    Security

    Help protect your account and resources with these security features

    Storage

    Store and access any amount of data reliably in the cloud

    Browse all products

  • AI/ML

    CMS

    Data and IoT

    Developer Tools

    Gaming and Media

    Hosting

    Security and Networking

    Startups and SMBs

    Web and App Platforms

    See all solutions

  • Community

    Documentation

    Developer Tools

    Get Involved

    Utilities and Help

  • Become a Partner

    Marketplace

  • Pricing
  • Log in
  • Sign up
  • Log in
  • Sign up

Company

  • About
  • Leadership
  • Blog
  • Careers
  • Customers
  • Partners
  • Referral Program
  • Affiliate Program
  • Press
  • Legal
  • Privacy Policy
  • Security
  • Investor Relations

Products

  • GPU Droplets
  • Bare Metal GPUs
  • Inference Engine
  • Data & Learning
  • Evaluations
  • Model Library
  • Droplets
  • Kubernetes
  • Functions
  • App Platform
  • Load Balancers
  • Managed Databases
  • Spaces
  • Block Storage
  • Network File Storage
  • API
  • Uptime
  • Cloud Security Posture Management (CSPM)
  • Identity and Access Management (IAM)
  • Cloudways
  • View all Products

Resources

  • Community Tutorials
  • Community Q&A
  • CSS-Tricks
  • Write for DOnations
  • Currents Research
  • DigitalOcean Startups
  • Wavemakers Program
  • Compass Council
  • Open Source
  • Newsletter Signup
  • Marketplace
  • Pricing
  • Pricing Calculator
  • Documentation
  • Release Notes
  • Code of Conduct
  • Shop Swag

Solutions

  • AI Training GPU
  • GPU Inference
  • VPS Hosting
  • Website Hosting
  • VPN
  • Docker Hosting
  • Node.js Hosting
  • Web Mobile Apps
  • WordPress Hosting
  • Virtual Machines
  • View all Solutions

Contact

  • Support
  • Sales
  • Report Abuse
  • System Status
  • Share your ideas

Company

  • About
  • Leadership
  • Blog
  • Careers
  • Customers
  • Partners
  • Referral Program
  • Affiliate Program
  • Press
  • Legal
  • Privacy Policy
  • Security
  • Investor Relations

Products

  • GPU Droplets
  • Bare Metal GPUs
  • Inference Engine
  • Data & Learning
  • Evaluations
  • Model Library
  • Droplets
  • Kubernetes
  • Functions
  • App Platform
  • Load Balancers
  • Managed Databases
  • Spaces
  • Block Storage
  • Network File Storage
  • API
  • Uptime
  • Cloud Security Posture Management (CSPM)
  • Identity and Access Management (IAM)
  • Cloudways
  • View all Products

Resources

  • Community Tutorials
  • Community Q&A
  • CSS-Tricks
  • Write for DOnations
  • Currents Research
  • DigitalOcean Startups
  • Wavemakers Program
  • Compass Council
  • Open Source
  • Newsletter Signup
  • Marketplace
  • Pricing
  • Pricing Calculator
  • Documentation
  • Release Notes
  • Code of Conduct
  • Shop Swag

Solutions

  • AI Training GPU
  • GPU Inference
  • VPS Hosting
  • Website Hosting
  • VPN
  • Docker Hosting
  • Node.js Hosting
  • Web Mobile Apps
  • WordPress Hosting
  • Virtual Machines
  • View all Solutions

Contact

  • Support
  • Sales
  • Report Abuse
  • System Status
  • Share your ideas
© 2026 DigitalOcean, LLC.Sitemap.
Product updates

DigitalOcean Evaluations: Production Model and Router Testing for the Inference Stack

author

By Grace Morgan

  • Updated: July 1, 2026
  • 3 min read
<- Back to blog home

Choosing the right model or inference router for production means more than reading a leaderboard. It means validating any model or routing configuration on your own data using your prompts and your evaluation criteria before it ever reaches production, and comparing quality, latency, and cost in one place.

Evaluations, now available on the DigitalOcean Inference Engine, lets teams validate any model or inference router configuration on their own data before production. Run structured LLM-as-a-Judge evaluations across catalog models, fine-tuned models, BYOM imports, and router setups without stitching together a separate evaluation stack.

DigitalOcean Evaluations Capabilities

Evaluations provide everything teams need to validate model and router performance before production. LLM-as-a-Judge scoring runs across any candidate in your inference stack and returns per-item scores with judge rationale, plus latency, token, and cost tracking per run. Six pre-built metrics cover the most common evaluation needs out of the box. For teams that need full control: custom rubrics, reusable presets, MCP support, and full dataset management — all in the same platform as the inference endpoints you use in production.

Pre-Built and Custom Rubrics: Score Against Criteria That Match Your Domain

The six pre-built metrics, correctness, completeness, faithfulness, PII, toxicity, and bias, cover common evaluation needs. For specialized domains, custom rubrics let teams define their own judge instructions and scoring criteria directly in the judge prompt.

The judge evaluates responses against these criteria and returns per-item scores with rationale. Custom rubrics can also adapt the built-in correctness metric to different data formats instead of relying on a default interpretation.

Evaluation Presets: Save Configurations and Re-Run Without Rebuilding

Without saved configurations, every re-run becomes a rebuild with different judge models, parameters, or prompts, making results hard to compare.

Evaluation presets store the full configuration of a run including judge model, metrics, system prompt, and parameters, so teams can reuse them across model and router versions and compare results directly across v1, v2, and v3 fine-tunes.

MCP Support: Trigger Evaluations Programmatically

For agentic workflows and CI pipelines, evaluations cannot be a manual step in these workflows. MCP support enables evaluation jobs to be triggered programmatically from model registration events, deployment triggers, or schedules.

API and SDK endpoints are also available for teams integrating evaluations into deployment workflows.

Dataset Management: Manage Evaluation Data as a First-Class Resource

Datasets can be uploaded, versioned, reused, and deleted in a single place. Each upload creates a versioned dataset linked to evaluation runs for traceability back to the source data.

Datasets support CSV and JSONL formats up to 1GB or 1,000 rows via Console or cURL. Optional ground truth columns can be included to enable faithfulness scoring.

How to Access Evaluations

Skip the standalone eval tools. Evaluations is natively integrated into your DigitalOcean stack, so you evaluate against the same endpoints you serve on, on infrastructure we run end to end.

Evaluations supports validating any model or inference router in your stack including models from the DigitalOcean Model Catalog, Dedicated Inference endpoints, BYOM imports from Hugging Face or Spaces, and router configurations. All evaluations run against production-grade endpoints.

Evaluations supports a range of judge models, including DeepSeek-R1-Distill-Llama-70B and Qwen3-32B. Access to premium commercial models (OpenAI and Anthropic) as candidates or judges requires a tier 2 account. You can complete a pre-payment in the Console to move to tier 2 and unlock premium model access.

Billing is based on inference tokens consumed by the candidate and the judge model. Dataset and result storage is provided at no additional charge for the first 12 months.

Your inputs, outputs, and ground truth are sent to the judge model provider for scoring only. They are not stored outside DigitalOcean and not used to train models.

Full documentation, including dataset formatting requirements, preset configuration, and MCP trigger setup, is available here.

Start Evaluating Before You Ship

Model and router decisions don’t stop after launch. Evaluations give you a repeatable way to validate on your workloads, against your criteria, on the same endpoints your users hit, as your stack evolves. Run your first evaluation in the DigitalOcean Cloud Console today.

About the author

Grace Morgan
Grace Morgan
Author
See author profile
See author profile

Share

  • Product Updates

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.
Sign up

Related Articles

Run Codex in the cloud – DigitalOcean for Codex is now available
Product updates

Run Codex in the cloud – DigitalOcean for Codex is now available

Ari Sigal
  • June 25, 2026
  • 3 min read

Read more

Server-Side Tools Are Now Available for DigitalOcean Inference Engine
Product updates

Server-Side Tools Are Now Available for DigitalOcean Inference Engine

Grace Morgan
  • June 17, 2026
  • 3 min read

Read more

Model Evaluations: Prove Your Routing Policy Actually Works
Product updates

Model Evaluations: Prove Your Routing Policy Actually Works

Sathish Jothikumar

  • June 4, 2026
  • 7 min read

Read more