Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
GPU
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

Hermes Agent on DigitalOcean Serverless Inference

Published on May 11, 2026

AI/ML Technical Content Strategist

Hermes Agent on DigitalOcean Serverless Inference

Most self-hosted AI agents eventually run into the same problem: the model layer becomes infrastructure glue code. Coding tasks want one model, summarization wants another, vision needs multimodal support, and suddenly your “simple agent” is juggling API keys, routing logic, retries, and provider quirks before it even becomes useful.

Hermes Agent makes this especially obvious by combining coding, reasoning, memory, delegation, tool use, and shell access within a single agent loop. Instead of manually orchestrating multiple providers, DigitalOcean Inference lets Hermes communicate with a single OpenAI-compatible endpoint, while the platform handles model selection and routing behind the scenes.

In this tutorial, we’ll connect Hermes Agent to DigitalOcean Serverless Inference and then use the Inference Router to automatically optimize which models handle different agent workloads - without building custom routing infrastructure yourself.

Key Takeaways

One API key, one endpoint, many models. A single DigitalOcean model access key and the fixed base URL https://inference.do-ai.run/v1 gives Hermes Agent access to 70+ models — no per-vendor key juggling or provider-specific code changes required.
The Inference Router eliminates manual routing logic. Instead of hardcoding which model handles coding vs. summarization vs. vision, you define a task pool once in the Control Panel and let the router classify and dispatch each request automatically — optimizing for cost, speed, or a manual ranking you set.
Auxiliary task overrides are where the real cost savings hide. By default, Hermes routes vision, compression, session search, and web extraction through your main model. Pinning those to cheaper models in the auxiliary: config block can dramatically reduce token spend without sacrificing quality on tasks that don’t need frontier models.

Hermes with DigitalOcean: A Powerful Pairing

Hermes is model-agnostic by design. It supports about 19 first-class providers, plus any OpenAI-compatible custom endpoint like DigitalOcean’s, and is switchable to the hermes model at any time with no code changes and no lock-in. The agent core handles tool calling, skills, memory, and sub-agent delegation, and it supports 18+ messaging gateways (e.g., Telegram, Discord, Slack, WhatsApp, Signal, Email, Home Assistant). The LLM is just a pluggable backend.

DigitalOcean’s Inference Engine bundles four things behind one OpenAI- and Anthropic-compatible endpoint:

Serverless Inference — pay-per-token access to 70+ open-source and frontier models (NVIDIA, DeepSeek, Qwen, MiniMax, Moonshot/Kimi, OpenAI, Anthropic, Mistral, and more) with scale-to-zero pricing.
Inference Router — middleware that classifies each prompt and routes it to the best model in a pool you define, optimizing for cost or latency. Public preview, no extra charge during preview; you pay only for the models that actually serve each request.
Dedicated Inference — reserved GPU endpoints for steady, high-throughput workloads.
Batch Inference — async jobs at up to 50% off real-time pricing.

For a Hermes setup, the two relevant pieces are Serverless Inference (your direct path to specific models) and the Inference Router (your way to get smart per-request model selection without writing routing logic yourself).

The fit:

One key, many models. A single DigitalOcean model access key (sk-do-…) gets Hermes access to the entire catalog without juggling per-vendor API keys.
Router as the brain. Hermes does heavy tool calling, sub-agent delegation, summarization, and vision — workloads with very different cost profiles. The Inference Router can send your delegate_task calls to a cheap fast model, your main turn to a frontier model, and your vision lookups to a multimodal one — all behind the same endpoint, all transparent.
Stable URL. Unlike DigitalOcean’s per-agent Agent Platform endpoints, Serverless Inference has a fixed base URL: https://inference.do-ai.run/v1. Nothing in Hermes’ config needs to change when you swap which router or model is selected on the DigitalOcean side.

Demo Prerequisites

You will need:

A DigitalOcean account with access to the Inference product (visible in the Control Panel sidebar under INFERENCE)
A model access key in sk-do-… format. Create one in the Control Panel under Inference → Model Access Keys. You can scope a key to specific foundation models, to inference routers, and optionally restrict it to a VPC
A working Hermes Agent install. On Linux, macOS, WSL2, or Android (Termux)

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

The installer pulls uv, Python 3.11, Node.js, ripgrep, ffmpeg, and the Hermes source. Native Windows is supported via a PowerShell one-liner but is still early beta — WSL2 is the more battle-tested path.
A model with ≥64K context. Hermes refuses to start with anything smaller, because tool schemas plus system prompt plus working memory fill a smaller window before the conversation even begins. Most catalog models on DigitalOcean meet this easily.

The 60-second path: Serverless Inference as Hermes’ main provider

This is the minimum viable setup. You’ll point Hermes’ custom-endpoint provider at DigitalOcean and pick a single model.

Export the key

echo 'export MODEL_ACCESS_KEY="sk-do-..."' >> ~/.zshrc
source ~/.zshrc

Use ~/.bashrc (bash) or ~/.config/fish/config.fish (fish) as appropriate.

Find a model ID

DigitalOcean uses simple model slugs (e.g. llama3.3-70b-instruct, anthropic-claude-4.6-sonnet, openai-gpt-4.1). List what’s available with the key you just exported:

curl -s -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  https://inference.do-ai.run/v1/models \
  | jq '.data[].id'

Pick one. For a Hermes main model, prefer something agentic and ≥64K context — Claude Sonnet/Opus, GPT-4.1, Llama 3.3 70B, DeepSeek V3.2, Qwen 3.5, or MiniMax M2 are all solid choices.

Configure the provider in Hermes

hermes model

In the interactive picker:

Choose Custom endpoint (self-hosted / VLLM / etc.) — or, depending on your Hermes version, More providers… → Custom endpoint (enter URL manually).
API base URL: https://inference.do-ai.run/v1
API key: paste your sk-do-… key.
Model name: the model slug you picked, e.g. llama3.3-70b-instruct.
Context length: leave blank for auto-detect.

Hermes will verify the endpoint by hitting /v1/models and writes the choice to ~/.hermes/config.yaml.

The resulting model: section looks like this:

model:
  default: llama3.3-70b-instruct
  provider: custom
  base_url: https://inference.do-ai.run/v1
  # api_key is read from the key you entered (stored in ~/.hermes/.env as OPENAI_API_KEY
  # or in the custom-provider credential store, depending on Hermes version)

Confirm everything works with a smoke test

hermes

Then in the chat: Summarize this repo in 3 bullets and show me the main entrypoint. If you get a coherent answer and Hermes’ tool indicators light up, you’re done with the basics.

A Step Further: Using the Inference Router as Hermes’ Engine

This is where it gets interesting. Instead of pinning Hermes to a single model, you let DigitalOcean route each request to the best-fit model in a pool.

What an Inference Router actually is

An Inference Router is a policy you configure once in the Control Panel and address by name. Every request to the router is analyzed against your task definitions; the router picks a model from the pool you specified for the matching task, applying your priority (cost, latency, or a manual ranking). If a chosen model rate-limits or 5xx’s, the router falls back to the next candidate automatically - no dropped calls. Every response carries metadata about which model was selected and which task was detected, so you can see exactly what ran and why. This has profound implications for running an application. Specifically, it allows for even deeper optimization for cost as we scale, and more efficient model application in general.

Create a router

In the Control Panel (on the left hand sidebar) select Inference → Inference Router, and then click Create Router in the activity screen.

Two ways to start:

Default / Recommended router. Click See Default Routers on the Getting Started tab. DigitalOcean ships pre-configured routers for common agentic patterns: writing and content development, software engineering, and document intelligence. Pick one if you want a one-click setup; you can always inspect, duplicate, and edit later.
Custom router. Click Create Router to define a pool from scratch.

For a Hermes-shaped workload, a custom router is usually worth the few extra minutes: Hermes mixes coding, summarization, reasoning, planning, and tool dispatch in a single agent loop, and you’ll want explicit routes for each.

When defining a custom router:

Name: pick something stable; you’ll reference it from Hermes as router:<name>.
Description: this doubles as the routing prompt, so be specific. The router uses it to pick the right task for each request.
Tasks: each task gets:
- a noun-centric name and clear description (the docs explicitly recommend noun-centric descriptors and minimal overlap between tasks for stable routing)
- up to 3 models in the pool
- a selection policy: cost optimization, speed optimization, or manual ranking

A reasonable starter task layout for Hermes:

Task	Selection policy	Suggested pool
coding	Manual ranking	Claude Sonnet → GPT-4.1 → DeepSeek V3.2
reasoning	Manual ranking	Claude Opus → GPT-4.1 → Llama 3.3 70B
summarization	Cost optimization	Llama 3.3 70B → DeepSeek V3.2 → Qwen 3.5
general_chat	Speed optimization	Llama 3.3 70B → MiniMax M2 → Qwen 3.5
document_intel	Manual ranking	A multimodal model first (vision-capable)

Once you have finished, save it. The router becomes available under My Routers.

Point Hermes at the router

Using a router is a drop-in replacement for a model call: same endpoint, same auth, just a different value in the model field: router:<router-name> instead of a raw model slug. The simplest way is to re-run hermes model and update only the model name:

hermes model
# Custom endpoint
# API base URL:   https://inference.do-ai.run/v1   (same as before)
# API key:        sk-do-...                         (same as before)
# Model name:     router:my-hermes-router           ← the only change

Or edit ~/.hermes/config.yaml directly:

model:
  default: router:my-hermes-router
  provider: custom
  base_url: https://inference.do-ai.run/v1

Verify with hermes config get model and hermes status. Then start a session.

What you should see at runtime

Every chat completion from the router returns the actually-selected model in the response. With cURL, a router call looks like:

curl --location 'https://inference.do-ai.run/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --header "Authorization: Bearer $MODEL_ACCESS_KEY" \
  --data '{
    "model": "router:my-hermes-router",
    "messages": [
      { "role": "user", "content": "Are there any syntax issues in this Python code?" }
    ]
  }'

Hermes is doing exactly this under the hood; the body of the response will tell you which model actually served the request.

Session pinning (the X-Model-Affinity header)

Agentic loops are weird for routers - the first turn might look like coding, the second like reasoning about tool output, the third like planning. Each one might get routed to a different model, which sometimes hurts cache hits and continuity.

DigitalOcean fixes this with the X-Model-Affinity header. The first call with a given affinity value runs through normal routing; subsequent calls with the same value skip routing and pin to whichever model the first call selected. Responses include “pinned”: true when the session is locked in.

Hermes doesn’t expose this header in its UI yet, but you have two practical options:

Run a small proxy in front of https://inference.do-ai.run/v1 that injects X-Model-Affinity: <hermes-session-id> based on the session, and point Hermes at the proxy. This is the cleanest path if you care about cache locality across long agent loops.
Skip it for now. The router’s per-request fallback already handles reliability, and prompt caching still works within a single turn even without pinning. Revisit if you see cost drift from cache misses.

Setting Up the Hermes config for a DigitalOcean-only setup

Below is a fully-populated ~/.hermes/config.yaml skeleton that uses DigitalOcean for everything — main model via the router, auxiliary tasks via cheaper models, and a fallback for when the router 5xx’s.

model:
  default: router:my-hermes-router
  provider: custom
  base_url: https://inference.do-ai.run/v1

# Auxiliary tasks: vision, web extraction, compression, session search, skills, MCP.
# Setting base_url here bypasses provider resolution and sends directly to DO.
auxiliary:
  vision:
    base_url: https://inference.do-ai.run/v1
    api_key: ${MODEL_ACCESS_KEY}    # or hard-code; env-var interpolation depends on version
    model: openai-gpt-4.1           # any multimodal model in the DO catalog
  compression:
    base_url: https://inference.do-ai.run/v1
    api_key: ${MODEL_ACCESS_KEY}
    model: llama3.3-70b-instruct    # cheap + ≥64K context (required for compression)
  session_search:
    base_url: https://inference.do-ai.run/v1
    api_key: ${MODEL_ACCESS_KEY}
    model: llama3.3-70b-instruct
    max_concurrency: 2              # avoid 429 bursts on summarization fan-out
  web_extract:
    base_url: https://inference.do-ai.run/v1
    api_key: ${MODEL_ACCESS_KEY}
    model: llama3.3-70b-instruct
  skills_hub:
    base_url: https://inference.do-ai.run/v1
    api_key: ${MODEL_ACCESS_KEY}
    model: llama3.3-70b-instruct
  mcp:
    base_url: https://inference.do-ai.run/v1
    api_key: ${MODEL_ACCESS_KEY}
    model: llama3.3-70b-instruct

# Subagent delegation — overrides what delegate_task uses
delegation:
  base_url: https://inference.do-ai.run/v1
  api_key: ${MODEL_ACCESS_KEY}
  model: llama3.3-70b-instruct      # cheaper than the main router for one-shot subtasks

# Fallback if the main provider returns 429/5xx after retries
fallback_model:
  provider: custom
  model: llama3.3-70b-instruct
  base_url: https://inference.do-ai.run/v1
  key_env: MODEL_ACCESS_KEY         # name of the env var holding the API key

A few notes on this:

Auxiliary model overrides matter. Without them, Hermes routes vision, compression, web extraction, MCP, and session-search calls back through your main model. With a router-as-main config, that means every summary of a giant conversation gets a routing decision — possibly to an expensive model. Pinning these to a cheap specific model is the right default.
The compression model must have ≥ your main model’s context length, because it receives the full middle section of the conversation to summarize. Don’t pick a 32K model for compression if your main is 200K.
Fallbacks activate on rate limits (429), server errors (5xx), and auth failures (401/403). Pointing a fallback at the same provider with a specific model means: “if the router itself ever fails, fall back to a known-good model on the same endpoint.” If you want true cross-vendor resilience, point your fallback at OpenRouter or Nous Portal instead. API key precedence quirk. Hermes scopes provider env vars (OPENROUTER_API_KEY, etc.) to their own base URLs, but OPENAI_API_KEY is the fallback for custom endpoints. If you also have OpenRouter or other providers configured, make sure MODEL_ACCESS_KEY (your DO key) is what your custom-endpoint provider actually picks up — hermes doctor and hermes status will tell you which credential is in use.

Hermes Features and Capabilities

A few things worth knowing before you push this into a long-running setup.

Tool calling. Hermes depends heavily on tool/function calling. OpenAI and Anthropic models on DigitalOcean support this reliably, as do most large open-source models in the catalog. If tool calls suddenly stop working after a router change, check which model the router selected.
Built-in tools. DigitalOcean’s Chat Completions API supports server-side tools including RAG (knowledge_base_retrieval),MCP integration, and web search. Hermes does not expose these directly by default, but they can be injected through custom plugins.
Prompt caching. Supported models automatically cache repeated prompt prefixes, reducing cost for Hermes’ large recurring system prompts and skill definitions.
Reasoning controls. Reasoning-capable models support configurable reasoning effort levels, and Hermes’ agent.reasoning_effort setting maps directly onto them.
Privacy and observability. Open-weight model traffic is not logged or used for training. DigitalOcean also exposes detailed metrics for latency, throughput, errors, token usage, and router decisions through the Control Panel.
Pricing. Serverless Inference is pay-per-token, and the Inference Router currently adds no additional cost during public preview. In practice, this lets Hermes route most requests to cheaper models while reserving frontier models for harder tasks.

Quick Troubleshooting Tips

If you receive any of these errors, this section outlines how to resolve them.

httpx.ReadError: Connection reset by peer immediately on first chat. Almost always a base-URL problem. Hermes expects the OpenAI-style URL with the /v1 suffix: https://inference.do-ai.run/v1, not https://inference.do-ai.run. Re-run hermes model and re-enter the URL with /v1.

401 Unauthorized from inference.do-ai.run. Your “sk-do-…” key is missing, expired, or scoped too narrowly. If you scoped the key to specific models, make sure either the model you’re using or the router you’re calling is in that scope. Check with:

curl -s -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  https://inference.do-ai.run/v1/models | jq '.data[].id'

Hermes complains about context length at startup. Pick a model ≥64K context. The catalog shows context windows next to each model.

Tool calls silently dropped. The router likely picked a model that doesn’t support tools for that request. Either pin a tool-capable model for that task in the router definition, or use manual ranking with a tool-capable model first.

Wrong API key being sent. Run hermes doctor. If you have OpenRouter or another provider configured alongside, Hermes’ key-scoping logic should send OPENROUTER_API_KEY only to OpenRouter and OPENAI_API_KEY (or whatever you stored for the custom endpoint) to DigitalOcean. If you see leakage, explicitly set api_key on the custom-endpoint config rather than relying on env-var fallback.

Costs higher than expected. Check Serverless Inference Metrics → cost attribution by model. If a Hermes auxiliary task — usually compression or session search — is hitting an expensive model, override that task in auxiliary: with a cheap one (Section 5).

Closing Thoughts

Hermes Agent and DigitalOcean Serverless Inference make a natural pair: Hermes brings the agent harness — tool calling, persistent memory, skills, sub-agent delegation, and a dozen-plus messaging gateways — while DigitalOcean brings the model layer, with one API key, one stable URL, and a router that quietly picks the right model for each turn of the loop. The result is a self-hosted, multi-provider personal agent where you spend your time on what the agent should do, not on stitching providers together or hand-tuning which model handles which task. Start with the 60-second setup against a single model to confirm everything works, then graduate to the Inference Router once you’ve seen where your real token spend is going — that’s usually when the cost story gets genuinely interesting, with frontier-quality outputs at open-source prices and the routing logic free during public preview. From there, Hermes’ auxiliary, delegation, and fallback hooks let you tune the system as deeply as you want, all behind the same endpoint. Build something, watch the metrics, and iterate.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

James Skelton

Author

AI/ML Technical Content Strategist

See author profile

Category:

Tutorial

Tags:

AI/ML

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.