AI/ML Technical Content Strategist

Most self-hosted AI agents eventually run into the same problem: the model layer becomes infrastructure glue code. Coding tasks want one model, summarization wants another, vision needs multimodal support, and suddenly your “simple agent” is juggling API keys, routing logic, retries, and provider quirks before it even becomes useful.
Hermes Agent makes this especially obvious by combining coding, reasoning, memory, delegation, tool use, and shell access within a single agent loop. Instead of manually orchestrating multiple providers, DigitalOcean Inference lets Hermes communicate with a single OpenAI-compatible endpoint, while the platform handles model selection and routing behind the scenes.
In this tutorial, we’ll connect Hermes Agent to DigitalOcean Serverless Inference and then use the Inference Router to automatically optimize which models handle different agent workloads - without building custom routing infrastructure yourself.
https://inference.do-ai.run/v1 gives Hermes Agent access to 70+ models — no per-vendor key juggling or provider-specific code changes required.Hermes is model-agnostic by design. It supports about 19 first-class providers, plus any OpenAI-compatible custom endpoint like DigitalOcean’s, and is switchable to the hermes model at any time with no code changes and no lock-in. The agent core handles tool calling, skills, memory, and sub-agent delegation, and it supports 18+ messaging gateways (e.g., Telegram, Discord, Slack, WhatsApp, Signal, Email, Home Assistant). The LLM is just a pluggable backend.
DigitalOcean’s Inference Engine bundles four things behind one OpenAI- and Anthropic-compatible endpoint:
For a Hermes setup, the two relevant pieces are Serverless Inference (your direct path to specific models) and the Inference Router (your way to get smart per-request model selection without writing routing logic yourself).
The fit:
https://inference.do-ai.run/v1. Nothing in Hermes’ config needs to change when you swap which router or model is selected on the DigitalOcean side.You will need:
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
The installer pulls uv, Python 3.11, Node.js, ripgrep, ffmpeg, and the Hermes source. Native Windows is supported via a PowerShell one-liner but is still early beta — WSL2 is the more battle-tested path.
A model with ≥64K context. Hermes refuses to start with anything smaller, because tool schemas plus system prompt plus working memory fill a smaller window before the conversation even begins. Most catalog models on DigitalOcean meet this easily.
This is the minimum viable setup. You’ll point Hermes’ custom-endpoint provider at DigitalOcean and pick a single model.
echo 'export MODEL_ACCESS_KEY="sk-do-..."' >> ~/.zshrc
source ~/.zshrc
Use ~/.bashrc (bash) or ~/.config/fish/config.fish (fish) as appropriate.
DigitalOcean uses simple model slugs (e.g. llama3.3-70b-instruct, anthropic-claude-4.6-sonnet, openai-gpt-4.1). List what’s available with the key you just exported:
curl -s -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
https://inference.do-ai.run/v1/models \
| jq '.data[].id'
Pick one. For a Hermes main model, prefer something agentic and ≥64K context — Claude Sonnet/Opus, GPT-4.1, Llama 3.3 70B, DeepSeek V3.2, Qwen 3.5, or MiniMax M2 are all solid choices.
hermes model
In the interactive picker:
https://inference.do-ai.run/v1Hermes will verify the endpoint by hitting /v1/models and writes the choice to ~/.hermes/config.yaml.
The resulting model: section looks like this:
model:
default: llama3.3-70b-instruct
provider: custom
base_url: https://inference.do-ai.run/v1
# api_key is read from the key you entered (stored in ~/.hermes/.env as OPENAI_API_KEY
# or in the custom-provider credential store, depending on Hermes version)
hermes
Then in the chat: Summarize this repo in 3 bullets and show me the main entrypoint. If you get a coherent answer and Hermes’ tool indicators light up, you’re done with the basics.
This is where it gets interesting. Instead of pinning Hermes to a single model, you let DigitalOcean route each request to the best-fit model in a pool.
An Inference Router is a policy you configure once in the Control Panel and address by name. Every request to the router is analyzed against your task definitions; the router picks a model from the pool you specified for the matching task, applying your priority (cost, latency, or a manual ranking). If a chosen model rate-limits or 5xx’s, the router falls back to the next candidate automatically - no dropped calls. Every response carries metadata about which model was selected and which task was detected, so you can see exactly what ran and why. This has profound implications for running an application. Specifically, it allows for even deeper optimization for cost as we scale, and more efficient model application in general.
In the Control Panel (on the left hand sidebar) select Inference → Inference Router, and then click Create Router in the activity screen.
Two ways to start:
For a Hermes-shaped workload, a custom router is usually worth the few extra minutes: Hermes mixes coding, summarization, reasoning, planning, and tool dispatch in a single agent loop, and you’ll want explicit routes for each.
When defining a custom router:
router:<name>.A reasonable starter task layout for Hermes:
| Task | Selection policy | Suggested pool |
|---|---|---|
| coding | Manual ranking | Claude Sonnet → GPT-4.1 → DeepSeek V3.2 |
| reasoning | Manual ranking | Claude Opus → GPT-4.1 → Llama 3.3 70B |
| summarization | Cost optimization | Llama 3.3 70B → DeepSeek V3.2 → Qwen 3.5 |
| general_chat | Speed optimization | Llama 3.3 70B → MiniMax M2 → Qwen 3.5 |
| document_intel | Manual ranking | A multimodal model first (vision-capable) |
Once you have finished, save it. The router becomes available under My Routers.
Using a router is a drop-in replacement for a model call: same endpoint, same auth, just a different value in the model field: router:<router-name> instead of a raw model slug.
The simplest way is to re-run hermes model and update only the model name:
hermes model
# Custom endpoint
# API base URL: https://inference.do-ai.run/v1 (same as before)
# API key: sk-do-... (same as before)
# Model name: router:my-hermes-router ← the only change
Or edit ~/.hermes/config.yaml directly:
model:
default: router:my-hermes-router
provider: custom
base_url: https://inference.do-ai.run/v1
Verify with hermes config get model and hermes status. Then start a session.
Every chat completion from the router returns the actually-selected model in the response. With cURL, a router call looks like:
curl --location 'https://inference.do-ai.run/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer $MODEL_ACCESS_KEY" \
--data '{
"model": "router:my-hermes-router",
"messages": [
{ "role": "user", "content": "Are there any syntax issues in this Python code?" }
]
}'
Hermes is doing exactly this under the hood; the body of the response will tell you which model actually served the request.
Agentic loops are weird for routers - the first turn might look like coding, the second like reasoning about tool output, the third like planning. Each one might get routed to a different model, which sometimes hurts cache hits and continuity.
DigitalOcean fixes this with the X-Model-Affinity header. The first call with a given affinity value runs through normal routing; subsequent calls with the same value skip routing and pin to whichever model the first call selected. Responses include “pinned”: true when the session is locked in.
Hermes doesn’t expose this header in its UI yet, but you have two practical options:
https://inference.do-ai.run/v1 that injects X-Model-Affinity: <hermes-session-id> based on the session, and point Hermes at the proxy. This is the cleanest path if you care about cache locality across long agent loops.Below is a fully-populated ~/.hermes/config.yaml skeleton that uses DigitalOcean for everything — main model via the router, auxiliary tasks via cheaper models, and a fallback for when the router 5xx’s.
model:
default: router:my-hermes-router
provider: custom
base_url: https://inference.do-ai.run/v1
# Auxiliary tasks: vision, web extraction, compression, session search, skills, MCP.
# Setting base_url here bypasses provider resolution and sends directly to DO.
auxiliary:
vision:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY} # or hard-code; env-var interpolation depends on version
model: openai-gpt-4.1 # any multimodal model in the DO catalog
compression:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY}
model: llama3.3-70b-instruct # cheap + ≥64K context (required for compression)
session_search:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY}
model: llama3.3-70b-instruct
max_concurrency: 2 # avoid 429 bursts on summarization fan-out
web_extract:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY}
model: llama3.3-70b-instruct
skills_hub:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY}
model: llama3.3-70b-instruct
mcp:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY}
model: llama3.3-70b-instruct
# Subagent delegation — overrides what delegate_task uses
delegation:
base_url: https://inference.do-ai.run/v1
api_key: ${MODEL_ACCESS_KEY}
model: llama3.3-70b-instruct # cheaper than the main router for one-shot subtasks
# Fallback if the main provider returns 429/5xx after retries
fallback_model:
provider: custom
model: llama3.3-70b-instruct
base_url: https://inference.do-ai.run/v1
key_env: MODEL_ACCESS_KEY # name of the env var holding the API key
A few notes on this:
A few things worth knowing before you push this into a long-running setup.
If you receive any of these errors, this section outlines how to resolve them.
httpx.ReadError: Connection reset by peer immediately on first chat. Almost always a base-URL problem. Hermes expects the OpenAI-style URL with the /v1 suffix: https://inference.do-ai.run/v1, not https://inference.do-ai.run. Re-run hermes model and re-enter the URL with /v1.
401 Unauthorized from inference.do-ai.run. Your “sk-do-…” key is missing, expired, or scoped too narrowly. If you scoped the key to specific models, make sure either the model you’re using or the router you’re calling is in that scope. Check with:
curl -s -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
https://inference.do-ai.run/v1/models | jq '.data[].id'
Hermes complains about context length at startup. Pick a model ≥64K context. The catalog shows context windows next to each model.
Tool calls silently dropped. The router likely picked a model that doesn’t support tools for that request. Either pin a tool-capable model for that task in the router definition, or use manual ranking with a tool-capable model first.
Wrong API key being sent. Run hermes doctor. If you have OpenRouter or another provider configured alongside, Hermes’ key-scoping logic should send OPENROUTER_API_KEY only to OpenRouter and OPENAI_API_KEY (or whatever you stored for the custom endpoint) to DigitalOcean. If you see leakage, explicitly set api_key on the custom-endpoint config rather than relying on env-var fallback.
Costs higher than expected. Check Serverless Inference Metrics → cost attribution by model. If a Hermes auxiliary task — usually compression or session search — is hitting an expensive model, override that task in auxiliary: with a cheap one (Section 5).
Hermes Agent and DigitalOcean Serverless Inference make a natural pair: Hermes brings the agent harness — tool calling, persistent memory, skills, sub-agent delegation, and a dozen-plus messaging gateways — while DigitalOcean brings the model layer, with one API key, one stable URL, and a router that quietly picks the right model for each turn of the loop. The result is a self-hosted, multi-provider personal agent where you spend your time on what the agent should do, not on stitching providers together or hand-tuning which model handles which task. Start with the 60-second setup against a single model to confirm everything works, then graduate to the Inference Router once you’ve seen where your real token spend is going — that’s usually when the cost story gets genuinely interesting, with frontier-quality outputs at open-source prices and the routing logic free during public preview. From there, Hermes’ auxiliary, delegation, and fallback hooks let you tune the system as deeply as you want, all behind the same endpoint. Build something, watch the metrics, and iterate.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.