By Amit Jotwani
DigitalOcean recently launched Serverless Inference - a hosted API that lets you run large language models like Claude, GPT-4o, LLaMA 3, and Mistral, without managing infrastructure.
It’s built to be simple: no GPU provisioning, no server configuration, no scaling logic to manage. You just send a request to an endpoint.
The API works with models from Anthropic, OpenAI, Meta, Mistral, Deep Seek, and others - all available behind a single base URL.
The best part is that it’s API-compatible with OpenAI.
That means if you’re already using tools like the OpenAI Python SDK, LangChain, or LlamaIndex, or any of the supported models, they’ll just work. You can swap the backend without rewriting your application.
This same approach works with DigitalOcean Agents too — models that are connected to your own documents or knowledge base.
There are a few reasons this approach stands out—especially if you’re already building on DigitalOcean or want more flexibility in how you use large language models.
Under the hood, everything runs through the OpenAI method:
client.chat.completions.create()
The only thing you need to change is the base_url
when you initialize the client
https://inference.do-ai.run/v1/
https://.agents.do-ai.run/api/v1/
That’s it - let’s walk through how this looks in practice.
Run the following command to install the OpenAI SDK to your machine:
pip install openai python-dotenv
This lists all available models from DigitalOcean Inference. Same code you’d use with OpenAI—you’re just pointing it at a DigitalOcean endpoint (base_url
).
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/", # DO's Inference endpoint
api_key=os.getenv("DIGITAL_OCEAN_MODEL_ACCESS_KEY")
)
# List all available models
try:
models = client.models.list()
print("Available models:")
for model in models.data:
print(f"- {model.id}")
except Exception as e:
print(f"Error listing models: {e}")
We’re using the same .chat.completions.create()
method. Only the base_url
is different.
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/", # DO's Inference endpoint
api_key=os.getenv("DIGITAL_OCEAN_MODEL_ACCESS_KEY")
)
# Run a simple chat completion
try:
response = client.chat.completions.create(
model="llama3-8b-instruct", # Swap in any supported model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a fun fact about octopuses."}
]
)
print(response.choices[0].message.content)
except Exception as e:
print(f"Error during completion: {e}")
Want a response that streams back token by token? Just add stream=True
, and loop through the chunks.
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/", # DO's Inference endpoint
api_key=os.getenv("DIGITAL_OCEAN_MODEL_ACCESS_KEY")
)
# Run a simple chat completion with streaming
try:
stream = client.chat.completions.create(
model="llama3-8b-instruct", # Swap in any supported model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a fun fact about octopuses."}
],
stream=True
)
for event in stream:
if event.choices[0].delta.content is not None:
print(event.choices[0].delta.content, end='', flush=True)
print() # Add a newline at the end
except Exception as e:
print(f"Error during completion: {e}")
DigitalOcean also offers Agents - these are LLMs paired with a custom knowledge base. You can upload docs, add URLs, include structured content, connect a DigitalOcean Spaces bucket or even an Amazon S3 bucket as a data source. The agent will then respond with that context in mind. It’s great for internal tools, documentation bots, or domain-specific assistants.
You still use .chat.completions.create()
— the only difference is the base_url
. But now your responses are grounded in your own data.
Note: With Inference, the base URL is fixed. With Agents, it’s unique to your agent, and you append /api/v1.
python
client = OpenAI(
base_url="https://your-agent-id.agents.do-ai.run/api/v1/",
api_key=os.getenv("DIGITAL_OCEAN_MODEL_ACCESS_KEY")
)
This is a standard Agent request using .chat.completions.create()
- same method as before.
The only real change is the base_url
, which points to your Agent’s unique endpoint (plus /api/v1
). With Inference, the base URL is fixed. With Agents, it’s unique to your agent, and you just append /api/v1
to it.
Here we’ve also added include_retrieval_info=True
to the body. This tells the API to return extra metadata about what the Agent pulled from your knowledge base to generate its response.
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
load_dotenv()
try:
# Create a new client with the agents endpoint
agents_client = OpenAI(
base_url="https://rrp247s4dgv4xoexd2sk62yq.agents.do-ai.run/api/v1/",
api_key=os.getenv("AJOT_AGENT_KEY")
)
# Try a simple text request with the agents endpoint
stream = agents_client.chat.completions.create(
model="openai-gpt-4o-mini",
messages=[{
"role": "user",
"content": "Hello! WHo is Amit?"
}],
extra_body = {"include_retrieval_info": True}
)
print(f"\nAgents endpoint response: {agents_response.choices[0].message.content}")
for choice in response.choices:
print(choice.message.content)
response_dict = response.to_dict()
print("\nFull retrieval object:")
print(json.dumps(response_dict["retrieval"], indent=2))
except Exception as e:
print(f"Error with agents endpoint: {e}")
The only change here is that we’ve enabled stream=True
to get the response back as it generates. Everything else is the same.
from openai import OpenAI
from dotenv import load_dotenv
import os
import json
load_dotenv()
try:
# Create a new client with the agents endpoint
agents_client = OpenAI(
base_url="https://rrp247s4dgv4xoexd2sk62yq.agents.do-ai.run/api/v1/",
api_key=os.getenv("AJOT_AGENT_KEY")
)
# Try a simple text request with the agents endpoint
stream = agents_client.chat.completions.create(
model="openai-gpt-4o-mini",
messages=[{
"role": "user",
"content": "Hello! WHo is Amit?"
}],
extra_body = {"include_retrieval_info": True},
stream=True,
)
for event in stream:
if event.choices[0].delta.content is not None:
print(event.choices[0].delta.content, end='', flush=True)
print() # Add a newline at the end
except Exception as e:
print(f"Error with agents endpoint: {e}")
To recap:
https://inference.do-ai.run/v1/
for general-purpose models like LLaMA 3, GPT-4o, Claude, etc.\/api/v1
) to connect to your own docs or knowledge base..chat.completions.create()
—no new methods to learn.stream=True
, and get retrieval info with include_retrieval_info=True
.This makes it easy to test multiple models, switch backends, or ground a model in your own content—all without changing your existing code.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.