AI Technical Writer

AI infrastructure is no longer a single, uniform stack, and it has evolved into distinct, purpose-built layers, each designed to simplify or control a different part of the AI lifecycle. The market now breaks into at least three tiers: abstraction-first platforms like Modal and Together AI, which hide infrastructure complexity entirely and let developers focus on models and deployment; general cloud providers like DigitalOcean, which started from an infrastructure foundation but are rapidly converging into full-stack AI platforms extending to compute, inference, data, and agents; and GPU-focused neoclouds like CoreWeave and Vultr, which remain true infrastructure-first providers, offering raw compute ownership.
For example, instead of setting up a GPU server, installing frameworks, loading model weights, and building an API, a developer can simply call a function or API to generate text. This dramatically reduces the time required to build and test ideas, making abstraction-first platforms especially effective for prototyping and experimentation.
However, as AI systems become more advanced, the limitations of abstraction begin to surface. Modern applications are no longer just single inference calls, and they often include retrieval pipelines (RAG), persistent memory, document storage, and orchestration across multiple components. For instance, a customer support chatbot might need to fetch data from a vector database, combine it with user history, and generate responses in real time. These systems require coordination between compute, storage, and networking, which abstraction layers typically hide and restrict. In contrast, infrastructure-first platforms provide direct control over these components, allowing teams to design systems with predictable performance and tighter integration, at the cost of greater setup effort.
Before we understand each platform, there’s one concept worth getting clear on because it quietly drives almost every architectural decision in modern AI systems.
What does stateless mean? A stateless system doesn’t remember anything between requests. Every call arrives fresh, gets processed, and leaves no trace. Think of it like a vending machine, put in your input, get your output, and the machine has no idea you were ever there.
A basic inference API works exactly this way. We send a prompt, get a response, done. Nothing is carried over.
What does stateful mean? A stateful system remembers. It holds onto context such as a user’s session, a conversation history, a loaded model, and a cached result, and uses that context to handle the next request. Think of it like a conversation with a colleague. They remember what we discussed last time, and we pick up where we left off.
The moment our AI application needs to do more than answer a single question, we’re dealing with state.
Why this matters in practice Modern AI systems almost always require some form of statefulness. Consider what’s actually happening inside a production-grade application:
A retrieval pipeline (RAG) fetches documents from a vector database and combines them with a user’s query. That retrieval result is a state, but it needs to be passed through the pipeline, not re-fetched on every token.
Session memory means the model knows what a user said two messages ago. That history lives somewhere, and something has to manage it. Streaming responses keep the connection between the model and the user open while tokens are being generated. That open connection is a state. Background jobs such as reindexing documents, re-ranking results, running evals, and the need to run asynchronously and often write results somewhere for later use.
Abstraction-first platforms are optimized for stateless execution. They’re excellent at running isolated, fast inference calls, but coordinating state across multiple components typically falls outside what they’re designed to handle. We end up relying on external services, working around platform limits, or accepting unpredictable behavior when a session needs to persist longer than a single request.
Infrastructure-first platforms are built with statefulness in mind from the start. Compute, storage, and networking are exposed as separate, composable primitives, which means we can design exactly how state flows through our system, rather than inheriting someone else’s assumptions about it.
Modal gives us serverless GPU functions. We write Python code, decorate a function, and Modal handles provisioning a GPU container to run it. For batch workloads, fine-tuning runs, offline evaluations, and document processing, this model works very well.
Modal also supports stateful patterns. Using the @app.cls() decorator, we can define a class that initializes state once, for example, loading model weights and reuses it across multiple requests. Combined with min_containers to keep a floor of warm containers running, and buffer_containers to handle traffic spikes, we can avoid most cold start penalties in a properly configured setup.
Memory Snapshots, currently in alpha, capture the full GPU state, including model weights and CUDA kernels, reducing initialization time by 3–10x on supported workloads.
The honest limitation is different: all of this requires deliberate configuration. Keeping containers warm costs money even when idle. Memory Snapshots need explicit opt-in and aren’t yet generally available. Managing stateful classes, warm pools, and persistent volumes adds architectural overhead that compounds as systems grow. We can build production-grade stateful systems on Modal, but we’re responsible for assembling and maintaining the components ourselves.
Together.ai’s default offering is a serverless inference API. We send a request, their infrastructure runs the model, and we get a response back. It’s fast, well-optimized, and requires almost no setup. For straightforward inference tasks such as generating text, embedding documents, and classifying inputs, it genuinely delivers.
The platform also goes beyond shared inference. Together.ai offers Dedicated Endpoints: reserved, isolated compute where a chosen model stays loaded persistently. We can deploy fine-tuned models from Hugging Face or S3 onto these endpoints, and the reserved GPU means the model is always warm. That’s a meaningful capability.
But here’s where the constraint shows up. Even with a dedicated endpoint, Together.ai manages the underlying infrastructure for us, and that means we’re working within their system, not building our own. The moment our application needs to pass state between a vector database, a model, and a user session, we’re stitching together external services on our own. Together.ai handles inference well. The surrounding system architecture is still our problem to solve.
DigitalOcean works from a different premise. Instead of abstracting infrastructure away and letting us opt back into control, it gives us independent building blocks from the start, such as compute, storage, and networking, and lets us wire them together the way our system actually needs. A GPU Droplet is a persistent virtual machine. We deploy it, load our model into memory, and it stays there. No cold starts, no keep-warm configuration, no alpha features to enable. The model is live because the server is running, which is how servers work. More importantly, the full infrastructure stack is available on the same platform. We can pair GPU compute with managed PostgreSQL for structured data, vector database for embeddings and retrieval, Spaces for object storage, and a load balancer in front of everything. The tradeoff is real and worth naming: we’re responsible for provisioning, monitoring, and scaling our own infrastructure. That requires more upfront work than calling an API or decorating a function. But the foundation we’re building on was designed for complexity and not designed to hide it until we grow past it.
Requirements:
Observed patterns:
On Modal:
modal.Volume) for storing model weights and datasets, designed for write-once, read-many workloads and are well-suited for model weights and static datasets, but not for the kind of concurrent read/write patterns a live vector index requires.On Together AI:
On DigitalOcean:
Modal’s model is serverless by default, which means the containers scale to zero when idle, so you’re not paying for GPU time when nothing is running. Cold starts do exist, and for GPU workloads, they can be significant. Modal has been working to close this gap. In tested workloads, the reduced startup time was from around 45 seconds down to about 5 seconds for some configurations. That said, GPU memory snapshots are still in alpha, subject to limitations, and most functions will need some code changes to take full advantage of them. It’s also worth noting that keeping containers warm longer is possible, but you’ll be billed for idle GPU reservation during that window.
Example: A batch document-processing job is a good fit; it runs, finishes, and the GPU scales back to zero. A real-time chatbot can work too, but cold start latency is still a real factor for traffic that’s unpredictable or bursty, even with snapshotting enabled.
Together AI gives you two modes. Serverless inference is shared infrastructure, fully managed, and billed per token. It automatically scales with request volume, with no infrastructure to manage and no long-term commitments. The trade-off is that since compute is shared, latency can vary under load.
For teams that need more predictability, Dedicated Endpoints provide isolated, single-tenant compute with reserved capacity, backed by Together’s inference engine. Benefits include predictable performance unaffected by serverless traffic and reliable capacity for spiky workloads. The endpoint runs continuously while active, so the cost model shifts from per-token to per-hour. Workloads averaging around 130,000 tokens per minute or more are typically where dedicated endpoints become more economical than serverless.
Neither option gives you direct access to the GPU or control over the underlying hardware, and that’s by design.
Example: A batch summarization job with no latency requirements fits serverless well, you pay for what you use. A high-traffic API serving real-time requests benefits from a dedicated endpoint to keep latency consistent and avoid contention with other users on shared infrastructure.
GPU Droplets are persistent virtual machines; the GPU is allocated to you and stays allocated. There’s no cold start in the serverless sense, because the machine is already running. The flip side is that billing begins when you create a Droplet and ends when you destroy it; however, you’re still billed for a powered-off Droplet, because the compute resources remain reserved on the hypervisor even when not in use. Scaling down means destroying or stopping resources manually, or building automation to do it for you.
For teams who don’t want to manage individual machines, DigitalOcean’s Serverless Inference offers a different model within the same platform, billed per token, no GPU to provision. Both options share the same VPC, so they can coexist in the same stack without cross-cloud networking.
Example: A production inference API with steady, predictable traffic benefits from a GPU Droplet — consistent latency, no cold starts, no shared compute. A development environment or an infrequently used model endpoint is better served by Serverless Inference, since an idle GPU Droplet accrues cost whether or not it’s handling requests.
Each platform reflects a deliberate trade-off between control and convenience. Modal optimizes for cost efficiency on intermittent workloads and accepts cold start as a manageable constraint. Together AI removes GPU management entirely, with a pricing model that shifts from variable (serverless) to fixed (dedicated) as usage scales. DigitalOcean keeps the GPU under your control, which means consistent performance but also consistent cost, idle or not.
The right choice depends less on which platform is “better” and more on how your workload actually behaves: how often it runs, how latency-sensitive it is, and how much of the infrastructure you want to reason about.
| Aspect | Modal | Together AI | DigitalOcean |
|---|---|---|---|
| COMPUTE & GPU ACCESS | |||
| GPU access model | Serverless containers spin up per function call | Fully abstracted, but no direct GPU access | Persistent VMs that you can manage the full lifecycle |
| Cold starts | Yes, mitigated by GPU memory snapshotting (alpha). Some workloads reduced from ~45s to ~5s | None on serverless. Dedicated endpoints are always-on | None, as the VM is already running once provisioned |
| Scale to zero | Yes, default behavior | Yes on serverless. Dedicated endpoints run continuously | No for GPU Droplets. The serverless inference option does scale to zero |
| GPU hardware control | Choose GPU type (A100, H100). | Dedicated: choose GPU type and count | Full control — H100, H200, B300, MI300X. Bare metal available |
| INFERENCE & MODEL SERVING | |||
| Open-source models | Any model, any framework (vLLM, SGLang, custom) | 200+ models via API | Serverless Inference + custom models on GPU Droplets |
| Custom / fine-tuned models | Full flexibility — bring any weights | Upload from Hugging Face or S3 to dedicated endpoints | Run on GPU Droplets or Dedicated Inference |
| Batch inference | Yes — built-in job queues | Yes — async batch up to 30B tokens at ~50% lower cost | Yes — Batch Inference tier with cost reduction |
| PRICING | |||
| Billing unit | Per second of GPU time | Per million tokens (serverless) / per hour (dedicated) | Per second for Droplets / per token for Serverless Inference |
| Idle cost | None — scale-to-zero means no cost when not running | None on serverless. Dedicated endpoints bill continuously | GPU Droplets bill even when powered off. Serverless Inference has no idle cost |
| Cost predictability | Variable — scales with usage, harder to forecast | Mixed — serverless varies, dedicated is fixed hourly | Predictable — fixed hourly/monthly rate for Droplets |
| DEVELOPER EXPERIENCE | |||
| Setup complexity | Low — deploy directly from Python, no cluster config | Very low — API key and a POST request | Medium — Serverless Inference is simple, GPU Droplets require VM setup |
| Infrastructure control | Moderate — GPU type, container env, concurrency | Minimal — not designed for server-level configuration | Full — root VM access, configure networking, storage, and runtime |
| Observability | Basic — logs and container view. Bring your own monitoring | Basic — built-in for dedicated containers, limited on serverless | Comprehensive — metrics, alerts, logs, Kubernetes monitoring |
| WORKLOAD FIT | |||
| Real-time inference | Workable with warm containers, but cold starts are the main risk | Strong — serverless stays warm, dedicated endpoints are single-tenant | Strong — persistent GPUs give consistent, predictable latency |
| Batch / async jobs | Strong — scale-to-zero makes it cost-efficient for bursty workloads | Strong — async batch at reduced pricing | Strong — dedicated batch tier with guaranteed completion window |
| Fine-tuning / training | Strong — any GPU, multi-node clusters in beta | Supported — fine-tuning API and GPU clusters available | Possible — GPU Droplets can run training |
| Agentic workflows | Partial — sandboxes for code execution, state is on the developer | Partial — code sandboxes, no native agent orchestration | Native — managed agents with durable state, sandboxes, and orchestration |
| Best suited for | Teams running GPU workloads from Python code without managing infrastructure. Good for batch jobs, model serving, and experimentation | Teams that want fast access to open-source models via API with no infrastructure work. Good for inference-heavy products at scale | Teams building full-stack AI applications that need compute, storage, databases, and networking in one place |
The most important shift happening in AI infrastructure right now is not about which platform is fastest or cheapest. It is about the fact that infrastructure decisions are becoming product decisions.
A year ago, the default move was to pick a cloud provider, provision a GPU, and build everything else yourself. That approach still works, but it is no longer the only option, and for many teams, it is no longer the right one. The stack has matured enough that the layer you build on top of directly shapes what you can ship, how fast you can move, and where you will hit walls.
What this comparison makes clear is that there is no universal answer. Modal, Together AI, and DigitalOcean are not competing for the same customer. They reflect three genuinely different philosophies about where developer time is best spent on controlling infrastructure, on calling an API, or on co-locating every component of a system in one place.
The teams that will build the best AI systems in the next few years are not necessarily the ones with the most GPU access. They are the ones who understand the trade-offs clearly enough to match their infrastructure to their actual workload and not to a benchmark, not to a trend, and not to what another company chose.
Inference is a solved problem at the API level. What is not solved is everything around it: managing state, reducing latency across components, keeping costs predictable at scale, and maintaining enough control to debug what goes wrong in production. That is where infrastructure choices start to compound.
Choose the layer that removes the friction that actually slows your team down. Everything else is configuration.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.