AI/ML Technical Content Strategist


(Source)
A Mixture of Experts (MoE) model is a transformer where the dense feed-forward layers inside each block are replaced by a set of smaller parallel networks called experts. A small router (or gating network) looks at each token and picks which experts process it — typically two to eight out of dozens or hundreds.
Everything else looks like a normal transformer. The sparsity lives entirely in those swapped-out feed-forward layers.
A dense transformer is like a single experienced general practitioner who sees every patient personally. Every visit costs the same amount of their time.
A MoE model is a hospital with a triage nurse and a roster of specialists. The nurse (the router) sends each patient to the two or three specialists most relevant to their case. The hospital knows more than any single GP, but you have to pay every specialist on staff whether they’re seeing patients or not.
That’s MoE inference in one sentence: you pay for memory in total parameters, but you only get compute savings on active parameters.
What we’re witnessing is best described as the MoE-ification of the open model ecosystem: a coordinated architectural migration from dense transformers to sparse ones, happening across every major lab in roughly an 18-month window. Three forces are driving it.
Training economics. MoE models hit a given quality target with substantially fewer training FLOPs than dense equivalents. When frontier training runs cost hundreds of millions of dollars, this matters enormously.
The DeepSeek moment. DeepSeek V3 (late 2024) and R1 (January 2025) proved publicly that an MoE could match frontier dense models at a fraction of training cost. DeepSeek V4, released as a preview in April 2026, pushed this further with a 1.6T-parameter Pro variant that activates just 49B per token. Every major lab spent 2025 either shipping their own MoE or explaining why their next release would be one.
Hardware caught up. The H100, H200, and B200 generations finally have the memory capacity and interconnect bandwidth to make large MoE serving practical. You can run frontier MoE models cost-effectively today on DigitalOcean GPU Droplets with NVIDIA H100 and H200 hardware.
| Model | Total / Active | Released | Notes |
|---|---|---|---|
| Mixtral 8x7B | 47B / 13B | Dec 2023 | The model that brought MoE mainstream |
| DeepSeek V3 | 671B / 37B | Dec 2024 | First major frontier-quality open MoE |
| DeepSeek R1 | 671B / 37B | Jan 2025 | Reasoning-focused; the tipping-point release |
| Llama 4 Scout | 109B / 17B | Apr 2025 | Meta’s first MoE; fits on a single H100 |
| Llama 4 Maverick | 400B / 17B | Apr 2025 | 128 experts; high quality at low active count |
| Kimi K2 | 1T / 32B | Jul 2025 | Modified MIT license |
| gpt-oss-120b / 20b | sparse | Aug 2025 | OpenAI’s open-weight return |
| Qwen 3.5 | 397B / 17B | Feb 2026 | Multimodal, 262K context window |
| DeepSeek V4 Pro | 1.6T / 49B | Apr 2026 | Current frontier open MoE, 1M context |
| Cohere Command A+ | 218B / 25B | May 2026 | 128 experts, Apache 2.0, runs on 2× H100 |
The active-to-total ratio keeps dropping: ~25% in Mixtral, ~4% in Llama 4 Maverick and Qwen 3.5, ~3% in DeepSeek V4 Pro. Sparser models train more efficiently but require more memory per useful FLOP delivered.
MoE changes inference economics on three axes: compute, memory, and utilization. Each pulls the bill in a different direction.
Per-token FLOPs scale with active parameters, not total. A 400B MoE that activates 17B per token does roughly the same math as a 17B dense model — and significantly less than a dense 70B. If you’ve been comparing models by total parameter count, the intuition you built up about cost is now misleading you.
This is why inference providers can price MoE models so aggressively. Lower FLOPs per token means more tokens per GPU-second, which means lower cost per million tokens at the same hourly rate. Latency benefits follow: prefill is faster because it’s FLOP-bound, and generation throughput typically beats dense models of equivalent quality.
Every expert has to live in VRAM, whether the router calls on it or not. The router can only choose between experts that are loaded and ready — you can’t materialize them on demand without crushing latency.
That 400B MoE needs ~400B parameters of memory. The activation sparsity that saves compute does nothing for memory. You’re not fitting it on a single H100 in FP16 — you need multi-GPU deployments with expert parallelism, aggressive quantization (FP8, INT8, INT4), or both. Each costs you something: hardware spend, engineering time, or quality.
The AMD MI300X GPUs available on DigitalOcean — with 192 GB of HBM3 memory per card — are structurally well-suited to MoE workloads precisely because they push the memory-per-dollar curve in the right direction.
There’s a second memory tax most articles miss: the KV cache. Every token in a conversation’s context window stores its key and value activations in GPU memory, and that cache grows linearly with both context length and batch size. At short contexts the cache is rounding error against model weights. At long contexts — and several frontier MoEs now ship with 256K, 1M, or longer context windows — KV cache can match or exceed the weight footprint.
This hits MoE deployments harder than dense ones for two reasons. First, you’ve already spent your VRAM budget loading experts you’re not actively using, so there’s less headroom for cache. Second, when experts are split across GPUs via expert parallelism, the KV cache still has to follow each request, which adds cross-GPU memory pressure and bandwidth costs that don’t appear in single-GPU sizing calculations.
The practical implication: when sizing hardware for a long-context MoE workload, budget memory for weights plus a realistic KV cache estimate based on your actual batch size and context length, not just the weight footprint alone. This is one of the most common reasons real deployments under-provision and hit out-of-memory errors weeks after going live.
MoE shines under high-throughput batched inference, where many tokens arrive at once and the router spreads load across experts evenly. Every expert in VRAM earns its keep, the GPU saturates, and per-token economics match the FLOP math.
It struggles under single-stream, low-batch, latency-sensitive serving. A handful of in-flight tokens activates a small subset of experts while the rest sit idle. You can also hit expert imbalance, where one or two “popular” experts get overloaded while others go untouched — producing inconsistent latency, underutilized GPUs, and surprising bills.
The practical takeaway: MoE is purpose-built for high-traffic API endpoints and merely tolerable for low-volume deployments. Same model, opposite economic outcomes.
MoE is not cheaper because it is sparse. MoE is cheaper when your traffic is dense enough to amortize sparse activation across expensive resident memory.

Suppose you’re choosing between Llama 4 Maverick (17B active / 400B total) and Llama 3.3 70B (dense) for production inference on DigitalOcean Compute. The published on-demand pricing varies - H100 at $3.39/GPU/hour, H200 at $3.44/GPU/hour, MI300X at $1.99/GPU/hour.
Memory footprint at FP8:
Hardware cost:
| Model | Configuration | Hourly cost |
|---|---|---|
| Llama 3.3 70B | 1× H100 | $3.39 |
| Llama 3.3 70B | 1× H200 | $3.44 |
| Llama 4 Maverick | 2× MI300X | $3.98 |
| Llama 4 Maverick | 3× H200 | $10.32 |
| Llama 4 Maverick | 4× H100 | $13.56 |
The MI300X configuration is the closest to parity — its higher VRAM-per-card is structurally well-matched to MoE workloads.
Per-token economics across utilization levels (illustrative ranges based on typical benchmarks — verify with your own workload):
| Utilization | Dense (1× H100) | MoE (2× MI300X) |
|---|---|---|
| 90% (saturated) | ~$0.87 / M tokens | ~$0.41 / M tokens |
| 40% (bursty) | ~$1.96 / M tokens | ~$0.94 / M tokens |
| 10% (single user) | ~$7.83 / M tokens | ~$3.76 / M tokens |
At saturation, MoE wins decisively — roughly half the per-token cost. At low utilization, both options look bad, but the dense option’s lower absolute hourly burn means less wasted money per idle hour. MoE saves money if your utilization is high enough to amortize the bigger hardware footprint. Otherwise, you’re paying for memory you’re not using.

The right strategy depends on which seat you’re in.
API consumers are the cleanest winners. You get MoE’s improved unit economics without handling memory or routing complexity. The DigitalOcean Inference Engine offers serverless inference across 70+ open and frontier models — including gpt-oss-120b, gpt-oss-20b, and other sparse models — with autoscaling from zero and per-token pricing. If you’re calling a model from production code, this is almost always the right starting point. You don’t need to think about expert imbalance; the platform handles it.
Cloud self-hosters face the most interesting math, and the worked example above is for you. You need to size hardware for total parameters but pay for tokens proportional to active parameters. DigitalOcean GPU Droplets let you test empirically without long contracts: spin up an MI300X for $1.99/hr if you suspect memory is your binding constraint, an H100 for $3.39/hr if compute is, run your actual workload for a day, and look at the numbers. The Inference-Optimized Image (PyTorch, CUDA, FlashAttention pre-configured) gets you from launch to live inference in minutes. 1-Click Models handle popular open-weight deployments without setup.
On-prem and enterprise workloads can stretch existing hardware further with MoE — if the workload matches. DigitalOcean Bare Metal GPUs are built for this profile: dedicated single-tenant servers with 8 GPUs per machine, RDMA interconnect up to 3.2 Tbps, no neighbor noise, and the network topology MoE expert parallelism actually needs. A bank running batched overnight document classification will love MoE on bare metal. The same bank running an interactive analyst assistant for 50 users may find a dense Llama 3.3 70B on a single H200 GPU Droplet serves them better at lower total cost.
Edge and local inference is still dense territory. Memory-constrained environments — laptops, phones, single consumer GPUs, embedded devices — get no benefit from holding 100 experts in RAM to use 2. Dense small models like Llama 3 8B, Phi, and Gemma remain the right tool, and the gap isn’t closing.
A few things don’t show up on benchmark sheets but will show up in your engineering estimates.
Quantization is trickier. Individual experts are smaller than the dense FFNs they replaced, making weights more sensitive to precision loss. INT4 quantization that works fine on Llama 3 70B can degrade an MoE of equivalent total size in subtle ways — worse behavior on specific topics rather than uniform quality drops. Evaluate quantized MoEs on your workload, not just MMLU.
Fine-tuning is harder. Open tooling grew up around dense transformers. LoRA and full fine-tuning recipes for MoE exist but are less well-trodden, and router behavior is particularly sensitive — naive runs can collapse expert specialization. If your roadmap depends on heavy customization, factor in extra engineering time.
Serving complexity adds up. Expert placement, all-to-all communication tuning, and load balancing are real engineering work. vLLM, SGLang, and TensorRT-LLM have matured rapidly and handle most major MoE architectures well — but support for new releases tends to lag by days or weeks, and edge cases (specific quantization formats, parallelism strategies, long contexts) still surface bugs.
Here’s what the rest of the internet won’t tell you: the MoE-ification of the open model ecosystem is real and important — and for most teams reading this, it shouldn’t change your deployment strategy this quarter.
The teams that benefit from frontier MoEs run steady, high-throughput, multi-tenant workloads at scale: API providers, document pipelines that never sleep, large-scale embedding generation, internal tools at companies with thousands of daily active users hitting one endpoint. If that’s you, the worked example shows real savings — 50%+ at the per-token level — and you should evaluate MoE seriously.
For everyone else, the math is less flattering. The single-engineer startup with bursty traffic. The mid-market SaaS embedding LLM features into an existing product. The R&D team building a proof-of-concept. The agency running inference for small clients. In each case, a well-served dense model on a single H100 or H200 will usually deliver lower total cost, simpler operations, more predictable latency, and an easier fine-tuning path than a frontier MoE on the multi-GPU rig it needs.
The asymmetric risk: you switch to MoE because the benchmarks look great and the per-token price is irresistible, then discover six months later that your workload — your batch sizes, your tail latency requirements, your fine-tuning needs — is on the wrong side of MoE’s sweet spot, and you’ve taken on infrastructure complexity that’s hard to walk back. Staying dense too long is much cheaper to correct.
| Workload pattern | Likely best choice |
|---|---|
| Low traffic, bursty app | Serverless inference |
| Medium traffic, predictable latency | Dense model on H100/H200 |
| High-throughput batched workload | MoE on multi-GPU / MI300X / bare metal |
| Heavy fine-tuning roadmap | Dense first |
| Long-context document workloads | Model weights + KV cache sizing test required |
Our recommendation: start with a dense model on right-sized infrastructure, measure your actual utilization and traffic patterns for at least a month, and only then revisit whether MoE makes economic sense for your specific workload. On DigitalOcean, that’s a single H100 or H200 GPU Droplet running Llama 3.3 70B (or a smaller dense model), with the DigitalOcean Inference Engine as a fallback for variable traffic. When your utilization data tells you to switch, the same platform supports the upgrade.
The active-to-total ratio race. Mixtral was ~25%; Llama 4 Maverick is ~4%; DeepSeek V4 Pro is ~3%. How far this can be pushed before quality breaks is an open empirical question — and the answer shapes what “frontier” looks like in 2027.
MoE-aware quantization. Generic dense techniques leave quality on the table. Expert-specific quantization and dynamic precision allocation could directly lower the memory cost that makes MoE painful for smaller deployers.
Routing innovations. Shared experts, fine-grained experts, and learned routing all change the cost structure. Router quality determines how well a model uses experts you’re already paying memory for.
Closed labs. ChatGPT has long been rumored sparse. Anthropic and Google don’t publish architectural details. If the closed frontier is also MoE, the architectural gap between open and closed shrinks, and deployment efficiency becomes the main competitive axis.
What does MoE stand for in AI?
MoE stands for Mixture of Experts: an architecture where a router sends each token to a small subset of specialized “expert” networks instead of running every parameter for every token.
How is an MoE model different from a dense model?
A dense model uses all of its parameters for every token. An MoE model keeps many more total parameters in memory, but activates only a fraction of them per token.
Are MoE models cheaper to run than dense models?
MoE models are cheaper when traffic is high enough to keep GPUs well utilized. At low utilization, they can cost more because you still pay to keep all expert weights loaded in memory.
Which numbers matter most for MoE inference cost?
The two key numbers are total parameters and active parameters. Total parameters determine memory footprint; active parameters determine per-token compute.
How much VRAM do MoE models need?
MoE models need enough VRAM for the full model weights, not just the active parameters. Long context windows also require extra memory for KV cache, which can materially change hardware sizing.
When should I use an MoE model?
Use MoE for high-throughput, batched workloads where the hardware stays busy. For bursty, low-volume, latency-sensitive, or heavily fine-tuned workloads, a dense model is often simpler and cheaper.
Should most teams switch from dense models to MoE?
Not automatically. Start with a right-sized dense model, measure real utilization, and only switch to MoE if your traffic pattern can amortize the larger memory footprint.
When should I use serverless inference instead of self-hosting?
Use serverless inference when traffic is variable or unpredictable. Self-hosting makes more sense when workload volume is steady enough to justify always-on GPU capacity.
What should stakeholders remember about MoE and inference bills?
MoE does not make inference automatically cheaper. It lowers compute per token, but the final bill depends on memory footprint, utilization, batching, context length, and serving complexity.
The MoE-ification of the open model ecosystem isn’t a niche architectural footnote — it’s the most consequential change to how open-weight models are built, served, and priced since the original Llama release. Nearly every major lab now ships MoE by default.
The cost picture isn’t uniformly better or worse than the dense era. It’s different. Per-token compute went down; memory footprint went up; batching matters more than ever; quantization and fine-tuning got harder; serving infrastructure got more complex. Whether your bill ends up smaller or larger depends entirely on whether your workload looks like the one MoE is optimized for.
The deeper change: comparing models by parameter count is effectively over. “70B” used to be a meaningful shorthand for cost, latency, and capability. In an MoE world, it’s almost meaningless. The numbers you need to reason about — active parameters, total parameters, memory footprint, throughput under load, traffic pattern — are more complicated than a single headline figure.
That’s a tax on everyone’s mental model, but it’s also the new baseline. Teams that internalize it first will end up with cheaper, faster, better-deployed inference. Teams that don’t will spend the next year wondering why their GPU bill keeps surprising them.
Get started on DigitalOcean. Start by benchmarking three numbers on your own workload: GPU utilization, tokens/sec at target latency, and KV cache memory at realistic context length. DigitalOcean lets you run that test across Serverless Inference, Dedicated Inference and GPU Droplets, without committing to a single serving strategy upfront.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.