By Balaji Varadarajan and Emilio Andere
At DigitalOcean, we’re committed to providing high-performance infrastructure for the next generation of AI, which is why we’ve been focused on hosting frontier Large Language Models (LLMs) on frontier GPUs—including AMD GPUs.
We see inference performance as an intricate systems-level challenge. For frontier open-weight models, achieving peak output speed is not just about the raw hardware. It also depends on a complex interaction between model architecture, runtime execution, memory systems, scheduling, and decoding strategy.
We believe there’s a significant “performance alpha” found in specialized inference engineering. Optimizing for both speed and cost-efficiency requires a much deeper approach than standard configuration sweeps. By taking a custom approach to the software stack, we can demonstrate that achieving performance parity with more expensive hardware is entirely possible.
While the current software ecosystem often presents non-obvious hurdles, deep engineering allows us to deliver stronger inference economics on high-performance AMD infrastructure relative to conventional flagship deployments.
To ground our “Performance Alpha” theory in reality, DO worked with Wafer to achieve high performance on specific frontier models on AMD GPUs through various optimizations. By utilizing Wafer’s Agent to identify inefficiencies and apply appropriate fixes, we were able to move beyond marginal gains toward order-of-magnitude improvements that change how these models are used in production.*
Beyond the technical achievement, these results represent a fundamental shift in the economics of frontier inference. Our work demonstrates that fully optimized AMD infrastructure can achieve elite performance levels while remaining more cost-effective than traditional flagship hardware deployments.
The takeaway is straightforward: inference performance is increasingly a systems problem. Delivering both high performance and sustainable economics requires a deep, custom approach to the software stack that maximizes every cycle of the underlying silicon.
*****Performance results are based on internal testing using the configurations described. Results in customer environments may vary depending on hardware, specific workload characteristics, implementation, and utilization.
In our research, we define “stock” frameworks as unmodified, out-of-the-box versions of inference engines or standard kernel libraries. While these tools are the fastest way to get a model running, they carry several architectural taxes that can hinder frontier model performance.
To understand how these gains were achieved, we must understand the systems-level concepts that govern frontier model performance. Inference engineering at its core is about mastering the interactions between hardware execution, memory hierarchy, the software dispatch layer, and knowing precisely where each lever sits.
MXFP4 is an open-standard 4-bit floating point format jointly developed by AMD, NVIDIA, Microsoft, and others under the OCP Microscaling specification[1]. Unlike per-tensor or per-channel quantization schemes, MXFP4 operates at the block level: a group of 16 or 32 values shares a single 8-bit scaling exponent, giving an effective storage cost of approximately 4.25 bits per weight rather than a clean 4.0.
This shared-exponent design is the key insight - it preserves the dynamic range needed for numerically sensitive operations like expert routing and attention projection, while still achieving the memory footprint of a 4-bit format.
Standard Multi-Head Attention (MHA) caches full K and V tensors for every layer, every head, and every token in the sequence. At long context lengths with large batch sizes, this KV cache becomes the dominant consumer of HBM - often exceeding the model weights themselves. MLA, introduced in DeepSeek-V2 [2] and carried forward through DeepSeek-V3, Kimi K2.5, and others, addresses this by changing what is stored.
c_KV of dimension d_c << d_model. At inference time, the full K and V heads are reconstructed from this latent vector on-the-fly via learned up-projection matrices. The KV cache now stores only c_KV per token which is a reduction of roughly 5 to 13x depending on the model’s head configuration at the cost of additional GEMM operations during decode.d_r-dimensional key component, which is cached alongside c_KV. This avoids the numerical issue of applying position-dependent transformations to vectors that will later be linearly projected, preserving attention correctness.c_KV from cache, (2) apply the up-projection GEMM to materialize K and V, (3) run attention across all cached latent vectors. This cannot be naively fused into a standard FlashAttention kernel. Efficient MLA execution requires a fused kernel that absorbs the up-projection into the attention computation itself — merging what would otherwise be a separate GEMM + attention dispatch into a single pass. Without this fusion, the projection GEMMs are too small to be compute-efficient at batch-1, and latency suffers significantly[3].A standard dense transformer activates every parameter for every token. An MoE layer replaces the FFN block with a collection of parallel expert FFNs and a learned router[4]. For each token, the router computes a softmax over all experts and selects the top-K highest-scoring ones. Only those K experts execute; the rest remain dormant[5]. The result is a model with a very large parameter count but a modest activated parameter count per token—GLM-5 has 774B total parameters but activates roughly 50–60B per token[6].
Every GPU kernel launch carries a fixed administrative cost. The driver must validate arguments, schedule the kernel onto a Streaming Multiprocessor and synchronize before the next kernel can begin. On ROCm/HIP, this overhead is approximately 2 to 5µs per launch and for a transformer layer executing at batch-1, the total compute time for a small operation like RMSNorm may itself be only 5 to 10µs. This means launch overhead is the same order of magnitude as the useful work.
all_reduce + residual_add + rms_norm kernel keeps the reduced tensor in registers or L1/L2 cache throughout[7]. The all-reduce result never touches HBM — it flows directly into the add and normalization logic within the same warp execution. AMD’s AITER library provides fused_add_rms_norm as a primitive that targets exactly this pattern; additional fusion opportunities include fused QKV projection + rotary embedding and fused gating + activation in MoE FFN layers.Autoregressive decode is inherently serial: each token depends on all previous tokens, and the full model must execute once per generated token. The wall-clock cost per token is dominated by the memory bandwidth required to load all activated weights at batch-1, this is a pure bandwidth-bound problem regardless of how fast the matrix cores are[8]. Speculative decoding breaks this serialization by parallelizing verification[9].
Tensor Parallelism distributes the weight matrices of each layer across multiple GPUs. For a linear projection Y = XW, the weight matrix W is column-split across N GPUs; each GPU computes a partial result, and an all-reduce synchronizes the outputs before the next layer[11]. This allows a model that would not fit on a single GPU to be served across a node, and reduces the per-GPU memory requirement proportionally.
The performance gaps we’ve identified - “Launch Tax”, prefill bias and rigid software constraints are not limitations of the underlying silicon. Rather, they are symptoms of a software ecosystem that has prioritized generality over peak efficiency.
By identifying these bottlenecks and mastering the “levers” of the modern inference stack from MXFP4 quantization to custom kernel fusion - our team has shown that it is possible to achieve significant performance gains on high-performance AMD infrastructure. These optimizations don’t just result in faster tokens, they rewrite the economic reality of hosting frontier models at scale.
This is only the beginning of our deep dive into inference engineering. In the coming weeks, we will release three technical “surgeries,” each focusing on a different frontier model and the specific optimizations used to unlock its potential:
Stay tuned as we move from the high-level anatomy of these bottlenecks to the low-level code that helps solve them. Keep in mind that results will depend on your specific configuration, hardware, and usage patterns.


