Report this

What is the reason for this report?

AMD 101

Published on June 30, 2025
AMD 101

Introduction

When it comes to AI accelerators, NVIDIA’s hardware, most recently the Blackwell GPUs, enjoys a competitive advantage due to hardware innovation and the widespread adoption and optimization of CUDA, a parallel computing platform and programming model. CUDA’s dominance has led to the extensive optimization of popular resources and tools, such as PyTorch, Triton, and Hugging Face, for NVIDIA hardware. This ecosystem of optimized software and hardware makes NVIDIA a favourable choice for developers and researchers alike, creating a self-reinforcing cycle of popularity and performance. The compatibility and performance advantages of NVIDIA’s hardware and CUDA have established a formidable moat, making their products a preferred option in the market for AI and GPU-accelerated computing.

That being said, there are certain workloads where using AMD GPUs is cost-effective. Additionally, we would like to note that when it comes to DigitalOcean GPU Droplets, using an AMD MI300x is currently cheaper than using an NVIDIA H100.

The goal of this article is to be a good resource for those just getting into using AMD GPUs for high performance computing AI applications. As a result, we will be discussing the CDNA architecture to better understand the hardware and the ROCm software stack to better understand the programmability of AMD GPUs. While very interesting, we will not be covering the RDNA architecture (used for gaming applications and optimized for frames per second), the XDNA architecture (used for personal computing AI applications), or the upcoming UDNA architecture (which unifies RDNA and CDNA).

CDNA

CDNA is a compute-optimized GPU architecture, optimized for compute (FLOPs per second). There have been several iterations featured in different AMD Instinct™ Series.

CDNA CDNA 2 CDNA 3 CDNA 4
Process Technology 7nm FinFET 6nm FinFET 5nm + 6nm FinFET 3nm + 6nm FinFET
Transistors 25.6 Billion Up to 58 Billion Up to 146 Billion Up to 185 Billion
CUs/Matrix Cores 120/440 Up to 220/880 Up to 304 /1216 256 / 1024
Memory Type 32GB HBM2 Up to 128GB HBM2E Up to 256GB HBM3 / HBM3E 288 GB HBM3E
Memory Bandwidth (Peak) 1.2 TB/s Up to 3.2 TB/s Up to 6 TB/s 8 TB/s
AMD Infinity Cache™ N/A N/A 256 MB 256MB
GPU Coherency N/A Cache Cache and HBM Cache and HBM
Data Type Support INT4, INT8, BF16, FP16, FP32, FP64 INT4, INT8, BF16, FP16, FP32, FP64 INT8, FP8, BF16, FP16, TF32, FP32, FP64 (Sparsity support) INT4, FP4, FP6, INT8, FP8, BF16, FP16, TF32*, FP32, FP64 (Sparsity support)
Products AMD Instinct™ MI100 Series AMD Instinct™ MI200 Series AMD Instinct™ MI300 Series AMD Instinct™ MI350 Series

*TF32 is supported by software emulation.

(Table adopted from Source)

Process Technology: This refers to the manufacturing process used to fabricate semiconductor chips, with smaller nanometer nodes indicating more advanced technology, leading to higher transistor density, improved performance, and greater energy efficiency, often utilizing FinFET technology.

Transistor Count: A higher transistor count generally signifies a more intricate chip capable of executing a greater number of operations and supporting a broader array of functionalities, although its increase is now driven by packaging innovations like chiplets rather than solely monolithic scaling.

Compute Units (CUs) / Matrix Cores: CUs are foundational blocks for parallel workloads, while Matrix Cores are specialized hardware within CUs that accelerate Generalized Matrix Multiplication (GEMM) computations, crucial for AI applications.

Memory Type and Memory Bandwidth (Peak): High Bandwidth Memory (HBM) is an advanced memory technology that, along with increasing memory bandwidth, provides exceptional data throughput and efficiency, preventing data bottlenecks and improving performance in data-intensive workloads.

AMD Infinity Cache™: This is a large, high-speed, on-package cache designed to significantly reduce data access latency and improve overall data access efficiency within the GPU’s memory hierarchy by acting as a crucial buffer.

GPU Coherency: This fundamental mechanism ensures consistency of shared data across multiple processing agents within a system, preventing errors and optimizing performance by reducing data contention and latency.

Data Type Support: This refers to the various numerical formats (e.g., FP64, FP32, FP16, BF16, TF32, INT8/4, FP8/6/4) a GPU can process, each offering a distinct balance between precision, range, and computational efficiency, crucial for optimizing HPC and AI workloads.

Sparsity Support: This allows GPUs to exploit the presence of zero or near-zero values within data or model parameters to reduce computational complexity and memory requirements, making AI models more efficient and scalable.

Products: These are the specific AMD Instinct™ accelerator series (MI100, MI200, MI300, MI350) that embody each generation of the CDNA architecture, demonstrating the practical application of underlying architectural innovations for HPC and AI workloads.

ROCm Software Stack

ROCm is an open-source software stack for programming AMD GPUs. ROCm includes the HIP (Heterogeneous-Compute Interface for Portability) programming model, which allows developers to write code that can run on both AMD and NVIDIA GPUs with minimal changes.

Developers can program AMD GPUs using several approaches: HIP for CUDA-like programming, OpenCL for cross-platform development, or OpenMP for directive-based parallel programming.

Inference with AMD

For inference tasks, AMD has collaborated with top serving frameworks like vLLM and SGLang to develop highly optimized containers. These containers are prepared for large-scale deployment of generative AI for inference, including Day 0 support for the most widely used generative AI models. vLLM is highly recommended as a versatile, general-purpose solution, with AMD providing support through bi-weekly stable releases and weekly development updates. For agentic workloads, Deepseek, and other specific applications, SGLang is the preferred choice, supported by weekly stable releases.

Beyond just the serving frameworks, AMD also optimizes leading models such as the Llama family, Gemma 3, Deepseek, and the Qwen family with Day 0 support. This ensures that the ecosystem can easily integrate the latest models in the rapidly evolving AI landscape.

Conclusion

While NVIDIA’s CUDA ecosystem dominates AI, AMD’s CDNA architecture and ROCm software are emerging as a strong alternative, especially for cost-effective workloads. AMD is actively collaborating with key inference frameworks like vLLM and SGLang, and optimizing leading generative AI models. This commitment to compute optimization and open-source software makes AMD an increasingly attractive option for high-performance AI, diversifying the landscape beyond NVIDIA.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Melani Maheswaran
Melani Maheswaran
Author
See author profile

Melani is a Technical Writer at DigitalOcean based in Toronto. She has experience in teaching, data quality, consulting, and writing. Melani graduated with a BSc and Master’s from Queen's University.

Category:
Tags:

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.