By Diogo Vieira and Anish Singh Walia

Effectively sizing and configuring GPUs for vLLM inference starts with a clear understanding of the two fundamental phases of LLM processing, Prefill and Decode, and how each places different demands on your hardware.
This guide explains the anatomy of vLLM’s runtime behavior, clarifies core concepts like memory requirements, quantization, and tensor parallelism, and provides practical strategies for matching GPU selection to real-world workloads. By exploring how these factors interact, you’ll gain the knowledge needed to anticipate performance bottlenecks and make informed, cost-effective decisions when deploying large language models on GPU infrastructure.
Prefill vs. Decode phases determine hardware needs: The prefill phase (processing input prompts) is memory-bandwidth bound and affects Time-To-First-Token, while the decode phase (generating outputs) is compute-bound and determines token generation speed.
VRAM capacity sets absolute limits: Model weights and KV cache must fit within available GPU memory. A 70B model in FP16 requires 140GB for weights alone, making quantization essential for single-GPU deployments.
KV cache grows dynamically: Unlike static model weights, the KV cache expands based on context length and concurrency. A 70B model with 32k context and 10 concurrent users needs approximately 112GB for FP16 cache or 56GB for FP8 cache.
Quantization is the primary optimization lever: Reducing precision from FP16 to INT4 cuts memory usage by 75%, enabling large models to run on smaller GPUs. FP8 quantization offers the best balance of speed and quality on modern hardware.
Tensor Parallelism enables larger models: When models exceed single-GPU capacity, TP shards weights across multiple GPUs, pooling VRAM at the cost of communication overhead. Single-GPU execution is faster when feasible.
This is the very first step of any request. vLLM takes the entire input prompt (user query + system prompt + any RAG context) and processes it all at once in highly parallel fashion.
Once the prefill is complete, vLLM enters an autoregressive loop to generate the output.

Learn More: For a deeper analysis of static vs. continuous batching trade-offs, refer to this article: 🔗 Hugging Face: LLM Performance - Prefill vs. Decode
Understanding which phase dominates your workload is essential for selecting the right hardware.
| Runtime Phase | Primary Action | Primary Hardware Constraint | Dominant Use Cases |
|---|---|---|---|
| Prefill | Processing long inputs in parallel. | Memory Bandwidth (TB/s) (Crucial for fast TTFT) | • RAG (Retrieval-Augmented Generation) • Long Document Summarization • Massive Few-Shot Prompting |
| Decode | Generating outputs sequentially. | Compute (TFLOPS) (Crucial for fast token generation) | • Interactive Chat & Customer Service • Real-time Code Generation • Multi-turn Agentic Workflows |
During inference, vLLM relies heavily on a KV cache to avoid recomputing work it has already done.
The Trade-off: Dynamic Memory Growth: The price of this speed is memory. Every new token generated appends more entries to the cache. At runtime, KV cache usage grows dynamically based on:
Scaling Impact: This behavior is why two workloads using the same model can have vastly different hardware requirements. A 70B model might fit on a GPU, but if the KV cache grows too large during a long conversation, the server will run out of VRAM and crash. Understanding memory management is essential for production deployments, as covered in our fine-tuning LLMs guide.

Once we understand how vLLM behaves at runtime, the next step is determining whether a model can run on a given GPU and what level of concurrency or context length it can support.
This section provides the mathematical formulas and decision trees needed to calculate static memory requirements, estimate KV cache growth, and systematically troubleshoot fit issues.
Before calculating model size, it is essential to understand the “container” we are trying to fit into. Different GPUs impose different hard limits on feasibility and performance.
Common Data Center GPU VRAM Capacities- These are the hard memory limits for the most common inference GPUs.
GPU Comparison for vLLM Inference and Training
| GPU Model | VRAM Capacity | Peak Dense TFLOPS (FP16 / FP8) | Primary Applications & Advantages |
|---|---|---|---|
| NVIDIA L40S | 48 GB | 362 / 733 | Cost-Effective Inference: Optimal for small-to-medium quantized models (7B–70B). |
| NVIDIA A100 | 40 GB / 80 GB | 312 / N/A | Previous Standard: 80GB version is excellent for tasks demanding high memory bandwidth. |
| NVIDIA H100 | 80 GB | 989 / 1,979 | Current High-End Standard: Features massive bandwidth, ideal for applications requiring long context lengths. |
| NVIDIA H200 | 141 GB | 989 / 1,979 | Significant Performance Boost: Allows for larger batch sizes or running 70B+ models with fewer required GPUs. |
| NVIDIA B300 | 288 GB | ~2,250 / 4,500 | Ultimate Density: Capable of fitting massive models (e.g., Llama 405B) with minimal GPU parallelism. |
| AMD MI300X | 192 GB | 1,307 / 2,614 | Massive Capacity: Perfectly suited for very large, unquantized models or processing huge batch sizes. |
| AMD MI325X | 256 GB | 1,307 / 2,614 | Capacity Optimized: Excellent choice for serving 70B+ models, especially those with very long context requirements. |
| AMD MI350X | 288 GB | 2,300 / 4,600 | High-Performance Flagship: Direct competitor to the B300, designed for massive-scale workloads. |
Even if a model fits in VRAM, the specific GPU architecture significantly impacts vLLM performance. Key metrics to consider are:
| Metric | Measured In | Impact on vLLM |
|---|---|---|
| VRAM Capacity | GB | Can it run? This sets the absolute maximum limit for the model size and context window. |
| Memory Bandwidth | TB/s | Prefill Speed. Crucial for RAG (Retrieval-Augmented Generation) and long-context summaries. High bandwidth ensures a fast Time-To-First-Token. |
| Compute (TFLOPS) | TFLOPS | Decode Speed. Essential for chat applications. High TFLOPS result in fast tokens/sec generation. |
| Interconnect | GB/s | Parallelism Cost. Any interconnect adds latency. Even with NVLink (DigitalOcean’s standard), Tensor Parallelism (TP) introduces synchronization overhead that reduces performance compared to single-GPU execution. |
Every model must load its weights into GPU VRAM before vLLM can serve requests. The size of the weights depends entirely on the number of parameters and the precision chosen.
The estimated VRAM requirement (in GB) for a model can be calculated using the formula
**VRAM (GB) ≈ Parameters (Billions) × Bytes per Parameter
The table below illustrates the VRAM calculation for a Llama 3.1 70 Billion Parameter model at various quantization precisions
| Precision | Bytes per Parameter | Example: Llama 3.1 70B VRAM (GB) |
|---|---|---|
| FP16 / BF16 | 2 bytes | 70 x 2 = 140GB |
| FP8 / INT8 | 1 byte | 70 x 1 = 70GB |
| INT4 | 0.5 bytes | 70 x 0.5 = 35GB |
Precision choice is the single biggest lever for feasibility. Quantizing a 70B model from FP16 to INT4 reduces its static footprint by 75%, moving it from “impossible on a single node” to “fits on a single A100”. This makes quantization essential for cost-effective deployments on DigitalOcean GPU Droplets.
While weights determine if a model can start, the KV cache determines if it can scale. It is easy to underestimate the KV cache, leading to OOMs under load.
To accurately size a deployment, you need to estimate how much memory the cache will consume based on the expected context length and concurrency.
For most customer conversations, the exact formula is too complex to calculate on the fly. Instead, use a “Per-Token” memory multiplier. This method gets you close enough for initial sizing decisions.
Simplified KV Cache Formula:
Total KV Cache (MB) = Total Tokens x Multiplier
(Where Total Tokens = Context Length x Concurrency)
Standard Multipliers:
| Model Size | Standard Multiplier (FP16 Cache) | Quantized Multiplier (FP8 Cache) |
|---|---|---|
| Small Models (7B - 14B) | 0.15 MB / token | 0.075 MB / token |
| Large Models (70B - 80B) | 0.35 MB / token | 0.175 MB / token |
Example Calculation:
A customer wants to run Llama 3 70B with a context of 32k and 10 concurrent users.
Verdict:
For detailed validation or corner cases, use the formal formula or an online calculator.
Total KV Cache (GB) = (2 × n_layers × d_model × n_seq_len × n_batch × precision_bytes) / 1024^3
Tensor Parallelism (TP) is a technique that shards (splits) a model’s individual weight matrices across multiple GPUs. Effectively, it allows vLLM to treat multiple GPUs as a single, massive device with pooled VRAM.
Why use it? TP is primarily a requirement for feasibility, not a performance optimizer. You typically enable it when:
The “Tax” of Parallelism (Performance Impact) While TP unlocks massive memory, it introduces communication overhead. After every layer of computation, all GPUs must synchronize their partial results.
Learn More: For a deeper look at how Tensor Parallelism shards models and impacts latency, refer to this conceptual guide: 🔗 Hugging Face: Tensor Parallelism Concepts
Before moving to advanced configurations, let’s apply the math from the previous sections to real-world scenarios. This helps verify our understanding of “Fit” and introduces the practical constraints often missed in pure math.
It is a common mistake to calculate Weights + Cache = Total VRAM and assume 100% utilization is possible. It is not.
8B params x 2 bytes = 16 GB-4 GB48 - 16 - 4 = 28 GB28,000 MB / 0.15 MB per token = 186,000 tokens.
70B params x 2 bytes = 140 GB
70B params x 1 byte = 70 GB80 - 70 - 4 = 6 GB6,000 / 0.175 MB per token (FP8) = 34,000 tokens total.
Let’s fix the “Cache Trap” from Scenario C by adding a second GPU. This demonstrates how Tensor Parallelism (TP) pools memory resources.
160 - 70 - 8 = 82 GB82,000 / 0.175 MB per token (FP8) = 468,000 tokens total.
As demonstrated in the previous sizing scenarios, VRAM is the primary bottleneck for LLM inference. Quantization is the technique of reducing the precision of the numbers used to represent data, effectively trading a tiny amount of accuracy for massive gains in memory efficiency and speed.
It is critical to distinguish between the two types of quantization used in vLLM, as they address different constraints.
This involves compressing the massive, static weight matrices of the pre-trained model before it is loaded.
This involves compressing the intermediate Key and Value states stored in memory during sequence generation.
--kv-cache-dtype).fp8) is highly recommended. It effectively doubles the available context capacity with negligible impact on model quality.Not all quantization formats are created equal. The choice depends on the available hardware architecture and the desired balance between model size and accuracy.
| Precision / Format | Bytes per Param | Accuracy Impact | Best Hardware Support | Recommended Use Case |
|---|---|---|---|---|
| FP16 / BF16 (Base) | 2 | None (Reference) | All modern GPUs | The Gold Standard. Use whenever VRAM capacity permits. |
| FP8 (Floating Point 8) | 1 | Negligible | H100, H200, L40S, MI300X | Modern Default. The best balance of speed and quality on new hardware. Ideal for KV cache. |
| AWQ / GPTQ (INT4 variants) | ~0.5 | Low/Medium | A100, L40S, Consumer | The “Squeeze” Option. Essential for fitting huge models on older or smaller GPUs. Excellent decode speed. |
| Generic INT8 | 1 | Medium | Older GPUs (V100, T4) | Legacy. Generally superseded by FP8 on newer hardware or AWQ for extreme compression. |

Deciding when to apply quantization requires balancing practical constraints against workload sensitivity. While powerful, quantization involves fundamental trade-offs that must be considered during deployment planning.
Before determining the scenario, consider these two foundational constraints:
Based on the trade-offs above, quantization is appropriate in a wide range of real-world scenarios and is often the default choice in enterprise environments:
Quantization is not universally suitable. Some workloads are highly sensitive to precision loss and should remain in FP16/BF16 whenever possible:
Up to this point, we have covered vLLM runtime behavior (Section 1), memory fundamentals (Section 2), and quantization strategies (Section 3).
This section connects these concepts into a repeatable decision framework. It moves from theory → practice, providing a structured workflow to evaluate feasibility, select hardware, and build a deployment plan.
To accurately size a vLLM deployment, you must extract specific technical details from the workload description. Abstract goals like “fast inference” are insufficient. Use these five questions to gather the necessary data:
Once requirements are clear, identify the smallest model and highest precision that meets the quality bar.
Verify the fit using the math from Section 2.
Params * Precision fit in VRAM?
Context * Concurrency * Multiplier?
With feasibility confirmed, select the specific GPU SKU. Use this “Cheat Sheet” for common scenarios.
| Workload Scenario | Recommended Configuration | Rationale |
|---|---|---|
| Standard Chat (8B–14B) | NVIDIA L40S (48GB) | Best Value. Massive compute for decode; 48GB easily fits small models + huge cache. |
| Large Chat (70B, Cost-Sensitive) | L40S (INT4) or A100 (INT4) | The “Squeeze.” Quantization allows fitting 70B on a single card, avoiding the complexity of multi-GPU setups. |
| High-Performance Chat (70B) | NVIDIA H100 (FP8) AMD MI300X (FP16/FP8) | Modern Standard. H100 uses FP8 to fit and speed up inference. AMD Advantage: MI300X’s 192GB VRAM allows running 70B models with massive batch sizes on a single card. |
| Massive Context / RAG | NVIDIA H200 AMD MI300X / MI325X | Bandwidth & Capacity Kings. AMD: With 192GB (MI300X) or 256GB (MI325X), these are the best options for extreme context (128k+) without needing 4-8 GPUs. |
| Uncompromised Quality (70B FP16) | 2x H100 (Tensor Parallelism) 1x AMD MI300X | The “Single-Card” Hero. NVIDIA: Requires 2 cards to fit 140GB of weights. AMD: Fits the full 70B FP16 model on a single GPU, eliminating the latency penalty of Tensor Parallelism. |
| Ultra-Scale / Next-Gen (405B+) | NVIDIA B300 AMD MI350X | The Frontier. Designed for massive model density. MI350X (288GB) rivals NVIDIA’s Blackwell generation for fitting 400B+ MoE models efficiently. |
No paper sizing is perfect.
GPU memory requirements depend on model size, precision, context length, and concurrency. As a rule of thumb, FP16 models need approximately 2GB per billion parameters for weights alone. A 70B model requires 140GB for FP16 weights, but only 35GB with INT4 quantization. Additionally, you must account for KV cache memory, which grows with context length and concurrent users. For a 70B model with 32k context and 10 users, expect approximately 112GB for FP16 cache or 56GB for FP8 cache.
Tensor parallelism (TP) shards model weight matrices across multiple GPUs within each layer, allowing multiple GPUs to work on the same computation simultaneously. This pools VRAM but requires synchronization after each layer, adding communication overhead. Pipeline parallelism (PP) distributes model layers across GPUs sequentially, with each GPU processing different layers. TP is typically used when a model is too large for a single GPU, while PP is more common in training scenarios. For inference, TP is the standard approach when models exceed single-GPU capacity.
Quantization is recommended when models don’t fit in available VRAM, when you need to support higher concurrency or longer context windows, or when cost optimization is a priority. FP8 quantization is ideal for modern hardware (H100, L40S, MI300X) and offers minimal accuracy loss. INT4 quantization is necessary for fitting large models on smaller GPUs but should be avoided for code generation, math, and scientific tasks where precision matters. For chat and RAG workloads, quantization is often the default choice.
Use the per-token multiplier method for quick estimation: multiply total tokens (context length × concurrency) by the model-specific multiplier. For small models (7B-14B), use 0.15 MB per token for FP16 cache or 0.075 MB for FP8 cache. For large models (70B-80B), use 0.35 MB per token for FP16 cache or 0.175 MB for FP8 cache. For exact calculations, use the formula: Total KV Cache (GB) = (2 × n_layers × d_model × n_seq_len × n_batch × precision_bytes) / 1024³, or use online tools like the LMCache KV Calculator.
Yes, vLLM) can be deployed on DigitalOcean GPU Droplets. DigitalOcean offers GPU Droplets with NVIDIA GPUs that support vLLM’s requirements. When deploying, ensure your selected GPU has sufficient VRAM for your model size and workload. For cost-effective deployments, consider using quantized models (INT4 or FP8) to fit larger models on smaller GPU instances. DigitalOcean’s GPU Droplets provide NVLink connectivity, which is essential for efficient tensor parallelism when using multiple GPUs.
Building on the foundational understanding of how factors like model size, precision, GPU architecture, KV cache, and batching influence performance, in the upcoming tutorials, we will apply these concepts to practical vLLM workloads.
For each use case, we will address three key questions to determine the optimal setup:
Properly sizing and configuring GPUs for vLLM requires understanding the fundamental trade-offs between model size, precision, context length, and concurrency. The prefill and decode phases have different hardware requirements, with prefill demanding high memory bandwidth and decode requiring high compute throughput. Quantization serves as the primary lever for fitting large models on available hardware, while tensor parallelism enables scaling beyond single-GPU limits.
The key to successful deployment is matching your workload characteristics to the right hardware configuration. Interactive chat applications prioritize compute for fast token generation, while RAG and long-context workloads need massive VRAM capacity and high memory bandwidth. By following the sizing framework outlined in this guide, you can systematically evaluate feasibility, select appropriate hardware, and optimize your vLLM deployment for production workloads.
Ready to deploy vLLM on GPU infrastructure? Explore these resources to get started:
Deploy on DigitalOcean GPU Droplets: Get hands-on experience with vLLM by deploying on DigitalOcean’s GPU Droplets. Learn how to set up your environment and configure vLLM for optimal performance in our GPU Droplets documentation.
Related Tutorials: Deepen your understanding of LLM deployment and optimization:
Try DigitalOcean Products:
For more technical guides and best practices, visit the DigitalOcean Community Tutorials or explore our resources on AI and machine learning.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Senior Solutions Architect at DigitalOcean, helping builders design and scale cloud infrastructure with a focus on AI/ML and GPU-accelerated workloads. Passionate about continuous learning, performance optimization, and simplifying complex systems.
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.