Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

Speculative Decoding on vLLM: A Configuration and Decision Framework

Published on July 2, 2026

AI/ML

Python

By Shaoni Mukherjee

AI Technical Writer

Speculative Decoding on vLLM: A Configuration and Decision Framework

You add --speculative-model to your vLLM config, run a benchmark, and see roughly 2× token throughput. You ship it. Three weeks later, your 99th percentile latency (P99 - the slowest 1% of requests) has climbed, a subset of requests are hitting OOM errors, and your on-call engineer is staring at metrics that make no sense - the model is technically “faster,” but users are complaining the API feels slower.

Speculative decoding is a conditional optimization - it delivers real gains on the right workloads and quietly degrades everything else. Most teams find out which category they’re in the hard way.

This article is an operational framework for making the decision: which draft model to pick, what it costs your memory budget, how it interacts with the scheduler under real concurrency, and - most importantly - how to measure whether it’s actually helping in your environment.

A note on the numbers in this article: the performance figures we reference come from the vLLM team’s own published benchmarks (Llama-3-70B, 4× H100 SXM5, October 2024). This article’s core argument is that you should measure at your own production conditions. The vLLM data is the best published reference available; treat it as a calibration baseline, not a prediction for your deployment. The measurement section at the end tells you exactly what to run to get your own numbers.

TL;DR

Speculative decoding runs a small draft model to propose tokens, then verifies all of them in one target model pass - faster token generation without changing output quality.
It helps at low query rates on structured, low-temperature workloads: up to 2.8× speedup on summarization, 1.5× on chat (vLLM team, 4× H100).
It degrades at high query rates - the same vLLM benchmarks show 1.4–1.8× slowdowns when the GPU is saturated.
Pick a draft model at a 1:8–1:12 size ratio from the same model family. The draft model’s VRAM cost comes directly out of your KV cache budget.
Monitor spec_decode_draft_acceptance_rate in production. Below ~0.5, you’re adding latency, not removing it - turn it off.

Choosing the Right Draft Model for Speculative Decoding

The first decision - which draft model to use - is also the most important, and it’s rarely treated that way. The intuition is simple: use a smaller model from the same family, pick a reasonable --num-speculative-tokens, and let the math work out. Whether that actually works out depends almost entirely on one factor: how large the draft model is relative to the target model.

Why the ratio is the primary lever

Normally, generating k tokens requires running the large target model k times - one full forward pass per token. Speculative decoding short-circuits this: the small draft model generates k candidate tokens quickly, then the target model checks all k candidates in a single forward pass, because it can evaluate every position in parallel. One verification pass is much faster than k generation passes.

Standard generation — 5 tokens = 5 sequential target model passes:

  [70B] → t₁ → [70B] → t₂ → [70B] → t₃ → [70B] → t₄ → [70B] → t₅
 (pass 1)      (pass 2)      (pass 3)      (pass 4)      (pass 5)


Speculative decoding — 5 proposed tokens = 1 draft pass + 1 verify pass:

  [8B Draft]  → t₁  t₂  t₃  t₄  t₅     (one fast pass, all proposed)
                              |
                              ▼
  [70B Target] → ✓t₁ ✓t₂ ✓t₃ ✗t₄  —    (one parallel verify pass)
                 └─────────────┘
                 3 tokens accepted, 2 rejected
                 target model ran once instead of five times

That holds only when enough draft tokens are accepted. Acceptance rate is the key variable. The size ratio matters because a larger draft model is better at predicting what the target model would have said. A 1B model simply doesn’t have enough capacity to reliably mimic a 70B model’s outputs, so more of its candidates get rejected. A larger draft model predicts more accurately, which means more accepted tokens per step.

A 1B draft model against a 70B target (~1:70 ratio) generates tokens fast and cheaply, but it guesses the wrong token too often, so the target model rejects most of what the draft proposed - and you’ve paid the cost of running two models without getting enough accepted tokens to make it worthwhile. A 13B draft against the same 70B target (~1:5 ratio) predicts well but costs 13B parameters worth of VRAM and compute, which shifts the crossover point where you’d have been better off just running the target model.

The mechanics of the algorithm point to a consistent sweet spot: a 1:8 to 1:12 size ratio using same-family, same-training-distribution models. The table below shows common pairings and the ratio math - treat these as a starting framework for deciding what to test, not as benchmarks to cite. The only acceptance rate that matters for your deployment is the one you measure on your own hardware, with your actual prompt distribution.

Draft Model	Target Model	Ratio	Why it matters
Llama-3.1-8B	Llama-3.1-70B	~1:9	Sweet spot - same family, good capacity match
Qwen2.5-7B	Qwen2.5-72B	~1:10	Same-family pairing, similar ratio
Llama-3.2-1B	Llama-3.1-70B	~1:70	Too small - draft diverges from target too often
Llama-3.1-8B	Llama-3.1-405B	~1:50	Same draft, but much weaker relative to the larger target

These pairings illustrate how the size ratio affects draft model quality. Acceptance rates vary significantly with hardware, quantization, and prompt distribution - measure on your own setup before drawing conclusions.

The 1B/70B pairing looks cheap but rarely pays off - the draft model rejects too many tokens, so you end up running two models for the cost of one.

One practical note: the vLLM team’s own published benchmarks used a 0.5B draft model (turboderp/Qwama-0.5B-Instruct) against Llama-3-70B - a 1:140 ratio, well outside the 1:8–1:12 sweet spot. They still saw 1.5× speedup at low query rates. At high query rates, they saw 1.4× slowdown. This reinforces the core point: query rate matters more than ratio. Even a suboptimal draft model can look like a win on an isolated benchmark. The failure mode occurs under production load.

Temperature destroys your benchmark numbers

Temperature is a setting that controls how predictable or random a model’s output is. At temperature=0, the model always picks the single most likely next token - this is called greedy decoding, and the output is fully deterministic. As you increase temperature, the model starts choosing from a wider range of possible tokens, making the output more varied and creative. At temperature=1.0 or above, the output can feel much more open-ended and less predictable.

This matters for speculative decoding because the draft model’s job is to guess what the target model will say next. At temperature=0, the target model is highly predictable - it always picks the top token - so the draft model guesses correctly most of the time. But at higher temperatures, the target model picks more surprising tokens, and the draft model’s guesses start missing more often. When the draft model misses, those candidate tokens get rejected and the work is wasted.

The problem is that most benchmarks are run at temperature=0, which is where speculative decoding looks best. In production, the temperature you actually use depends on the task - code generation and structured output tend to use low temperatures, while chat and creative writing typically use higher ones. If your real workload runs at temperature=0.7 or above, your benchmark numbers will be significantly more optimistic than what you see in production.

The simplest way to see this directly is to run the same prompt at different temperatures and watch spec_decode_draft_acceptance_rate shift in real time:

import requests

VLLM_URL = "http://localhost:8000/v1/chat/completions"
METRICS_URL = "http://localhost:8000/metrics"
PROMPT = "Write a short story about a robot learning to paint."

def get_acceptance_rate():
    text = requests.get(METRICS_URL).text
    for line in text.split("\n"):
        if "spec_decode_draft_acceptance_rate" in line and not line.startswith("#"):
            return float(line.split()[-1])
    return None

for temperature in [0.0, 0.4, 0.8, 1.0]:
    # Send 20 requests at this temperature
    for _ in range(20):
        requests.post(VLLM_URL, json={
            "model": "meta-llama/Llama-3.1-70B-Instruct",
            "messages": [{"role": "user", "content": PROMPT}],
            "temperature": temperature,
            "max_tokens": 200,
        })
    rate = get_acceptance_rate()
    print(f"temperature={temperature}  acceptance_rate={rate:.2f}")

Expected output shape (your numbers will vary by model pair and prompt):

temperature=0.0  acceptance_rate=0.81
temperature=0.4  acceptance_rate=0.71
temperature=0.8  acceptance_rate=0.52
temperature=1.0  acceptance_rate=0.38

Find the row that matches your production temperature. If the acceptance rate is below 0.5, stop - speculative decoding is net-negative on this workload, and you should disable it. If it’s between 0.5 and 0.65, you’re near the break-even line; measure P99 latency under real load before deciding. Above 0.65, you’re likely seeing a genuine benefit - confirm with a baseline comparison and ship it.

Higher temperature → flatter probability distribution → draft model’s concentrated guesses miss more often → more rejections. The pattern is more important than the specific multipliers:

Workload	Dataset	QPS	Measured result	Method
Summarization	CNN/DailyMail	Low (QPS=1)	2.8× speedup	N-gram speculative decoding
Chat / general	ShareGPT	Low (QPS=1)	1.5× speedup	Draft model (0.5B → 70B)
Chat / general	ShareGPT	High	1.4× slowdown	Draft model (0.5B → 70B)
Summarization	CNN/DailyMail	High	1.8× slowdown	N-gram speculative decoding

Source: vLLM Team, “How Speculative Decoding Boosts vLLM Performance by up to 2.8x”, October 2024. Benchmarked on Llama-3-70B with 4× H100.

The last two rows are not edge cases. At production query rates - where your GPU is compute-saturated - speculative decoding adds overhead instead of removing it. The extra compute required to propose and verify tokens compounds under load. The benefit evaporates, and the cost remains.

The intuition for the crossover point: an 8B draft model costs roughly 1/9th the compute of a 70B target model. If fewer than roughly half your draft tokens are accepted, the compute you spent on the draft model plus the wasted verification steps outweighs the tokens you gained. At high QPS, you’re already compute-saturated - there is no idle capacity to absorb that overhead.

The practical implication is straightforward: don’t assume speculative decoding is helping just because your benchmark looked good. If your users are running creative writing, open-ended chat, or anything at temperatures above 0.7, measure acceptance rates at those temperatures specifically. A benchmark run at temperature=0 tells you nothing about what happens at temperature=0.8 in production.

Memory Budget Reality

Running speculative decoding means running two models simultaneously. This is obvious in principle and surprisingly painful in practice once you work through the VRAM math.

Actual footprint numbers

Using Llama-3.1 as a concrete example. Weight sizes here are derived from parameter counts and precision (BF16 = 2 bytes/param, INT8 = 1 byte/param, INT4 = 0.5 bytes/param) - your actual usable VRAM per GPU will vary depending on your hardware configuration and host-level overhead.

Llama-3.1-70B target model:

BF16: ~140GB → too large for a single 80GB H100; requires 2× H100 (160GB total), leaving only ~20GB free before you add the draft model or KV cache
INT8: ~70GB → fits on a single H100 with ~10GB to spare, or on 2× H100 with ~90GB free for KV cache and the draft model
INT4 (AWQ/GPTQ): ~35GB → fits comfortably on a single H100 with ~45GB free, the most memory-efficient option, but at some cost to output quality

Llama-3.1-8B draft model:

BF16: ~16GB
INT8: ~8GB
INT4: ~4GB

On a 2× H100 SXM5 setup (160GB total), a common configuration for production 70B serving:

Target (70B INT8, ~70GB) + Draft (8B BF16, ~16GB) = ~86GB weights. That leaves roughly 74GB for KV cache, page tables, and activation memory across both GPUs.
Target (70B BF16, ~140GB) + Draft (8B BF16, ~16GB) = ~156GB weights. That leaves only ~4GB - effectively unusable for any meaningful batch size or context length.

The practical takeaway for H100: if you want to run speculative decoding with a 70B target, you have to quantize it to INT8. Running both models in full BF16 simply doesn’t leave enough room for the KV cache.

On DO’s H200 GPU Droplets (2× H200 SXM5, 282GB total), this constraint goes away. Running 70B BF16 + 8B BF16 uses ~156GB of weights and leaves ~126GB for KV cache - compared to 74GB on the H100 INT8 configuration, that’s roughly 1.7× more headroom. More importantly, you no longer need to quantize the target model at all to make the memory budget work. If you’re serving long-context requests (32K+ tokens), that difference is significant.

How the draft model shrinks your KV cache

To understand this, you need to know what a KV cache is. Every time a model processes a token, it generates intermediate values called keys and values (K and V). These are stored in GPU memory, so the model doesn’t have to recompute them on every step - that’s the KV cache. The more tokens a request has (longer conversations, longer documents), the more KV cache it needs. The more concurrent requests you serve, the more KV cache you need in total.

vLLM allocates KV cache at startup using a simple formula: total VRAM × gpu_memory_utilization (default: 0.9) - model weights = KV cache budget. Whatever memory the model weights don’t use goes entirely to KV cache. This means adding a 16GB draft model costs exactly 16GB of KV cache - it’s a direct, one-for-one trade. You now have 16GB less capacity to hold the context for in-flight requests.

In practice, this means one or both of the following:

Shorter maximum context length - requests with long inputs or long conversation histories will hit memory limits sooner
Smaller maximum batch size - you can serve fewer concurrent requests before running out of KV cache space

The latency gains from faster token generation can easily be wiped out by the latency increase from being forced to process fewer requests in parallel. A useful rule of thumb: if your target model already uses more than 70% of your GPU memory, adding a draft model will reduce how many requests you can serve at once - test under real load before enabling it.

# vLLM config for speculative decoding on 2× H100
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.92 \
  --dtype bfloat16

The --speculative-draft-tensor-parallel-size 1 flag is worth noting explicitly: the draft model typically runs on a single GPU while the target model spans both. This keeps the draft model fast (small model, single-GPU forward pass) but means you’re effectively dedicating 16GB on one of your H100s to a model that generates candidate tokens you may reject 25% of the time at high temperatures.

Quantization and Speculative Decoding

Running speculative decoding with a quantized target model is one of the most common production configurations - and one of the least understood in terms of what it actually does to acceptance rates.

Why quantization changes acceptance rates

The speculative decoding algorithm works by comparing the draft model’s proposed tokens against what the target model would have generated. When the target model is quantized, its output distribution shifts slightly. INT8 quantization introduces small numerical rounding errors across the weight matrices; INT4 (GPTQ, AWQ) introduces larger ones that accumulate across layers. These errors are systematic, not random - the quantized model consistently outputs slightly different probability distributions than the full-precision version.

The draft model was optimized against the full-precision target model’s probability distribution - not the quantized variant’s. Quantization introduces systematic shifts in the target’s output distribution; the draft model’s proposals are calibrated to the unquantized distribution and will diverge from the quantized model’s preferences proportionally to the severity of quantization. This means tokens the draft model proposes with high confidence may be rejected by the quantized target - not because the proposal was semantically wrong, but because the quantized model’s probability mass for that token falls just below the acceptance threshold.

The magnitude of the acceptance rate penalty depends on your model architecture, quantization method, and prompt distribution - there is no universal number, and we haven’t measured it on DO hardware. What’s consistent across reported configurations is the direction and relative ordering: INT8 imposes a smaller penalty than INT4, and the penalty grows with quantization aggressiveness. The exact threshold that matters for your deployment is the one where acceptance rate drops below your break-even point - which is why the right approach is to measure spec_decode_draft_acceptance_rate before and after switching quantization levels on your actual workload, rather than relying on any general estimate. If the acceptance rate drops more than a few points on the same prompt distribution, that’s the signal to step back from INT4 to INT8 on the target.

The asymmetry between quantizing draft vs. target

The two models can be quantized independently, and the performance implications are not symmetric:

Quantizing the target model has the biggest impact because it is the model that decides whether each token proposed by the draft model is accepted. If quantization changes the target model’s predictions, fewer proposed tokens will be accepted, reducing the benefits of speculative decoding.

Quantizing the draft model is usually less risky. The draft model only suggests tokens, while the target model still verifies every suggestion. Even if quantization makes the draft model slightly less accurate, the verification process stays the same.

In practice, it’s generally safe to quantize the draft model more aggressively than the target model. For example, using an INT8 draft model with a BF16 target model usually has little effect on acceptance rates. However, using an INT4 target model, even with a BF16 draft model, can noticeably reduce acceptance rates, especially when generation uses sampling instead of deterministic decoding.

Target quantization	Draft quantization	VRAM (2× H100, 70B+8B)	Notes
BF16 (~140GB)	BF16 (~16GB)	~156GB total	Doesn’t fit; ~4GB left for KV cache
INT8 (~70GB)	BF16 (~16GB)	~86GB	Recommended baseline; ~74GB for KV cache
INT8 (~70GB)	INT8 (~8GB)	~78GB	Saves 8GB; minimal acceptance rate impact vs. BF16 draft
INT4 (~35GB)	BF16 (~16GB)	~51GB	Fits on single H100; acceptance rate penalty is real
INT4 (~35GB)	INT4 (~4GB)	~39GB	Maximum memory efficiency; expect noticeably lower acceptance rates

Weights only. KV cache, activations, and page tables consume additional VRAM. Actual weights may differ slightly by quantization implementation.

The flags

Draft model quantization is handled separately from the target model via --speculative-model-quantization. The main --quantization flag applies only to the target:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --quantization bitsandbytes \
  --load-format bitsandbytes \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model-quantization bitsandbytes \
  --num-speculative-tokens 5 \
  --speculative-draft-tensor-parallel-size 1 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.92

There’s a second flag worth knowing about: --spec-decoding-acceptance-method. The default is rejection_sampler, which enforces strict token acceptance based on the probability ratio between draft and target distributions. The alternative, typical_acceptance_sampler, is configurable - it trades a small reduction in output quality for a higher acceptance rate. If you’re running a quantized target model and seeing acceptance rates that would otherwise push you below the break-even threshold, switching to typical_acceptance_sampler can recover some of that loss without changing your model configuration:

--spec-decoding-acceptance-method typical_acceptance_sampler \
--typical-acceptance-sampler-posterior-threshold 0.09 \
--typical-acceptance-sampler-posterior-alpha 0.3

The defaults (threshold=0.09, alpha=0.3) are reasonable starting points. Lowering the threshold accepts more draft tokens; raising it enforces stricter quality. Test on your actual prompt distribution before adjusting.

One more useful flag for quantized deployments under variable load: --speculative-disable-by-batch-size. Set this to a batch size threshold, and the server will automatically disable speculative decoding for new requests when the queue exceeds that size. This gives you the low-QPS gains without manually toggling the configuration when traffic spikes.

What to watch after switching to a quantized configuration

After enabling quantization on either model, pull spec_decode_draft_acceptance_rate and compare it against your baseline (non-quantized) measurement on the same prompt distribution. A drop of more than 5 percentage points relative to baseline suggests the quantization penalty is significant enough to revisit your configuration choice - typically by moving the target model from INT4 to INT8, or by switching acceptance methods.

The right quantization configuration depends on your VRAM constraints, your context length requirements, and your acceptance rate tolerance. The general priority order: INT8 target over INT4 target, same-level quantization for draft and target over mixed, and measure acceptance rate before and after any quantization change.

Continuous Batching: Where the Scheduler Gets Complicated

The performance advantage of continuous batching comes from requests sharing GPU compute in the same forward pass, with new requests slotting into available capacity dynamically. PagedAttention manages the KV cache to enable this without padding or memory waste. Speculative decoding introduces structural assumptions that create friction with both of these mechanisms.

What the scheduler actually assumes

Under standard continuous batching, every iteration of the forward pass generates exactly one new token per request. This uniformity is what makes scheduling clean: you know the output length of each step, you can predict memory allocation, and you can pack requests efficiently.

Speculative decoding breaks this assumption. Each iteration consists of:

The draft model generating k candidate tokens per request
The target model verifying all k+1 positions (k draft tokens + 1 correction) in a single forward pass
Accepting some prefix of those tokens based on the acceptance criterion

The number of tokens actually appended to each request’s sequence after step 3 is variable - somewhere between 1 and k+1. This creates irregular batch shapes that complicate the scheduler.

What this means for mixed workloads

Under a homogeneous workload - all requests at similar temperatures, similar lengths, similar acceptance rates - the irregularity is predictable enough that the scheduler handles it gracefully. Under a mixed workload, the picture is messier.

Imagine a batch where half the requests are running structured JSON extraction at temperature=0.1 (acceptance rate ~80%) and half are running open-ended creative generation at temperature=1.0 (acceptance rate ~35%). The high-acceptance requests are completing their speculative steps efficiently. The low-acceptance requests are running the draft model, paying the memory bandwidth cost of a second forward pass, and rejecting most of what it produces - effectively adding overhead per token rather than removing it.

The scheduler can’t split these cleanly. They share the same verification pass, which means the requests that don’t benefit from speculation are still paying its cost. In practice, workload homogeneity matters more than most teams realize. Speculative decoding is well-suited for dedicated deployments - a code completion endpoint, a structured extraction pipeline, a RAG system with constrained output formats. It is a poor fit for general-purpose chat APIs where temperature and task type vary across requests in the same batch.

The signal to watch for is a gap between P50 and P99 latency that widens after enabling speculative decoding. Under a mixed workload, P50 often improves (the high-acceptance requests pulling the median down) while P99 gets worse (the low-acceptance requests adding tail latency that compounds under load). If your P50 looks like a win but your P99 is a regression, the scheduler interaction is likely the cause. To confirm: run the same load test against a homogeneous low-temperature workload and compare P99 behavior. If it tightens significantly, you have a mixed-workload problem, not a speculative decoding problem.

How to Measure This on Your Own Deployment

The published reference numbers give you a baseline; what actually matters for your deployment decision are the numbers you collect on your own hardware, with your own prompt distribution, at your own query rates. Here is exactly what to measure and how.

The most common operational mistake is measuring the wrong thing. A benchmark that shows 2× speedup on isolated single-request tests will not reliably predict behavior under concurrent production load.

What to actually measure

Acceptance rate, per request. The spec_decode_draft_acceptance_rate metric is available at the /metrics endpoint. Track it as a histogram, not an average - you want to see the distribution. If P10 of your acceptance rate distribution is below 0.5, you have a substantial portion of your traffic where speculative decoding is net-negative.

TTFT vs. TPOT separately. Speculative decoding affects time-per-output-token (TPOT), not time-to-first-token (TTFT). TTFT may actually increase slightly - the draft model adds a prefill step before the first token is returned. If your SLO is primarily TTFT-bound (e.g., interactive chat where users care about responsiveness more than throughput), speculative decoding may not move the metric you care about.

P50 vs. P99 latency under load. This is where the scheduler interactions surface. A benchmark running 10 concurrent requests at load=0.3 will look very different from 100 concurrent requests at load=0.8. Run your load tests at the concurrency levels you actually see in production.

# Pull speculative decoding metrics from the metrics endpoint
curl -s http://localhost:8000/metrics | grep spec_decode

A healthy deployment looks like this - acceptance rate consistently above 0.7, accepted tokens close to the number of draft tokens:

# HELP vllm:spec_decode_draft_acceptance_rate Speculative decoding draft acceptance rate
vllm:spec_decode_draft_acceptance_rate{...} 0.76
vllm:spec_decode_num_draft_tokens_total{...} 48200
vllm:spec_decode_num_accepted_tokens_total{...} 36600   # ~76% accepted

An unhealthy deployment - acceptance rate below 0.5, most draft tokens rejected, overhead not paying off:

vllm:spec_decode_draft_acceptance_rate{...} 0.38
vllm:spec_decode_num_draft_tokens_total{...} 51000
vllm:spec_decode_num_accepted_tokens_total{...} 19400   # ~38% accepted

At 38% acceptance with a 1:9 draft/target ratio, you are adding latency, not removing it. This is what a high-temperature or mismatched-family draft model looks like in production. If your metrics look like the second block, turn off speculative decoding until you’ve addressed the root cause.

The monitoring setup you need before trusting the flag:

Acceptance rate histogram (P10, P50, P90) segmented by temperature bucket
TTFT and TPOT at P50, P95, P99 - compared against a baseline without speculative decoding
GPU memory utilization and KV cache hit rate (to catch the squeeze)
Throughput (tokens/second) under your actual production concurrency, not synthetic single-request benchmarks

Why aggregate benchmarks lie

Consider a deployment where 60% of requests are low-temperature structured queries with 82% acceptance rates, and 40% are high-temperature creative requests with 38% acceptance rates. The weighted average acceptance rate is ~65%, which looks healthy. But the 40% of requests that are degrading performance are doing so in a way that adds tail latency to the whole batch. P50 looks like a win; P99 is a regression. Aggregate benchmarks will show the win and hide the regression.

Decision Framework

Speculative decoding delivers when these conditions hold:

Use it when:

Your workload is temperature-homogeneous and skews toward structured output or code - target temperature ≤ 0.5
You have VRAM headroom after target model weights (draft model should consume no more than ~15% of total available VRAM)
Your workload is batch-size-consistent - you’re not mixing request types with dramatically different acceptance rates in the same batch
You’ve validated acceptance rates at production temperatures, not just greedy benchmarks
Your primary SLO is TPOT/throughput, not TTFT

Leave it off when:

Temperature varies widely across your request mix, or your median temperature is above 0.7
You’re VRAM-constrained and serving long-context requests - the KV cache squeeze will cost you more than the token throughput gains
Your workload is TTFT-bound rather than throughput-bound
You haven’t set up acceptance rate monitoring - you can’t tell whether it’s helping

Draft model selection checklist:

Requirement	Why it matters
Same model family	Shared training distribution = higher acceptance rates
1:8–1:12 size ratio	Large enough to predict accurately, small enough to be cheap
Same tokenizer	Mismatched tokenizers require expensive re-encoding between models
Quantize the draft before relaxing with the family	INT8 draft is fine; relaxing the family constraint hurts acceptance more than INT8 does

Speculative decoding is a genuine win for the right workloads. For everything else, the flag is not a universal accelerant. Treat it like any other performance optimization: measure first, at production conditions, then decide.

FAQ

Does speculative decoding change the model’s outputs?

No. The acceptance criterion guarantees the final output distribution is identical to what the target model would have produced on its own. It is a pure latency optimization - it changes how fast tokens are generated, not what tokens are generated. You can enable it without touching your prompts, sampling parameters, or output validation.

Does it improve time-to-first-token (TTFT) or time-per-output-token (TPOT)?

It improves TPOT, not TTFT. The draft model adds a small prefill step before the first token is returned, so TTFT may actually increase slightly. If your SLO is primarily TTFT-bound - interactive chat where users notice the first response delay more than the generation speed - speculative decoding may not move the metric that matters to you. It’s most valuable when your bottleneck is throughput or output speed, not initial responsiveness.

What’s the difference between draft model and n-gram speculative decoding?

Draft model speculation uses a separate smaller model to propose tokens - it works across any prompt type but costs VRAM and requires a compatible model family. N-gram speculation reuses repeated phrases from the input prompt itself, which makes it nearly free on memory but only useful when the output closely echoes the input (summarization, RAG, document Q&A). For general chat or code generation, use a draft model. For summarization pipelines where the answer largely paraphrases the source, n-gram is often the better choice and requires no additional model at all.

Resources

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Shaoni Mukherjee

Author

AI Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags:

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Report this