AI Technical Writer

You add --speculative-model to your vLLM config, run a benchmark, and see roughly 2× token throughput. You ship it. Three weeks later, your 99th percentile latency (P99 - the slowest 1% of requests) has climbed, a subset of requests are hitting OOM errors, and your on-call engineer is staring at metrics that make no sense - the model is technically “faster,” but users are complaining the API feels slower.
Speculative decoding is a conditional optimization - it delivers real gains on the right workloads and quietly degrades everything else. Most teams find out which category they’re in the hard way.
This article is an operational framework for making the decision: which draft model to pick, what it costs your memory budget, how it interacts with the scheduler under real concurrency, and - most importantly - how to measure whether it’s actually helping in your environment.
A note on the numbers in this article: the performance figures we reference come from the vLLM team’s own published benchmarks (Llama-3-70B, 4× H100 SXM5, October 2024). This article’s core argument is that you should measure at your own production conditions. The vLLM data is the best published reference available; treat it as a calibration baseline, not a prediction for your deployment. The measurement section at the end tells you exactly what to run to get your own numbers.
spec_decode_draft_acceptance_rate in production. Below ~0.5, you’re adding latency, not removing it - turn it off.The first decision - which draft model to use - is also the most important, and it’s rarely treated that way. The intuition is simple: use a smaller model from the same family, pick a reasonable --num-speculative-tokens, and let the math work out. Whether that actually works out depends almost entirely on one factor: how large the draft model is relative to the target model.
Normally, generating k tokens requires running the large target model k times - one full forward pass per token. Speculative decoding short-circuits this: the small draft model generates k candidate tokens quickly, then the target model checks all k candidates in a single forward pass, because it can evaluate every position in parallel. One verification pass is much faster than k generation passes.
Standard generation — 5 tokens = 5 sequential target model passes:
[70B] → t₁ → [70B] → t₂ → [70B] → t₃ → [70B] → t₄ → [70B] → t₅
(pass 1) (pass 2) (pass 3) (pass 4) (pass 5)
Speculative decoding — 5 proposed tokens = 1 draft pass + 1 verify pass:
[8B Draft] → t₁ t₂ t₃ t₄ t₅ (one fast pass, all proposed)
|
▼
[70B Target] → ✓t₁ ✓t₂ ✓t₃ ✗t₄ — (one parallel verify pass)
└─────────────┘
3 tokens accepted, 2 rejected
target model ran once instead of five times
That holds only when enough draft tokens are accepted. Acceptance rate is the key variable. The size ratio matters because a larger draft model is better at predicting what the target model would have said. A 1B model simply doesn’t have enough capacity to reliably mimic a 70B model’s outputs, so more of its candidates get rejected. A larger draft model predicts more accurately, which means more accepted tokens per step.
A 1B draft model against a 70B target (~1:70 ratio) generates tokens fast and cheaply, but it guesses the wrong token too often, so the target model rejects most of what the draft proposed - and you’ve paid the cost of running two models without getting enough accepted tokens to make it worthwhile. A 13B draft against the same 70B target (~1:5 ratio) predicts well but costs 13B parameters worth of VRAM and compute, which shifts the crossover point where you’d have been better off just running the target model.
The mechanics of the algorithm point to a consistent sweet spot: a 1:8 to 1:12 size ratio using same-family, same-training-distribution models. The table below shows common pairings and the ratio math - treat these as a starting framework for deciding what to test, not as benchmarks to cite. The only acceptance rate that matters for your deployment is the one you measure on your own hardware, with your actual prompt distribution.
| Draft Model | Target Model | Ratio | Why it matters |
|---|---|---|---|
| Llama-3.1-8B | Llama-3.1-70B | ~1:9 | Sweet spot - same family, good capacity match |
| Qwen2.5-7B | Qwen2.5-72B | ~1:10 | Same-family pairing, similar ratio |
| Llama-3.2-1B | Llama-3.1-70B | ~1:70 | Too small - draft diverges from target too often |
| Llama-3.1-8B | Llama-3.1-405B | ~1:50 | Same draft, but much weaker relative to the larger target |
These pairings illustrate how the size ratio affects draft model quality. Acceptance rates vary significantly with hardware, quantization, and prompt distribution - measure on your own setup before drawing conclusions.
The 1B/70B pairing looks cheap but rarely pays off - the draft model rejects too many tokens, so you end up running two models for the cost of one.
One practical note: the vLLM team’s own published benchmarks used a 0.5B draft model (turboderp/Qwama-0.5B-Instruct) against Llama-3-70B - a 1:140 ratio, well outside the 1:8–1:12 sweet spot. They still saw 1.5× speedup at low query rates. At high query rates, they saw 1.4× slowdown. This reinforces the core point: query rate matters more than ratio. Even a suboptimal draft model can look like a win on an isolated benchmark. The failure mode occurs under production load.
Temperature is a setting that controls how predictable or random a model’s output is. At temperature=0, the model always picks the single most likely next token - this is called greedy decoding, and the output is fully deterministic. As you increase temperature, the model starts choosing from a wider range of possible tokens, making the output more varied and creative. At temperature=1.0 or above, the output can feel much more open-ended and less predictable.
This matters for speculative decoding because the draft model’s job is to guess what the target model will say next. At temperature=0, the target model is highly predictable - it always picks the top token - so the draft model guesses correctly most of the time. But at higher temperatures, the target model picks more surprising tokens, and the draft model’s guesses start missing more often. When the draft model misses, those candidate tokens get rejected and the work is wasted.
The problem is that most benchmarks are run at temperature=0, which is where speculative decoding looks best. In production, the temperature you actually use depends on the task - code generation and structured output tend to use low temperatures, while chat and creative writing typically use higher ones. If your real workload runs at temperature=0.7 or above, your benchmark numbers will be significantly more optimistic than what you see in production.
The simplest way to see this directly is to run the same prompt at different temperatures and watch spec_decode_draft_acceptance_rate shift in real time:
import requests
VLLM_URL = "http://localhost:8000/v1/chat/completions"
METRICS_URL = "http://localhost:8000/metrics"
PROMPT = "Write a short story about a robot learning to paint."
def get_acceptance_rate():
text = requests.get(METRICS_URL).text
for line in text.split("\n"):
if "spec_decode_draft_acceptance_rate" in line and not line.startswith("#"):
return float(line.split()[-1])
return None
for temperature in [0.0, 0.4, 0.8, 1.0]:
# Send 20 requests at this temperature
for _ in range(20):
requests.post(VLLM_URL, json={
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": PROMPT}],
"temperature": temperature,
"max_tokens": 200,
})
rate = get_acceptance_rate()
print(f"temperature={temperature} acceptance_rate={rate:.2f}")
Expected output shape (your numbers will vary by model pair and prompt):
temperature=0.0 acceptance_rate=0.81
temperature=0.4 acceptance_rate=0.71
temperature=0.8 acceptance_rate=0.52
temperature=1.0 acceptance_rate=0.38
Find the row that matches your production temperature. If the acceptance rate is below 0.5, stop - speculative decoding is net-negative on this workload, and you should disable it. If it’s between 0.5 and 0.65, you’re near the break-even line; measure P99 latency under real load before deciding. Above 0.65, you’re likely seeing a genuine benefit - confirm with a baseline comparison and ship it.
Higher temperature → flatter probability distribution → draft model’s concentrated guesses miss more often → more rejections. The pattern is more important than the specific multipliers:
| Workload | Dataset | QPS | Measured result | Method |
|---|---|---|---|---|
| Summarization | CNN/DailyMail | Low (QPS=1) | 2.8× speedup | N-gram speculative decoding |
| Chat / general | ShareGPT | Low (QPS=1) | 1.5× speedup | Draft model (0.5B → 70B) |
| Chat / general | ShareGPT | High | 1.4× slowdown | Draft model (0.5B → 70B) |
| Summarization | CNN/DailyMail | High | 1.8× slowdown | N-gram speculative decoding |
Source: vLLM Team, “How Speculative Decoding Boosts vLLM Performance by up to 2.8x”, October 2024. Benchmarked on Llama-3-70B with 4× H100.
The last two rows are not edge cases. At production query rates - where your GPU is compute-saturated - speculative decoding adds overhead instead of removing it. The extra compute required to propose and verify tokens compounds under load. The benefit evaporates, and the cost remains.
The intuition for the crossover point: an 8B draft model costs roughly 1/9th the compute of a 70B target model. If fewer than roughly half your draft tokens are accepted, the compute you spent on the draft model plus the wasted verification steps outweighs the tokens you gained. At high QPS, you’re already compute-saturated - there is no idle capacity to absorb that overhead.
The practical implication is straightforward: don’t assume speculative decoding is helping just because your benchmark looked good. If your users are running creative writing, open-ended chat, or anything at temperatures above 0.7, measure acceptance rates at those temperatures specifically. A benchmark run at temperature=0 tells you nothing about what happens at temperature=0.8 in production.
Running speculative decoding means running two models simultaneously. This is obvious in principle and surprisingly painful in practice once you work through the VRAM math.
Using Llama-3.1 as a concrete example. Weight sizes here are derived from parameter counts and precision (BF16 = 2 bytes/param, INT8 = 1 byte/param, INT4 = 0.5 bytes/param) - your actual usable VRAM per GPU will vary depending on your hardware configuration and host-level overhead.
On a 2× H100 SXM5 setup (160GB total), a common configuration for production 70B serving:
The practical takeaway for H100: if you want to run speculative decoding with a 70B target, you have to quantize it to INT8. Running both models in full BF16 simply doesn’t leave enough room for the KV cache.
On DO’s H200 GPU Droplets (2× H200 SXM5, 282GB total), this constraint goes away. Running 70B BF16 + 8B BF16 uses ~156GB of weights and leaves ~126GB for KV cache - compared to 74GB on the H100 INT8 configuration, that’s roughly 1.7× more headroom. More importantly, you no longer need to quantize the target model at all to make the memory budget work. If you’re serving long-context requests (32K+ tokens), that difference is significant.
To understand this, you need to know what a KV cache is. Every time a model processes a token, it generates intermediate values called keys and values (K and V). These are stored in GPU memory, so the model doesn’t have to recompute them on every step - that’s the KV cache. The more tokens a request has (longer conversations, longer documents), the more KV cache it needs. The more concurrent requests you serve, the more KV cache you need in total.
vLLM allocates KV cache at startup using a simple formula: total VRAM × gpu_memory_utilization (default: 0.9) - model weights = KV cache budget. Whatever memory the model weights don’t use goes entirely to KV cache. This means adding a 16GB draft model costs exactly 16GB of KV cache - it’s a direct, one-for-one trade. You now have 16GB less capacity to hold the context for in-flight requests.
In practice, this means one or both of the following:
The latency gains from faster token generation can easily be wiped out by the latency increase from being forced to process fewer requests in parallel. A useful rule of thumb: if your target model already uses more than 70% of your GPU memory, adding a draft model will reduce how many requests you can serve at once - test under real load before enabling it.
# vLLM config for speculative decoding on 2× H100
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92 \
--dtype bfloat16
The --speculative-draft-tensor-parallel-size 1 flag is worth noting explicitly: the draft model typically runs on a single GPU while the target model spans both. This keeps the draft model fast (small model, single-GPU forward pass) but means you’re effectively dedicating 16GB on one of your H100s to a model that generates candidate tokens you may reject 25% of the time at high temperatures.
Running speculative decoding with a quantized target model is one of the most common production configurations - and one of the least understood in terms of what it actually does to acceptance rates.
The speculative decoding algorithm works by comparing the draft model’s proposed tokens against what the target model would have generated. When the target model is quantized, its output distribution shifts slightly. INT8 quantization introduces small numerical rounding errors across the weight matrices; INT4 (GPTQ, AWQ) introduces larger ones that accumulate across layers. These errors are systematic, not random - the quantized model consistently outputs slightly different probability distributions than the full-precision version.
The draft model was optimized against the full-precision target model’s probability distribution - not the quantized variant’s. Quantization introduces systematic shifts in the target’s output distribution; the draft model’s proposals are calibrated to the unquantized distribution and will diverge from the quantized model’s preferences proportionally to the severity of quantization. This means tokens the draft model proposes with high confidence may be rejected by the quantized target - not because the proposal was semantically wrong, but because the quantized model’s probability mass for that token falls just below the acceptance threshold.
The magnitude of the acceptance rate penalty depends on your model architecture, quantization method, and prompt distribution - there is no universal number, and we haven’t measured it on DO hardware. What’s consistent across reported configurations is the direction and relative ordering: INT8 imposes a smaller penalty than INT4, and the penalty grows with quantization aggressiveness. The exact threshold that matters for your deployment is the one where acceptance rate drops below your break-even point - which is why the right approach is to measure spec_decode_draft_acceptance_rate before and after switching quantization levels on your actual workload, rather than relying on any general estimate. If the acceptance rate drops more than a few points on the same prompt distribution, that’s the signal to step back from INT4 to INT8 on the target.
The two models can be quantized independently, and the performance implications are not symmetric:
Quantizing the target model has the biggest impact because it is the model that decides whether each token proposed by the draft model is accepted. If quantization changes the target model’s predictions, fewer proposed tokens will be accepted, reducing the benefits of speculative decoding.
Quantizing the draft model is usually less risky. The draft model only suggests tokens, while the target model still verifies every suggestion. Even if quantization makes the draft model slightly less accurate, the verification process stays the same.
In practice, it’s generally safe to quantize the draft model more aggressively than the target model. For example, using an INT8 draft model with a BF16 target model usually has little effect on acceptance rates. However, using an INT4 target model, even with a BF16 draft model, can noticeably reduce acceptance rates, especially when generation uses sampling instead of deterministic decoding.
| Target quantization | Draft quantization | VRAM (2× H100, 70B+8B) | Notes |
|---|---|---|---|
| BF16 (~140GB) | BF16 (~16GB) | ~156GB total | Doesn’t fit; ~4GB left for KV cache |
| INT8 (~70GB) | BF16 (~16GB) | ~86GB | Recommended baseline; ~74GB for KV cache |
| INT8 (~70GB) | INT8 (~8GB) | ~78GB | Saves 8GB; minimal acceptance rate impact vs. BF16 draft |
| INT4 (~35GB) | BF16 (~16GB) | ~51GB | Fits on single H100; acceptance rate penalty is real |
| INT4 (~35GB) | INT4 (~4GB) | ~39GB | Maximum memory efficiency; expect noticeably lower acceptance rates |
Weights only. KV cache, activations, and page tables consume additional VRAM. Actual weights may differ slightly by quantization implementation.
Draft model quantization is handled separately from the target model via --speculative-model-quantization. The main --quantization flag applies only to the target:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--quantization bitsandbytes \
--load-format bitsandbytes \
--speculative-model meta-llama/Llama-3.1-8B-Instruct \
--speculative-model-quantization bitsandbytes \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.92
There’s a second flag worth knowing about: --spec-decoding-acceptance-method. The default is rejection_sampler, which enforces strict token acceptance based on the probability ratio between draft and target distributions. The alternative, typical_acceptance_sampler, is configurable - it trades a small reduction in output quality for a higher acceptance rate. If you’re running a quantized target model and seeing acceptance rates that would otherwise push you below the break-even threshold, switching to typical_acceptance_sampler can recover some of that loss without changing your model configuration:
--spec-decoding-acceptance-method typical_acceptance_sampler \
--typical-acceptance-sampler-posterior-threshold 0.09 \
--typical-acceptance-sampler-posterior-alpha 0.3
The defaults (threshold=0.09, alpha=0.3) are reasonable starting points. Lowering the threshold accepts more draft tokens; raising it enforces stricter quality. Test on your actual prompt distribution before adjusting.
One more useful flag for quantized deployments under variable load: --speculative-disable-by-batch-size. Set this to a batch size threshold, and the server will automatically disable speculative decoding for new requests when the queue exceeds that size. This gives you the low-QPS gains without manually toggling the configuration when traffic spikes.
After enabling quantization on either model, pull spec_decode_draft_acceptance_rate and compare it against your baseline (non-quantized) measurement on the same prompt distribution. A drop of more than 5 percentage points relative to baseline suggests the quantization penalty is significant enough to revisit your configuration choice - typically by moving the target model from INT4 to INT8, or by switching acceptance methods.
The right quantization configuration depends on your VRAM constraints, your context length requirements, and your acceptance rate tolerance. The general priority order: INT8 target over INT4 target, same-level quantization for draft and target over mixed, and measure acceptance rate before and after any quantization change.
The performance advantage of continuous batching comes from requests sharing GPU compute in the same forward pass, with new requests slotting into available capacity dynamically. PagedAttention manages the KV cache to enable this without padding or memory waste. Speculative decoding introduces structural assumptions that create friction with both of these mechanisms.
Under standard continuous batching, every iteration of the forward pass generates exactly one new token per request. This uniformity is what makes scheduling clean: you know the output length of each step, you can predict memory allocation, and you can pack requests efficiently.
Speculative decoding breaks this assumption. Each iteration consists of:
k candidate tokens per requestk+1 positions (k draft tokens + 1 correction) in a single forward passThe number of tokens actually appended to each request’s sequence after step 3 is variable - somewhere between 1 and k+1. This creates irregular batch shapes that complicate the scheduler.
Under a homogeneous workload - all requests at similar temperatures, similar lengths, similar acceptance rates - the irregularity is predictable enough that the scheduler handles it gracefully. Under a mixed workload, the picture is messier.
Imagine a batch where half the requests are running structured JSON extraction at temperature=0.1 (acceptance rate ~80%) and half are running open-ended creative generation at temperature=1.0 (acceptance rate ~35%). The high-acceptance requests are completing their speculative steps efficiently. The low-acceptance requests are running the draft model, paying the memory bandwidth cost of a second forward pass, and rejecting most of what it produces - effectively adding overhead per token rather than removing it.
The scheduler can’t split these cleanly. They share the same verification pass, which means the requests that don’t benefit from speculation are still paying its cost. In practice, workload homogeneity matters more than most teams realize. Speculative decoding is well-suited for dedicated deployments - a code completion endpoint, a structured extraction pipeline, a RAG system with constrained output formats. It is a poor fit for general-purpose chat APIs where temperature and task type vary across requests in the same batch.
The signal to watch for is a gap between P50 and P99 latency that widens after enabling speculative decoding. Under a mixed workload, P50 often improves (the high-acceptance requests pulling the median down) while P99 gets worse (the low-acceptance requests adding tail latency that compounds under load). If your P50 looks like a win but your P99 is a regression, the scheduler interaction is likely the cause. To confirm: run the same load test against a homogeneous low-temperature workload and compare P99 behavior. If it tightens significantly, you have a mixed-workload problem, not a speculative decoding problem.
The published reference numbers give you a baseline; what actually matters for your deployment decision are the numbers you collect on your own hardware, with your own prompt distribution, at your own query rates. Here is exactly what to measure and how.
The most common operational mistake is measuring the wrong thing. A benchmark that shows 2× speedup on isolated single-request tests will not reliably predict behavior under concurrent production load.
Acceptance rate, per request. The spec_decode_draft_acceptance_rate metric is available at the /metrics endpoint. Track it as a histogram, not an average - you want to see the distribution. If P10 of your acceptance rate distribution is below 0.5, you have a substantial portion of your traffic where speculative decoding is net-negative.
TTFT vs. TPOT separately. Speculative decoding affects time-per-output-token (TPOT), not time-to-first-token (TTFT). TTFT may actually increase slightly - the draft model adds a prefill step before the first token is returned. If your SLO is primarily TTFT-bound (e.g., interactive chat where users care about responsiveness more than throughput), speculative decoding may not move the metric you care about.
P50 vs. P99 latency under load. This is where the scheduler interactions surface. A benchmark running 10 concurrent requests at load=0.3 will look very different from 100 concurrent requests at load=0.8. Run your load tests at the concurrency levels you actually see in production.
# Pull speculative decoding metrics from the metrics endpoint
curl -s http://localhost:8000/metrics | grep spec_decode
A healthy deployment looks like this - acceptance rate consistently above 0.7, accepted tokens close to the number of draft tokens:
# HELP vllm:spec_decode_draft_acceptance_rate Speculative decoding draft acceptance rate
vllm:spec_decode_draft_acceptance_rate{...} 0.76
vllm:spec_decode_num_draft_tokens_total{...} 48200
vllm:spec_decode_num_accepted_tokens_total{...} 36600 # ~76% accepted
An unhealthy deployment - acceptance rate below 0.5, most draft tokens rejected, overhead not paying off:
vllm:spec_decode_draft_acceptance_rate{...} 0.38
vllm:spec_decode_num_draft_tokens_total{...} 51000
vllm:spec_decode_num_accepted_tokens_total{...} 19400 # ~38% accepted
At 38% acceptance with a 1:9 draft/target ratio, you are adding latency, not removing it. This is what a high-temperature or mismatched-family draft model looks like in production. If your metrics look like the second block, turn off speculative decoding until you’ve addressed the root cause.
Consider a deployment where 60% of requests are low-temperature structured queries with 82% acceptance rates, and 40% are high-temperature creative requests with 38% acceptance rates. The weighted average acceptance rate is ~65%, which looks healthy. But the 40% of requests that are degrading performance are doing so in a way that adds tail latency to the whole batch. P50 looks like a win; P99 is a regression. Aggregate benchmarks will show the win and hide the regression.
Speculative decoding delivers when these conditions hold:
| Requirement | Why it matters |
|---|---|
| Same model family | Shared training distribution = higher acceptance rates |
| 1:8–1:12 size ratio | Large enough to predict accurately, small enough to be cheap |
| Same tokenizer | Mismatched tokenizers require expensive re-encoding between models |
| Quantize the draft before relaxing with the family | INT8 draft is fine; relaxing the family constraint hurts acceptance more than INT8 does |
Speculative decoding is a genuine win for the right workloads. For everything else, the flag is not a universal accelerant. Treat it like any other performance optimization: measure first, at production conditions, then decide.
Does speculative decoding change the model’s outputs?
No. The acceptance criterion guarantees the final output distribution is identical to what the target model would have produced on its own. It is a pure latency optimization - it changes how fast tokens are generated, not what tokens are generated. You can enable it without touching your prompts, sampling parameters, or output validation.
Does it improve time-to-first-token (TTFT) or time-per-output-token (TPOT)?
It improves TPOT, not TTFT. The draft model adds a small prefill step before the first token is returned, so TTFT may actually increase slightly. If your SLO is primarily TTFT-bound - interactive chat where users notice the first response delay more than the generation speed - speculative decoding may not move the metric that matters to you. It’s most valuable when your bottleneck is throughput or output speed, not initial responsiveness.
What’s the difference between draft model and n-gram speculative decoding?
Draft model speculation uses a separate smaller model to propose tokens - it works across any prompt type but costs VRAM and requires a compatible model family. N-gram speculation reuses repeated phrases from the input prompt itself, which makes it nearly free on memory but only useful when the output closely echoes the input (summarization, RAG, document Q&A). For general chat or code generation, use a draft model. For summarization pipelines where the answer largely paraphrases the source, n-gram is often the better choice and requires no additional model at all.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Scale up as you grow — whether you're running one virtual machine or ten thousand.

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.
