By Adrien Payong and Shaoni Mukherjee

Prompt caching is a provider-native feature that stores and reuses the initial, unchanging part of a prompt (the prompt prefix) so that large language models don’t have to process it again on every request. More specifically, it caches the internal state of the model for that prefix, reducing redundant computation. This results in reducing latency and input token savings, without any loss in quality. In other words, prompt caching makes your LLM calls faster and cheaper whenever you use prompts with long, identical prefixes (like system instructions, tools, or context data) in multiple requests.
This article will explain how it works behind the scenes, how to structure your prompts to maximize cache hits, and the differences in implementation between OpenAI, Anthropic Claude, and Google Gemini. We will also show how it compares to semantic caching, and how to measure its ROI while avoiding common pitfalls.
Prompt caching is an easy win for faster and cheaper LLMs, as it reuses the unchanged prefix of a prompt across calls. When enabled, the LLM’s API checks to see if your prompt begins with a prefix that the model has recently seen. If it does, it will skip computation for that prefix instead of recomputing those tokens. This can reduce latency by up to 80% and reduce input token costs by up to 90% for large and repetitive prompts. The key point here is that prompt caching reuses the model’s intermediate state (e.g., key-value tensors in the transformer’s attention layers) for the prefix tokens, rather than the output text itself. This means the final output of the model remains unaffected – prompt caching will give the same output as if it were processed normally, but without any redundant work on the prefix.
At a high level, prompt caching relies on exact prefix matching and provider-side caching of model state for that prefix. The mechanism can be summarized in a few steps:

If you issue many requests in rapid succession with the same exact prefix, they will be routed to the same cache node – though at very high rates (e.g.,>15 requests per minute with the same exact prefix), some requests may overflow to other machines, and cache effectiveness will be reduced.
A single character difference, a different order of JSON keys, or a setting toggled in the prefix will cause a mismatch and result in a cache miss.
Provider-native prompt caching can materially reduce time-to-first-token and input spend, but the “how” differs across OpenAI, Anthropic Claude, and Google Gemini. The practical goal remains the same: identify the largest stable prefix in your requests, maintain it byte-for-byte across calls, and instrument cache metrics to validate hit rates and ROI in production.

OpenAI prompt caching is automatic on supported models when your prompt crosses a minimum size (commonly ≥ 1,024 tokens). The core rule is simple: cache hits require an exact matching prefix, so place stable content first and push request-specific data (user question, retrieved snippets, volatile metadata) to the end.
Two knobs matter in production. First, prompt_cache_key can influence routing and help improve hit rates by steering similar-prefix traffic to the same cache “bucket”. Second, retention can be controlled with prompt_cache_retention (e.g., in-memory short retention) and, on supported models, extended retention such as “24h”. For observability, log usage.prompt_tokens_details.cached_tokens to measure how many prompt tokens were served from cache.
Anthropic’s technique is more direct: you tag cacheable segments with cache_control on content blocks (typically “ephemeral”), which amounts to saying “cache up to here.” Claude builds the cache in a specified order (tools → system → messages) and matches the cached region exactly (including images and tool definitions).
Practically speaking, Claude’s design allows for up to four breakpoints, and the cache-matching step involves a bounded lookback (documented to be ~20 blocks). As a result, multi-breakpoint designs can reduce miss rates for long prompts. TTL is 5 minutes by default, with an optional 1-hour TTL at an additional cost. cache_creation_input_tokens (cache write) and cache_read_input_tokens (cache read) log how much of a prompt contributes to warm-up vs. steady-state performance.
In the Gemini API, you can generally select between implicit caching (automatic, model-dependent) or explicit caching (create a cache object, then reference it). Implicit caching is enabled by default for Gemini 2.5 models and has model-dependent minimum sizes (Flash vs. Pro thresholds, for example). Explicit caching enables you to specify a TTL (Generally defaulting to 1 hour) and reuse a large static context as a prefix across many questions.
Similarly, on Vertex AI, there is implicit and explicit caching. Requests must be of a certain minimum size (documented, e.g., 2,048 tokens), and implicit cached tokens may be eligible for large discounts, whereas explicit caching introduces storage costs.
The golden rule of prompt caching: “Static-first, dynamic-last.” To maximize cache hits, structure your prompts so that the prefix (start of the prompt) contains the static and reusable parts, and the suffix (the end of the prompt) contains all the request-specific or user-provided content. Only a matching prefix can be cached; any part of the prompt that changes between requests must be excluded from the cached section. In practice, that means you should concatenate the prompt in a consistent order, for example:
Note that prompt caching (also known as prefix caching) is distinct from semantic caching, which addresses a similar problem but at a different layer and in a different manner. Prompt caching, as explained here, is strictly about having the exact same prefix: it’s great for direct re-use of context. Semantic caching, by contrast, means caching at the level of user queries and responses based on meaning, even if the wording differs.
| Caching Method | Prompt Caching (Exact Prefix) | Semantic Caching (Meaning-Based) |
|---|---|---|
| What is cached | The LLM’s internal state for a prompt prefix (e.g., model context for the first N tokens). The model still generates a fresh output each time (just starting from a saved state). | The output (response) for a given input query, stored for reuse. Often involves saving the full answer text for a query or prompt. |
| Hit criterion | Exact token match of the prompt prefix. Any difference in the prefix text (or parameters like system prompt, temperature, etc.) means no hit. | Semantic similarity of the new query to a past query. Uses embeddings to find if the new question is essentially the same as one seen before (even if phrased differently). |
| Use cases | Repeated static context across many requests (long instructions, documents, tool specs). Useful in multi-turn dialogues or agent tools where the environment stays the same but questions vary. | Repeated questions from users, FAQ-style queries, or scenarios where users often ask the same thing. Great for caching complete answers to common or expensive queries, even if phrasing varies. |
| Benefits | Saves input token processing costs and reduces latency for large prompts by avoiding recomputation of the static part. The LLM still tailors the answer to the new question each time. | Can avoid an LLM call entirely if a similar query was answered before – returning a stored response instantly. This can yield order-of-magnitude speedups (responses in milliseconds) and cost reduction by skipping calls. |
| Limitations | Only works for identical prefixes; cannot help if the prompt context changes or if the cost is in the output tokens. Doesn’t handle semantically similar but not identical prompts. | Requires maintaining an external cache (database or vector index) of Q&A pairs. There’s a risk of stale answers or incorrect matches if two queries are similar in embedding but actually different in intent, so it needs careful tuning (thresholds, validation). Also, semantic caching typically doesn’t help with the prompt’s context length (it’s about reusing outputs). |
It makes sense to use prompt caching and semantic caching in tandem – sometimes called a “double caching” strategy. For instance, in a customer support chatbot you might use prompt caching to serve a large static knowledge base (so the LLM doesn’t have to re-read the manual every time it’s invoked, to reduce token load) while using semantic caching of the final answers so that if user X asks a question and later user Y asks the same thing you can serve the cached answer directly without even having to call the LLM.
Prompt caching and semantic caching operate at different layers (one inside the model’s inference process, the other outside at the application level), so they’re not mutually exclusive and can easily complement each other. Prompt caching makes each LLM call as efficient as possible, while semantic caching reduces the total number of LLM calls needed in the first place by eliminating repeats.
How do you know if prompt caching is adding value to your application? And when is it even worth it? Here are some metrics you should track and some rules of thumb to assess your ROI:

A high hit rate (e.g., 80%+ of tokens from cache) is a good sign that you’re successfully reusing context most of the time. A low hit rate means either that your prompts aren’t repeating often enough, or your cache is expiring/invalidating too frequently.
There are three possible reasons for caching problems: (1) the prompt is not cacheable (too short or caching disabled), (2) the cached prefix is not exactly identical between calls (even 1-bit differences trigger a miss), or (3) the cache window expires or is invalidated by changing modes (tools/web/citations/images). The following table gives a systematic checklist to identify the symptoms, verify the cause, and use the most appropriate solution:
| Issue/symptom | Diagnostic checks (what to verify) | Fix / best practice (what to do) |
|---|---|---|
| No cache hits (cached tokens always 0) | 1) Prompt length clears provider threshold (e.g., OpenAI ~1024+, Vertex implicit ~2048+). 2) Prefix is byte-for-byte identical across calls. 3) Provider-specific caching is enabled on every request (e.g., Claude requires cache_control each time). |
Increase prompt size past threshold (move stable docs/system text into the prefix). Make prompt construction deterministic. Log full prompts and diff them. Ensure required cache flags/markers are sent on every call. |
| Cache misses after “small” changes | Compare a known hit vs miss prompt: system text, tool schemas, JSON serialization, whitespace/newlines, timestamps/IDs, or non-fixed defaults. Check JSON key ordering and template rendering stability. | Use stable serialization (canonical JSON / fixed key order). Freeze defaults and remove dynamic fields from the cached prefix. Normalize whitespace and templating. Add a “prompt fingerprint” (hash) in logs to catch drift early. |
| Cache invalidated when switching features/modes | Did you toggle web/tools/citations, change tool definitions, add images, or edit system instructions between calls? (These often create a logically new context.) | Treat each mode/toggle combination as a separate cache “profile.” Keep toggles consistent within a session. If you must switch modes, expect a miss and warm the cache for that mode. |
| Cache expires before reuse | Time between user turns vs TTL/retention window. Do you have access to longer retention (OpenAI extended cache, Claude longer TTL, explicit cache TTL on Google)? | Choose the longest economical retention for your interaction pattern (hours-later follow-ups → extended retention). If retention cannot be extended, consider periodic lightweight refresh calls only for high-value contexts (and quantify the cost). |
| Partial prompt updates don’t behave as expected | Where is the change occurring? If you inserted/edited inside the cached prefix, it’s a new prefix (miss). If you only append after the prefix, caching should still apply. If a provider supports multiple checkpoints, are they placed correctly? | Structure prompts into stable-first blocks (system + long docs) and append volatile parts (user turns) later. Use multiple cache breakpoints/blocks where supported so independent sections can remain cacheable. Avoid inserting content mid-prefix. |
Prompt caching is one of the highest-ROI performance levers for LLM applications, because it reduces redundant prefill computation without changing model quality. The basic takeaways are easy: standardize & version your prompt prefix, keep dynamic inputs at the end, and instrument cache-read signals (cached tokens, cache read/write counters) alongside TTFT and cost. Build that out, and you can layer on top a semantic caching strategy to eliminate entire calls. This will result in a “double caching” pattern that can win both unit economics and user-perceived responsiveness at scale.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Join the many businesses that use DigitalOcean’s Gradient AI Agentic Cloud to accelerate growth. Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.