
Prompt Caching Is the End of the Repeated Input Tax
Stop paying for the same system instructions over and over by leveraging prefix-reuse mechanics to slash both latency and LLM costs.
If you’ve ever felt a physical pang of guilt watching your API bill climb because you’re sending the same 5,000-word documentation file with every single user query, I have some very good news for you. We are finally moving past the era where we pay "rent" on the same blocks of text every few seconds.
For a long time, LLMs were stateless in the most expensive way possible. Every time you hit an endpoint, the model had to re-process your system prompt, your few-shot examples, and your context documents from scratch. It was like hiring a world-class consultant who develops total amnesia the moment they finish a sentence. Prompt caching changes that by allowing providers to store the intermediate "thinking" (the KV cache) of your prompt's prefix, letting you skip the re-processing time and cost.
The "Groundhog Day" Tax
Standard LLM calls follow a linear path: you send tokens, the provider computes them, the model generates a response, and then the whole state is deleted. If you have a 10k token system prompt—maybe a complex set of brand guidelines or a codebase summary—you pay for those 10k tokens on every. single. request.
Prompt caching allows the model to say, "Hey, I recognize this first 10k tokens. I’ve already computed them. Let’s just jump straight to the new stuff."
This usually results in two massive wins:
1. Cost: Cached tokens are often discounted by 80–90%.
2. Latency: Since the model doesn't have to re-read the prefix, time-to-first-token (TTFT) drops off a cliff.
How Anthropic Does It (Explicit Control)
Anthropic’s implementation is my favorite because it gives you manual control. You tell the API exactly where you want to "checkpoint" the cache. This is perfect for RAG (Retrieval-Augmented Generation) where you might have a massive knowledge base that rarely changes.
Here is what that looks like in Python using the anthropic library. Notice the cache_control block:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert on this massive 20,000-word technical documentation...",
"cache_control": {"type": "ephemeral"} # This marks the end of the cached block
}
],
messages=[
{"role": "user", "content": "How do I configure the load balancer?"}
],
)
print(f"Usage stats: {response.usage}")The Catch: Anthropic requires a minimum of 1,024 tokens to cache (at the time of writing). If your system prompt is just "You are a helpful assistant," caching it won't do anything but make your code more verbose.
OpenAI's Automatic Approach
OpenAI took a different route. They handle caching automatically. If you send a request that starts with the same prefix as a previous request, they’ll cache it for you behind the scenes.
There’s no special code to write, but there is a strict requirement: the prefix must be exactly the same, and it must be over 1,024 tokens.
from openai import OpenAI
client = OpenAI()
# The first call will be a "cache miss" (full price)
# The second call, if the prompt starts the same way, is a "cache hit"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Imagine a 2000 token long knowledge base here..."},
{"role": "user", "content": "Summarize the section on authentication."}
]
)
# Check the 'usage_details' in the response to see if caching worked
print(response.usage.prompt_tokens_details.cached_tokens)The "magic" here is nice, but it's fragile. If you accidentally move a single space or change a character at the beginning of that knowledge base, the cache breaks.
The "Golden Rule" of Caching: Order Matters
This is the part that trips people up. Caching works from the beginning of the string. You cannot cache the "middle" of a prompt.
Think of it like a stack of bricks. You can only cache the bottom bricks. If you want to change a brick at the bottom, you have to take the whole stack off and start over.
Bad Prompt Structure (Breaks Cache):
1. Variable: Current Time/Date
2. Static: 5,000 token Knowledge Base
3. Variable: User Query
Good Prompt Structure (Cache Optimized):
1. Static: 5,000 token Knowledge Base (Marked for cache)
2. Variable: Current Time/Date
3. Variable: User Query
By keeping your dynamic data (like the current time or the user's name) *after* your massive static blocks, you ensure the prefix remains identical across calls.
When should you actually care?
If you’re building a simple chatbot that just answers "Hello," prompt caching is overkill and won't even trigger. But if you’re doing any of the following, you’re basically burning money if you aren't using it:
* Many-shot prompting: Including 50+ examples of how the model should behave.
* Chat History: Caching the previous turns of a long conversation so the model doesn't have to re-read the whole transcript every time the user says "Okay."
* Codebase Analysis: Sending a set of library definitions or API docs with every request.
* PDF/Document Chat: Keeping a 100-page report in the context while the user asks multiple questions.
The Gotchas
Nothing in life is free, though prompt caching comes close. Here are the things that will bite you if you aren't careful:
1. Eviction: Caches aren't permanent. They usually expire after 5 to 30 minutes of inactivity. If no one uses your app for an hour, the first person back pays the "cold start" latency and price.
2. Cost of Writing: On some platforms, "writing" to the cache actually costs slightly more than a standard token, but you make that back instantly on the first "read."
3. The 1k Minimum: If your static prefix is 800 tokens, most providers won't cache it. You need enough bulk to make it worth their while.
Prompt caching is one of those rare "free lunch" updates in tech. It makes things faster for the user and cheaper for the developer. If you haven't audited your prompt structure lately to see if you can shove your static content to the front, now is the time. Your AWS bill will thank you.


