The Cache That Thinks in Embeddings

I’ve spent an embarrassing amount of time watching my cloud bill tick up because three different users asked my chatbot the exact same thing in slightly different ways.

Traditional caching is a bit of a stickler. If User A asks, "What is the capital of France?" and User B asks, "What's France's capital?", a standard Redis cache—which looks for an exact string match—sees these as two completely different events. It sends both to the LLM. You pay for the tokens twice. You wait for the inference twice. It’s inefficient, and frankly, it feels a bit dated in the era of "vibes-based" computing.

If we want our apps to be fast and cheap, we need a cache that understands intent, not just characters. We need a cache that thinks in embeddings.

The Exact Match Trap

In a typical web app, caching is easy. You hash a URL or a database query, and if it matches, you’re golden. But LLMs deal with the messiness of human language.

Consider these three prompts:
1. "How do I reset my password?"
2. "I forgot my password, how to change it?"
3. "Password reset instructions please."

To a computer, these strings have almost zero overlap. To a human (and an LLM), they are identical. If you’re using a standard key-value store, you’re missing the cache 66% of the time here.

Semantic Caching: The "Vibe" Check

Instead of mapping a string to a response, we map a vector to a response.

The flow looks like this:
1. Turn the user's prompt into an embedding (a list of numbers representing the meaning).
2. Search a vector database for the "nearest neighbor" to that embedding.
3. If the closest match is "close enough" (based on a similarity threshold), return the cached answer.
4. If not, hit the LLM and save the result.

Here is a bare-bones example using sentence-transformers for local embedding generation and numpy for the math. You don't always need a massive vector DB to start; sometimes a simple local array is enough to prove the point.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a lightweight model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our "Database"
cache = {
    "embeddings": [],
    "responses": [],
    "prompts": []
}

def get_similarity(v1, v2):
    # Standard cosine similarity
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def ask_with_cache(new_prompt, threshold=0.85):
    new_vec = model.encode(new_prompt)
    
    best_match_idx = -1
    highest_score = -1

    for i, cached_vec in enumerate(cache["embeddings"]):
        score = get_similarity(new_vec, cached_vec)
        if score > highest_score:
            highest_score = score
            best_match_idx = i

    if highest_score > threshold:
        print(f"--- Cache Hit! (Score: {highest_score:.4f}) ---")
        return cache["responses"][best_match_idx]

    print("--- Cache Miss. Calling LLM... ---")
    # Mocking an LLM call
    response = f"LLM generated response for: {new_prompt}"
    
    # Store it
    cache["embeddings"].append(new_vec)
    cache["responses"].append(response)
    cache["prompts"].append(new_prompt)
    
    return response

# Usage
print(ask_with_cache("How do I bake a chocolate cake?"))
print(ask_with_cache("Give me a recipe for chocolate cake.")) # Should hit!

Finding the Sweet Spot (The Threshold)

The most stressful part of setting this up isn't the code; it's the threshold.

If you set it to 0.99, you’re basically back to exact matching. If you set it to 0.70, you might ask "How do I kill a process?" and get back the cached answer for "How do I kill a mockingbird?" That’s a bad day for everyone involved.

I’ve found that 0.88 to 0.92 is usually the "goldilocks zone" for general conversation. However, if you are doing something high-stakes—like medical advice or code execution—you probably want to stay above 0.95.

Scaling Up with Redis or Qdrant

If you're doing this in production, don't loop through a Python list like I did above. It'll get slow fast. Modern vector databases like Qdrant, Pinecone, or even Redis (with the Search module) handle this much better.

Here’s how you might structure a query using something like GPTCache or a custom Redis implementation:

from redis import Redis
from redis.commands.search.query import Query

# Note: This assumes you've set up a RediSearch index with HNSW
def query_vector_cache(query_vector, index_name="llm_cache"):
    r = Redis(host='localhost', port=6379)
    
    # Return the top 1 result within a distance range
    q = Query("*=>[KNN 1 @vector $vec as score]")\
        .sort_by("score")\
        .return_fields("response", "score")\
        .dialect(2)
    
    params = {"vec": query_vector.tobytes()}
    results = r.ft(index_name).search(q, params)
    
    return results.docs[0] if results.docs else None

The "Gotchas" You Can't Ignore

Semantic caching is great, but it has some weird edge cases that will bite you if you aren't careful.

1. Temporal Drift: If a user asks "What is the stock price of Apple?", a semantic cache might return a result from three days ago because the *intent* is identical. For time-sensitive queries, you need to append a timestamp to your embedding or bypass the cache entirely.
2. Personalization: If User A asks "Show me my last orders," you absolutely cannot serve that from a global cache to User B. Always namespace your cache keys by user_id or session if the data is private.
3. The Embedding Cost: Generating embeddings isn't free. If you're using OpenAI’s text-embedding-3-small, it’s very cheap, but it’s still an API call. For maximum performance, run a small model like BGE-Micro locally on your application server.

Why Bother?

It comes down to the user experience. An LLM might take 2-5 seconds to stream a response. A semantic cache hit takes about 20ms. When you combine that with the fact that you’re potentially cutting your API costs by 20-40% depending on your traffic patterns, it becomes a no-brainer.

Stop caching strings. Start caching meaning. Your wallet (and your users) will thank you.