
The Lossy Dimension
High-scale vector search shouldn't bankrupt your infrastructure; here is the low-level math behind Product Quantization (PQ) and how to squeeze billion-embedding datasets into mere megabytes of RAM.
I was looking at a cloud bill last month and realized we were paying several thousand dollars just to keep a few million floats "warm" in RAM. It’s the kind of realization that makes you rethink the way we treat modern embedding models. We take these 1536-dimensional vectors from OpenAI or Cohere, toss them into a flat index, and then act surprised when the infrastructure costs start looking like a mortgage payment.
The reality is that vectors are heavy. If you’re running a small-scale RAG (Retrieval-Augmented Generation) app with 10,000 documents, you can afford to be sloppy. But once you hit the million-vector mark—and certainly the billion-vector mark—the physics of memory becomes your primary antagonist. You simply cannot store everything in its raw, high-precision state. You have to embrace the lossy dimension.
The Brutal Math of High-Dimensional Storage
Let's do some back-of-the-napkin math. Suppose you have a dataset of 100 million embeddings. Each embedding has 768 dimensions (a standard BERT-sized output), and you’re using float32 precision.
# The cost of "Flat" storage
num_vectors = 100_000_000
dimensions = 768
bytes_per_float = 4
total_gb = (num_vectors * dimensions * bytes_per_float) / (1024**3)
print(f"Memory required: {total_gb:.2f} GB")
# Output: Memory required: 286.10 GBNearly 300 gigabytes of RAM just for the raw data. That doesn't account for index overhead, operating system buffers, or the fact that you probably want some level of redundancy. If you’re using a managed service, you’re looking at a high-tier instance that will eat your margin for breakfast.
We need a way to compress this data that doesn't completely destroy our ability to find "similar" things. This is where Product Quantization (PQ) enters the chat.
Breaking the Vector: How PQ Works
Product Quantization is essentially the art of "lossy compression for distance math." Instead of trying to compress the whole vector as one giant unit, we chop it into pieces.
Imagine a 1024-dimensional vector. If we try to run K-Means clustering on a 1024-D space to find "typical" vectors, the "curse of dimensionality" will ensure that every point is roughly the same distance from every other point. It's a nightmare.
PQ solves this by slicing the vector into $m$ distinct sub-vectors (subspaces).
1. Slicing: We take our $D=1024$ vector and break it into $m=8$ sub-vectors, each with $128$ dimensions.
2. Clustering: For each of those 8 subspaces, we run K-Means on a sample of our data. We usually find 256 "centroids" (the most representative points) for each subspace.
3. Encoding: Since we have 256 centroids per subspace, we can represent any sub-vector by the index of its nearest centroid. 256 can be represented by exactly 8 bits (1 byte).
After this process, our 1024-dimensional vector (which originally took $1024 \times 4 = 4096$ bytes) is now represented by 8 bytes. That is a 512x compression ratio.
Implementing a Naive Product Quantizer
To really grok this, we should look at how we'd build a simple version using NumPy. This isn't production-grade, but it reveals the plumbing.
import numpy as np
from scipy.cluster.vq import kmeans2
class SimplePQ:
def __init__(self, m, k=256):
self.m = m # number of subspaces
self.k = k # centroids per subspace
self.codebooks = []
def fit(self, data):
# data shape: [N, D]
N, D = data.shape
assert D % self.m == 0, "Dimensions must be divisible by m"
ds = D // self.m
for i in range(self.m):
# Slice the data into subspaces
sub_data = data[:, i*ds : (i+1)*ds]
# Cluster the subspace
centroids, _ = kmeans2(sub_data, self.k, minit='points')
self.codebooks.append(centroids)
def encode(self, data):
N, D = data.shape
ds = D // self.m
# Our codes will be uint8 because k=256
codes = np.zeros((N, self.m), dtype=np.uint8)
for i in range(self.m):
sub_data = data[:, i*ds : (i+1)*ds]
# Find nearest centroid for each sub-vector
# (In reality, use a more efficient distance calc here)
centroids = self.codebooks[i]
distances = np.linalg.norm(sub_data[:, np.newaxis] - centroids, axis=2)
codes[:, i] = np.argmin(distances, axis=1)
return codes
# Dummy data: 1000 vectors, 128 dimensions
data = np.random.random((1000, 128)).astype('float32')
pq = SimplePQ(m=8) # 8 subspaces of 16 dims each
pq.fit(data)
compressed_data = pq.encode(data)
print(f"Original shape: {data.shape}, Size: {data.nbytes} bytes")
print(f"Compressed shape: {compressed_data.shape}, Size: {compressed_data.nbytes} bytes")The output usually looks something like this:Original shape: (1000, 128), Size: 512000 bytesCompressed shape: (1000, 8), Size: 8000 bytes
The Magic of Asymmetric Distance Computation (ADC)
Now, you might be thinking: *"Great, I've compressed my data, but now I have to decompress every single vector to calculate the distance during a search? That sounds slow."*
This is the clever part. You don't decompress.
When a query comes in, it's a raw, uncompressed vector. To find the nearest neighbors in the compressed index, we use Asymmetric Distance Computation (ADC).
1. Take the query vector $q$ and slice it into $m$ pieces.
2. For each piece, calculate the distance between that piece and all 256 centroids in that subspace's codebook.
3. Store these distances in a Look-up Table (LUT). Since we have $m$ subspaces and 256 centroids each, the LUT is small ($m \times 256$).
4. To get the approximate distance between the query $q$ and a compressed vector $C$ in our database, we just sum up the pre-calculated distances from the LUT.
def query_pq(query_vec, pq_instance, codes):
m = pq_instance.m
ds = len(query_vec) // m
# Precompute Lookup Table (LUT)
# Shape: [m, k]
lut = np.zeros((m, pq_instance.k))
for i in range(m):
sub_query = query_vec[i*ds : (i+1)*ds]
centroids = pq_instance.codebooks[i]
# Calculate distance from sub-query to all centroids in this subspace
lut[i] = np.linalg.norm(centroids - sub_query, axis=1)**2
# Calculate distances to all items in 'codes' using the LUT
# This is a very fast operation: just additions
dists = np.zeros(len(codes))
for i in range(m):
dists += lut[i, codes[:, i]]
return distsThe heavy lifting (the L2 distance or Dot Product) is only done $m \times 256$ times per query. Everything else is just a series of lookups and additions. This is why PQ is fast enough to handle millions of comparisons in milliseconds on a single CPU core.
When Does it Break? (The Trade-offs)
I've seen developers get excited about PQ and then immediately regret it because their search results turned into garbage. PQ is not a silver bullet. It is inherently lossy.
1. The Training Set Problem
The "Codebook" (the centroids) must represent the distribution of your data. If you train your PQ codebook on a dataset of Wikipedia articles but your production data is all Python code snippets, the centroids will be in the wrong place. Your "lossy" compression will lose the most important features.
2. High Variance in Subspaces
PQ assumes that the subspaces are somewhat independent. If your vector dimensions are highly correlated, slicing them arbitrarily into chunks might split a single feature across two subspaces, making the quantization much less effective. This is why many production systems run a Principal Component Analysis (PCA) or an OPQ (Optimized Product Quantization) rotation before slicing.
3. The Recall vs. Compression Slider
- Increasing $m$ (more subspaces) improves accuracy (Recall) but increases memory and search time.
- Increasing $k$ (more centroids) improves accuracy but makes the LUT larger and the codebook training slower. Usually, $k=256$ is the sweet spot because 1 byte is very easy for CPUs to handle.
Production Implementation with Faiss
If you're doing this for real, don't use my NumPy snippet above. Use Faiss (Facebook AI Similarity Search). It’s written in C++ and has highly optimized SIMD instructions for PQ lookups.
Here is how you would set up an IVFPQ index. The "IVF" (Inverted File) part is another layer of optimization that clusters the dataset first so you only search a small fraction of the PQ-encoded vectors.
import faiss
import numpy as np
d = 768 # dimension
nlist = 100 # how many cells/clusters for the IVF part
m = 32 # number of sub-quantizers (768 / 32 = 24 dims per sub-vector)
nbits = 8 # bits per sub-vector (standard)
# Define the index
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
# Generate some fake data
data = np.random.random((100_000, d)).astype('float32')
# Training is crucial for PQ!
index.train(data)
index.add(data)
# Search
query = data[:1]
distances, indices = index.search(query, k=5)
print(f"Indices of 5 nearest neighbors: {indices}")In this setup, each vector takes exactly 32 bytes of storage in the index (plus a small overhead for the IVF list). Compared to the original 3,072 bytes ($768 \times 4$), that's a 99% reduction in memory.
Squeezing the Last Bit
If you’re still hitting memory limits, you can look into Scalar Quantization (SQ). SQ doesn't chunk the vector; it just reduces the precision of each float. Moving from float32 to int8 (8-bit integers) gives you a 4x reduction immediately. It’s often used as a first step before PQ or HNSW.
But for the true high-scale stuff—the "billion-scale" vector search—PQ remains the gold standard. It’s the difference between needing a cluster of expensive r6g.16xlarge instances and running your search index on a single high-memory box.
The "Lossy Dimension" isn't something to fear. It's the only way to make the economics of modern AI work. You just have to be deliberate about where you shave off the bits.
The goal isn't perfect reconstruction of the vector; the goal is preserving the *neighborhood*. In a high-dimensional space, you don't need to know exactly where a point is—you just need to know who its neighbors are. PQ gives you exactly that, at a fraction of the cost.