3 Mathematical Properties of 'HyperLogLog' That Will Help You Count Billions of Items with 1KB of RAM

I was once tasked with building a real-time dashboard for a high-traffic news site. Every time a user clicked a link, we needed to update the "Unique Visitors" count for that specific article. Within three hours of launch, our Redis instance hit its memory limit and crashed because we were storing millions of UUIDs in standard Sets.

The bill for upgrading the RAM was going to be astronomical, but the fix turned out to be a piece of code that occupied less space than a low-resolution thumbnail.

This is the magic of HyperLogLog (HLL). It's a probabilistic data structure that estimates the number of unique elements (cardinality) in a set. While a traditional Set grows linearly with the number of items you add, HyperLogLog has a fixed memory footprint. You can count 10 items or 10 billion items, and it will still only take up about 1.5 KB of space with a typical error rate of 0.81%.

To truly trust HLL in production, you have to understand the three mathematical properties that make this "impossible" compression work.

1. The Geometry of Randomness: Trailing Zeros

The fundamental observation behind HyperLogLog is surprisingly intuitive if you think about coin tosses.

Imagine I tell you that I flipped a coin repeatedly until it landed on "Heads." If I told you the longest streak of "Tails" I saw before hitting that first "Heads" was zero, you'd guess I only flipped it once or twice. But if I told you I saw a streak of 10 "Tails" in a row, you’d assume I had been flipping that coin for quite a while. Mathematically, the probability of seeing a sequence of $n$ tails is $(1/2)^n$.

HyperLogLog applies this logic to the binary representation of hashed data. When we hash an input, we get a seemingly random string of bits. If we see a hash that ends in ...1, it’s like a single coin flip. If we see a hash that ends in ...1000, that’s like seeing three "Tails" (zeros) before a "Heads" (one).

The more unique items we hash, the higher the probability that we will eventually encounter a hash with a long string of leading or trailing zeros.

Here is a simple Python function to demonstrate how we find the "rank" (the position of the first 1-bit):

import hashlib

def get_rank(item, bit_length=64):
    # Hash the item and convert to a binary string
    hash_value = int(hashlib.sha256(str(item).encode('utf8')).hexdigest(), 16)
    binary = bin(hash_value)[2:].zfill(bit_length)
    
    # Find the position of the first '1' from the right
    # (Using trailing zeros for this example)
    return (binary[::-1].find('1') + 1)

# Let's test it
print(f"Rank of 'apple': {get_rank('apple')}")
print(f"Rank of 'banana': {get_rank('banana')}")

If the maximum rank we've seen is $R$, we can estimate that we have roughly $2^R$ unique items. However, relying on a single "maximum" is incredibly risky. One "lucky" hash with 20 zeros would make us think we have a million items when we might only have ten. This leads us to the second property.

2. Stochastic Averaging and the Harmonic Mean

To fix the "lucky hash" problem, HyperLogLog doesn't just keep one maximum. It divides the incoming data into many "buckets" (or registers).

When an item comes in, we use part of the hash to decide which bucket it belongs to, and the rest of the hash to calculate the rank. If the new rank is higher than what’s currently in the bucket, we update it.

But even with buckets, how do we combine them? If we take the standard arithmetic mean, a single outlier still exerts too much influence. If you have ten buckets with values [1, 1, 1, 2, 1, 1, 20, 1, 1, 1], the arithmetic mean is 3. But that 20 is a massive outlier.

HyperLogLog uses the Harmonic Mean. The harmonic mean is the reciprocal of the average of the reciprocals. It is famously good at discounting large outliers.

The formula for the raw estimate looks like this:

$$E = \alpha_m m^2 \left( \sum_{j=1}^m 2^{-M[j]} \right)^{-1}$$

Where:
- $m$ is the number of registers.
- $M[j]$ is the value in each register.
- $\alpha_m$ is a constant to correct for systematic bias.

Here is how you would implement a simplified version of this averaging logic in Python:

import math

class SimplifiedHLL:
    def __init__(self, p=10):
        self.p = p # Precision: 2^p registers
        self.m = 2**p
        self.registers = [0] * self.m
        # Alpha constant for bias correction
        self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def add(self, item):
        # Hash the item
        h = int(hashlib.md5(str(item).encode('utf8')).hexdigest(), 16)
        
        # Use first p bits for the register index
        idx = h & (self.m - 1)
        # Use the remaining bits to find the rank
        w = h >> self.p
        rank = self._get_rank(w)
        
        # Update the register if the new rank is higher
        self.registers[idx] = max(self.registers[idx], rank)

    def _get_rank(self, w):
        if w == 0: return 64 - self.p
        return (bin(w).split('1')[-1].count('0') + 1)

    def count(self):
        # Harmonic mean calculation
        reciprocal_sum = sum([2.0**-r for r in self.registers])
        estimate = self.alpha * (self.m**2) * (1.0 / reciprocal_sum)
        return estimate

# Usage
hll = SimplifiedHLL(p=10) # 1024 registers
for i in range(10000):
    hll.add(f"user_{i}")

print(f"Estimated count: {int(hll.count())}")

By using the harmonic mean, HLL effectively ignores the "lucky" hashes that would otherwise break our estimate. It looks at the global distribution across all buckets to find the truth.

3. The "Small Range" Correction (Linear Counting)

One of the most frequent complaints about probabilistic algorithms is that they are "overkill" for small numbers. If you only have 50 unique visitors, using a complex logarithmic formula seems like using a chainsaw to cut a grape. Worse, HLL is actually quite inaccurate for very small sets.

When the set is small, many of the registers will remain at zero. This creates a specific kind of bias in the harmonic mean.

The original HyperLogLog paper addresses this by switching to Linear Counting when the estimate is below a certain threshold (usually $2.5 \times m$). Linear counting relies on the number of empty registers. If most of your buckets are empty, it's mathematically easier to estimate the size based on the ratio of empty to full buckets.

The correction formula is:
$$V = m \ln(m / V_{zero})$$
where $V_{zero}$ is the number of registers equal to zero.

This is why HLL is so versatile. It’s actually two algorithms in one:
1. Linear Counting for small sets (high precision).
2. Harmonic Mean of Ranks for large sets (massive scale).

In a production-ready implementation, like the one used in Redis, there's even a third stage for "middle-range" values and a "sparse" representation for very small sets that uses even less memory than the standard HLL structure.

Why 1KB?

Let’s talk about the math of the memory footprint.

The standard error of HyperLogLog is approximately $1.04 / \sqrt{m}$.
If we want an error rate of about 2%, we need $m = 2048$ registers.

How big is a register? We need to store the maximum rank. If we are hashing 64-bit values, the maximum rank is 64. To store the number 64, we only need 6 bits ($2^6 = 64$).

- 2048 registers * 6 bits = 12,288 bits.
- 12,288 bits / 8 = 1,536 bytes.

That is roughly 1.5 KB.

With that 1.5 KB, you can count the entire population of the Earth with a 2% margin of error. If you need 1% error, you double the registers, and you're still under 4 KB. Compare that to storing 8 billion 64-bit integers in a traditional set, which would require roughly 64 GB of RAM.

Using HyperLogLog in the Real World

You rarely have to implement the raw math yourself because Redis has arguably the best implementation of HLL available through the PFADD and PFCOUNT commands.

Here is how I solved that initial problem using Redis and Python:

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

def track_visit(article_id, user_id):
    # PFADD adds the user_id to the HLL structure for that article
    # It returns 1 if the internal registers were modified
    key = f"article:visits:{article_id}"
    r.pfadd(key, user_id)

def get_unique_count(article_id):
    key = f"article:visits:{article_id}"
    return r.pfcount(key)

# Simulating 100,000 visits from 10,000 unique users
for i in range(100000):
    user_id = f"user_{i % 10000}"
    track_visit("hll-post-123", user_id)

print(f"Unique visitors: {get_unique_count('hll-post-123')}")

The "PF" prefix in the Redis commands is a nod to Philippe Flajolet, the French mathematician who co-authored the original HyperLogLog paper in 2007.

The Gotchas: When NOT to use HyperLogLog

HyperLogLog is an engineering marvel, but it isn't a silver bullet. There are three specific scenarios where you should stick to traditional sets:

1. You need to know WHO is in the set. HLL is a "write-only" structure in terms of data retrieval. You can ask "How many?" but you can never ask "Is user_123 in this set?" or "Give me a list of all users."
2. You need 100% accuracy. If your use case involves financial auditing or billing where a 1% error results in legal trouble, probabilistic structures are not for you.
3. The data is small and stays small. If you are only ever counting 100 items, a standard Set or even a simple Array is faster and potentially uses less memory due to the overhead of the HLL registers.

Final Thoughts

The beauty of HyperLogLog lies in its defiance of the linear relationship between data size and memory. It reminds us that sometimes, we don't need the exact truth; we just need a very good estimate that doesn't crash our servers.

Next time you're looking at a growing memory bill for your analytics, stop trying to buy more RAM. Instead, look at the distribution of your hashes. The answer might just be hidden in a few trailing zeros and a harmonic mean.