A 5-Point Checklist for the Near-Duplicate Detector: Scaling Document Similarity with SimHash

How do you tell the difference between two 5,000-word blog posts that are identical except for a single tracking pixel or a "Recommended for You" sidebar?

If you're building a web crawler or a search engine, this isn't just a philosophical question—it’s a resource-management nightmare. Traditional hashing (like MD5 or SHA-256) is designed to be "brittle"; change one comma, and the whole hash flips. That's great for security, but terrible for identifying near-duplicates. If you try to compare every new document against your database using a standard string comparison, you’re looking at an $O(n^2)$ complexity trap that will melt your servers faster than a crypto miner in July.

This is where SimHash (a Locality Sensitive Hashing algorithm) saves the day. It turns huge documents into small fingerprints where "similar" inputs produce "similar" hashes.

Before you start deduplicating your data, here is the 5-point checklist I use to ensure the detector actually works at scale.

1. Choose Your "Features" Wisely (Don't just hash words)

If you just split a document by spaces and hash the words, you lose all context. "The dog bit the man" and "The man bit the dog" look identical to a bag-of-words model.

For near-duplicate detection, k-shingling is usually the better play. Instead of single words, you use overlapping sequences of characters or words.

def get_shingles(text, k=3):
    """
    Creates word-level shingles. 
    'The quick brown fox' -> ['The quick brown', 'quick brown fox']
    """
    words = text.lower().split()
    return [" ".join(words[i:i+k]) for i in range(len(words) - k + 1)]

text = "Scaling document similarity with SimHash is surprisingly fun."
print(get_shingles(text, k=2))
# Output: ['scaling document', 'document similarity', 'similarity with', ...]

The Gotcha: Too small a k and everything looks the same. Too large, and you miss genuine duplicates with slight variations. I’ve found k=3 for words or k=8 for characters is usually the "Goldilocks" zone for web content.

2. Weight Your Features (Not all shingles are created equal)

In a typical technical doc, the shingle "the following example" appears everywhere. The shingle "bitwise-xor-redundancy" is rare. If you treat them equally, the common noise will drown out the signal.

When building your SimHash, you can apply weights to your features. You can use simple frequency counts or, if you want to be fancy, a pre-calculated TF-IDF score.

# Simple weighting example
features = {
    "simhash algorithm": 5, # High weight: rare/important
    "the is a": 1           # Low weight: common/noise
}

3. The Bitwise Summation (The "Magic" Step)

Here is where the SimHash actually happens. You aren't just XORing bits. You are creating a vector of integers (usually 64) initialized to zero. For every feature:
1. Hash the feature to a 64-bit value.
2. Iterate through the bits: if the $i$-th bit is 1, add the feature's weight to the $i$-th position of your vector. If it's 0, subtract the weight.

Finally, you collapse that vector back into a single 64-bit fingerprint: if the vector position is $>0$, the bit is 1. Otherwise, it's 0.

import hashlib

def simhash(features):
    v = [0] * 64
    for feature, weight in features.items():
        # Using a 64-bit hash
        h = int(hashlib.md5(feature.encode('utf-8')).hexdigest(), 16)
        for i in range(64):
            bit = (h >> i) & 1
            if bit:
                v[i] += weight
            else:
                v[i] -= weight
    
    fingerprint = 0
    for i in range(64):
        if v[i] >= 0:
            fingerprint |= (1 << i)
    return fingerprint

# Let's say we have two slightly different feature sets
doc1 = {"hello": 1, "world": 1, "simhash": 5}
doc2 = {"hello": 1, "world": 1, "simhash": 4} # slight weight change

print(f"Doc 1 Fingerprint: {bin(simhash(doc1))}")
print(f"Doc 2 Fingerprint: {bin(simhash(doc2))}")

4. Define Your Hamming Distance Threshold

Now you have two 64-bit integers. To see how similar they are, you calculate the Hamming Distance—the number of bit positions where the two bits are different.

In Python, this is incredibly fast thanks to the bin(val1 ^ val2).count('1') trick.

def hamming_distance(f1, f2):
    x = f1 ^ f2
    return bin(x).count('1')

f1 = simhash({"apple": 1, "banana": 2})
f2 = simhash({"apple": 1, "banana": 3})

distance = hamming_distance(f1, f2)
if distance < 3:
    print(f"Near-duplicates! Distance: {distance}")
else:
    print(f"Different enough. Distance: {distance}")

The Rule of Thumb: For a 64-bit SimHash, a Hamming distance of 3 or less usually indicates near-identical content. If the distance is over 10, they are completely different animals.

5. Use Permutation Tables for $O(1)$ Lookups

If you have 10 million documents, you can't calculate the Hamming distance of a new doc against all 10 million existing ones. That puts us right back at the $O(n)$ or $O(n^2)$ problem we were trying to avoid.

The trick to scaling SimHash is using Multiple Tables.
If you’re looking for a Hamming distance of $k=3$, you can split your 64-bit hash into 4 chunks (16 bits each). If two hashes differ by at most 3 bits, at least one of those 16-bit chunks must be identical (by the Pigeonhole Principle).

1. Store your fingerprints in 4 different hash maps (one for each 16-bit chunk).
2. When a new document comes in, check its 4 chunks against the 4 maps.
3. You’ll only have a handful of candidates to check for the full Hamming distance.

This turns a global search into a few direct dictionary lookups. It’s the difference between a crawler that finishes in minutes vs. one that never finishes at all.

Wrapping Up

SimHash is one of those rare "elegant" solutions in computer science. It’s not perfect—it can be sensitive to very short documents—but for massive-scale text processing, it’s a workhorse.

If you're building this today, start with a 64-bit hash and word-level 3-shingles. Don't over-engineer the weights until you see how the Hamming distances shake out on your actual data. And for heaven's sake, don't use a loop to compare every hash to every other hash in your database. Your CPU (and your AWS bill) will thank you.