loke.dev
Header image for How to Sample Billion-Row Tables Without the Latency of a Full Postgres Scan

How to Sample Billion-Row Tables Without the Latency of a Full Postgres Scan

Harness the power of the TABLESAMPLE clause to run near-instant statistical queries on massive datasets without triggering a sequential scan.

· 7 min read

Run SELECT count(*) FROM transactions; on a table that’s breached the billion-row mark and you might as well go grab a coffee. By the time you get back, Postgres might still be grinding through a sequential scan, churning your I/O and locking up buffer cache space that could be better used for active queries.

When we deal with "Big Data" inside a relational database, we often don't need the exactitude of a full scan. If you're calculating an average order value, a churn rate, or a distribution of user ages across a 500GB table, a statistically significant sample is usually enough. The problem is that most developers try to sample using ORDER BY random() LIMIT 1000, which is a performance suicide note for large tables.

Postgres has a native solution that’s been around since version 9.5: the TABLESAMPLE clause. It allows you to grab a subset of data by touching only a fraction of the physical blocks on disk.

Why ORDER BY random() Fails at Scale

Before looking at the right way, let’s look at the wrong way. I’ve seen this in production more times than I care to admit:

SELECT * FROM large_logs 
ORDER BY random() 
LIMIT 1000;

On a billion-row table, this query is a disaster. To execute this, Postgres must:
1. Generate a random value for every single one of the billion rows.
2. Sort the entire billion-row dataset based on those random values.
3. Discard everything except the top 1000 rows.

You aren't just reading the table; you're performing a massive sort operation that will likely spill to disk (work_mem won't save you here). Even if you have the fastest NVMe drives, you're looking at minutes of latency.

The TABLESAMPLE Mechanics

The TABLESAMPLE clause sits directly after the table name in your FROM clause. It tells the executor: "Don't scan the whole thing. Just give me a percentage based on this sampling method."

Postgres ships with two built-in methods: BERNOULLI and SYSTEM.

Bernoulli Sampling

The BERNOULLI method scans the entire table but picks individual rows based on a probability. If you specify 0.1%, it flips a weighted coin for every row.

SELECT avg(sale_price) 
FROM sales TABLESAMPLE BERNOULLI (0.1);

The Reality: While this is statistically "cleaner" because every row has an equal chance of being picked, it still performs a full sequential scan of the table to perform those coin flips. It’s better than a sort, but it doesn't solve the I/O bottleneck for billion-row tables. If you want to avoid the latency of a full scan, BERNOULLI is usually not the answer.

System Sampling

This is where the real speed lives. SYSTEM sampling doesn't look at rows; it looks at pages (data blocks). Postgres data is stored in 8KB blocks. SYSTEM picks a random percentage of those blocks and returns every row inside them.

SELECT avg(sale_price) 
FROM sales TABLESAMPLE SYSTEM (0.01);

On a billion-row table, a 0.01% SYSTEM sample might only need to read a few hundred blocks from disk. This can turn a 10-minute query into a 100-millisecond query.

The Catch: SYSTEM sampling introduces "clustering bias." If your data is naturally ordered on disk—for example, if created_at is roughly correlated with how rows were inserted—rows in the same block will be similar. If you happen to pick a block containing only high-value transactions from Black Friday, your average will be skewed.

Extending the Toolkit with tsm_system_rows

One frustration with the standard TABLESAMPLE is that it works on percentages. If you want exactly 500 rows, you have to guess the percentage. If you guess 0.001% and the table grows, you get too many rows; if the table is small, you might get zero.

Postgres includes a "contrib" module that solves this. You’ll need to enable it first:

CREATE EXTENSION tsm_system_rows;

Now you can sample a specific number of rows without worrying about the percentage math:

SELECT customer_id, feedback_score
FROM survey_responses
TABLESAMPLE SYSTEM_ROWS(1000);

Under the hood, SYSTEM_ROWS picks random blocks until it has collected enough rows to satisfy your limit. It’s nearly instantaneous, even on massive datasets, because it stops reading the moment it hits your target count.

Repeatability and the Seed

Data science and reporting often require reproducibility. If you run a sample today and another tomorrow, you want to know if the difference in results is due to new data or just the luck of the draw.

The REPEATABLE clause allows you to pass a seed. Using the same seed on the same dataset will return the same "random" sample every time.

SELECT user_email 
FROM users TABLESAMPLE SYSTEM (0.5) REPEATABLE (42);

This is incredibly useful for debugging or for pagination-like behavior in sampling where you don't want the "set" of sampled users to jump around while you're analyzing them.

Handling the "Empty Result" Gotcha

A common point of confusion with TABLESAMPLE occurs on smaller tables or with very low percentages. Because SYSTEM sampling picks blocks, if you have a table that only occupies 10 blocks and you ask for a 1% sample, there is a very high probability that Postgres will pick 0 blocks, returning an empty result set even though the table has data.

If your table is small, stick to BERNOULLI or a standard LIMIT. TABLESAMPLE is a tool built for the "Too Big to Scan" tier.

Combining Samples with Joins

You can sample tables that are part of a join, but you have to be careful about *where* the sample is applied.

SELECT u.email, o.total
FROM users u TABLESAMPLE SYSTEM (1)
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed';

In this query, Postgres samples 1% of the users table first, and *then* joins those specific users to the orders table. This is much faster than joining a billion users to five billion orders and then trying to sample the result. However, keep in mind that if you have a "sparse" join—where only 1 in 100 users has an order—sampling 1% of users might leave you with almost no results after the join.

Performance Comparison: A Real-World Scenario

I recently worked on a telemetry database where one table tracked sensor heartbeats. It was sitting at roughly 2.1 billion rows.

The Goal: Calculate the average signal strength.

The Slow Way:

-- Result: 4.2 minutes
SELECT AVG(signal_strength) FROM sensor_data;

The "Pseudo-Random" Slow Way:

-- Result: 11.5 minutes (The sort killed it)
SELECT AVG(signal_strength) FROM (
    SELECT signal_strength FROM sensor_data 
    ORDER BY random() 
    LIMIT 100000
) s;

The Fast Way:

-- Result: 0.08 seconds
SELECT AVG(signal_strength) 
FROM sensor_data TABLESAMPLE SYSTEM (0.01);

The difference is three orders of magnitude. The SYSTEM sample was within 0.2% of the actual average calculated by the full scan. For a dashboard or a health check, that margin of error is a small price to pay for a 3,000x speedup.

When to Avoid TABLESAMPLE

As much as I love this clause, it isn't a silver bullet. There are three specific scenarios where it will fail you:

1. When you need 100% accuracy: Financial auditing, billing, and regulatory reporting don't care about "statistical significance." They care about every cent.
2. Highly Skewed Data: If 90% of your "Active" users are in the first 10% of your data blocks (due to how the table was loaded), SYSTEM sampling will give you a wildly inaccurate picture of your user base.
3. Small Tables: As mentioned, the overhead of block-level sampling isn't worth it for a table that fits in a few MBs of memory.

Optimizing for the Query Planner

One final tip: the query planner isn't always great at estimating how many rows a TABLESAMPLE will return, especially when combined with complex WHERE clauses. If you find that the planner is choosing a bad join strategy (like a nested loop when it should hash) because it thinks the sample will return 1 row when it actually returns 10,000, you might need to use a CTE to "materialize" the sample first.

WITH sampled_users AS (
    SELECT id FROM users TABLESAMPLE SYSTEM (5)
)
SELECT * 
FROM sampled_users
JOIN massive_logs ON sampled_users.id = massive_logs.user_id;

By isolating the sample in a CTE, you sometimes give the optimizer a clearer boundary to work with, though in modern Postgres versions, the planner is generally savvy enough to handle TABLESAMPLE in-line.

Final Thoughts

Sampling is a mindset shift. We are conditioned to think of databases as "truth machines" that must visit every row. But as your data grows from millions to billions, that "truth" becomes increasingly expensive to extract.

If you're building internal tools, data science sandboxes, or monitoring dashboards, TABLESAMPLE SYSTEM is the most underutilized tool in the Postgres arsenal. It’s the difference between a dashboard that feels snappy and one that developers stop using because it takes too long to load.