loke.dev
Header image for A Stable Sort for the Global Table

A Stable Sort for the Global Table

Your database indexes are only as reliable as the locale definitions provided by your operating system—until a minor library update silently breaks your B-trees.

· 5 min read

Imagine waking up to a Slack notification telling you that your database is "lying." Your application queries for a record that you *know* exists, but Postgres insists it isn't there—all because you performed a routine OS security patch the night before.

It’s a nightmare scenario that stems from a fundamental misunderstanding: we think our database owns its sort order. It doesn't. By default, Postgres delegates the logic of "Does 'A' come before 'a'?" to the operating system's C library (glibc). When that library updates—which it does—the rules can shift under your feet, silently corrupting every B-tree index you own.

The Invisible Dependency

When you initialize a Postgres cluster with en_US.UTF-8, you’re essentially signing a contract with the Linux kernel. You’re saying, "Hey, whenever I index a string, use whatever logic the current version of glibc thinks is right."

The problem is that glibc developers occasionally change their minds about linguistic rules. In glibc 2.28, for example, the way characters were weighted changed significantly.

If you have a B-tree index built on version 2.27, and you upgrade to 2.28, the index remains physically structured according to the *old* rules. But when you run a SELECT, Postgres uses the *new* rules to navigate that index. It’s like trying to find a word in a dictionary where someone swapped the pages around while you were sleeping. You'll hit a node, look left, but the data you need is now technically to the right according to the new logic. Result? Data not found.

How to check if you're standing on a fault line

If you are running Postgres on Linux, you can see which provider you're currently using. Most legacy setups look like this:

SELECT datname, datcollate, datctype 
FROM pg_database 
WHERE datname = current_database();

If datcollate shows something like en_US.UTF-8, you are using the libc provider. This is the "danger zone."

To see if your OS version has changed since you created the index, check the pg_collation catalog (Postgres 15+):

SELECT collname, collversion 
FROM pg_collation 
WHERE collname = 'default';

If the collversion recorded in the database doesn't match the version currently provided by the OS, Postgres will often throw a warning in the logs: collation version mismatch. Do not ignore this warning.

The B-Tree Divorce

Let's look at what actually happens to the data. Imagine a simple table of names:

CREATE TABLE users (
    name TEXT PRIMARY KEY
);

INSERT INTO users (name) VALUES ('charlie'), ('CAESAR'), ('beta');

In an older version of a locale, CAESAR might sort before beta. In a newer version, it might sort after. If the index was built when CAESAR < beta, the B-tree put CAESAR on the left branch. After the update, the DB thinks CAESAR > beta. When searching for CAESAR, it goes to the right branch, finds nothing, and returns an empty result set.

To fix this manually (the "old way"), you have to reindex everything:

-- This is the "oops, my indexes are broken" command
REINDEX (VERBOSE) TABLE users;

But REINDEX is a band-aid. We want a permanent fix.

Enter ICU: The Stable Alternative

The industry has been moving toward ICU (International Components for Unicode). Unlike libc, ICU is much more consistent across different operating systems and allows Postgres to track versioning more strictly.

Starting in Postgres 15, you can define ICU as your default collation provider. Here is how you'd create a table that is much more resistant to OS-level surprises:

-- Creating a stable, ICU-based collation
CREATE COLLATION nd_icu (
    provider = 'icu',
    locale = 'en-US',
    deterministic = false
);

CREATE TABLE products (
    sku_name TEXT COLLATE "nd_icu" PRIMARY KEY
);

By specifying provider = 'icu', you're telling Postgres to stop listening to the OS's glibc and instead use the ICU library. While ICU versions can also change, Postgres 15+ tracks these versions. If the ICU version changes, Postgres knows exactly which indexes are stale.

Postgres 17 and the "Built-in" Savior

The real "Holy Grail" arrived with Postgres 17. It introduced the builtin provider for the C and UTF-8 locales.

Why is this a big deal? Because the builtin provider is immutable. The sort logic is compiled directly into Postgres itself. It will *never* change, regardless of whether you're running on Debian, RedHat, or macOS, and regardless of what version of glibc you have.

If you are starting a new project on Postgres 17, this is how you should define your database to avoid the "global table" sorting headache forever:

-- In Postgres 17, use the immutable builtin provider
CREATE DATABASE my_stable_db
LOCALE_PROVIDER = 'builtin'
LOCALE = 'en_US.UTF-8';

With this setting, your B-trees are finally safe. You can move your data folder from an old Ubuntu server to a brand-new Alpine container, and the sort order will remain identical.

Practical Takeaways

If you’re managing a production database today, here’s my advice:

1. Monitor for Mismatches: If you're on Postgres 15 or 16, query pg_collation for version mismatches. If you see them, plan a REINDEX CONCURRENTLY during off-peak hours.
2. Verify with `amcheck`: If you suspect corruption, use the amcheck extension to verify your B-trees without a full rebuild:
`sql
CREATE EXTENSION amcheck;
SELECT bt_index_check(c.oid, true)
FROM pg_index i
JOIN pg_class c ON i.indexrelid = c.oid
WHERE c.relname = 'users_pkey';
`
3. Migrate to ICU: If you’re on a modern Postgres version, start moving columns that require specific linguistic sorting to ICU collations.
4. Target Postgres 17: The builtin provider is the final boss of database stability. If you're tired of OS updates breaking your indexes, that's your exit ramp.

Sorting seems trivial until your unique constraint fails or your primary key lookups return NULL. Treat your locales as a dependency, and pin them just like you'd pin a library version in your code.