loke.dev
Header image for 5 Low-Level VFS Hooks That Unlock Distributed SQLite

5 Low-Level VFS Hooks That Unlock Distributed SQLite

Peel back the 'single file' abstraction to see how the SQLite Virtual File System layer enables modern edge-database architectures like Turso and LiteFS.

· 7 min read

I used to bang my head against the wall trying to figure out how SQLite could possibly work for "edge" workloads. The pitch for things like Turso or LiteFS sounded like architectural sorcery: how do you take a database famously defined as a "single file on disk" and make it run across a hundred data centers without losing data or killing performance? I assumed there was some massive fork of the SQLite core involved.

Then I found the sqlite3_vfs object.

It turns out, SQLite doesn't actually know how to talk to your hard drive. It doesn't know what a file is, or what a "lock" looks like in Linux versus Windows. Instead, it defines a narrow, pluggable interface called the Virtual File System (VFS). If you can implement a handful of C functions that satisfy this interface, you can make SQLite believe that a S3 bucket, a remote HTTP server, or a distributed Raft cluster is just a local file.

If you want to build the next generation of distributed data tools, these are the five low-level VFS hooks you need to master.

The Architecture of an Interceptor

Before jumping into the hooks, we need to look at the registration. A VFS is essentially a struct of function pointers. When you call sqlite3_open(), SQLite looks up the VFS you've registered and uses those pointers for every single I/O operation.

typedef struct sqlite3_vfs sqlite3_vfs;
struct sqlite3_vfs {
  int iVersion;            /* Structure version number (currently 3) */
  int szOsFile;            /* Size of subclassed sqlite3_file structure */
  int mxPathname;          /* Maximum file pathname length */
  sqlite3_vfs *pNext;      /* Next registered VFS */
  const char *zName;       /* Name of this virtual file system */
  void *pAppData;          /* Pointer to application-specific data */
  int (*xOpen)(sqlite3_vfs*, const char *zName, sqlite3_file*, int flags, int *pOutFlags);
  int (*xDelete)(sqlite3_vfs*, const char *zName, int syncDir);
  int (*xAccess)(sqlite3_vfs*, const char *zName, int flags, int *pResOut);
  // ... and so on
};

In a distributed setup, your VFS acts as a "shim." It usually wraps the default OS VFS (like unix-dotfile or win32), intercepts the calls it cares about, and passes the rest through.

---

1. xOpen: The Gateway to Metadata

The xOpen hook is the first point of contact. When you call sqlite3_open_v2("my_db.db", ...) with your custom VFS, this function is triggered.

In a distributed system, xOpen is where you decide *what* the file actually is. For a tool like LiteFS, xOpen isn't just opening a file descriptor; it’s identifying the database within a cluster.

One powerful trick here is using URI parameters. SQLite allows you to pass query strings in the filename, which your xOpen implementation can parse to configure behavior on a per-connection basis.

static int myvfs_Open(
  sqlite3_vfs *pVfs,
  const char *zName,
  sqlite3_file *pFile,
  int flags,
  int *pOutFlags
){
  // Check if we are opening a WAL file or the main DB
  if (flags & SQLITE_OPEN_MAIN_DB) {
    printf("Opening database: %s\n", zName);
    // Here, a distributed VFS might check a local cache 
    // or contact a control plane to see who owns the write lease.
  }

  // Delegate to the real OS 'open'
  sqlite3_vfs *pParent = (sqlite3_vfs*)pVfs->pAppData;
  return pParent->xOpen(pParent, zName, pFile, flags, pOutFlags);
}

The Distributed Unlock: By intercepting xOpen, you can implement "Lazy Loading." You can open a database that doesn't exist on disk yet, signaling a background process to stream the header and initial pages from a remote source.

2. xRead: On-Demand Data Fetching

SQLite reads data in "pages" (usually 4KB). When the B-Tree engine needs to look up a row, it asks the VFS for a specific chunk of the file via xRead.

If you're building a distributed SQLite like Turso (based on libSQL), xRead is where the magic of "database streaming" happens. Instead of downloading a 50GB database to an edge device, you can intercept xRead, check if the requested offset is in a local cache, and if not, fetch that specific 4KB page over the network.

static int myvfs_Read(
  sqlite3_file *pFile, 
  void *zBuf, 
  int iAmt, 
  sqlite3_int64 iOfst
){
  // iOfst is the byte offset, iAmt is how much to read.
  // In a distributed VFS, you might do:
  if (!is_page_locally_cached(iOfst)) {
    fetch_page_from_remote_storage(iOfst, zBuf, iAmt);
  } else {
    read_from_local_disk(zBuf, iAmt, iOfst);
  }
  return SQLITE_OK;
}

The Gotcha: Latency. If every xRead triggers a round-trip to a remote server, your database will be painfully slow. Distributed SQLite implementations solve this by pre-fetching predicted pages or keeping a hot local cache of the upper levels of the B-Tree (the internal nodes).

3. xWrite: The Replication Engine

This is the big one. If you want to replicate data across a cluster, you have to catch the data as it's being written.

In modern SQLite usage, we almost always use WAL (Write-Ahead Log) mode. In this mode, SQLite doesn't write directly to the .db file during a transaction. Instead, it writes to a .db-wal file. This is a sequence of frames, each containing a new version of a page.

A distributed VFS intercepts xWrite calls to the WAL file. When a transaction commits, the VFS can ship those new WAL frames to a replica or a S3 bucket.

static int myvfs_Write(
  sqlite3_file *pFile,
  const void *zBuf,
  int iAmt,
  sqlite3_int64 iOfst
){
  // 1. Write the data to the local disk so the local instance stays updated
  int rc = osWrite(pFile, zBuf, iAmt, iOfst);
  
  // 2. If this is the WAL file, ship the buffer to our replication stream
  if (is_wal_file(pFile) && rc == SQLITE_OK) {
    enqueue_for_replication(zBuf, iAmt, iOfst);
  }
  
  return rc;
}

By hooking xWrite, you transform SQLite from a storage engine into a log producer. Every write becomes an event that can be played back on a thousand other machines.

4. xLock and xUnlock: Distributed Concurrency

SQLite's biggest hurdle in a distributed environment is the "Single Writer" rule. SQLite relies on file-system level locks (flock, fcntl) to ensure only one process writes at a time. In a distributed system, a local file lock is useless because the "other" process is on a different server.

If you want to allow writes on multiple nodes, your VFS must implement Distributed Locking.

When SQLite calls xLock(RESERVED_LOCK), your VFS shouldn't just talk to the local kernel. It needs to talk to a consensus service like Etcd, ZooKeeper, or a custom central lease manager.

static int myvfs_Lock(sqlite3_file *pFile, int eLock){
  if (eLock == SQLITE_LOCK_RESERVED) {
    // Attempt to acquire a global lease
    if (!acquire_network_lease("my_database_lock")) {
        return SQLITE_BUSY; // Tell SQLite someone else is writing
    }
  }
  return osLock(pFile, eLock);
}

Why this is hard: Network partitions. If a node acquires a lock and then the network dies, the entire cluster might be stuck. Distributed SQLite implementations often use "Lease Heartbeats"—if the writer doesn't check in every few seconds, the lock is forcibly released.

5. xSync: The Durability Barrier

xSync is the VFS equivalent of fsync(). It’s SQLite saying, "I have handed you data; do not return until you are 100% sure it is safely on permanent storage."

In a distributed VFS, xSync is where performance goes to die—or where you get clever. If you are doing synchronous replication, xSync shouldn't return until the data has been acknowledged by a majority of the nodes in your cluster (Quorum).

static int myvfs_Sync(sqlite3_file *pFile, int flags){
  // Ensure local disk is flushed
  osSync(pFile, flags);

  // For a distributed DB, ensure remote replicas have received the frames
  if (must_be_durable_across_cluster(pFile)) {
    wait_for_quorum_ack();
  }

  return SQLITE_OK;
}

The design of xSync defines your consistency model. If you return immediately after the local write (Asynchronous Replication), you get incredible performance but might lose the last few transactions if the primary node crashes. If you wait for the cluster, you get "strong consistency" at the cost of network latency.

---

Why does this matter?

The beauty of the VFS is that SQLite has no idea any of this is happening.

You don't have to change your SQL queries. You don't have to change how you handle transactions. You use the standard sqlite3 library, but by swapping the VFS, your "local" database gains the powers of a massive distributed system.

Real-World Examples to Study

If you're inspired to dig deeper, check out these projects:

1. [LiteFS](https://github.com/superfly/litefs): A FUSE-based file system that implements these concepts at the kernel level rather than the SQLite VFS level (though the principles are identical). It intercepts syscalls to replicate WAL frames.
2. [libSQL / Turso](https://github.com/tursodatabase/libsql): They've extended the VFS to support "virtual WALs," allowing them to ship writes over HTTP to a primary instance.
3. [SQLite-S3vfs](https://github.com/uktrade/sqlite-s3vfs): A Python implementation that lets you run SQLite directly against S3. It heavily uses xRead and xWrite to map S3 blocks to SQLite pages.

Performance vs. Complexity

Implementing a custom VFS isn't free. You're moving the complexity from the database engine to the I/O layer. Here are three things I've learned from tinkering with this:

* Page Size alignment is everything. If your network protocol or your storage backend uses a different block size than SQLite (default 4096), you'll end up with massive "write amplification." Match them up.
* The WAL is your friend. Trying to replicate the main .db file is a nightmare of random-access writes. Replicating the WAL is much easier because it's mostly append-only.
* Error codes are non-negotiable. SQLite is incredibly robust, but it relies on VFS error codes to decide when to retry a transaction or when to report a database as "corrupt." If your network call fails and you return SQLITE_OK, you're going to have a bad time.

SQLite isn't just a file format anymore. It's a protocol for structured data access, and the VFS is the API that lets us redefine where that data actually lives. Once you realize the "disk" is a lie, the possibilities for distributed architecture become endless.