
S3 Is the Only Filesystem That Matters
Why modern high-scale databases are abandoning the local block store in favor of object storage and how the 'S3-native' architecture solves the distributed state problem.
Why are we building the world’s most sophisticated databases on top of an API that doesn’t even have a rename() function?
For decades, the path to performance was simple: get the data as close to the CPU as possible. We obsessed over NVMe speeds, IOPS, and local raid arrays. If you were building a database, you targeted the POSIX filesystem. You expected fsync() to actually mean something, and you designed your entire architecture around the idea that moving data across a network was a cardinal sin.
But a funny thing happened on the way to the cloud. Our "disks" became networked block stores (like AWS EBS), which gave us the worst of both worlds: the latency of the network with the rigid scaling constraints of a physical drive. If your database node died, your data was trapped on that volume until you could re-attach it elsewhere.
Then came the realization that has redefined the last five years of data engineering: S3 isn't just a place to dump logs or host static images. It is the only filesystem that matters for modern, high-scale distributed systems.
The POSIX Lie and the Cloud Reality
Standard filesystems like Ext4 or XFS were designed for a world where the disk was a local, reliable piece of hardware. They provide a rich API: you can append to files, rename directories, and lock ranges of bytes.
S3 provides none of that. It is an object store. You can’t append to an object; you have to overwrite the whole thing. You can’t rename a "folder" (which doesn't exist); you have to copy the objects and delete the old ones.
So why is everyone moving there? Because S3 offers three things no block store can:
1. Infinite Throughput: You don't scale S3 by buying a bigger disk; you scale by hitting it with more concurrent requests.
2. Decoupled Storage: Your data lives independently of your compute. If your database engine crashes, you don't need to "recover" the disk. You just start a new container and point it at the bucket.
3. The 11 Nines: The durability of S3 is effectively an act of god. To lose data, AWS would have to lose multiple data centers simultaneously.
The Architecture Shift: Compute-Storage Separation
In a traditional database (think standard Postgres or MySQL), storage and compute are married. If you need more storage, you often end up paying for more compute. If you need more compute, you’re stuck with the disk attached to that instance.
Modern "S3-native" systems—like Snowflake, ClickHouse (with S3 disks), or Neon—flip this. They use S3 as the "Source of Truth" and treat local NVMe drives merely as a volatile cache.
Consider how a modern engine queries a massive Parquet file. It doesn't download the whole 10GB file. It uses HTTP Range Requests to pluck exactly the bytes it needs.
Example: The Power of Range Requests
If you’re using Python and duckdb, you can query files directly on S3 without ever "downloading" them in the traditional sense. DuckDB's engine understands the Parquet footer and only fetches the metadata it needs.
import duckdb
# Configure S3 access
con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
con.execute("SET s3_region='us-east-1';")
# Query a 100GB dataset as if it's a local table
# DuckDB will only perform range requests for the specific columns/rows needed
query = """
SELECT
user_id,
COUNT(*)
FROM read_parquet('s3://my-big-data-bucket/events/*.parquet')
WHERE event_type = 'purchase'
GROUP BY 1
LIMIT 10
"""
res = con.execute(query).fetchall()
print(res)Behind the scenes, the database engine sends an HTTP header like Range: bytes=500-1000. This allows for sub-millisecond metadata lookups on petabytes of data.
Solving the Immutability Problem
The biggest hurdle to using S3 as a filesystem is that objects are immutable. You can't just UPDATE a row in the middle of a file.
The industry solved this by adopting the LSM-Tree (Log-Structured Merge-tree) pattern for almost everything. Instead of modifying files, we write new ones. Periodically, a background process "compacts" these files, merging them and deleting the obsolete versions.
This sounds inefficient until you realize that S3’s throughput is so high that background compaction is essentially "free" compared to the cost of managing complex distributed locking on a traditional block store.
Handling Large Writes with Multipart Uploads
When your database needs to flush a large memory buffer to S3, you can't just open a stream and hope the network stays up for 5 minutes. You use Multipart Uploads. This allows you to upload chunks in parallel and "commit" the file only when every piece has arrived.
Here is a simplified look at how you might implement a robust S3 writer in Go, ensuring that your "filesystem" write is atomic:
package main
import (
"context"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/s3"
"github.com/aws/aws-sdk-go-v2/feature/s3/manager"
"os"
)
func uploadToS3(bucket, key string, file *os.File) error {
ctx := context.TODO()
cfg, _ := config.LoadDefaultConfig(ctx)
client := s3.NewFromConfig(cfg)
// The Uploader handles concurrent part uploads automatically
uploader := manager.NewUploader(client, func(u *manager.Uploader) {
u.PartSize = 64 * 1024 * 1024 // 64MB chunks
u.Concurrency = 10 // Upload 10 parts at once
})
_, err := uploader.Upload(ctx, &s3.PutObjectInput{
Bucket: &bucket,
Key: &key,
Body: file,
})
return err
}This pattern transforms a sequential write into a parallel network operation, often saturating a 10Gbps or 25Gbps network interface—something a single EBS volume struggles to do without significant (and expensive) IOPS provisioning.
The "Stateful" Problem is Gone
If you’ve ever managed a Kubernetes cluster, you know the pain of PersistentVolumeClaims. You have a pod, it has data, the pod dies, and now the scheduler has to find a node in the same Availability Zone (AZ) because EBS volumes are AZ-locked.
When S3 is your filesystem, your nodes become stateless.
I recently worked on a project migrating a large-scale search index from local Lucene files to an S3-backed implementation. The complexity of our deployment pipeline dropped by 70%. We stopped worrying about "draining nodes" or "volume attachment limits." We just killed the old pods and started new ones. The new pods would "warm" their local NVMe cache from S3 on demand.
The Latency Gotcha
I would be lying if I said S3 was a perfect replacement for a local disk in every scenario. The "Time to First Byte" for S3 is typically between 30ms and 100ms. If your application does thousands of tiny, sequential random reads, S3 will feel like it's stuck in the mud.
The trick is aggressive prefetching and batching.
In a standard filesystem, you might read 4KB pages. In an S3-native system, you read 8MB or 16MB chunks. You treat the high latency as a fixed cost and maximize the payload to amortize that cost.
If you're building a system today, you should probably be using a library like fsspec (in Python) or object_store (in Rust). These libraries provide an abstraction layer that makes S3 look like a filesystem while optimized for the high-latency/high-bandwidth reality of object storage.
// Using the 'object_store' crate in Rust to treat S3 like a filesystem
use object_store::{aws::AmazonS3Builder, ObjectStore, path::Path};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let s3 = AmazonS3Builder::from_env()
.with_bucket_name("my-data-lake")
.build()?;
let path = Path::from("data/inventory.parquet");
// Get a stream of bytes. This is much more efficient than
// a standard 'read' call because it handles the async buffer.
let result = s3.get(&path).await?;
let bytes = result.bytes().await?;
println!("Read {} bytes from S3", bytes.len());
Ok(())
}The "Infinite" WAL: Why Kafka is next
We're even seeing this shift in the streaming world. Traditionally, Kafka stores data on local disks. If a broker fails, you have to replicate terabytes of data across the network to a new broker to get back to a healthy state.
Newer players like WarpStream or Confluent (with tiered storage) are moving the actual log segments to S3. The broker becomes a thin proxy. This effectively gives you "infinite" retention without having to manage a literal mountain of hard drives.
S3-Native is a Competitive Advantage
If you are building a data-intensive application in 2024, and you are still thinking in terms of "attaching disks to instances," you are building a legacy system.
By embracing S3 as your primary filesystem, you get:
* Zero-copy cloning: Just copy the metadata pointing to S3 objects.
* Instant Scalability: Go from 1 to 1000 nodes without re-sharding data.
* Cost Efficiency: S3 Standard storage is ~ $0.023/GB. High-performance SSD block storage can be 4-5x that when you factor in provisioned IOPS.
S3 is the most successful distributed system in history. It has survived outages that leveled entire regions. It has handled the growth of the internet from a few petabytes to exabytes. It is time we stopped treating it as a "backup target" and started treating it as the foundation of our architecture.
Stop worrying about your block device. Write to the bucket. It's the only filesystem that actually matters.


