Parquet Is the New JSON

Open your browser’s DevTools on a data-heavy dashboard and look at the Network tab. You’ll likely see a massive data.json fetch—maybe 15MB, maybe 50MB—followed by a significant CPU spike where the main thread hangs for 800ms while JSON.parse() struggles to breathe. We’ve accepted this as the cost of doing business on the web, but we’re essentially trying to transport a skyscraper's worth of steel by turning it into dust and blowing it through a straw.

JSON is a row-based, text-heavy format that forces the browser to scan every single character, escape sequence, and curly brace before it can do anything useful. If you have 100,000 rows and you only need to calculate the average of one column, JSON.parse() still forces you to realize every single string and nested object into memory. It’s inefficient, it’s slow, and in the era of WebAssembly (Wasm), it’s becoming obsolete for analytical workloads.

The JSON Tax on Memory and CPU

When you fetch a JSON blob, the browser does three expensive things:
1. Decompression: Usually un-Gzipping the text.
2. Parsing: Turning string characters into a JavaScript Object tree. This is a blocking operation.
3. Materialization: Allocating memory for every key-value pair. Because JSON is row-based, you end up with a massive array of objects, each carrying the overhead of key names ("id", "timestamp", "value") repeated thousands of times.

Parquet changes this by being columnar. If you only need the "price" column, a Parquet-aware client only reads the bytes associated with that column. Everything else stays on the server or remains unparsed.

Enter DuckDB-Wasm and the Vectorized Client

To use Parquet in the browser, we need more than just a parser; we need an execution engine. DuckDB-Wasm has emerged as the definitive tool for this. It brings the full power of an analytical SQL database into the browser as a Wasm binary.

Instead of writing complex filter and reduce logic in JavaScript—which is notoriously slow for large arrays—you hand the data processing to a vectorized engine written in C++ that runs at near-native speed.

Here is how you initialize a DuckDB instance in the browser and prepare it to handle Parquet files:

import * as duckdb from '@duckdb/duckdb-wasm';

const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();

// Select a bundle based on browser capability (e.g., threading support)
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

const conn = await db.connect();

With this setup, the browser is no longer just a rendering engine; it's a data warehouse.

The Magic of Remote Parquet Queries

The real "JSON killer" feature isn't just that Parquet is smaller (though it usually is, thanks to Snappy or Zstd compression). The killer feature is HTTP Range Requests.

If you host a 500MB Parquet file on S3 or a static file server, DuckDB-Wasm doesn't have to download the whole thing. It can send Range headers to fetch only the metadata (the "footer" of the Parquet file) to understand where specific columns and row groups live. Then, it only fetches the specific bytes needed for your query.

Look at this query:

// Querying a remote Parquet file without downloading the whole thing
await conn.query(`
    CREATE VIEW sales_data AS 
    SELECT * FROM read_parquet('https://my-bucket.s3.amazonaws.com/large_dataset.parquet');
`);

const result = await conn.query(`
    SELECT 
        region, 
        SUM(revenue) as total_rev 
    FROM sales_data 
    WHERE date > '2023-01-01'
    GROUP BY region
`);

console.table(result.toArray());

In a traditional JSON flow, that large_dataset would have crashed the tab. With Parquet and DuckDB-Wasm, the browser might only download a few hundred kilobytes of data to answer the query, even if the source file is gigabytes in size.

Why Columnar Layout Matters for the Browser

Standard JS objects are "Layout of Arrays" (LoA) or "Array of Objects" (AoO).

// The JSON way (Array of Objects)
[
  { "id": 1, "type": "click", "val": 10.5 },
  { "id": 2, "type": "view", "val": 2.1 }
]

To calculate the sum of val, the CPU has to jump across memory addresses, grabbing an entire object, extracting one property, then jumping to the next object. This wreaks havoc on the CPU cache.

Parquet stores data like this:
- id: [1, 2]
- type: ["click", "view"]
- val: [10.5, 2.1]

When the Wasm engine calculates the sum of val, it loads a contiguous block of memory into the CPU cache. This is called vectorized execution. Modern CPUs love contiguous memory; they can use SIMD (Single Instruction, Multiple Data) to process multiple values in a single clock cycle. This is why DuckDB can process millions of rows in milliseconds while your React state update is still struggling to iterate through a medium-sized array.

Practical Example: Building a "Zero-Backend" Analytics Dashboard

Let's say you're building a dashboard for log analysis. Instead of building a complex Express/Node.js API with multiple endpoints for filtering and grouping, you can simply dump your logs into Parquet files daily.

Your frontend code becomes a clean SQL-based interface:

async function getMetricSummary(columnName, filterValue) {
    const query = `
        SELECT 
            ${columnName}, 
            COUNT(*) as count,
            AVG(response_time) as latency
        FROM 'logs_2023_oct.parquet'
        WHERE status_code = ${filterValue}
        GROUP BY 1
        ORDER BY count DESC
        LIMIT 10
    `;
    
    // DuckDB returns Apache Arrow tables
    const arrowTable = await conn.query(query);
    
    // We can convert this to standard JS objects for UI components
    return arrowTable.toArray().map(row => row.toJSON());
}

By using Apache Arrow as the transport format between the Wasm engine and the JavaScript main thread, you avoid the serialization overhead entirely. Arrow is a zero-copy memory format. When DuckDB finishes a query, the resulting memory can often be shared or mapped directly into the JS environment without a "conversion" step.

The Trade-offs: When JSON Still Wins

I’m being opinionated here, but I’m not delusional. Parquet isn't a silver bullet for every use case.

1. Write Complexity: Generating Parquet files is harder than JSON.stringify(). You need a library like fast-parquet in Python or a dedicated writer in Node.js/Rust. It’s a build-step or a backend-process requirement.
2. Schema Rigidity: Parquet requires a schema. You can’t just throw random fields into a row like you can with JSON. For highly polymorphic data, JSON (or a document store) is still king.
3. Initial Load: DuckDB-Wasm is a multi-megabyte binary. If your app only needs to display a list of 10 items, the overhead of downloading the Wasm engine is a massive regression in performance. Parquet is for *data-intensive* applications, not simple CRUD forms.

Optimization Strategy: Predicate Pushdown and Sorting

To get the most out of Parquet, the way you *save* the file matters. Parquet files contain metadata about the minimum and maximum values in each "row group" (a chunk of rows).

If you sort your Parquet file by timestamp before uploading it to your server, DuckDB can perform predicate pushdown. If you run a query for WHERE timestamp > '2023-10-01', the engine looks at the metadata, sees that a specific row group only contains data from June, and skips downloading those bytes entirely.

Here is a quick look at how you might prepare such a file in Python (the current standard for Parquet creation) before serving it to your JS client:

import pandas as pd

# Load your messy data
df = pd.read_csv("raw_logs.csv")

# Sort by the column you query most often to enable efficient skipping
df = df.sort_values("timestamp")

# Save as parquet with snappy compression
df.to_parquet(
    "optimized_logs.parquet", 
    engine="pyarrow", 
    index=False, 
    row_group_size=100000 # Larger groups allow better compression and skipping
)

Bridging the Gap: The Arrow Link

One thing developers often miss is the relationship between Parquet and Apache Arrow. Think of Parquet as the long-term storage format (optimized for disk/bandwidth) and Arrow as the in-memory format (optimized for the CPU).

DuckDB-Wasm reads Parquet and produces Arrow. This is why the performance is so startling. You aren't converting data into JavaScript objects until the very last second—the "visualization" layer. If you use a library like Arquero or Perspective.js for your UI, they can consume the Arrow buffers directly.

import * as arrow from 'apache-arrow';

// If you have an Arrow table from DuckDB
const table = await conn.query(`SELECT * FROM data`);

// Accessing data is nearly instant because it's just a view on a TypedArray
const firstRowValue = table.getChild('price').get(0);

The Architecture Shift

We are moving toward a "Fat Client" architecture that actually makes sense. In the past, "Fat Client" meant loading 500 React components. Now, it means moving the query engine to the edge.

By using Parquet, you reduce your backend to a simple static file server (or an S3 bucket). You don't need a complex API layer to handle filtering, pagination, and sorting. You send the "database" (the Parquet file) to the user, and the browser handles the rest. This drastically reduces server costs and provides a snappier, more interactive experience for the user.

The browser is no longer a terminal; it’s a node in your distributed data system. If you're still pushing massive JSON arrays over the wire, you're building for the web of 2015. It's time to embrace the column.