loke.dev
Header image for 16 Bits Is All You Need: Scaling On-Device AI with JavaScript’s New Float16Array

16 Bits Is All You Need: Scaling On-Device AI with JavaScript’s New Float16Array

JavaScript finally has native half-precision floats, providing a massive 50% reduction in memory pressure for local LLMs and WebGPU compute shaders.

· 4 min read

16 Bits Is All You Need: Scaling On-Device AI with JavaScript’s New Float16Array

Your web browser has been wasting exactly half of its memory on AI models for years, and until literally just now, there was almost nothing you could do about it. When we run Large Language Models (LLMs) or Stable Diffusion locally in the browser, we aren't just fighting for CPU cycles; we are fighting for every single megabyte of RAM.

For a long time, JavaScript developers were stuck between a rock and a hard place: use Float32Array and watch the browser tab crash on a 16GB machine, or use Uint16Array and write a mountain of "bit-shifting" boilerplate to fake half-precision math.

With the arrival of the native Float16Array, that trade-off is dead. We finally have a way to handle massive tensors and GPU buffers without the 50% "memory tax."

Why 16 bits is the "Goldilocks" zone

In the world of AI, precision is expensive. A standard Float32 (Single Precision) uses 4 bytes. A Float64 (Double Precision) uses 8. If you're loading a model with 7 billion parameters—even a small one by today's standards—doing the math in Float32 requires about 28GB of VRAM/RAM just for the weights.

Most LLMs are trained or fine-tuned in FP16 (Half Precision) or BF16 (Bfloat16). They don't actually need the extreme decimal precision of a 32-bit float. By using Float16Array, we can store those same weights in 2 bytes per element.

Suddenly, that 28GB model fits into 14GB. That’s the difference between "This app only runs on a Mac Studio" and "This app runs on a MacBook Air."

Meet the Float16Array

The syntax is exactly what you’d expect if you’ve ever used a TypedArray. It behaves like a Float32Array, but under the hood, it follows the IEEE 754 half-precision binary format (1 sign bit, 5 bits for the exponent, and 10 bits for the fraction).

// Creating a Float16Array is now native!
const weights = new Float16Array([0.1, 0.2, 0.5, -1.5, 65504]);

console.log(weights.length); // 5
console.log(weights.byteLength); // 10 (2 bytes per element)

// You can also use it with DataView
const buffer = new ArrayBuffer(16);
const view = new DataView(buffer);
view.setFloat16(0, 3.14, true); // little-endian
console.log(view.getFloat16(0, true)); // 3.138671875 (the precision loss is real!)

The Catch: Precision Loss is Real

You might have noticed that 3.14 became 3.138671875 in the example above. That's not a bug; it's the nature of 16-bit floats.

While Float32 can represent numbers up to roughly $3.4 \times 10^{38}$ with 7 decimal digits of precision, Float16 caps out at 65,504 and only gives you about 3.3 decimal digits.

When should you use it?
- Storing Weights: Perfect. Neural networks are surprisingly resilient to this slight loss of precision.
- GLSL/WGSL Buffers: Essential for performance.

When should you avoid it?
- Accumulating Sums: If you are adding thousands of small numbers together (like in a dot product), the rounding errors in Float16 will snowball quickly. Do your heavy accumulation in Float32, then cast back to Float16.

// A dangerous pattern: accumulating in FP16
let sum = new Float16Array([0])[0];
for (let i = 0; i < 1000; i++) {
    sum += 0.0001; // Rounding errors will eat this alive
}

// The better way:
let sum32 = 0;
for (let i = 0; i < 1000; i++) {
    sum32 += 0.0001;
}
const finalResult = new Float16Array([sum32])[0];

Supercharging WebGPU

The real magic happens when you pair Float16Array with WebGPU. Previously, if you wanted to upload FP16 data to a GPU shader, you had to wrap it in a Uint16Array and do some truly disgusting bit-manipulation inside your WGSL shader to "unpack" it.

Now, with the shader-f16 extension in WebGPU, the pipeline is seamless.

// Check if the browser supports 16-bit floats in shaders
const adapter = await navigator.gpu.requestAdapter();
const supportsF16 = adapter.features.has("shader-f16");

if (supportsF16) {
    const device = await adapter.requestDevice({
        requiredFeatures: ["shader-f16"],
    });

    // Now you can pass Float16Array directly to your GPU buffers
    const weightBuffer = device.createBuffer({
        size: myModelWeights.byteLength,
        usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
    });

    device.queue.writeBuffer(weightBuffer, 0, myModelWeights.buffer);
}

By keeping the data in 16-bit from the disk to the GPU, you reduce the pressure on the PCIe bus (or the unified memory bus on Apple Silicon). You're literally moving half the data, which often results in a nearly 2x speedup for data-transfer-bound operations.

Can you use it today?

The Float16Array is part of the ES2024 specification. The browser support is actually quite good now:
- Chrome/Edge: Version 121+
- Firefox: Version 129+
- Safari: Version 18+

If you need to support older browsers, there is a float16 polyfill that mimics the API, but you won't get the native memory savings—it's mostly to keep your code from throwing ReferenceError.

The "So What?"

If you're building a simple Todo app, you will never care about Float16Array. But if you're part of the new wave of "Local AI" developers trying to squeeze a Llama-3 or a Whisper model into a browser tab, this is a landmark update.

We are moving away from a world where the browser was just a thin client for a heavy Python backend. Native half-precision floats are one of the last few bricks needed to build a professional-grade AI inference engine entirely in JavaScript.

Go forth and save some RAM. Your users' fans will thank you.