loke.dev
Header image for The Day I Finally Parallelized My Pixel Loops: How WASM SIMD Scaled My Image Filters to 60fps

The Day I Finally Parallelized My Pixel Loops: How WASM SIMD Scaled My Image Filters to 60fps

Stop treating your pixel arrays as a sequence of numbers and start treating them as parallel vectors to unlock the true potential of your users' hardware.

· 8 min read

There is a specific kind of frustration that comes from watching a browser tab choke on a simple grayscale filter. I was staring at a 4K frame buffer, my laptop fan was beginning its familiar ascent into jet-engine territory, and the "60fps" dream was looking more like a slideshow.

Even with WebAssembly (WASM), my pixel loops weren't cutting it. I had moved my logic from JavaScript to Rust, expecting a miracle, but I was still iterating through millions of integers one by one. I was treating the CPU like a single-file line at a grocery store when I had a sixteen-lane highway sitting right there in the silicon.

That highway is SIMD (Single Instruction, Multiple Data). It’s the difference between telling a worker to "pick up this one box" 100 times and telling them to "pick up these 16 boxes at once." Once I finally sat down to refactor my WASM kernels to use 128-bit vectors, the performance didn't just improve; it transcended the overhead of the browser environment entirely.

The Scalar Wall

Most of us are taught to write "scalar" code. We write a loop, we grab a pixel, we manipulate the Red, Green, and Blue channels, and we move to the next index. In Rust, a standard brightness filter might look like this:

pub fn apply_brightness(data: &mut [u8], adjustment: i16) {
    for pixel in data.iter_mut() {
        let val = *pixel as i16 + adjustment;
        *pixel = val.clamp(0, 255) as u8;
    }
}

This looks efficient. It’s Rust. It’s compiled to WASM. But under the hood, the CPU is performing a load, an add, a clamp, and a store for every single byte in that array. If you have a 1080p image, that’s roughly 2 million iterations (ignoring alpha channels for a moment). If you’re doing 60 frames per second, your CPU is working overtime just to keep up with the loop overhead and branch predictions.

Enter the Vector: What is SIMD?

SIMD allows the CPU to process multiple pieces of data with a single instruction. In the context of WebAssembly, this means using the v128 type. A v128 is a 128-bit wide register.

Think about what fits into 128 bits:
- 16 u8 values (8-bit integers)
- 8 u16 values (16-bit integers)
- 4 f32 values (32-bit floats)

When we use SIMD, we don't say "add 10 to this byte." We say "load these 16 bytes into this 128-bit register and add 10 to all of them simultaneously."

Enabling the Power

Before you can even use these instructions, you have to tell your compiler and your environment that you’re playing with fire. In Rust, this means targeting the wasm32-unknown-unknown target with specific features enabled.

In your .cargo/config.toml:

[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+simd128"]

And in your code, you'll be reaching for std::arch::wasm32.

Refactoring for Parallelism

Let’s look at how we actually implement that brightness filter using SIMD. The logic changes from "do this to one" to "do this to a chunk."

use std::arch::wasm32::*;

pub fn apply_brightness_simd(data: &mut [u8], adjustment: i16) {
    // We create a vector filled with our adjustment value.
    // Since we're working with u8s, but adjustment can be negative, 
    // we have to be careful about signed vs unsigned math.
    let adj_vec = unsafe { i8x16_splat(adjustment as i8) };
    
    let chunks = data.chunks_exact_mut(16);
    let remainder = chunks.into_remainder();

    for chunk in data.chunks_exact_mut(16) {
        unsafe {
            // 1. Load 16 bytes from the chunk into a 128-bit register
            let pixels = v128_load(chunk.as_ptr() as *const v128);
            
            // 2. Add the adjustment to all 16 lanes at once
            // Note: sat_s (saturating signed) prevents wrapping
            let result = i8x16_add_sat(pixels, adj_vec);
            
            // 3. Store the result back into memory
            v128_store(chunk.as_mut_ptr() as *mut v128, result);
        }
    }

    // Don't forget the leftover pixels if the length isn't a multiple of 16!
    for pixel in remainder {
        let val = *pixel as i16 + adjustment;
        *pixel = val.clamp(0, 255) as u8;
    }
}

Why This is Significantly Faster

In the scalar version, the CPU executes the loop logic (incrementing the pointer, checking the bounds) for every single byte. In the SIMD version, it does that work 16 times less often.

But it's more than just loop unrolling. SIMD instructions are implemented at the hardware level. The silicon is literally wired to perform 16 additions in parallel within the same clock cycle it would normally take to do one.

The Complexity Trap: Floating Point Math

Brightness is easy because it’s just addition. What about a Sepia filter? Sepia requires multiplying each color channel by a specific coefficient.

R' = (R * .393) + (G * .769) + (B * .189)
G' = (R * .349) + (G * .686) + (B * .168)
B' = (R * .272) + (G * .534) + (B * .131)

Doing this in SIMD is harder because pixels are u8 (integers), but the coefficients are floats. Converting back and forth between u8x16 and f32x4 inside a loop is expensive.

I found that the real trick to 60fps wasn't just using SIMD; it was staying in the register as long as possible. If you convert your u8 pixels to f32 vectors, you can only fit 4 pixels in a register instead of 16. Suddenly your "16x speedup" becomes a "4x speedup," and you spend all your time on "type promotion" (turning small ints into big floats).

The solution? Fixed-point arithmetic. Instead of multiplying by 0.393, multiply by 101 (which is 0.393 * 256) and then bit-shift the result to the right by 8. This allows you to stay in the integer domain, using u16x8 or u32x4 vectors, which are significantly faster in WASM SIMD than hopping back and forth to floating point.

A Practical Example: The Inversion Filter

Inversion is the simplest SIMD win. It’s just a bitwise NOT or a subtraction from 255.

#[target_feature(enable = "simd128")]
pub unsafe fn invert_simd(data: &mut [u8]) {
    let all_ones = u8x16_splat(255);
    
    for chunk in data.chunks_exact_mut(16) {
        let ptr = chunk.as_mut_ptr() as *mut v128;
        let pixels = v128_load(ptr);
        
        // XOR with 255 is a very fast way to invert bits
        let inverted = v128_xor(pixels, all_ones);
        
        v128_store(ptr, inverted);
    }
}

If you profile this against a standard JS for loop, the difference is staggering. On a modern M1 or M2 Mac, the JS loop might take 5-8ms for a 4K frame. The WASM SIMD version often finishes in under 0.5ms. This gives you massive "headroom" to do more complex effects, like blurs or edge detection, without dropping frames.

The "Gotchas" of WASM SIMD

It’s not all free performance. There are several hurdles I ran into that can ruin your day:

1. Alignment: WASM memory is just a big linear blob. However, CPUs love it when you load data from addresses that are multiples of the vector size (16 bytes). If your image width isn't a multiple of 16, your "rows" might start at odd offsets. While WASM's v128_load handles unaligned memory, it can be slightly slower on some architectures.
2. The "Tail": As seen in the code examples, you always have to handle the remainder. If your pixel array is 100 bytes long, you can process 6 chunks of 16 (96 bytes), but you have 4 bytes left over. You must process those with a standard scalar loop.
3. Browser Support: SIMD is well-supported in Chrome, Edge, and Firefox now. Safari was the last holdout, but it’s there in recent versions. Still, you should always feature-detect.

How to Feature-Detect

You don't want to serve a WASM binary with SIMD instructions to a browser that doesn't understand them. The browser will simply refuse to compile the module.

The standard approach is to use a tiny "probe" module or check the WebAssembly.validate API:

const simdSupported = async () => {
  // A tiny valid WASM SIMD module (an empty function with a v128 local)
  const bytes = new Uint8Array([0,97,115,109,1,0,0,0,1,5,1,96,0,0,3,2,1,0,10,10,1,8,0,65,0,253,15,26,11]);
  return WebAssembly.validate(bytes);
};

// Use it to choose which .wasm file to fetch
const wasmModule = await simdSupported() ? 'filters_simd.wasm' : 'filters_scalar.wasm';

The Architecture of Real-Time Video

When I finally got this working for a real-time video stream, the architecture looked like this:

1. RequestAnimationFrame: Trigger the render cycle.
2. Canvas Capture: Grab the frame from a <video> element using drawImage onto a hidden canvas.
3. Shared Memory: Pass the Uint8ClampedArray (the pixels) into the WASM linear memory. Pro tip: Don't copy the data. Use WebAssembly.Memory and write directly into the buffer that WASM is looking at.
4. SIMD Kernel: Run the Rust/WASM function.
5. PutImageData: Push the modified bytes back to the visible canvas.

Even with the overhead of moving data between the DOM and WASM, SIMD makes the actual "work" part of the frame so fast that the bottlenecks shift entirely to the browser’s canvas API.

When NOT to use SIMD

Is it always better? No.

If you are writing a filter that requires looking at many neighboring pixels (like a large-radius Gaussian blur), the complexity of loading multiple overlapping vectors can sometimes outweigh the benefits. SIMD shines when the operation is "Point-wise"—where the result of one pixel doesn't depend on the result of another.

Also, if your image is tiny (like a 32x32 icon), the overhead of setting up the SIMD lanes and handling the remainder might actually be slower than a simple scalar loop.

Final Thoughts

Moving to SIMD changed how I think about data. I stopped seeing a pixel as a struct { r, g, b, a } and started seeing it as a slice of a 128-bit word.

For the longest time, we've treated WebAssembly as a way to "run C++ on the web." But that’s selling it short. With SIMD, we aren't just running native code; we are running *optimized* native code that leverages the hardware in a way JavaScript never can.

If you’re building image editors, video tools, or even heavy data-visualization libraries in the browser, treat your pixel arrays as vectors. Stop iterating and start streaming. Your users' CPUs (and their cooling fans) will thank you.