Will 'Insertable Streams' Finally Bring True End-to-End Encryption to Your WebRTC Stack?

I remember debugging a "secure" healthcare app a few years back and realizing that while the connection was encrypted, our media server sat right in the middle, staring at every frame of video in plain text. It felt like putting a high-tech deadbolt on your front door but handing the spare key to a random delivery guy.

WebRTC has always been secure "on the wire" thanks to DTLS and SRTP. If you’re doing a 1:1 call, you’re golden. But the moment you scale to a group call using a Selective Forwarding Unit (SFU), that middleman needs to decrypt your packets to know where to send them. This is where the dream of true, "the-server-can't-see-me" End-to-End Encryption (E2EE) usually goes to die.

Enter Insertable Streams.

The Middleman Problem

To understand why we need Insertable Streams, you have to look at how a typical SFU works. It receives your media, decrypts the SRTP layer, looks at the metadata to decide who gets what, and then re-encrypts it for the receivers.

For most of us, we "trust" the server. But for high-stakes industries—legal, medical, or just the privacy-conscious—that trust is a liability. We want a way to encrypt the media *before* it hits the WebRTC stack and decrypt it only *after* it leaves the stack on the other end.

How Insertable Streams Flip the Script

The Insertable Streams API (often referred to as the "Encoded Transform" API) gives us a hook into the media pipeline. It lets us sit between the encoder and the packetizer on the sender side, and between the depacketizer and the decoder on the receiver side.

You essentially get a ReadableStream and a WritableStream of encoded frames. You can pluck a frame out, scramble its bits, and shove it back into the pipe.

Step 1: Hooking the Sender

First, we need to tell our RTCPeerConnection that we want to mess with the bits. We do this when we add the track.

const pc = new RTCPeerConnection(config);
const track = stream.getTracks()[0];

// We specify that we want to use a 'transform' 
const sender = pc.addTrack(track, stream);
const { sender: { transform } } = sender; // This is where the magic lives

// We'll use a Web Worker because doing crypto on the main thread is a crime
const worker = new Worker('crypto-worker.js');

sender.transform = new RTCRtpScriptTransform(worker, { 
  key: "super-secret-key", 
  direction: "encrypt" 
});

The Worker: Where the Dirty Work Happens

You don't want to block the UI thread while doing AES-GCM on 60 frames per second of 4K video. That’s a one-way ticket to Jitter-ville. We handle the transformation inside a Web Worker.

The worker listens for the rtctransform event. We get a readable stream of frames, and we pipe them through a TransformStream.

// crypto-worker.js
onrtctransform = (event) => {
  const { readable, writable, options } = event.transformer;
  
  const transformStream = new TransformStream({
    transform(encodedFrame, controller) {
      // The 'data' is the actual encoded media (VP8, H.264, etc.)
      const buffer = encodedFrame.data;
      const view = new DataView(buffer);

      // Simple XOR for demo purposes (PLEASE don't use XOR for real E2EE!)
      const encryptedData = new Uint8Array(buffer.byteLength);
      const originalData = new Uint8Array(buffer);
      
      for (let i = 0; i < buffer.byteLength; i++) {
        encryptedData[i] = originalData[i] ^ 0xAA; 
      }

      encodedFrame.data = encryptedData.buffer;
      controller.enqueue(encodedFrame);
    },
  });

  readable.pipeThrough(transformStream).pipeTo(writable);
};

The "Don't Break the SFU" Rule

Here is where most people trip up. If you encrypt the *entire* frame, your SFU is going to have a bad time.

SFUs need to see the frame metadata—is this a keyframe (I-frame) or a delta frame? If you scramble the header bytes that tell the SFU it's a keyframe, the SFU won't know it needs to send that frame to a new participant who just joined. The result? A black screen and a lot of frustrated users.

In a real-world scenario, you'd use something like SFrame. You'd skip the first few bytes of the payload (the codec-specific headers) and only encrypt the actual media data.

function encryptPayload(frame, key) {
  const headerLength = getCodecHeaderLength(frame); // You have to calculate this!
  const data = new Uint8Array(frame.data);
  
  // Encrypt only the portion AFTER the header
  const payloadToEncrypt = data.slice(headerLength);
  const encrypted = actualCryptoMagic(payloadToEncrypt, key);
  
  const finalFrame = new Uint8Array(headerLength + encrypted.length);
  finalFrame.set(data.slice(0, headerLength), 0);
  finalFrame.set(encrypted, headerLength);
  
  frame.data = finalFrame.buffer;
}

The Receiver Side: Reversing the Damage

On the receiving end, it’s the exact same dance, just in reverse. You hook the RTCRtpReceiver and pass the bits back through a worker to decrypt them before they hit the <video> element.

pc.ontrack = (event) => {
  const receiver = event.receiver;
  receiver.transform = new RTCRtpScriptTransform(worker, { 
    key: "super-secret-key", 
    direction: "decrypt" 
  });
};

Is this "True" E2EE?

Technically, yes. Because the encryption happens at the application layer before the data is handed off to the browser's networking stack, the SFU only ever sees "garbage" data (the encrypted payload). It can route the packets based on the unencrypted RTP headers, but it can't peek at the pixels.

The Reality Check

Before you rush out to refactor your entire stack, keep a few things in mind:

1. Key Management: Insertable Streams give you the *pipe*, but they don't give you the *key*. You still need a secure way (like Double Ratchet or Signal Protocol) to exchange keys between participants without the server intercepting them.
2. Performance: While Web Workers help, there is overhead in copying buffers. Using Transferable Objects is a must to keep things snappy.
3. Browser Support: Chrome and Edge are lead the charge here. Safari is... getting there (it's in Technology Preview). Firefox has its own thoughts on the matter. It's not a universal "it just works" solution yet.

Insertable Streams are a massive leap forward. They turn WebRTC from a "secure-ish" protocol into a toolkit for building genuinely private communication. It’s a bit of a low-level grind to get the frame parsing right, but if you actually care about keeping eyes off your users' data, it’s the only way to fly.