3 Performance Bottlenecks You Can Finally Delete Using WebAssembly Tail Calls

I spent a good portion of 2019 trying to port a custom Scheme interpreter to the web. On paper, WebAssembly was the perfect target. In practice, I hit a wall that felt like 1995: the call stack. Every time my interpreted code entered a recursive loop, the browser would scream "Maximum call stack size exceeded" and die. To fix it, I had to wrap everything in a "trampoline"—a giant while loop that manually managed function calls. It worked, but it was slow, ugly, and felt like I was fighting the platform instead of using it.

For years, WebAssembly was missing a critical feature that functional programmers and compiler writers take for granted: Tail Call Optimization (TCO).

Without TCO, every function call pushes a new frame onto the stack. If you have a function that calls itself (or another function) at the very end of its execution, you’re wasting memory. The Wasm Tail Call proposal, which is now a finalized standard supported in Chrome, Firefox, and recently Safari, introduces the return_call and return_call_indirect instructions. These allow you to jump to another function while *reusing* the current stack frame.

If you’re building high-performance runtimes, complex state machines, or anything involving functional paradigms, here are the three specific performance bottlenecks you can finally delete.

---

1. The "Trampoline" Tax in Language Runtimes

When you’re compiling a language like Haskell, Elixir, or Scheme to Wasm, you often deal with functions that call each other indefinitely. Since Wasm (historically) didn't support tail calls, compilers had to implement "trampolines."

In a trampoline setup, a function doesn't call the next function directly. Instead, it returns the *address* of the next function to a central loop.

The Old Way (The Slow Way)

Here is what a trampoline looks like in conceptual C-style code that might be compiled to Wasm:

// A slow trampoline loop
void* current_func = &start_node;
while (current_func != NULL) {
    current_func = ((void* (*)())current_func)();
}

Every single function call now involves:
1. Returning from a function.
2. Popping a stack frame.
3. Jumping back to the top of a while loop.
4. An indirect branch to the next function.

This destroys the CPU’s branch predictor. You’re essentially paying a "tax" on every single logical jump in your code.

The New Way (The `return_call` Way)

With tail calls, the "trampoline" disappears. You just call the next function. In WebAssembly Text (WAT), it looks like this:

(func $even (param $n i32) (result i32)
  local.get $n
  i32.eqz
  if (result i32)
    i32.const 1 ;; It is even
  else
    local.get $n
    i32.const 1
    i32.sub
    return_call $odd ;; Tail call! No new stack frame.
  end
)

(func $odd (param $n i32) (result i32)
  local.get $n
  i32.eqz
  if (result i32)
    i32.const 0 ;; It is odd
  else
    local.get $n
    i32.const 1
    i32.sub
    return_call $even ;; Tail call! Reuses the frame.
  end
)

The return_call instruction tells the Wasm engine: "I'm done with the current frame. Clean it up, then call the next function using the current space." This reduces the overhead to a simple jump instruction, which is exactly what CPUs are built to do fast.

---

2. Indirect Dispatch Latency in State Machines

If you're building a regex engine, a parser, or a virtual machine (an interpreter within Wasm), you’re likely using a state machine.

Usually, this is implemented as a giant switch statement inside a loop. This is known as "opcode dispatch." The problem is that as the switch statement grows, the engine struggles to optimize it.

The Switch Bottleneck

while (running) {
  switch (opcodes[pc++]) {
    case OP_ADD: 
      stack.push(stack.pop() + stack.pop()); 
      break;
    case OP_JUMP: 
      pc = opcodes[pc]; 
      break;
    // ... 200 more cases
  }
}

Every iteration of this loop hits the switch logic. Even if the engine uses a jump table, you're still forced to go back to the "hub" before going to the "spoke."

Using Tail Calls for Direct Threaded Code

With return_call_indirect, you can implement what's known as Direct Threaded Code. Each opcode "handler" is its own Wasm function. At the end of the ADD function, it looks up the next function in a table and return_call_indirects straight to it.

;; A table of functions (our handlers)
(table $dispatch_table 256 funcref)

(func $op_add (param $pc i32)
  ;; ... perform addition logic ...
  
  ;; Get next opcode
  local.get $pc
  i32.add (i32.const 1)
  
  ;; Tail call the next handler directly
  local.get $next_op_index
  return_call_indirect (type $op_sig) (table $dispatch_table)
)

By doing this, you've deleted the central loop entirely. You’ve transformed a centralized bottleneck into a distributed chain of jumps. This allows the hardware's indirect branch predictor to actually learn the patterns of *your* interpreted code (e.g., it learns that OP_CMP is almost always followed by OP_JZ).

---

3. Deep Recursion and Shadow Stacks

A common "gotcha" in Wasm performance is the mismatch between the Wasm stack and the "linear memory" stack.

Wasm has its own execution stack that is managed by the browser engine (you can't touch it). However, many languages (like C++ or Rust) also maintain a "shadow stack" in linear memory to store pointers to local variables.

When you have deep recursion, you have to grow *both*.

The Memory Pressure Problem

In a standard recursive function (without tail calls), every call pushes:
1. A frame onto the Wasm internal stack.
2. A frame onto your own linear memory shadow stack.

The shadow stack growth is particularly painful because it messes with data locality. You're constantly jumping to new memory addresses, which can lead to cache misses. Furthermore, if you’re in a memory-constrained environment (like a background worker or a low-end mobile device), you might actually run out of linear memory simply because you’re storing 10,000 sets of function arguments that you don't actually need anymore.

The "Zero-Growth" Solution

When you use return_call, the engine knows the current arguments are dead. It can overwrite them in place.

If you're writing a recursive tree-traversal or a deep search algorithm, TCO ensures that your memory usage remains O(1) instead of O(N). This isn't just about avoiding a crash; it's about staying in the L1/L2 cache.

Here is a practical example: a tail-recursive "Sum" function.

(func $sum (param $ptr i32) (param $acc i64) (result i64)
  ;; Base case: if pointer is 0, return accumulator
  local.get $ptr
  i32.eqz
  if (result i64)
    local.get $acc
  else
    ;; Recursive case:
    ;; $ptr = load_next($ptr)
    ;; $acc = $acc + load_value($ptr)
    local.get $ptr
    i32.load offset=4 ;; get next pointer
    
    local.get $acc
    local.get $ptr
    i64.load          ;; get current value
    i64.add
    
    return_call $sum  ;; Jump back with new values
  end
)

In this code, no matter how long the linked list is, the Wasm stack and the shadow stack never grow. You are effectively running a loop, but with the semantic clarity of recursion.

---

Why didn't we have this from day one?

It sounds like a no-brainer, right? If it's so much faster, why did it take nearly 6 years for Tail Calls to move from a proposal to a standard?

The answer lies in the complexity of the WebAssembly security model and the diversity of execution environments.

1. The Shadow Stack: Engines like V8 (Chrome) and SpiderMonkey (Firefox) had to ensure that tail calls didn't break their existing stack-trace and debugging infrastructure. When you overwrite a frame, you "lose" history. How do you show a clean stack trace in the DevTools if the frames have been reused?
2. Architecture Differences: Implementing a guaranteed tail call on x86 is different than on ARM64. The Wasm specification requires that return_call *must* be a tail call (it's not just a hint to the compiler, it's a requirement). Ensuring this worked consistently across all platforms took time.
3. Security: Stack management is a frequent source of security vulnerabilities. The engineers had to be 100% sure that return_call couldn't be used to bypass the Control Flow Integrity (CFI) protections built into Wasm.

How to use it today

If you're writing WAT by hand, you can use return_call immediately. But most of us use higher-level languages.

* C++ (Emscripten): You can use the -mptail-calls flag. You’ll also need to ensure your functions are actually written in a tail-recursive way and that you're using a recent version of LLVM.
* Rust: Rust doesn't have a native become keyword yet (the proposed syntax for tail calls), but the Wasm backend is gaining support for these opcodes.
* AssemblyScript: Support is currently being refined. You can often see discussions in the GitHub issues regarding the implementation of the return statement in tail positions.

Checking for Support in JS

Before you serve a Wasm binary that uses tail calls, you should feature-detect it. Since it's a new opcode, older browsers will fail to even *instantiate* the module.

const hasTailCalls = WebAssembly.validate(new Uint8Array([
  0x00, 0x61, 0x73, 0x6d, // Magic
  0x01, 0x00, 0x00, 0x00, // Version
  0x01, 0x04, 0x01, 0x60, 0x00, 0x00, // Type section
  0x03, 0x02, 0x01, 0x00,             // Function section
  0x0a, 0x06, 0x01, 0x04, 0x00, 0x12, 0x00, 0x0b // Code section (contains return_call)
]));

if (hasTailCalls) {
  console.log("We are in the future! Tail calls are go.");
} else {
  console.log("Fallback to the slow trampoline way.");
}

The Verdict

WebAssembly Tail Calls aren't just a niche feature for functional programming nerds. They represent a fundamental shift in how we can structure Wasm applications. By deleting the need for trampolines, we remove a layer of abstraction that has been slowing down ported runtimes since Wasm's inception.

If you’ve been avoiding porting a complex language runtime or a high-performance interpreter to the web because of "stack anxiety" or the performance hit of indirect dispatch, it's time to take another look. The bottlenecks are officially open for deletion.

3 Performance Bottlenecks You Can Finally Delete Using WebAssembly Tail Calls

1. The "Trampoline" Tax in Language Runtimes

The Old Way (The Slow Way)

The New Way (The `return_call` Way)

2. Indirect Dispatch Latency in State Machines

The Switch Bottleneck

Using Tail Calls for Direct Threaded Code

3. Deep Recursion and Shadow Stacks

The Memory Pressure Problem

The "Zero-Growth" Solution

Why didn't we have this from day one?

How to use it today

Checking for Support in JS

The Verdict

Related Articles

Parquet Is the New JSON

Related Articles

Parquet Is the New JSON

1. The "Trampoline" Tax in Language Runtimes

The Old Way (The Slow Way)

The New Way (The return_call Way)

2. Indirect Dispatch Latency in State Machines

The Switch Bottleneck

Using Tail Calls for Direct Threaded Code

3. Deep Recursion and Shadow Stacks

The Memory Pressure Problem

The "Zero-Growth" Solution

Why didn't we have this from day one?

How to use it today

Checking for Support in JS

The Verdict

Related Articles

Parquet Is the New JSON

Related Articles

Parquet Is the New JSON

The New Way (The `return_call` Way)