loke.dev
Header image for The String Is a Memory Illusion

The String Is a Memory Illusion

A deep dive into V8’s internal representation of ConsStrings and SlicedStrings, and why your 10-character substring might be keeping a 100MB file alive in the heap.

· 10 min read

Have you ever wondered why your Node.js process is gasping for breath, hitting Out-Of-Memory (OOM) limits, even though your heap snapshot claims you’re only holding onto a few hundred small strings?

It’s a common frustration. You write a parser, you extract some IDs from a massive 200MB log file, you discard the original file content, and yet the resident set size (RSS) stays stubbornly high. In the high-level world of JavaScript, we treat strings like immutable, atomic primitives. We think of them as simple arrays of characters living somewhere in the heap. But V8, the engine powering Chrome and Node.js, sees things differently. To V8, a string is rarely just a string; it is a complex, hierarchical data structure designed to balance memory usage against execution speed.

The reality is that "The String" is often a memory illusion. What looks like a 10-character ID in your debugger might actually be a pointer holding a 100MB buffer hostage in the garbage collector’s eyes.

The Mental Model vs. The Machine

We are taught that strings are primitives. If I do this:

let a = "Hello";
let b = "World";
let c = a + " " + b;

Our brain imagines V8 allocating a new block of memory, copying "Hello", then a space, then "World" into it. If that were true, heavy string manipulation would be incredibly expensive. Every concatenation would involve an $O(n)$ copy operation. If you were building a large JSON string or a HTML template by hand, your performance would crater.

V8 avoids this by being lazy. It uses a variety of internal "shapes" or representations for strings. Depending on how a string was created, it might be a SeqString, a ConsString, a SlicedString, or a ThinString.

The Simple Case: SeqString

A SeqString (Sequential String) is the closest thing to our mental model. It is a contiguous block of memory containing the actual characters.

// This is likely stored as a SeqOneByteString
const simple = "Just some text";

V8 has two main flavors here: SeqOneByteString (for ASCII/Latin-1) and SeqTwoByteString (for UTF-16). If V8 can fit your string into one byte per character, it will, saving 50% of the memory. This is the "flat" representation. Every other string type eventually points back to one of these.

The Efficiency Trap: ConsStrings

When you concatenate two strings using the + operator, V8 doesn't copy them. Instead, it creates a ConsString. Think of a ConsString as a node in a binary tree. It has two pointers: left and right.

const str1 = "Large string content...".repeat(1000);
const str2 = "More content...".repeat(1000);
const bigString = str1 + str2;

In this scenario, bigString doesn't contain the combined characters. It's a small object that says: "I am the result of str1 and str2."

This makes concatenation $O(1)$. It’s incredibly fast. But there is a hidden cost. If you concatenate in a loop, you create a deeply nested tree.

let html = "";
for (let i = 0; i < 10000; i++) {
  html += "<div>" + i + "</div>";
}

Now, html is a ConsString tree 10,000 levels deep. If you try to access a character at the end of that string, V8 has to traverse that entire tree. To mitigate this, V8 will occasionally "flatten" these strings into a SeqString if they are used frequently or if the tree becomes too deep.

But the real danger to your heap isn't the ConsString. It's its cousin: the SlicedString.

The Memory Leak: SlicedStrings

This is where the "illusion" becomes dangerous. Let’s say you are reading a massive 50MB CSV file into memory. You only need the header names.

const fs = require('fs');

function getHeaders() {
  // Assume fileContent is a 50MB string
  const fileContent = fs.readFileSync('massive_data.csv', 'utf-8');
  
  // We only want the first line
  const firstLine = fileContent.split('\n')[0];
  
  return firstLine;
}

const headers = getHeaders();
// At this point, you expect 50MB to be eligible for Garbage Collection.

In many versions of V8, firstLine is not a new 100-byte string. It is a SlicedString.

A SlicedString doesn't store its own characters. It stores:
1. A pointer to the parent string.
2. An offset into that parent.
3. A length.

In our example, headers is a tiny object that says: "I am a substring of fileContent, starting at index 0, with a length of 80."

Because headers is still reachable in your code, the SlicedString object is reachable. And because the SlicedString has a pointer to the 50MB fileContent string, the entire 50MB string is held in memory. Even though you only care about 80 bytes, you are paying for 50 million.

This isn't a bug; it's a performance optimization. Copying 50MB of data is slow. Creating a SlicedString is nearly instantaneous. V8 gambles that you’ll eventually let go of the slice, or that the memory savings of not copying outweigh the risk of retention.

Visualizing the Ghost

If you were to look at a heap snapshot in Chrome DevTools after running the code above, you might see something confusing. You’ll see your headers string, and its "retained size" will be massive. The "shallow size" (the memory the object itself takes) will be tiny (maybe 32 bytes), but the "retained size" (the memory that would be freed if this object were deleted) will include the entire 50MB parent.

The ThinString: A Modern Twist

V8 developers are aware of this. In recent years, they introduced ThinString to help with "string interning."

When you have a string that V8 decides to "intern" (store in a unique lookup table so multiple identical strings point to the same memory), it uses ThinString. If you have a ConsString that gets flattened, V8 might turn the original ConsString into a ThinString that simply points to the new, flat SeqString.

This keeps the memory graph complex. The takeaway remains the same: the string you hold in your hand is often just a wrapper for something much larger.

Practical Evidence: Identifying the Leak

Let’s write a script that demonstrates this leak. We can use the --expose-gc flag in Node.js to see the impact clearly.

// run with: node --expose-gc leak-demo.js

function getUsage() {
  const used = process.memoryUsage().heapUsed / 1024 / 1024;
  return `${Math.round(used * 100) / 100} MB`;
}

let tinyStrings = [];

function leakMemory() {
  // Create a large 10MB string
  let largeString = "x".repeat(10 * 1024 * 1024);
  
  // Take a tiny slice of it
  let slice = largeString.substring(0, 5);
  
  // Store the slice globally
  tinyStrings.push(slice);
  
  // largeString goes out of scope here...
  // but slice still points to it internally!
}

console.log("Starting:", getUsage());

for (let i = 0; i < 10; i++) {
  leakMemory();
  global.gc(); // Force garbage collection
  console.log(`After leak ${i + 1}:`, getUsage());
}

If you run this, you will see the heap usage climb by roughly 10MB every iteration, even though we are only adding a 5-character string to the array and manually calling the GC. Those 5 characters are SlicedStrings dragging the 10MB carcasses of their parents behind them.

How to Break the Spell

How do you force V8 to give up the parent string? You need to turn the SlicedString into a SeqString. You need to force a copy.

1. The "Slice-and-Dice" (Old Hack)

In older versions of V8, there were various "hacks" to force a copy, like + '' or String.fromCharCode. However, V8's optimizer is smart and often realizes these are no-ops, preserving the SlicedString.

2. The String.prototype.repeat(0) Trick

Surprisingly, some developers use .repeat(1) or similar methods. But even this is inconsistent across engine versions.

3. The Template Literal "Hack"

Sometimes, creating a new string via interpolation can force a new allocation, but it depends on the length and the engine's current state.

let slice = largeString.substring(0, 5);
let forcedCopy = `${slice}`; // Not guaranteed to work!

4. The Nuclear Option: JSON.parse(JSON.stringify())

This is incredibly slow and should never be used in a hot path, but because JSON serialization requires visiting every character, the resulting object will always contain fresh SeqStrings.

let safeSlice = JSON.parse(JSON.stringify(largeString.substring(0, 5)));

5. The Real Solution: Careful Parsing

If you are worried about this, the best approach is to avoid holding onto slices of large strings in long-lived objects. If you must keep a value, consider if you can convert it to a different type (like a Number) as soon as possible.

In modern Node.js, Buffer.slice() has been changed to Buffer.subarray() specifically because slice() used to share memory and subarray() makes that behavior explicit. With strings, we don't have a copy() method, which is a significant omission in the JavaScript standard library for performance-critical applications.

When ConsStrings Bite Back

It’s not just about memory leaks. ConsString trees can also destroy performance in ways that aren't obvious.

Imagine you are building a large string in a loop.

let result = "";
for (let i = 0; i < 100000; i++) {
  result += "line " + i + "\n";
}

Every + creates a new ConsString object. These objects are small, but they live in the "Young Generation" (New Space) of the V8 heap. Because we are creating 100,000 of them, we trigger frequent "Minor GCs" (Scavenges).

If you then pass this result to a function that uses a Regular Expression:

const match = result.match(/line 999/);

V8 has to "flatten" the string before the Regex engine can work on it. Flattening involves allocating a single SeqString large enough to hold the whole thing and then doing a tree traversal to copy all the characters. If your string is 10MB, you just did a 10MB allocation and a massive copy operation right when you thought you were just doing a simple search.

The Fix: Use an array and join('').

let parts = [];
for (let i = 0; i < 100000; i++) {
  parts.push("line " + i + "\n");
}
let result = parts.join('');

Array.join is optimized to pre-calculate the total length required, allocate a single SeqString, and copy the parts into it in one pass. It skips the "tree of pointers" phase entirely.

The External String Gotcha

In Node.js, you also encounter ExternalString. This happens frequently when using Buffers or data coming from C++ addons. An ExternalString points to memory outside the V8 heap (in the "system" memory).

The trick here is that the V8 Garbage Collector only knows about the small JavaScript wrapper object. If you have a 1GB string living in system memory and a tiny JavaScript object pointing to it, V8 might not feel "pressured" to run a GC because it thinks it’s only using a few bytes. This can lead to your system running out of RAM while Node.js thinks everything is fine.

Summary: Living with the Illusion

Understanding that strings are trees and pointers helps you write better code. Here is the mental checklist I use when working with large amounts of text in JavaScript:

1. Concatenation loops are evil: Use Array.join() or Buffer.concat() instead of += inside loops.
2. Slices are sticky: If you extract a small piece of a massive string and intend to keep it in a long-lived cache or global variable, be aware that you might be keeping the whole parent alive.
3. Regex flattens: If you perform regex operations on a string built via concatenation, you’re triggering a hidden flattening cost.
4. DevTools lie (slightly): When looking at heap snapshots, always look at the Retained Size, not just the Shallow Size of strings.

V8's string implementation is a masterpiece of engineering. It makes 99% of our code run faster without us thinking about it. But for that 1%—the high-performance parsers, the data processors, the long-running servers—the illusion of the simple string is one we eventually have to see through.

Memory management in JavaScript is automatic, but it isn't magic. Sometimes, the best way to save memory is to stop believing in the primitives and start looking at the pointers.