How to Tokenize Multilingual Text Without a Heavy NLP Library

Have you ever tried to split a Japanese sentence into words using .split(' ') and realized—with a sinking feeling—that Japanese doesn't actually use spaces?

It’s a classic "English-centric" developer trap. We assume text is just a bunch of characters separated by whitespace, but the moment your app hits a global audience, that logic falls apart. Usually, the "fix" is to reach for a massive NLP library like Natural or Compromise. Suddenly, your bundle size balloons by 500KB just so you can accurately count words in a search bar.

There is a better way. It’s built into your browser, it's locale-aware, and it's called Intl.Segmenter.

The "Space" Fallacy

If you're working strictly in English, string.split(' ') is... fine. Mostly. But what happens when a user enters "Hello!!! How are you?" Your split logic gives you ["Hello!!!", "How", "are", "you?"]. Those punctuation marks are hitching a ride on your tokens.

Now, try that with a language like Thai or Chinese.

const sentence = "我喜欢编写代码"; // "I like writing code"
console.log(sentence.split(' ')); 
// Output: ["我喜欢编写代码"] -> Totally useless.

The browser already knows how to handle this; it has to, otherwise it wouldn't know where to wrap lines of text on a webpage. Intl.Segmenter lets you tap into that native power.

Meet Intl.Segmenter

The API is straightforward. You define a locale and a "granularity" (whether you want to split by word, sentence, or grapheme).

Here is the basic setup:

const segmenter = new Intl.Segmenter('zh', { granularity: 'word' });
const input = "我喜欢编写代码";

const segments = segmenter.segment(input);

for (const { segment, isWordLike } of segments) {
  console.log(segment);
}
// Output: "我", "喜欢", "编写", "代码"

Magic, right? No 10MB dictionary files required. The browser uses the underlying ICU data already present in the OS.

Filtering Out the Junk

One of my favorite features is the isWordLike property. When you segment text into words, the API technically includes the spaces and punctuation as segments. In most cases, you don't want those.

Instead of writing a complex regex to clean up the results, you can just filter the iterator:

const text = "JavaScript is fun, isn't it? 🚀";
const segmenter = new Intl.Segmenter('en', { granularity: 'word' });
const segments = segmenter.segment(text);

const wordsOnly = Array.from(segments)
  .filter(s => s.isWordLike)
  .map(s => s.segment);

console.log(wordsOnly);
// ["JavaScript", "is", "fun", "isn't", "it", "🚀"]

Notice how it handled "isn't"? A simple split would have struggled with the apostrophe or left it dangling. Here, it’s treated as a single semantic unit.

The Emoji Problem (Graphemes)

If you’ve ever tried to split('') a string containing complex emojis (like 👨‍👩‍👧‍👦), you know the nightmare of surrogate pairs. You end up with a bunch of garbled boxes because that one emoji is actually four different characters joined by "Zero Width Joiners."

If you need to count how many "visual" characters a user typed, use the grapheme granularity:

const emojiSoup = "Hi! 👋🏾👨‍👩‍👧‍👦";

// The old, broken way
console.log(emojiSoup.split('').length); // 15 (Yikes)

// The Intl way
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
const count = [...segmenter.segment(emojiSoup)].length;

console.log(count); // 5 (H, i, !, 👋🏾, 👨‍👩‍👧‍👦)

Real-World Performance Tip

Don't instantiate the Intl.Segmenter inside a loop. Like its cousins Intl.DateTimeFormat and Intl.NumberFormat, the initialization is the expensive part.

If you're processing a large array of strings, create the segmenter once and reuse it:

const words = ["Some long text...", "Another string...", "Wait, more?"];
const segmenter = new Intl.Segmenter('en', { granularity: 'word' });

// Do this
const processed = words.map(text => [...segmenter.segment(text)]);

// DON'T create a new segmenter inside the map!

Can I use this today?

For a long time, Firefox was the holdout, which made Intl.Segmenter a "nice to have but I can't use it" feature. However, Firefox 125 finally shipped support in early 2024.

We are now in the glorious era where all major evergreen browsers (Chrome, Edge, Safari, and Firefox) and Node.js (v16+) support this natively.

If you're building a character counter, a search highlighter, or a basic tag extractor, stop adding dependencies to your package.json. The browser is already an NLP expert; you just have to ask it nicely.