loke.dev
Header image for How to Run On-Device AI Prompts Without the 2GB Model Download

How to Run On-Device AI Prompts Without the 2GB Model Download

Stop forcing your users to download massive WASM binaries and learn how to tap into the browser's native, built-in language model for zero-latency AI features.

· 4 min read

Stop treating your users' data caps like an all-you-can-eat buffet. Shipping a 2GB WASM binary and a massive model weights file just to summarize a feedback form is, frankly, digital malpractice.

We’ve been told for the last year that if we want "on-device AI," we have to force the client to download a miniature version of Llama or Mistral. It makes the first-page load feel like downloading a AAA game from 2012. But there is a better way that's currently hiding inside your browser, and it requires exactly zero megabytes of extra download for your users.

The Browser is Finally Getting a Brain

Google has started baking Gemini Nano—a condensed version of their LLM—directly into the Chrome binary. This means the model is already there, sitting on the user's hard drive, waiting to be called via a JavaScript API.

No transformers.js, no giant .bin files, and no $0.02-per-1k-tokens bill from OpenAI. It’s called the Prompt API, and while it’s still in the "Experimental" phase (available in Chrome Canary or Dev channels), it’s the future of how we’ll build lightweight AI features.

Step 1: Checking if the House is Home

You can't just start shouting prompts at the window object and hope for the best. First, we need to see if the browser actually supports the language model API and if the model is downloaded and ready to go.

async function checkAISupport() {
  const canCreate = await window.ai?.languageModel?.capabilities();
  
  if (!canCreate || canCreate.available === 'no') {
    console.warn("AI Model not available. Back to regular old boring code.");
    return false;
  }
  
  return true;
}

The available check is actually quite smart. It can return readily, meaning the model is on-disk, or after-download, which means the browser needs to fetch the model components (a one-time background task) before you can use it.

Step 2: Creating a Session

Unlike a stateless REST API call, the Prompt API works with sessions. This is great because it maintains context. If you’re building a chat interface or a multi-step tool, you don't have to manually concatenate a giant string of previous messages and pass them back every time.

async function runSimplePrompt(userPrompt) {
  if (!(await checkAISupport())) return;

  // Create the session
  const session = await window.ai.languageModel.create({
    systemPrompt: "You are a helpful assistant that explains things like I'm five."
  });

  try {
    const result = await session.prompt(userPrompt);
    console.log("AI says:", result);
  } catch (err) {
    console.error("The AI had a brain freeze:", err);
  } finally {
    // Clean up to save memory
    session.destroy();
  }
}

Streaming: Because Nobody Likes Loading Spinners

Waiting for a full LLM response to generate is agonizing. Even on-device, it can take a few seconds. Luckily, the Prompt API supports streaming out of the box using promptStreaming.

This is where things get fun. You can update the UI character-by-character, making the app feel alive.

async function streamSummary(textToSummarize) {
  const session = await window.ai.languageModel.create();
  const stream = session.promptStreaming(
    `Summarize this text in three bullet points: ${textToSummarize}`
  );

  let fullResponse = "";
  for await (const chunk of stream) {
    // The chunk is the *entire* response so far, not just the new bits
    fullResponse = chunk; 
    document.getElementById('summary-box').innerText = fullResponse;
  }
}

A quick gotcha: Unlike some streaming APIs that send you just the "delta" (the new words), the current iteration of the Prompt API often sends the full accumulated string in each chunk. Keep an eye on the docs as this evolves.

Why Should You Care? (The "Why")

I’ve built apps where the "AI features" were hidden behind a 500MB download. The analytics showed that 40% of users just closed the tab before the model finished loading.

By using the built-in model:
1. Privacy is Default: The data never leaves the machine. If you're building a medical app or a private journal, this is a massive selling point.
2. Zero Latency: No round-trip to a server in us-east-1.
3. Cost: It's free. Your AWS bill will thank you.

The Reality Check

Is it perfect? No.

It’s currently behind flags. To play with this today, you need to go to chrome://flags, enable "Prompt API for Gemini Nano," and set "Enables optimization guide on device" to "Enabled BypassPrefCheck." It’s a developer preview, not something you ship to 10 million production users tomorrow.

Also, Gemini Nano is small. It’s not going to write a complex 5,000-word legal brief or solve advanced calculus. It’s best for:
* Summarization
* Grammar correction
* Sentiment analysis
* Classifying user input

How to Try It Now

If you have Chrome Canary, you can actually check your own console right now. Type await window.ai.languageModel.capabilities() and see what happens.

We are moving toward a world where the "operating system" or the "browser" provides intelligence as a utility, much like it provides a File API or a Geolocation API. Stop bundling the whole brain with your website and start using the one that’s already there.