How to Capture Client-Side Network Failures Without an External Monitoring SDK

Your server logs are lying to you. They tell a story of every request that reached your infrastructure, but they are blissfully unaware of the thousands of users who tried to connect to your site and failed before a single packet ever hit your load balancer.

When a user hits a DNS resolution error, a TLS handshake failure, or a "Connection Refused" error, your application-level monitoring is effectively blind. If the browser can't establish a connection, it can't download your expensive monitoring SDK, which means that SDK can't report the failure. You're left staring at a "clean" dashboard while a subset of your users sees a "Site can't be reached" page. This is the observability black hole of the "last mile."

To see into this void, we have to move away from JavaScript-based tracking and tap into the browser's native capabilities. By using the Reporting API and Network Error Logging (NEL), we can instruct the browser itself to become our monitoring agent, reporting failures directly to an endpoint of our choosing—even if our main site is completely down.

The Flaw in Traditional Client-Side Monitoring

Most of us rely on a script tag at the top of our HTML to catch errors. We’ve been told that if we put our monitoring SDK in the <head>, we’re safe. But think about the sequence of events required for that script to work:

1. DNS Lookup: The browser needs to resolve your domain to an IP.
2. TCP Connection: A three-way handshake must occur.
3. TLS Negotiation: Certificates must be verified and encrypted tunnels established.
4. The Request: The browser asks for index.html.
5. The Response: The server sends the HTML.
6. Parsing: The browser begins parsing the HTML and sees the <script> tag for your SDK.
7. SDK Fetch: The browser makes *another* network request to get the monitoring code.

If any of the first five steps fail, your monitoring SDK never even exists in the user's browser memory. You have zero visibility into ISP routing issues, expired SSL certificates on your CDN, or regional DNS outages.

Enter Network Error Logging (NEL)

Network Error Logging is a W3C specification that allows a site to opt-in to having the browser perform long-term monitoring of its network performance. Unlike your JavaScript, NEL lives in the browser's networking stack.

Once a user visits your site successfully *once*, the browser receives a policy in the HTTP response headers. It caches this policy. From that point forward, if the browser fails to connect to your domain, it remembers that it has a job to do: "If I fail to connect to example.com, I should queue a report and POST it to analytics.example.com whenever I have a chance."

This is "out-of-band" reporting. The failure happens, the browser tucks the error away, and when the user eventually regains connectivity or visits a different site that *can* connect, the browser fires off a JSON payload to your reporting endpoint.

Step 1: Defining Your Endpoints

The modern way to handle this is through the Reporting-Endpoints header. You need to tell the browser where to send the data. This should be a robust, lightweight endpoint—ideally a simple serverless function or a dedicated ingestion service that isn't dependent on your primary application stack.

Reporting-Endpoints: main-endpoint="https://telemetry.yourdomain.com/v1/reports",
                     critical-errors="https://telemetry.yourdomain.com/v1/critical"

In this example, we define two named groups. You can route different types of browser reports (like Content Security Policy violations or Deprecation warnings) to different URLs, but for networking, one solid endpoint is usually enough.

Step 2: Activating the NEL Policy

Now we tell the browser specifically to track network failures using the NEL header. This header contains a JSON object that defines the behavior of the monitoring.

NEL: {"report_to": "main-endpoint", "max_age": 2592000, "include_subdomains": true, "failure_fraction": 1.0}

Let's break down these fields:
* `report_to`: This matches the name we defined in the Reporting-Endpoints header.
* `max_age`: How long (in seconds) the browser should remember this policy. 2592000 is 30 days. This is crucial—it's what allows the browser to report failures even if the user doesn't come back to your site for a week.
* `include_subdomains`: If your API lives on api.example.com and your main site is example.com, you want to know if subdomains are failing too.
* `failure_fraction`: You might not want a report for every single 404. Setting this to 1.0 means "report 100% of failures." On high-traffic sites, you might drop this to 0.01 (1%) to save on ingestion costs.

Putting it Together: The Server Config

If you're using Nginx, you would add these headers to your server block. I prefer doing this at the edge (CDN or Load Balancer) because that’s the first point of contact.

# Nginx Configuration
add_header Reporting-Endpoints 'default="https://collector.example.com/reports"';
add_header NEL '{"report_to":"default","max_age":2592000,"include_subdomains":true}';

If you are running a Node.js/Express app, you can use a simple middleware:

app.use((req, res, next) => {
  res.setHeader(
    'Reporting-Endpoints',
    'default="https://collector.example.com/reports"'
  );
  res.setHeader(
    'NEL',
    JSON.stringify({
      report_to: 'default',
      max_age: 2592000,
      include_subdomains: true,
      failure_fraction: 1.0
    })
  );
  next();
});

What Does a Failure Report Look Like?

When a failure occurs, the browser waits for a period of idle time and then sends a POST request to your collector. The Content-Type will be application/reports+json.

The payload is an array of objects. Here is what a real-world DNS failure looks like:

[
  {
    "age": 42,
    "type": "network-error",
    "url": "https://example.com/api/data",
    "user_agent": "Mozilla/5.0...",
    "body": {
      "sampling_fraction": 1.0,
      "type": "dns.name_not_resolved",
      "host": "example.com",
      "status_code": 0,
      "protocol": "http/1.1",
      "method": "GET",
      "elapsed_time": 143,
      "phase": "dns"
    }
  }
]

Notice the body.type. Instead of a generic "Network Error," the browser gives us specific, actionable strings like:
* tcp.connection_refused
* tls.cert.invalid
* tls.version_or_cipher_mismatch
* http.response.invalid_chunked_encoding
* abandoned (User closed the tab before the request finished)

This is data you simply cannot get from inside a standard JavaScript try/catch block.

Building a Simple Ingestion Worker

You don't need a massive infrastructure to start collecting these. A simple Cloudflare Worker or an AWS Lambda function can pipe these into a database or even a Slack channel (though I wouldn't recommend Slack for high-volume failures).

Here’s a basic example of a worker designed to receive these reports and log them.

// A simple collector script (Node.js / Express example)
const express = require('express');
const app = express();

// The browser sends application/reports+json, which is just JSON
app.use(express.json({ type: 'application/reports+json' }));

app.post('/reports', (req, res) => {
  const reports = req.body;

  reports.forEach(report => {
    if (report.type === 'network-error') {
      const { type, phase, host } = report.body;
      const url = report.url;
      
      console.log(`[NEL ERROR] Type: ${type}, Phase: ${phase}, URL: ${url}`);
      
      // Here you would send this to your time-series database 
      // e.g., InfluxDB, Prometheus, or BigQuery
      saveToDatabase({
        timestamp: new Date(),
        errorType: type,
        phase: phase,
        url: url,
        userAgent: report.user_agent
      });
    }
  });

  // Always respond with a 204 No Content to the browser
  res.status(204).end();
});

app.listen(3000, () => console.log('Telemetry collector running on port 3000'));

Why Not Just Use a 3rd Party SDK?

I am not saying you should delete Sentry or LogRocket. Those tools are fantastic for application-level debugging (e.g., "Why did this React component crash?"). But for *networking*, native browser reporting has three massive advantages:

1. Zero Bundle Size: You aren't forcing your users to download 40KB of JavaScript to tell you that your site is slow.
2. Privacy by Design: The browser controls what is sent. You aren't injecting a third-party script that might be tracking user behavior or scraping PII. You're getting pure, technical telemetry.
3. Unbeatable Reliability: If your CDN goes down, your SDK goes down with it. NEL persists in the browser cache and waits for the lights to come back on before reporting.

The "Success" Reporting Trap

NEL also allows you to track successful requests by setting a success_fraction.

{"report_to": "default", "max_age": 2592000, "success_fraction": 0.01}

Be careful with this. Even a 1% success reporting rate on a high-traffic site can result in millions of POST requests to your collector. I generally suggest keeping success_fraction at 0.0 unless you are actively debugging a latency issue and need a baseline of "elapsed_time" for successful connections.

Real-World Gotchas

While this API is powerful, it isn't magic. There are a few things that tripped me up when I first deployed this across a fleet of microservices.

1. Browser Support

Safari and Firefox have historically been slow to adopt the full NEL spec. As of now, it's primarily a Chromium-based feature (Chrome, Edge, Brave, Opera). However, because it's implemented via headers, it’s a "progressive enhancement." It doesn't break anything for Safari users; you just don't get data from them. Given Chrome's market share, you're still covering the vast majority of your users.

2. The Loop Problem

Don't host your reporting endpoint on the same infrastructure you're monitoring. If your entire domain example.com goes down due to a DNS issue, and your collector is at collector.example.com, the browser will try to report the DNS failure to a domain it can't resolve. It will eventually give up.
Pro-tip: Use a completely different TLD or a specialized third-party ingestion service for your telemetry to ensure the paths don't cross.

3. The Caching Headache

Because the NEL policy is cached by the browser for the duration of max_age, you can't "turn it off" instantly. If you accidentally set a policy that points to a broken endpoint, the browser will keep trying to send reports there until the policy expires. Always start with a short max_age (like 3600 seconds) when testing.

Implementing a "Kill Switch"

If you need to clear a NEL policy from your users' browsers, you send the header again with a max_age of 0:

NEL: {"max_age": 0}

This tells the browser to immediately purge the policy for that domain.

Correlating Data

The real magic happens when you combine NEL data with your server-side logs.

Imagine your server logs show a 20% drop in traffic from Germany. Usually, you'd be scrambling. Is it a bug? A bad deployment? But then you check your NEL dashboard and see a massive spike in dns.name_not_resolved errors originating from German IP ranges.

Suddenly, you aren't debugging your code; you're calling your DNS provider or checking if a specific ISP is having a bad day. You've moved from "Something is wrong" to "I know exactly where the bottleneck is" without your users saying a word.

Making the Data Actionable

Collecting the JSON is only the first half. To make this valuable, you should categorize the type field into three main buckets:

1. Infrastructure Issues: dns.name_not_resolved, tcp.timed_out, tls.cert.invalid. These require immediate attention from your DevOps/SRE team.
2. App/Configuration Issues: http.error.4xx, http.error.5xx. These might indicate a bad deployment or a broken asset link.
3. User Environment Issues: abandoned, tcp.address_unreachable. These often happen when a user enters an elevator or a tunnel. These are "noise" but useful for calculating a baseline of expected failure.

Final Thoughts

We’ve spent years perfecting our ability to monitor what happens *inside* our applications, yet we’ve largely ignored the plumbing that gets users there. The Reporting API and NEL fill that gap.

It takes about ten minutes to configure these headers and set up a basic ingestion point. For that minimal investment, you gain an "eye in the sky" that monitors your DNS, your SSL certificates, and your global routing—all without adding a single byte to your JavaScript bundle.

Stop guessing why users are bouncing and let their browsers tell you the truth. It's time to shine a light into the last mile.