My Tests are Gaslighting Me: How I Finally Tamed the Flakiness in Our CI Pipeline

I clicked it. The button. The "Re-run failed jobs" button in GitHub Actions.

And it passed.

No code changes. No configuration tweaks. I just asked the CI runner to try again, and this time, it decided that my code was perfectly fine. Instead of feeling relieved, I felt a cold shiver of dread. Because deep down, I knew what this meant: The flakiness had arrived.

Flaky tests are the silent killers of developer velocity. They start as a minor annoyance—an occasional "red" build that turns "green" on a retry. But left unchecked, they become a cultural rot. Developers stop trusting the CI. They start ignoring failures. They begin uttering the most dangerous phrase in software engineering: _"Oh, that test just fails sometimes, just ignore it."_

Last week, I decided I’d had enough. I went on a scorched-earth mission to hunt down every non-deterministic "ghost" in our suite. Here is the post-mortem of what I found and how I fixed it.

The Psychological Toll of the "Retry" Button

Before we get into the code, let’s talk about the vibe. When a test fails randomly, it's basically gaslighting you. You look at the logic, it’s sound. You run it locally, it’s 100% green. But in the cold, sterile environment of a Linux runner in US-East-1, it decides to explode.

We had reached a point where 30% of our PRs required at least one retry. That’s not a testing suite; that’s a slot machine.

The First Culprit: The "Wait-and-Hope" Pattern

The most common source of flakiness I found was what I call "Wait-and-Hope." This usually happens in integration or E2E tests (we use Playwright, but this applies to Cypress and Selenium too).

You’re waiting for an API call to finish or a transition to complete, so you write something like this:

// DON'T DO THIS. SERIOUSLY.
await page.click('#submit-button')
await page.waitForTimeout(2000) // The "Hope" part
expect(await page.textContent('.success-msg')).toBe('Done!')

This is a crime. On your MacBook Pro, 2000ms is plenty. On a heavily loaded CI runner sharing CPU cycles with ten other containers? 2000ms might be just a fraction too short.

The Fix: Never wait for time. Wait for state.

// DO THIS INSTEAD
await page.click('#submit-button')
const successMsg = page.locator('.success-msg')
await expect(successMsg).toBeVisible()
await expect(successMsg).toHaveText('Done!')

By using auto-retrying assertions, the test becomes resilient to the environment's speed. It waits exactly as long as it needs to and no longer.

The Second Culprit: Shared State Leaks

This one was trickier. I noticed that test_user_profile_update would only fail if test_user_logout ran immediately before it on the same worker.

Classic side effects.

We were using a global mock for our authentication service, and one test was changing a property of that mock without resetting it. In Vitest/Jest, if you don’t clean up after yourself, you’re essentially leaving landmines for the next test.

// The offender
import { authService } from './auth'

test('failed login', async () => {
  authService.currentUser = null // We changed a global/shared object!
  // ... assertion
})

// The fix
afterEach(() => {
  vi.restoreAllMocks()
  // Or explicitly reset your singleton states
  authService.reset()
})

I ended up enforcing a rule: No shared singletons in tests. If a service needs to be mocked, we inject it or use a fresh instance per test. It's more boilerplate, but I sleep better now.

The Third Culprit: The "Timezone Trap"

One specific test failed only when the CI ran after 7:00 PM EST. Why? Because 7:00 PM EST is 12:00 AM UTC.

The test was checking a date formatting utility. Locally, the dev’s machine was in America/New_York. The CI server was in UTC. When the day flipped in UTC but not in EST, the "Today" label logic would diverge.

<Callout> Always force your test environment to a specific timezone. Don't let the server's location dictate your business logic's correctness. </Callout>

In our package.json, I updated the test script:

"test": "TZ=UTC vitest"

Consistency is the enemy of flakiness.

The "Chaos Monkey" of Database Seeding

We use a real Postgres database for our integration tests (via Testcontainers). I found that we were using Math.random() to generate IDs for some of our seeded data.

_Sometimes_, two tests running in parallel would generate the same "random" ID, causing a unique constraint violation. It happened once every 50 runs.

The Solution: Deterministic seeding. Use a counter or a seeded faker.

// Instead of this:
const userId = Math.floor(Math.random() * 10000)

// Use this:
let userIdCounter = 0
const getUserId = () => ++userIdCounter

How We Monitor It Now

You can't fix what you can't measure. We started using a "Flaky Test Tracer." Most modern CI tools (like Buildkite or Datadog CI) have this built-in, but you can also DIY it by logging retries.

If a test fails and then passes on the same commit, it’s automatically flagged. If its "Flake Rate" exceeds 2%, the build fails even if the test eventually passes. This sounds harsh, but it forces us to deal with the debt immediately rather than letting it accumulate.

The Result

After a week of hunting, our CI success rate (on the first attempt) went from 68% to 96%.

The vibe in the Slack channel changed instantly. People stopped saying "just retry it" and started saying "hey, why did this fail?" Trust was restored.

Look, writing tests is easy. Writing _good_ tests is hard. But writing _reliable_ tests? That’s the real engineering work. Don't let your tests gaslight you. Hunt those ghosts down and delete them.

Anyway, I'm off to delete some more waitForTimeout calls. Wish me luck.