How to review AI generated code without losing your sanity

I wasted four hours last Tuesday tracking a production regression that should not have existed. The feature, a data export utility, was generated by an AI. It passed unit tests. It compiled without warnings. It looked perfect. But there was an off-by-one error in a boundary condition that the model hallucinated, completely ignoring the business logic. The code was syntactically clean, but semantically toxic.

This is the hidden bug syndrome. It is the tax we pay for AI speed. Data shows that AI-assisted pull requests carry roughly 1.7x more issues than human ones, and your repository cognitive complexity can inflate by nearly 40% if you let it. You are no longer reviewing code. You are auditing a junior developer who has read every manual but understands none of the requirements.

The silent threat of the hallucinated API

The most dangerous thing an LLM does is invent a library method that sounds correct. You see a call like user.getPermissions().filterByRole('admin'), and your brain fills in the gaps. You assume the API exists because the syntax is idiomatic. Tools like Cursor or Copilot encourage this "happy path" bias.

I have seen developers burn half a day debugging a library only to realize the method they were calling was never part of the SDK. To catch these before they break your CI, enforce a strict source-of-truth policy in yourAI-assisted pull request audit2.

If you are using TypeScript, ignore the IDE autocomplete during generation. It will happily import types that do not exist. My rule is simple. If the AI adds a new utility or library call, verify it against the node_modules or the local definition file immediately. If you cannot jump to the definition using F12, mark it as a high-priority failure.

Scaling secure AI code

Security is where the confidence trap turns fatal. LLMs are trained on massive swathes of public code, which includes plenty of insecure patterns. They do not just generate logic. They replicate vulnerabilities.

Treat every AI-generated block as if it came from an untrusted source. Your guardrails must be automated. We use a combination of strict SAST tools and custom linting rules that flag any AI-generated function longer than 30 lines.

If the code is too complex to verify in two minutes, it is too complex to merge. Large AI blocks suffer from context collapse. The model optimizes for local scope while ignoring the wider architecture of your application.

The review workflow for the AI era

Standard pull request thresholds fail here. The traditional async review model breaks when the volume of code increases, as noted byBryan Finster3. You need a dedicated checklist forhow to review AI generated code1.

Here is the rubric I force my team to follow:

1.The Intent Audit: Does this code solve the ticket, or does it solve the problem the AI thought the ticket was asking?
2.The API Sanity Check: Did I trace the definition of every external method call?
3.The Complexity Cap: Is this function doing more than two things? If yes, it gets a "Refactor" tag.
4.Security Scrutiny: Are we passing user-input strings directly to sinks? Are we using deprecated encryption standards?
5.The Dependency Check: Did the AI add a library to package.json that we already have an internal tool for?

Concrete failure pattern: The "Polite" Injection

Look at this common pattern where an LLM attempts to sanitize an input:

// AI-generated, "secure" helper
function sanitizeInput(input: string): string {
    return input.replace(/<script>/g, '').replace(/eval\(/g, '');
}

It looks reasonable. It follows naming conventions. It is clean. But this is security theater. It fails on obfuscated input, case variation, and nested tags. It is a hallucinated fix that provides a false sense of safety. A human might write this, but an AI will write it with such absolute confidence that you might merge it without checking the OWASP cheat sheet.

Practical workflow templates

We moved away from "reviewing by reading" to "reviewing by testing." If an AI writes a feature, the pull request must include a test file that targets the edge cases the model glossed over.

1.Require "Proof of Life" tests: If the AI generates a core service, it must include a unit test that mocks the edge cases, such as nulls, empty strings, or unauthorized roles.
2.Constraint-based Linting: Use eslint to forbid non-standard or deprecated libraries. This prevents the model from hallucinating legacy patterns.
3.Synchronous Pairing: If a pull request involves more than 100 lines of AI-generated code, the review happens over a 15-minute call. You pair on the logic. Do not leave comments in GitHub.

Human-in-the-loop

Reviewing AI-generated code5 requires significantly more mental overhead than reading human work. You are not just checking for style. You are looking for the absence of context. The AI does not know about your legacy debt or why that one module in utils/ is deprecated.

Stop treating the AI as an oracle. Treat it as a slightly chaotic intern who is fast at typing but lacks any sense of project history. When you review, do not look for what is there. Look for what is missing. Look for the edge cases the model ignored because it was too busy being helpful.

Efficiency is not about how fast you merge. It is about how much technical debt you prevent from sneaking in under the guise of speed. Next time you see a 500-line pull request, do not feel guilty about asking for a rewrite or deleting the whole thing. If you cannot explain why the code is written the way it is, you should not be responsible for maintaining it. Trust your skepticism more than the green checkmark in your IDE.