How to Review AI Generated Code Without Shipping Bugs

Most engineering teams treat AI coding assistants like a junior developer who never sleeps but also never checks the documentation. The code gets shipped, it looks clean, and it passes the linter. Everyone feels like a winner. Then, three weeks later, you're back in the office at 9 PM tracing a production outage caused by a logic flaw that shouldn't have existed.

The industry narrative is dangerously optimistic. We talk about velocity while ignoring that nearly half of AI-generated code carries significant security flaws. The "context gap", where the AI writes code that is locally correct but globally catastrophic, is the silent killer of project health. Knowing how to review AI output is the primary responsibility of a senior engineer now.

The Illusion of Correctness

The most insidious trap is the well-formatted lie. AI tools are optimized to produce syntactically perfect, idiomatic code. Because the output looks like the work of a seasoned engineer, our brains trigger a trust heuristic. We skim the PR, see standard library calls, and approve.

But the AI lacks intent. It doesn't know your AuthService expects a specific scope from an OAuth provider. It doesn't care if your database migration logic is idempotent. It suggests a solution based on probability rather than your business invariants. The code behaves exactly as the AI thinks it should, which is usually wrong.

Catching Hallucinated APIs

The hallucinated API is the classic it-works-on-my-machine nightmare. LLMs are trained on massive datasets including deprecated docs and experimental junk. They frequently synthesize method names that sound correct but simply don't exist.

I saw a PR recently where the AI suggested using client.query_stream_async() for a database driver. It looked natural, so the engineer didn't verify it. The code compiled, but it crashed at runtime with an AttributeError the moment it hit a high-concurrency path.

The Detection Checklist:

*   The Library Boundary: If the code interacts with a third-party SDK or internal module, treat every call as suspicious.
*   The Truth Test: If you see an unfamiliar method, force the author to comment a link to the official documentation in the PR. If they can't find the source, the AI invented it.
*   Trace the Import: AI often makes up imports to match its fake methods. Check if the module actually exports the function by searching the library source, not just the code provided.

Architectural Guardrails

Relying on developers to just be careful is a strategy that fails under pressure. You need a team-level workflow that treats AI output as hostile code.

In our workflow, we treat AI-generated blocks with the same suspicion as external user input.

*   Authorization Audit: AI tools frequently suggest public endpoints without middleware. If the AI adds an endpoint, the reviewer must verify it against the actual auth logic.
*   Least Privilege: If an AI generates a query like SELECT *, we reject it immediately. The developer must refactor for specific columns.
*   Hardcoded Secrets: AI loves to suggest dummy values like API_KEY_123. We use pre-commit hooks to kill any build containing patterns that resemble secrets.

The Testing Trap

The "Same Source Testing Trap" occurs when you ask your AI to write the code and the tests simultaneously. The AI is essentially grading its own homework. It repeats the same faulty assumptions in both the implementation and the test suite. The test passes, but the code breaks as soon as a real user touches it.

# The AI generates this
def process_user_data(user_dict):
    # It assumes 'email' always exists, but doesn't check
    return user_dict['email'].lower()

# And the AI generates this test
def test_process_user_data():
    data = {'email': 'TEST@EXAMPLE.COM'}
    assert process_user_data(data) == 'test@example.com'

The Fix:
Force a separation of concerns. If the AI writes the implementation, the developer must manually write the test case. Or, force the AI to generate the test against a different model persona and the original requirements.

Reviewing AI Output

Stop reading for style. Stop reading for syntax. Use this rubric instead.

1. Requirement Check: Does this code solve the actual ticket, or just a generic version the AI hallucinated?
2. Dependency Audit: Does this code use an API that isn't defined in your local dependencies?
3. Boundary Analysis: What happens if the input is null or malicious? AI rarely accounts for these unless specifically pushed.
4. Maintenance Debt: Is this code so dense that you would be terrified to touch it in six months?

Rewrite or Debug?

The honest answer is that if the AI-generated code is more than 30 lines long and involves complex state, it's faster to delete it and write it yourself. Debugging AI logic is like untangling a web where you aren't sure if the knots are intentional or just accidents of the training data.

Use your tools to scaffold boilerplate or write repetitive unit tests. Never use them to architect core business services. When you treat the AI as a generator of prototypes rather than production code, the quality of your work stays high and your incident rate stays low. Don't let the productivity metrics fool you. The time you save by blindly merging today is the time you'll spend on an incident response call tomorrow. Keep the human in the loop.

Resources

*   kluster.ai
*   codeant.ai
*   axify.io