
Evals Are Pre-Commit Hooks
Stop pushing broken prompts to production and start treating your LLM evaluation suite as a mandatory gate in your local development loop.
You change a single word in your system prompt—swapping "concise" for "professional"—and suddenly your JSON output parser starts screaming because the model decided to wrap every response in a conversational "Certainly! Here is your data:" preamble. We've all been there. You spend forty minutes playing whack-a-mole with edge cases, only to realize that fixing one prompt broke three others you weren't even looking at.
Prompt engineering is still largely "vibes-based" development. We tweak, we refresh the UI, we say "looks good," and we ship. But if you wouldn't dream of pushing a Python update without running your test suite, why are you pushing LLM instructions that lack a safety net?
The "Vibes" Trap
In traditional software, we have pre-commit hooks. They catch your trailing whitespace, your missing semicolons, and your broken unit tests before the code ever touches a remote branch. In LLM development, our "code" is natural language, which is notoriously non-deterministic.
If you treat your evaluation suite as a "sometime" task—something you run manually once a week—you’ve already lost. Evals shouldn't be a post-mortem; they need to be the gatekeeper.
Making Evals Local and Fast
An eval doesn't have to be a massive, expensive 1,000-row benchmark. For a pre-commit hook, you want a "Smoke Test" suite—the 5 to 10 most brittle edge cases that usually break your app.
Here is a dead-simple way to structure a local eval using pytest and a basic assertion. We aren't doing anything fancy here; we're just checking if our "concise" prompt actually stays concise.
import pytest
from my_app.llm_chain import generate_response
# A small set of inputs that have historically caused "hallucinations" or formatting issues
TEST_CASES = [
("What is the status of order #123?", "Order #123 is currently processing."),
("Tell me a joke about robots.", "Why did the robot go on vacation? To recharge its batteries."),
("Export this user data: {id: 1}", '{"id": 1}')
]
@pytest.mark.parametrize("user_input, expected_subset", TEST_CASES)
def test_prompt_consistency(user_input, expected_subset):
response = generate_response(user_input)
# Check for the "Vibe": Is the response short?
assert len(response.split()) < 50, "Response is way too wordy!"
# Check for the "Content": Does it contain what we need?
assert expected_subset.lower() in response.lower(), f"Response missed critical info: {expected_subset}"
# Check for the "Format": Is it clean JSON when requested?
if "{" in expected_subset:
assert response.startswith("{") and response.endswith("}"), "Output is not valid JSON"Wiring it into the Git Loop
Running pytest manually is better than nothing, but we’re humans; we’re lazy and we forget things. We need the machine to enforce the rules. If you’re using pre-commit, you can add a local script to your .pre-commit-config.yaml.
The trick here is to only run these tests when your prompt files or LLM logic change. You don't want to spend $2.00 in API credits every time you fix a typo in your README.
- repo: local
hooks:
- id: llm-evals
name: Run LLM Smoke Tests
entry: pytest tests/llm_smoke_tests.py
language: python
files: ^prompts/|^src/llm_logic/ # Only run if prompts change
pass_filenames: falseThe "LLM-as-a-Judge" Gotcha
Sometimes simple string matching isn't enough. You might need a "Judge" LLM to grade your "Worker" LLM. This is where things get meta—and a bit slower.
I usually recommend avoiding LLM-as-a-judge in a strictly *local* pre-commit hook because it adds 5–10 seconds of latency. However, if your prompt is highly subjective (e.g., "Make sure the tone is empathetic"), a quick call to gpt-4o-mini to provide a "Pass/Fail" is worth the wait.
Pro-tip: Use a cheaper, faster model for your judge than you use for your production output. If mini thinks your GPT-4o output is garbage, it probably is.
Why This Matters
When you treat evals as pre-commit hooks, the psychological shift is massive. You stop being afraid of the "System Prompt" file. You can experiment with different phrasing or new model versions (looking at you, o1-preview) with the confidence that you aren't silently breaking the user experience.
If the hook fails, the commit fails. You fix the prompt, you verify the output, and *then* you push. No more "Fixing prompt again" commit messages. No more "Oops, the LLM started apologizing" bug reports.
Stop shipping vibes. Start shipping verified outputs.

