Back to Log
Best Practices
Shift Left with Evals: Replacing Unit Tests with Probabilistic Checks
Sep 28, 2025
10 min read
Sarah Chen, Head of AI Architecture
When you write code with Agents, traditional Unit Tests are insufficient. An agent might write code that passes `assert(result == 5)` but does so by hardcoding the return value, or by introducing a subtle security vulnerability.
The Three Levels of AI Evaluation
- L1: Deterministic Checks**
- L2: Functional Evals**
- L3: Model-Based Evals (LLM-as-a-Judge)**
Implementing LLM-as-a-Judge
We recommend building a small Python script that runs on every PR generated by an agent. It should check for qualities that are hard to assert programmatically.
python
PROMPT = """
Review the following code diff.
Rate it on a scale of 1-5 for:
1. Maintainability (Clean abstractions)
2. Security (Input sanitization)
3. Performance (No N+1 queries)
If any score is below 4, fail the build.
"""
def eval_pr(diff):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": PROMPT + diff}]
)
return parse_score(response)By shifting these evals left—running them before a human ever reviews the PR—you reduce the noise ratio and ensure that humans only spend time reviewing high-quality, pre-vetted code.