Back to Log
Best Practices

Shift Left with Evals: Replacing Unit Tests with Probabilistic Checks

Sep 28, 2025
10 min read
Sarah Chen, Head of AI Architecture
Shift Left with Evals: Replacing Unit Tests with Probabilistic Checks

When you write code with Agents, traditional Unit Tests are insufficient. An agent might write code that passes `assert(result == 5)` but does so by hardcoding the return value, or by introducing a subtle security vulnerability.

The Three Levels of AI Evaluation

  • L1: Deterministic Checks**
  • L2: Functional Evals**
  • L3: Model-Based Evals (LLM-as-a-Judge)**

Implementing LLM-as-a-Judge

We recommend building a small Python script that runs on every PR generated by an agent. It should check for qualities that are hard to assert programmatically.

python
PROMPT = """
Review the following code diff.
Rate it on a scale of 1-5 for:
1. Maintainability (Clean abstractions)
2. Security (Input sanitization)
3. Performance (No N+1 queries)

If any score is below 4, fail the build.
"""

def eval_pr(diff):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": PROMPT + diff}]
    )
    return parse_score(response)

By shifting these evals left—running them before a human ever reviews the PR—you reduce the noise ratio and ensure that humans only spend time reviewing high-quality, pre-vetted code.