Navtej Singh

The Problem With Testing AI

Traditional software testing is built on a simple premise: given input X, the output should be Y. You write an assertion, it passes or fails, and you move on. This premise breaks completely with LLM-powered features.

When the output is natural language, there is no single correct answer. A summarization feature might produce ten different valid summaries for the same input. A classification model might express the same label in different words. You cannot assertEquals on prose. And yet, shipping AI features without any quality measurement is how teams end up with products that work in demos and fail in production.

This is the problem that eval-driven development solves. When I was building Clover and iterating on Trovex's search quality, the shift from ad hoc testing to systematic evals was the single biggest improvement in our development velocity. Not because we wrote fewer bugs, but because we found them before users did.

Whiteboard with diagrams and notes for planning evaluation strategies — Planning an eval strategy before writing a single prompt — defining success criteria, golden datasets, and scoring rubrics upfront — Photo on Unsplash

What Eval-Driven Development Is

Eval-driven development (EDD) is a methodology where you write evaluations before you write prompts, in the same way test-driven development prescribes writing tests before code. The eval suite defines what success looks like. The prompt is the implementation that tries to satisfy it.

The workflow is concrete:

Define the task specification. What exactly should this AI feature do? What inputs will it receive? What does a good output look like? What does a bad output look like? Write this down before touching a prompt.

Build a golden dataset. Collect 50-200 representative input-output pairs that cover the expected distribution of real-world usage. Include edge cases, adversarial inputs, and ambiguous queries. This dataset is the ground truth your system is measured against.

Implement automated scoring. Define metrics and scoring functions that can evaluate model outputs programmatically. Some dimensions are easy to automate (format compliance, length constraints, JSON validity). Others require LLM-as-judge or human review.

Write the prompt. Now — and only now — write the prompt. Run it against the eval suite. Iterate until metrics meet your threshold. Every prompt change is validated against the full eval suite before it ships.

The key insight is that the eval suite is the specification, not the prompt. The prompt is an implementation detail that can change freely as long as the evals pass. This is the same principle that makes TDD effective for traditional software — the tests define the contract, the code is free to vary.

Designing Eval Suites

The quality of your eval suite determines the quality of your AI feature. A weak eval suite gives you false confidence. A strong one catches regressions before they reach production.

Golden Datasets

A golden dataset is a curated set of inputs paired with expected outputs or output characteristics. The hard part is not collecting the data — it is ensuring the dataset is representative of production traffic. Sample real user queries if you have them. If you are building a new feature, invest time in generating realistic test cases that cover the full distribution: common queries, rare queries, malformed queries, and adversarial inputs.

Automated Scoring with LLM-as-Judge

For dimensions that resist simple programmatic checks — factual accuracy, helpfulness, coherence — use a separate LLM as an evaluator. The judge model receives the input, the expected output (or criteria), and the actual output, then scores on a defined rubric.

python

def llm_judge(query, expected, actual, criteria):
    """Use an LLM to score output quality on specific criteria."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": (
                "You are an evaluation judge. Score the ACTUAL output "
                "against the EXPECTED output on the given criteria. "
                "Return a JSON object with 'score' (1-5) and 'reasoning'."
            )
        }, {
            "role": "user",
            "content": f"Query: {query}\n"
                       f"Expected: {expected}\n"
                       f"Actual: {actual}\n"
                       f"Criteria: {criteria}"
        }],
        response_format={"type": "json_object"},
        max_tokens=200
    )
    return json.loads(response.choices[0].message.content)

The reliability of LLM-as-judge scales with the specificity of the rubric. Vague criteria like is this response good? produce inconsistent scores. Specific criteria like does the response contain the three key facts from the reference answer? produce scores that correlate well with human judgment.

Human-in-the-Loop for Edge Cases

Automated scoring handles the bulk of evaluation, but certain failure modes — subtle hallucinations, tone mismatches, culturally inappropriate responses — still require human review. In practice, reserve human evaluation for the 10-15% of cases where automated metrics are ambiguous or where the stakes of a false positive are high.

Metrics That Matter

Not all metrics are equally useful. The metrics you track should map directly to user-facing quality and business viability.

Metric	What It Measures	How to Compute	Target Range
Task completion rate	Does the output actually solve the user's problem?	LLM-as-judge or human review on golden dataset	90%+ for production readiness
Factual accuracy	Are claims in the output verifiably true?	Cross-reference against source documents	95%+ for knowledge-grounded tasks
Format compliance	Does the output match the required schema?	Programmatic validation (JSON parse, regex)	99%+ (non-negotiable for structured outputs)
Latency (p95)	How long does the end-to-end request take?	Time from request to complete response	Depends on UX context, typically under 3s
Cost per request	How much does each inference cost?	Total tokens * price per token	Depends on business model and margins

Key takeaway

Track task completion rate as your primary metric. All other metrics are secondary — a fast, cheap response that does not solve the user's problem has negative value.

CI/CD Integration

Evals are only useful if they run automatically. The goal is to make it impossible to deploy a prompt regression.

The integration pattern mirrors traditional CI/CD. On every pull request that modifies a prompt, system message, or model configuration, the CI pipeline runs the full eval suite against the golden dataset. If any metric drops below the defined threshold, the build fails and the deploy is blocked.

python

# eval_runner.py — runs in CI on every prompt change

import json
import sys

def run_eval_suite(prompt_version, golden_dataset_path, thresholds):
    dataset = json.load(open(golden_dataset_path))
    results = {
        "task_completion": [],
        "format_compliance": [],
        "factual_accuracy": [],
        "latency_ms": [],
        "cost_usd": []
    }

    for case in dataset:
        output, latency, cost = run_inference(prompt_version, case["input"])
        results["task_completion"].append(
            llm_judge(case["input"], case["expected"], output, "task_completion")
        )
        results["format_compliance"].append(validate_format(output, case["schema"]))
        results["latency_ms"].append(latency)
        results["cost_usd"].append(cost)

    # Compute aggregates and check thresholds

    scores = {
        "task_completion": avg(results["task_completion"]),
        "format_compliance": avg(results["format_compliance"]),
        "latency_p95": percentile(results["latency_ms"], 95),
        "avg_cost": avg(results["cost_usd"])
    }

    passed = all(scores[k] >= thresholds[k] for k in thresholds)

    print(json.dumps(scores, indent=2))
    sys.exit(0 if passed else 1)

This gives you a clear signal on every change: either the new prompt is at least as good as the previous one, or it is not. No ambiguity. No guessing. No deploying and hoping.

Laptop showing code for an evaluation pipeline implementation — Implementing the eval pipeline — automated scoring, CI integration, and threshold checks that block regressions before they ship — Photo on Unsplash

The Eval Pyramid

Like the traditional testing pyramid, evals should be structured in layers with different coverage, cost, and speed characteristics.

Unit evals (base layer). Fast, cheap, and numerous. These test individual prompt behaviors in isolation: does the model follow the format instructions? Does it respect length constraints? Does it refuse out-of-scope queries? Run these on every commit. Hundreds of test cases, seconds to execute using cached or mocked responses where appropriate.

Integration evals (middle layer). Test the full pipeline end-to-end: retrieval, prompt assembly, model inference, and post-processing. These catch failures that only appear when components interact — like a retrieval step returning irrelevant context that causes the model to hallucinate. Run these on every PR. Dozens of test cases, minutes to execute against live models.

Human evals (top layer). Periodic, expensive, and essential. A human evaluator reviews a sample of production outputs against a rubric. These catch the failures that automated metrics miss: subtle tone issues, culturally inappropriate responses, technically correct but unhelpful answers. Run these weekly or on major prompt changes. Small sample size, hours to execute.

Why this matters

The pyramid structure ensures you catch most regressions cheaply and quickly at the base layer, while reserving expensive human evaluation for the nuanced failures that only humans can detect. Inverting this pyramid — relying primarily on human review — does not scale.

Tools and Frameworks

You do not need to build everything from scratch. Several frameworks have matured to the point where they are genuinely useful in production.

Promptfoo is an open-source eval framework that supports custom assertions, multiple model providers, and CI integration out of the box. It handles the mechanical parts — running prompts against test cases, computing metrics, generating reports — so you can focus on designing the eval criteria. For most teams, this is the right starting point.

Braintrust provides a managed platform for logging, evaluating, and comparing prompt variants. The key advantage is the tracking layer: you can see exactly how each prompt version performed over time and correlate metric changes with specific prompt edits. This is valuable once your eval suite is mature enough that you are optimizing rather than building.

For cases where existing tools do not fit — unusual scoring criteria, proprietary data constraints, complex multi-model pipelines — a custom eval harness of a few hundred lines of Python is often simpler than adapting a general-purpose framework. The eval runner script above is a minimal example. In practice, you add structured logging, parallel execution, and a comparison mode that shows diffs between prompt versions.

When EDD Is Overkill vs. Essential

EDD is not always the right approach. The cost of building and maintaining an eval suite is real, and some use cases do not justify it.

EDD is essential when: the AI feature faces real users, errors have consequences (financial, medical, legal), the prompt is iterated on frequently, or multiple team members edit prompts. In these cases, shipping without evals is shipping without seatbelts. You will eventually regret it.

EDD is overkill when: you are prototyping, the feature is internal-only with a forgiving audience, the prompt is simple and stable, or you are exploring whether the feature is even viable. In these cases, a few manual spot checks are sufficient. Do not build a testing infrastructure for a feature that might not survive the week.

The inflection point is usually the moment the feature moves from prototype to production. That is when you stop testing by feel and start testing by measurement.

Key takeaway

Eval-driven development is TDD for the AI era. Write evals first, treat the prompt as an implementation detail, automate everything, and never deploy a regression. The teams that adopt this methodology ship faster and break less — not because they write better prompts, but because they know exactly when a prompt is good enough.

Conclusion

The shift from ad hoc prompting to eval-driven development is the difference between building AI features that happen to work and building AI features that are engineered to work. The tooling exists. The methodology is proven. The only barrier is the discipline to write the eval before the prompt.

Start small. Pick one AI feature in your product. Build a golden dataset of 50 cases. Implement three metrics: task completion, format compliance, and latency. Wire it into CI. Then never ship a prompt change without running it. Within a week, you will wonder how you ever shipped AI features without this.

What Is Eval-Driven Development? How to Ship AI Features Without Guessing

The Problem With Testing AI

What Eval-Driven Development Is

Designing Eval Suites

Golden Datasets

Automated Scoring with LLM-as-Judge

Human-in-the-Loop for Edge Cases

Metrics That Matter

CI/CD Integration

The Eval Pyramid

Tools and Frameworks

When EDD Is Overkill vs. Essential

Conclusion

More posts

Stay in the loop

What Is Eval-Driven Development? How to Ship AI Features Without Guessing

The Problem With Testing AI

What Eval-Driven Development Is

Designing Eval Suites

Golden Datasets

Automated Scoring with LLM-as-Judge

Human-in-the-Loop for Edge Cases

Metrics That Matter

CI/CD Integration

The Eval Pyramid

Tools and Frameworks

When EDD Is Overkill vs. Essential

Conclusion

More posts

Getting Started with Medical Image Classification Using Deep Learning

Linear Algebra Is All You Need: The Math Behind Every AI Model

Building Medical AI: A Complete Guide To Pathology Slide Analysis

Stay in the loop