The Problem With Testing AI
Traditional software testing is built on a simple premise: given input X, the output should be Y. You write an assertion, it passes or fails, and you move on. This premise breaks completely with LLM-powered features.
When the output is natural language, there is no single correct answer. A summarization feature might produce ten different valid summaries for the same input. A classification model might express the same label in different words. You cannot assertEquals on prose. And yet, shipping AI features without any quality measurement is how teams end up with products that work in demos and fail in production.
This is the problem that eval-driven development solves. When I was building Clover and iterating on Trovex's search quality, the shift from ad hoc testing to systematic evals was the single biggest improvement in our development velocity. Not because we wrote fewer bugs, but because we found them before users did.

What Eval-Driven Development Is
Eval-driven development (EDD) is a methodology where you write evaluations before you write prompts, in the same way test-driven development prescribes writing tests before code. The eval suite defines what success looks like. The prompt is the implementation that tries to satisfy it.
The workflow is concrete:
The key insight is that the eval suite is the specification, not the prompt. The prompt is an implementation detail that can change freely as long as the evals pass. This is the same principle that makes TDD effective for traditional software — the tests define the contract, the code is free to vary.
Designing Eval Suites
The quality of your eval suite determines the quality of your AI feature. A weak eval suite gives you false confidence. A strong one catches regressions before they reach production.
Golden Datasets
A golden dataset is a curated set of inputs paired with expected outputs or output characteristics. The hard part is not collecting the data — it is ensuring the dataset is representative of production traffic. Sample real user queries if you have them. If you are building a new feature, invest time in generating realistic test cases that cover the full distribution: common queries, rare queries, malformed queries, and adversarial inputs.
Automated Scoring with LLM-as-Judge
For dimensions that resist simple programmatic checks — factual accuracy, helpfulness, coherence — use a separate LLM as an evaluator. The judge model receives the input, the expected output (or criteria), and the actual output, then scores on a defined rubric.
def llm_judge(query, expected, actual, criteria):
"""Use an LLM to score output quality on specific criteria."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": (
"You are an evaluation judge. Score the ACTUAL output "
"against the EXPECTED output on the given criteria. "
"Return a JSON object with 'score' (1-5) and 'reasoning'."
)
}, {
"role": "user",
"content": f"Query: {query}\n"
f"Expected: {expected}\n"
f"Actual: {actual}\n"
f"Criteria: {criteria}"
}],
response_format={"type": "json_object"},
max_tokens=200
)
return json.loads(response.choices[0].message.content)The reliability of LLM-as-judge scales with the specificity of the rubric. Vague criteria like is this response good? produce inconsistent scores. Specific criteria like does the response contain the three key facts from the reference answer? produce scores that correlate well with human judgment.
Human-in-the-Loop for Edge Cases
Automated scoring handles the bulk of evaluation, but certain failure modes — subtle hallucinations, tone mismatches, culturally inappropriate responses — still require human review. In practice, reserve human evaluation for the 10-15% of cases where automated metrics are ambiguous or where the stakes of a false positive are high.
Metrics That Matter
Not all metrics are equally useful. The metrics you track should map directly to user-facing quality and business viability.
| Metric | What It Measures | How to Compute | Target Range |
|---|---|---|---|
| Task completion rate | Does the output actually solve the user's problem? | LLM-as-judge or human review on golden dataset | 90%+ for production readiness |
| Factual accuracy | Are claims in the output verifiably true? | Cross-reference against source documents | 95%+ for knowledge-grounded tasks |
| Format compliance | Does the output match the required schema? | Programmatic validation (JSON parse, regex) | 99%+ (non-negotiable for structured outputs) |
| Latency (p95) | How long does the end-to-end request take? | Time from request to complete response | Depends on UX context, typically under 3s |
| Cost per request | How much does each inference cost? | Total tokens * price per token | Depends on business model and margins |
Track task completion rate as your primary metric. All other metrics are secondary — a fast, cheap response that does not solve the user's problem has negative value.
CI/CD Integration
Evals are only useful if they run automatically. The goal is to make it impossible to deploy a prompt regression.
The integration pattern mirrors traditional CI/CD. On every pull request that modifies a prompt, system message, or model configuration, the CI pipeline runs the full eval suite against the golden dataset. If any metric drops below the defined threshold, the build fails and the deploy is blocked.
# eval_runner.py — runs in CI on every prompt change
import json
import sys
def run_eval_suite(prompt_version, golden_dataset_path, thresholds):
dataset = json.load(open(golden_dataset_path))
results = {
"task_completion": [],
"format_compliance": [],
"factual_accuracy": [],
"latency_ms": [],
"cost_usd": []
}
for case in dataset:
output, latency, cost = run_inference(prompt_version, case["input"])
results["task_completion"].append(
llm_judge(case["input"], case["expected"], output, "task_completion")
)
results["format_compliance"].append(validate_format(output, case["schema"]))
results["latency_ms"].append(latency)
results["cost_usd"].append(cost)
# Compute aggregates and check thresholds
scores = {
"task_completion": avg(results["task_completion"]),
"format_compliance": avg(results["format_compliance"]),
"latency_p95": percentile(results["latency_ms"], 95),
"avg_cost": avg(results["cost_usd"])
}
passed = all(scores[k] >= thresholds[k] for k in thresholds)
print(json.dumps(scores, indent=2))
sys.exit(0 if passed else 1)This gives you a clear signal on every change: either the new prompt is at least as good as the previous one, or it is not. No ambiguity. No guessing. No deploying and hoping.

The Eval Pyramid
Like the traditional testing pyramid, evals should be structured in layers with different coverage, cost, and speed characteristics.
The pyramid structure ensures you catch most regressions cheaply and quickly at the base layer, while reserving expensive human evaluation for the nuanced failures that only humans can detect. Inverting this pyramid — relying primarily on human review — does not scale.
Tools and Frameworks
You do not need to build everything from scratch. Several frameworks have matured to the point where they are genuinely useful in production.
Promptfoo is an open-source eval framework that supports custom assertions, multiple model providers, and CI integration out of the box. It handles the mechanical parts — running prompts against test cases, computing metrics, generating reports — so you can focus on designing the eval criteria. For most teams, this is the right starting point.
Braintrust provides a managed platform for logging, evaluating, and comparing prompt variants. The key advantage is the tracking layer: you can see exactly how each prompt version performed over time and correlate metric changes with specific prompt edits. This is valuable once your eval suite is mature enough that you are optimizing rather than building.
For cases where existing tools do not fit — unusual scoring criteria, proprietary data constraints, complex multi-model pipelines — a custom eval harness of a few hundred lines of Python is often simpler than adapting a general-purpose framework. The eval runner script above is a minimal example. In practice, you add structured logging, parallel execution, and a comparison mode that shows diffs between prompt versions.
When EDD Is Overkill vs. Essential
EDD is not always the right approach. The cost of building and maintaining an eval suite is real, and some use cases do not justify it.
EDD is essential when: the AI feature faces real users, errors have consequences (financial, medical, legal), the prompt is iterated on frequently, or multiple team members edit prompts. In these cases, shipping without evals is shipping without seatbelts. You will eventually regret it.
EDD is overkill when: you are prototyping, the feature is internal-only with a forgiving audience, the prompt is simple and stable, or you are exploring whether the feature is even viable. In these cases, a few manual spot checks are sufficient. Do not build a testing infrastructure for a feature that might not survive the week.
The inflection point is usually the moment the feature moves from prototype to production. That is when you stop testing by feel and start testing by measurement.
Eval-driven development is TDD for the AI era. Write evals first, treat the prompt as an implementation detail, automate everything, and never deploy a regression. The teams that adopt this methodology ship faster and break less — not because they write better prompts, but because they know exactly when a prompt is good enough.
Conclusion
The shift from ad hoc prompting to eval-driven development is the difference between building AI features that happen to work and building AI features that are engineered to work. The tooling exists. The methodology is proven. The only barrier is the discipline to write the eval before the prompt.
Start small. Pick one AI feature in your product. Build a golden dataset of 50 cases. Implement three metrics: task completion, format compliance, and latency. Wire it into CI. Then never ship a prompt change without running it. Within a week, you will wonder how you ever shipped AI features without this.