Skip to content

8 May 2025 · Binary Hat engineering

Eval-first is the only honest way to ship AI

The SLA argument generalises. Whether you're shipping safety vision, an agentic feature, a VFX pipeline, or a research model, the single highest-leverage choice is defining "good" before you train.

A while back we argued that the only honest way to sell AI safety is per-module accuracy SLAs. The same argument, in stronger form, applies to every kind of AI work we do — safety vision, agentic systems, generative VFX, film production, custom ML. The single highest-leverage choice across all of them is the same: define what “good” looks like before you build the system.

This is the eval-first posture. It is unfashionable, slow at the start, and the most reliable predictor we have of whether a project will survive production.

Why most AI projects die

Failed AI projects fail in patterns. We’ve watched enough of them to be sure of the shape:

  • The “looks great in the demo” project. The team ships a prototype to a stakeholder, the stakeholder is delighted, the team is funded to “productionise it,” and six months later nobody can say whether the production version is better or worse than the demo because nobody ever wrote down what “better” meant.

  • The “we’ll measure once it’s live” project. Measurement is deferred until production. In production, the team realises the system was always answering an easier question than the one users were asking. There’s no test set that captures the real question, so there’s no way to fix it without rebuilding from scratch.

  • The “the benchmark moved” project. The team optimises against a public benchmark. The public benchmark turns out to be saturated, gamed, or measuring something orthogonal to user value. The model wins the benchmark and fails the customer.

  • The “no one owns the eval” project. Engineering builds the model. Product owns the customer outcome. Neither owns the evaluation. When the eval gets in the way of a release, it disappears.

Every failure mode above has the same root cause: the evaluation was not the first artefact built.

What eval-first looks like in practice

Eval-first means the test set, the scoring methodology, and the success threshold are written down before any model training, prompt engineering, or pipeline integration starts. Concretely:

  1. Name the user outcome. Not the model output — the user outcome. “Operator acknowledges a P1 alert within 30 seconds.” “Customer resolves their billing question without escalating to a human.” “Supervisor accepts the generated shot on the first review pass.” “Researcher reproduces the result on held-out data.”

  2. Build the test set. Real data, labelled by people who understand the domain, with explicit subgroup breakouts (children vs adults; new customers vs returning; outdoor vs indoor; one language vs another). The test set is an asset. It is more valuable than the model.

  3. Define the metric and the threshold. “Above X% on the indoor-adult slice and above Y% on the outdoor-night slice, with a false-positive rate below Z.” Numbers. Real numbers. Not “high accuracy.” Not “industry-leading.”

  4. Specify the cost SLO too. For agentic and generative work, cost-per-task is as much an SLO as accuracy. A 99% task-success rate at $5 per task is a worse product than 95% at $0.05. Engineering treats cost as a metric, not as an externality.

  5. Make the eval runnable. A single command, on a CI machine, producing a reproducible report. If running the eval is hard, the eval will stop being run. If the eval stops being run, the model degrades silently.

  6. Re-run on every change. Every new model version, every prompt change, every retrieval-index update, every adapter — they all go through the eval before they ship. Regressions are caught at code review, not at the customer.

This is unromantic engineering. It is also the difference between AI that survives a year in production and AI that doesn’t.

What it looks like per practice

The metric changes per practice. The discipline doesn’t.

  • Safety vision. Per-module accuracy, false-positive rate per camera per night, mean-time-to-acknowledge per alert class. Measured monthly, per site.

  • Agentic systems. Task-success rate per workflow, tool-call success rate, average human escalations per session, cost-per-task. Measured per release, with regression suites in CI.

  • VFX pipeline. Supervisor accept-rate on first review, rework rate, deterministic-reproduction rate (same prompt, same seed → same output), pipeline throughput in shots-per-day. Measured per show.

  • Film production. Approvals-per-milestone, language-parity quality scores from native-speaker QA, consent-record completeness, audit-trail integrity. Measured per project gate.

  • Custom ML. Held-out test set performance, distribution-shift performance, fairness metrics where applicable, runtime and memory budgets per deployment target. Measured per training run.

Different metrics. Same posture.

The two questions to ask any AI vendor

When you’re evaluating an AI vendor — including us — there are two questions that almost always discriminate honest engineering from sales process:

  1. “Show me the eval you ran on your last engagement.” Not a benchmark from a paper. The actual eval, with the actual test set, from a real client. If they can’t show you a redacted version, they probably don’t run them.

  2. “What metric are you committing to in our contract, and what’s the methodology?” A vendor who can answer this in a single sentence has thought about it. A vendor who needs to “circle back” is selling you something they can’t operate.

If those questions land softly, you have a partner. If they don’t, you have a sales process. The cost difference shows up in year two.

Want this kind of detail on your deployment?

Discovery starts with a conversation about your operational reality — and ends with a document your team can stress-test.