12 April 2025 · Binary Hat engineering
Why per-module accuracy SLA is the only honest way to sell AI safety
A single "99% accurate" headline is a marketing artefact. Real safety deployments have to be measured per module, per site, monthly — and the contract has to mean it.
When a safety-vision vendor tells you their system is “99% accurate,” ask them which event class on which camera, under what lighting, at what time of day, with which definition of a true positive — and watch the conversation die.
This is not a hostile question. It is the only question that matters when you’re about to wire AI alerts into a real operational response — into a security team’s pagers, a school’s parent-communication system, or a hospital’s restricted-zone access. The headline number doesn’t survive contact with any of those settings, because there isn’t one.
The number is not one number
Accuracy in computer vision decomposes along several axes that the headline rolls up and hides.
By module. Face recognition and weapon detection are completely different problems with different state-of-the-art ceilings. A vendor who quotes a single accuracy figure across both is, at best, averaging numbers that should never have been averaged.
By target subgroup. Adult face recognition under controlled indoor light is a much easier problem than under-12 face recognition where the subject’s features change every six months. Numbers that don’t break out subgroups will quietly undercount the populations the SLA was supposed to protect.
By condition. Outdoor night ANPR with monsoon spray on the lens is not the same problem as daylight at a clean toll plaza. We owe the buyer a band — what we will commit to in each condition, and what we will refuse to commit to.
By definition of true positive. Did the system fire on the event class? Within the right time window? With the right confidence? Vendors who don’t publish their event definition can quietly redefine it after a missed quarter.
What an honest SLA looks like
Here is the structure we use, simplified:
- A per-module table: module name → conditions → committed accuracy band → measurement window → exclusions.
- A monthly report: per site, per module, the measured number for the previous calendar month, plus a written note on any miss.
- A service-credit mechanism that triggers automatically when a module misses two consecutive months. No special pleading required from the buyer.
- A jointly agreed exclusion list: events the system was not designed to catch, written in the same document so neither side can claim them later.
The numbers themselves matter less than the structure. A 90% SLA that holds is worth more than a 99% SLA the vendor can talk their way out of.
Why vendors resist this
Three reasons, in declining order of cynicism:
- The headline number wins more deals. A 90% commitment loses to a 99% claim in a procurement that scores accuracy as a single field on a spreadsheet.
- Honest measurement is operationally expensive. You need ground truth, a labelling pipeline, and an evaluation cadence — all on per-site data, not just internal benchmarks.
- Honest measurement creates accountability. The miss months are visible. The conversation has to happen.
We think the last reason is the actual reason. The first two are the rationalisations. If you find a vendor whose contract gives them a way to never have to publish a miss month, you have found a vendor whose product cannot survive its own measurement.
What to ask in your next vendor conversation
- “Show me a redacted monthly SLA report from a current client.”
- “What’s the published service-credit mechanism, and how is it calculated?”
- “What does your evaluation pipeline look like? Who labels the ground truth?”
- “Which modules do you decline to commit numbers on, and why?”
If those questions land softly, you have a vendor who has thought about this. If they don’t, you have a sales process. The difference is the next five years of your operational reality.