Evaluation before model
The evaluation framework is defined before the model is trained. If we cannot characterise what constitutes a successful outcome at the start, we cannot recognise it at the finish.
Practice
Bespoke ML engineering and applied research — for problems where an off-the-shelf model does not exist, the evaluation criteria do not exist, or the work must begin from the data.
How it works
Some problems don't yet have a vendor solution. They need a focused team to examine the data, design the evaluation, build a model that meets it, and integrate it into the system that will use it. Concretely: PyTorch / JAX training, Weights & Biases experiment tracking, DVC data versioning, ONNX / TensorRT / GGUF deployment artefacts, and an eval suite that runs in CI. This practice covers domain-specific model development, adaptation and distillation for edge deployment, evaluation-harness design, and pre-competitive R&D collaborations.
Medical imaging, industrial inspection, materials and energy, scientific data — model design where the off-the-shelf foundation models aren't the right primitive.
Compressing larger models into edge-deployable artefacts (TensorRT, ONNX, GGUF) with measured accuracy retention. Often the bridge from a cloud PoC to a shippable product.
When the standard benchmark misses what your buyer cares about, we build the eval that doesn't. Test sets, scoring methodology, statistical confidence, regression suites.
Joint work with university groups, research labs, and innovation programmes. Co-authored papers, open-source artefacts where appropriate, clear IP terms from the start.
The evaluation framework is defined before the model is trained. If we cannot characterise what constitutes a successful outcome at the start, we cannot recognise it at the finish.
Most ML projects fail at the data stage, not the model stage. We over-invest in data understanding, labelling protocols, and curation before any training runs.
We report performance on held-out and distribution-shifted test sets. We do not cherry-pick the slice that makes the headline number look better.
When clients agree, we contribute back — open weights, open datasets where lawful, open methods. We believe the field gets better when good work gets shared.
If we can see at scoping time that the result can't be operated by the client, we say so up front. We don't sell a deliverable that has no path to production.
Tuning to a known test set to win a procurement scorecard is dishonest engineering. We measure on data the buyer brings, not the data we trained on.
Research that lives in a private repo, never operationalised and never shared, is wasted effort. We agree the destination up front — paper, product, or open release.
Tell us what you've already tried, what you've ruled out, and what success looks like. We come back within one working day.