Practice

Custom ML & Research

Bespoke ML engineering and applied research — for problems where an off-the-shelf model does not exist, the evaluation criteria do not exist, or the work must begin from the data.

AI-generated image of a research workspace — laptop displaying data plots and analysis, printed graphs scattered on a desk, soft natural light.

How it works

Evaluation is the first artefact built, not the last. The model is finished when it passes an eval defined before training began.

What this practice is.

Some problems don't yet have a vendor solution. They need a focused team to examine the data, design the evaluation, build a model that meets it, and integrate it into the system that will use it. Concretely: PyTorch / JAX training, Weights & Biases experiment tracking, DVC data versioning, ONNX / TensorRT / GGUF deployment artefacts, and an eval suite that runs in CI. This practice covers domain-specific model development, adaptation and distillation for edge deployment, evaluation-harness design, and pre-competitive R&D collaborations.

What we build.

Domain-specific model development

Medical imaging, industrial inspection, materials and energy, scientific data — model design where the off-the-shelf foundation models aren't the right primitive.

Model distillation and edge optimisation

Compressing larger models into edge-deployable artefacts (TensorRT, ONNX, GGUF) with measured accuracy retention. Often the bridge from a cloud PoC to a shippable product.

Evaluation harness design

When the standard benchmark misses what your buyer cares about, we build the eval that doesn't. Test sets, scoring methodology, statistical confidence, regression suites.

Pre-competitive R&D collaborations

Joint work with university groups, research labs, and innovation programmes. Co-authored papers, open-source artefacts where appropriate, clear IP terms from the start.

How we engineer in this practice.

Evaluation before model

The evaluation framework is defined before the model is trained. If we cannot characterise what constitutes a successful outcome at the start, we cannot recognise it at the finish.

Data is the asset

Most ML projects fail at the data stage, not the model stage. We over-invest in data understanding, labelling protocols, and curation before any training runs.

Honest about generalisation

We report performance on held-out and distribution-shifted test sets. We do not cherry-pick the slice that makes the headline number look better.

Open by default in research collaborations

When clients agree, we contribute back — open weights, open datasets where lawful, open methods. We believe the field gets better when good work gets shared.

Stack in this practice.

PyTorch, JAX where appropriate; HuggingFace ecosystem
Weights & Biases for experiment tracking
ONNX, TensorRT, GGUF for deployment artefacts
Ray for distributed training; SLURM where the client already has it
DVC and Git LFS for data versioning

See the firm-wide stack →

What we won't build in this practice.

PoCs we know we cannot productionise

If we can see at scoping time that the result can't be operated by the client, we say so up front. We don't sell a deliverable that has no path to production.

Benchmark gaming

Tuning to a known test set to win a procurement scorecard is dishonest engineering. We measure on data the buyer brings, not the data we trained on.

Research without a publication path or use path

Research that lives in a private repo, never operationalised and never shared, is wasted effort. We agree the destination up front — paper, product, or open release.

Where to go next.

Approach: honest measurement

Open

Insights: SLA argument generalised