Case Study · Data Generation & Annotation Suite/02

The model is only as good as its data.

DataForge is a suite of annotation platforms purpose-built for training and evaluating AI translation models. Binary preference collection, HITL feedback loops, summary evaluation, and ecommerce-domain pipelines — each tailored to a specific evaluation methodology, all sharing the same infrastructure.

Read the case study Skip to the solution~ 6 min read

scroll

/metric.01

Annotation platforms in the suite, shared infrastructure

/metric.02

HITL

Continuous feedback loops, not one-shot labeling

/metric.03

Multi-method

Binary, preference, scoring, and summary evaluation

/metric.04

Domain

Ecommerce-specialised data pipelines

› the challenge / 01

An AI translation model is a function of the data it was trained on. The data is the product.

Most teams building AI translation models talk about the model. The model is the wrong thing to talk about. The model is the output. The data is the product — and the annotation platform is the factory.

Generic web-scraped corpora get you a generic translation model. They will not get you a model that handles ecommerce product titles correctly, or one that preserves the structure of a long-form summary, or one that knows when a binary “is this acceptable?” judgement is the right unit of evaluation. For that, you need labelled data produced under the methodology that matches the evaluation question — and you need it produced at volume, by annotators who can move quickly because the tool fits the task.

The methodologies are not interchangeable. Binary judgement (is this translation acceptable?) needs a different UI from preference comparison (translation A vs translation B). Summary evaluation (does this preserve meaning?) needs a different flow again. Ecommerce-domain validation (is this product title correct for the locale?) needs its own pipeline because the data shape is different. Forcing all of them into a single off-the-shelf labelling tool produces inconsistent data — which produces inconsistent models.

“The model is the output. The data is the product. The annotation platform is the factory.”

— the design constraint

The challenge was to build a suite rather than a tool — a set of annotation platforms that share infrastructure but each fit a specific evaluation methodology — and to wire HITL feedback loops through all of them so the data improves over time, instead of being labelled once and forgotten.

› the problem / 02

Two ways teams handle annotation. Both produce mediocre data.

Teams training AI translation models tend to default to one of two patterns when they need labelled data — and both compromise the dataset before the model ever sees it.

Failure mode 01

Generic labelling tool, forced methodology.

Off-the-shelf labellers handle eighty percent of annotation work fine. The moment the methodology gets specific — binary preference between two translation candidates, summary evaluation of long-form output, ecommerce product validation — the team is hacking around the tool's assumptions. Annotators get confused. Inter-annotator agreement drops. The data comes back inconsistent and has to be cleaned before it can train anything.

observed

Drift in agreement

Failure mode 02

One-shot labelling without feedback loops.

Most annotation projects treat labelling as a one-time step. Annotators tag the data, the model gets trained, the dataset is filed away. When the model drifts in production or hits an edge case nobody anticipated, there is no path to feed corrections back into the dataset. The training data is fossilised. The model degrades silently against a moving distribution.

observed

Static dataset

/summary

Neither produces what a team training serious AI translation models actually needs: a methodology-specific annotation surface for each evaluation type, plus a HITL feedback loop that keeps the dataset alive as the model meets reality.

› the solution / 03

A suite of annotation platforms — methodology-specific surfaces, shared infrastructure.

DataForge is not a single tool. It is a set of annotation platforms — each tailored to a specific evaluation methodology — built on a shared backend that handles task queuing, data versioning, audit trail, and HITL feedback into model training.

dataforge.suite / live

4 platforms · 1 backend

Binary & preference annotation

Two of the most useful evaluation methodologies for translation work — accept/reject judgements and head-to-head A/B preference. Each gets a UI designed around the decision being made, with built-in disagreement-resolution flows.

Summary evaluation platform

Long-form translation output cannot be evaluated by spot-checking strings. The summary platform presents the full source and target, asks structured questions about meaning preservation and fluency, and produces scores annotators agree on.

Ecommerce-domain pipelines

Ecommerce data has its own shape — product titles, descriptions, attribute values, locale-specific units and currencies. The ecommerce pipeline is built around that shape rather than treating it as generic text, which is what produces ecommerce-grade training data.

HITL feedback loops, end to end

When a model misclassifies in production, the error becomes an annotation task. The corrected label flows back into the dataset, gets versioned, and is available for the next training run. Labelling stops being a one-time cost and becomes a continuous improvement loop.

› the result / 04

Annotation became infrastructure, not a one-off project.

DataForge is operating as the data factory behind multiple AI translation model lines. Methodology-specific platforms produce consistent labelled data. HITL loops keep the datasets current. The model team's bottleneck moved off “we need data” and onto “what should we train next.”

/result.01

Annotation platforms running on shared infrastructure

/result.02

Continuous

HITL feedback flow from production back into training data

/result.03

Methodologies forced into the wrong UI

Methodology-fit annotation produced consistent labelled data.

Each platform is built around the evaluation question annotators are answering — binary, preference, summary, or domain validation. Inter-annotator agreement improved because the UI stopped fighting the methodology, and the data downstream of it became trainable without a cleaning pass.

HITL feedback loops kept training data current with production drift.

Production errors became annotation tasks automatically. Corrected labels flowed back into the dataset, got versioned, and were available for the next training cycle. The dataset stopped being a static artifact and became a live record of how the model meets reality.

Multiple evaluation methodologies, one shared backend.

Adding a new evaluation type means building a new frontend on top of the same task queue, the same data versioning, the same audit trail. The infrastructure cost amortises across every platform — and a new methodology ships in days instead of months.

Ecommerce pipelines unlocked ecommerce-grade model quality.

Ecommerce-specific data — product titles, attributes, locale-specific patterns — gets annotated through a pipeline that knows the data shape. The resulting models handle ecommerce content meaningfully better than the same architectures trained on generic web corpora, because the training signal is domain-specific by construction.

Annotation became a capability, not a cost centre.

The team that owns data generation moved from “we need to build another labeller” to “which methodology fits this evaluation question.” Most of the answers are off-the-shelf within the suite. The ones that are not become new platforms on the same backend — and ship fast because the hard parts are already solved.

› built with

ReactDjangoPostgreSQLPythonCelery

› get started

Need annotation infrastructure that fits the methodology?

Talk to us about how DataForge can plug into your model training workflow. We will walk you through the suite, run a sample annotation flow on your data shape, and tailor a deployment to the evaluation methodologies your team actually uses.

Request a demo Talk to an expert

30-min walkthrough · No commitment

Sample annotation · Bring your own data shape

Pilot-ready · Methodology-fit from day one

› case / 01

LQA Automation

From a translated file to a defensible MQM score.

› case / 03

Image Localization

Image localisation, treated as infrastructure.