› aurix/case studies/lqa-automation
Case Study · Localisation Quality Assurance/01

From a translated file to a defensible MQM score.

LQA Automation is a dual-AI-agent and four-stage human review platform that turns a translated file into an auditable, MQM-scored quality report — with a frozen scoring rubric, per-error reviewer decisions, and checkpoint-resumable processing for jobs at scale.

scroll
/metric.01
2×
AI agents in disagreement — the unit of evaluation
/metric.02
4
Optional human review stages, all skippable per job
/metric.03
6
MQM error dimensions across four severity levels
/metric.04
100%
Reproducible scores via frozen rubric snapshots

› the challenge / 01

Selling translation quality means selling a number a customer will pay against.

Language Quality Assurance is the step every serious localisation buyer needs and almost no one wants to pay for. It is how an agency proves a translation is good enough to ship — and the deliverable is a single MQM score the customer signs off against.

Producing that score at scale is brutal. Senior linguists work through files row by row, tagging errors by severity and category, in spreadsheets that do not talk to each other. The work is slow, expensive, inconsistent across reviewers, and leaves no real audit trail. Every file restarts the loop.

The market has been pushed toward AI for QA, but adoption keeps stalling on a hard problem: buyers do not trust a score they cannot audit. A translation agency cannot put its name on a number generated by a single LLM that may have hallucinated half the errors it flagged.

“The deliverable is a number a customer pays against. It has to be defensible — or it isn’t worth producing.”
— the design constraint

The challenge was to build a QA platform an agency could stake its reputation — and its invoice — on. Fast enough to be commercially viable. Auditable enough to be defensible. And configurable enough that the same workflow could ship a marketing brochure and a regulated medical document without changing tools.

› the problem / 02

Two ways the industry tried to solve this. Both fail.

Every existing approach to translation QA collapses into one of two shapes — and both miss what an agency actually has to deliver.

Failure mode 01

Manual MQM scorecards in Excel.

A senior reviewer spends 20–40 minutes per 1,000 words categorising errors by severity and dimension. The output is a spreadsheet. Inconsistent across reviewers, no real audit trail beyond cell history, and the next file starts from zero. It does not scale, and the cost per scored file makes it commercially unviable above a certain volume.

consistencyper reviewer →
observed
20–40 min / 1k words
Failure mode 02

Single-LLM quality estimation.

One AI pass finds errors — but it also invents them. There is no second opinion, no override path, no traceability. The score is a black box. Buyers will not pay for a number they cannot defend, and the platform offers no way for a human to step in mid-process and correct the model when it drifts on style or terminology.

consistencyper reviewer →
observed
0 traceability
/summary

Neither produces what an agency actually sells: a scored, signed-off, defensible quality report with traceability from the raw file all the way to the final pass/fail decision.

› the solution / 03

Two AI passes for breadth. A four-stage human gate for accountability.

LQA Automation treats quality assurance as a two-pass AI funnel followed by an optional four-stage human review chain. Two agents disagreeing is the unit of evaluation. Reviewers see both opinions and act on individual errors, not whole segments.

lqa.flow / live
2 ai · 4 human · 1 score
LAYER 01 · AIDual-agent detectionAGENTA1Finds errorsA2VALIDATESHAND-OFFLAYER 02 · HUMAN4-stage review gateSTAGER1Reviewer 1accept · edit · rejectSTAGETRTranslatoraccept · rejectSTAGER2Reviewer 2reconcileSTAGEARArbitratorfinal authorityunassigned stages auto-skip
01

Per-error granularity

Reviewers act on individual errors, not whole segments. The segment-level decision is derived, not chosen — which forces engagement with each finding and keeps the resulting MQM score defensible.

02

Frozen scoring snapshot

At job creation, the active quality framework is copied verbatim into a snapshot. The score is calculated against that snapshot — never the live framework. A delivered score stays reproducible even if the rubric changes a year later.

03

Checkpoint-resume processing

A 6,000-segment job that gets stopped at 3,200 does not re-parse the file on resume. Only unfinished segments reprocess. Stop, resume, restart, and cancel each carry distinct semantics — first-class controls, not afterthoughts.

04

Live-formula scorecard

The Excel scorecard ships with embedded COUNTIF formulas, not pre-calculated values. The customer can re-filter by severity or category and the totals stay live — making the scoring logic itself part of the deliverable.

› the result / 04

A working platform that produces audit-grade scores at scale.

The platform is in production. PMs run jobs end to end. Linguists review through assignment links. Customers receive defensible scorecards.

/result.01
10K+
Segments per job, processed without re-parsing on resume
/result.02
~6 hr
Reviewer time reclaimed per job vs. manual MQM scoring
/result.03
0
Score recalculations forced by post-job framework edits
01

Audit-grade error detection.

Two AI agents disagreeing surfaces hallucinations a single-pass system would carry through to the score. Reviewers see Agent 1's claim, Agent 2's verdict, and act on whichever is closer to the truth — with full traceability from the raw file to the final approved target.

02

Reviewer time reclaimed without sacrificing rigour.

The AI passes do the categorisation work that used to consume a senior linguist's day. Reviewers spend their time on judgement calls — not on tagging — and the per-error UI keeps every decision granular enough to defend.

03

Reproducible, defensible customer scores.

The frozen-snapshot pattern means a score delivered today can be re-derived next year against the exact rubric in force at job creation. Customers get an audit trail that holds up to scrutiny — and the agency can edit its quality framework without invalidating past deliverables.

04

Configurable QA depth per job.

Marketing copy can run AI-only with one human reviewer. A regulated medical document can run all four human stages plus arbitration. Same platform, same workflow, no separate product line — stages auto-skip when no reviewer is assigned.

05

Operable at scale.

A 10,000-segment job that dies at segment 7,400 resumes at segment 7,400, not segment 0. The difference is the difference between “this is a tool” and “this is a toy.” Stop, resume, restart, and switch-model controls are part of the core surface.

› get started

Ready to make your translation quality defensible?

Talk to us about how LQA Automation can fit into your localisation workflow. We’ll walk you through the platform, show a scorecard from a real job, and tailor a setup to your content, teams, and quality standards.

30-min walkthrough · No commitment
Custom scorecard demo · Real MQM output
Pilot-ready · Bring your own file
› more case studies

Image localization platform · AI data pipelines

Coming soon — request the deck to see them now.

See all our work