Case study · /04

Audio data collection is a coordination problem.

Polyphony is a six-portal platform that takes recorded speech from a live multi-speaker room through media processing, two AI agents, two QC passes, annotation, and QA — without the voice data ever leaving owned infrastructure.

Request a demo Skip to the solution~ 7 min read

01Challenge 02Problem 03Solution 04Result

Role-specific portals on one shared pipeline

Pipeline stages from recording to delivery

Self-hosted

WebRTC mesh + Llama 3.3 70B on owned GPU

GDPR

Voice biometric data treated as special category

The Challenge · 01

Speech datasets are produced by a pipeline of specialists. The pipeline is the product.

Most teams building speech AI talk about the model. The model is the wrong thing to talk about. The model is the output. The dataset is the product — and the dataset is produced by a multi-stage pipeline of specialists who each do one thing well: project managers scope the topics, speakers record in multi-speaker rooms, media reviewers clean the audio, QC reviewers gate quality, annotators tag the timeline, QA reviewers cross-check the annotations.

That pipeline is where data quality is won or lost. A handoff that loses provenance breaks the dataset. A reviewer who cannot see what an annotator saw produces inconsistent labels. A speaker payout that does not reconcile to session-level approvals erodes the speaker pool. None of those failures show up in the model architecture — they show up in the data, weeks or months later, when the model trained on it underperforms.

And voice biometric data carries obligations on top. Under GDPR it is special-category personal data — the same tier as health and genetic information. Sending raw speaker audio to a third-party API for transcription or quality checks is not a tooling choice; it is a compliance decision with consequences.

“The model is the output. The dataset is the product. The pipeline that produces the dataset is the platform.”

— the design constraint

The challenge was to build a pipeline rather than a recorder — a coordinated system of role-specific portals that hand sessions off through a single state machine, with AI agents augmenting the human review stages, and with every byte of voice data living on owned infrastructure.

The Problem · 02

Two ways teams collect speech data. Both break under audit.

Teams producing speech datasets tend to default to one of two patterns — and both break the dataset before the model ever sees it.

Failure 01

Tool-stitched workflow.

Zoom for recording. Google Drive for handoff. A spreadsheet for QC tracking. A generic labelling tool for annotation. Email for speaker payouts. Each step works in isolation; together they shed provenance — a flagged annotation cannot be traced back to the original session segment without a manual hunt, and a regulator asking who reviewed what at which stage gets an apologetic shrug.

Observed

Lost provenance

Failure 02

Black-box recording SaaS.

A hosted recording vendor holds the raw audio. A third-party API does the transcription. Another vendor does the quality checks. The dataset moves freely between four companies, none of whom are accountable for the speaker contract or the GDPR notice. The moment a speaker requests deletion, the request fans out into emails to vendors who may or may not honour it.

Observed

Sovereignty lost

Neither produces what a team building production speech AI actually needs: a single pipeline with role-specific surfaces, AI augmentation at the points where it pays off, and voice data that never leaves owned infrastructure.

The Solution · 03

Six role-specific portals on one shared pipeline, with AI agents inline.

Polyphony is not a recording tool. It is a six-portal platform that takes a session from the live recording room through a nine-stage state machine to a delivered dataset — with two AI agents augmenting the human review stages and the entire stack running on owned infrastructure.

polyphony.pipeline / live

9 stages · 6 portals · 2 agents

Pipeline

S1
Live recording
mesh P2P · 1–6 speakers per room
S2
Media processing
ffmpeg normalise · segment · format
S3
QC1 + QC2
two-pass human review · agent-augmented
S4
Annotation + QA
timeline editor · cross-validated

AI Layer

A1
Agent #5
QC pre-check · pass/warn/fail per check
A2
Agent #6
annotation cross-validation · 0–100 score
A3
vLLM · Llama 3.3 70B
self-hosted · JSON-schema-constrained
A4
Eval harness
20-sample fixtures · concordance gated

01 · Six role-specific portals, one pipeline

Project managers, speakers, media reviewers, QC1, QC2, annotators, and annotation-QA each get a portal built around the decision they are making — not a generic dashboard squeezed into a sidebar role-switcher. Sessions move between portals through a single state machine, so a handoff is a logged transition, not a Slack thread.

02 · PM-authored pipelines via a node-graph builder

Project managers compose data-collection workflows visually — drag speakers, recording rooms, processing steps, QC, annotation, and conditionals onto a canvas; wire ports together; toggle the AI agents that run at each node. Adding a new collection methodology stops being an engineering ticket and becomes a PM action.

03 · Two AI agents inline in the pipeline

Agent #5 pre-checks every session entering QC for clipping, silence, noise, speaker count, and language detection. Agent #6 cross-validates annotator output against the audio for turn-boundary, transcript fidelity, and overlap issues. Both run on a self-hosted Llama 3.3 70B with JSON-schema-constrained outputs — and both are eval-gated to 85% / 80% concordance before they ship.

04 · Self-hosted everything except payment rails

Voice biometric data is GDPR special-category — speaker trust depends on it not leaving your infrastructure. The platform self-hosts the WebRTC stack, the LLM (vLLM + Llama 3.3 70B on owned GPU), the object storage (MinIO), the database, and the observability stack. Only Stripe Connect remains external, because payment rails fundamentally cannot be self-hosted.

The Result · 04

Data collection became infrastructure, not a vendor stack.

Polyphony runs the pipeline behind multilingual speech-AI training programmes. Sessions flow through the state machine without manual coordination. AI agents handle the categorisation work. Speakers get paid on schedule. And the voice data never leaves the team's own GPU and storage.

result.01

Role-specific portals shipped on a single shared pipeline

result.02

AI agents inline in QC and annotation review — eval-gated

result.03

Third-party APIs touching raw voice data

Pipeline handoffs became first-class state transitions.

Recorded → processing → QC1 → noise → QC2 → annotation → annotation-QA → completed. Every transition writes an immutable audit row with actor, reason, and timestamp. A session that gets stuck surfaces to admin automatically — there is no quiet limbo where work sits and ages out of memory.

AI agents took the tagging work, humans kept the judgement.

Agent #5 produces a pass/warn/fail report against every QC checklist item before the human reviewer opens the workspace. Agent #6 flags turn-boundary disagreements and overlap issues for the QA reviewer to confirm or reject. The reviewers' day stops being categorisation and starts being decisions — at a fraction of the time per session.

Voice biometric data never leaves owned infrastructure.

Raw audio, processed segments, transcripts, agent prompts, and agent responses all live on self-hosted MinIO buckets and a self-hosted GPU. There is no third-party API call that takes speaker audio off-premises. GDPR special-category obligations are met by architecture, not by addendum.

The Workflow Builder turned pipeline composition into a PM task.

Adding a new data-collection methodology — a different number of speakers, a different QC ordering, a different agent configuration — used to be an engineering change. With the node-graph builder, it is a PM dragging nodes onto a canvas. Engineering ships node types; PMs assemble pipelines.

Speakers got paid on schedule, through one external dependency.

Approved sessions aggregate into weekly Stripe Connect transfers. Speakers onboard through Stripe Express — KYC, tax docs, and bank routing all handled outside our scope. The platform stores no card or bank data, and every payout is reconciled against the underlying sessions in the audit log.

Built withNext.jsDjangoCeleryPostgreSQLMinIOWebRTCvLLMLlama 3.3

Get Started

Need a speech-data pipeline that respects data sovereignty?

Talk to us about how Polyphony can plug into your model training programme. We will walk you through the six portals, run a sample session through the live pipeline, and tailor a deployment to the languages, methodologies, and compliance constraints your team actually operates under.

Request a demo Talk to an expert

30-min walkthrough · No commitment

Live pipeline · Bring a sample session

Self-hosted from day one

More of our work

View all case studies

Case · /02

DataForge — annotation as continuous infrastructure.

AI Data Pipelines

Read the case study

Case · /01

From a translated file to a defensible MQM score.

LQA Automation

Read the case study