Audio data collection is a coordination problem.
Polyphony is a six-portal platform that takes recorded speech from a live multi-speaker room through media processing, two AI agents, two QC passes, annotation, and QA — without the voice data ever leaving owned infrastructure.
Speech datasets are produced by a pipeline of specialists. The pipeline is the product.
Most teams building speech AI talk about the model. The model is the wrong thing to talk about. The model is the output. The dataset is the product — and the dataset is produced by a multi-stage pipeline of specialists who each do one thing well: project managers scope the topics, speakers record in multi-speaker rooms, media reviewers clean the audio, QC reviewers gate quality, annotators tag the timeline, QA reviewers cross-check the annotations.
That pipeline is where data quality is won or lost. A handoff that loses provenance breaks the dataset. A reviewer who cannot see what an annotator saw produces inconsistent labels. A speaker payout that does not reconcile to session-level approvals erodes the speaker pool. None of those failures show up in the model architecture — they show up in the data, weeks or months later, when the model trained on it underperforms.
And voice biometric data carries obligations on top. Under GDPR it is special-category personal data — the same tier as health and genetic information. Sending raw speaker audio to a third-party API for transcription or quality checks is not a tooling choice; it is a compliance decision with consequences.
“The model is the output. The dataset is the product. The pipeline that produces the dataset is the platform.”
The challenge was to build a pipeline rather than a recorder — a coordinated system of role-specific portals that hand sessions off through a single state machine, with AI agents augmenting the human review stages, and with every byte of voice data living on owned infrastructure.
Two ways teams collect speech data. Both break under audit.
Teams producing speech datasets tend to default to one of two patterns — and both break the dataset before the model ever sees it.
Tool-stitched workflow.
Zoom for recording. Google Drive for handoff. A spreadsheet for QC tracking. A generic labelling tool for annotation. Email for speaker payouts. Each step works in isolation; together they shed provenance — a flagged annotation cannot be traced back to the original session segment without a manual hunt, and a regulator asking who reviewed what at which stage gets an apologetic shrug.
Black-box recording SaaS.
A hosted recording vendor holds the raw audio. A third-party API does the transcription. Another vendor does the quality checks. The dataset moves freely between four companies, none of whom are accountable for the speaker contract or the GDPR notice. The moment a speaker requests deletion, the request fans out into emails to vendors who may or may not honour it.
Neither produces what a team building production speech AI actually needs: a single pipeline with role-specific surfaces, AI augmentation at the points where it pays off, and voice data that never leaves owned infrastructure.
Six role-specific portals on one shared pipeline, with AI agents inline.
Polyphony is not a recording tool. It is a six-portal platform that takes a session from the live recording room through a nine-stage state machine to a delivered dataset — with two AI agents augmenting the human review stages and the entire stack running on owned infrastructure.
- S1Live recordingmesh P2P · 1–6 speakers per room
- S2Media processingffmpeg normalise · segment · format
- S3QC1 + QC2two-pass human review · agent-augmented
- S4Annotation + QAtimeline editor · cross-validated
- A1Agent #5QC pre-check · pass/warn/fail per check
- A2Agent #6annotation cross-validation · 0–100 score
- A3vLLM · Llama 3.3 70Bself-hosted · JSON-schema-constrained
- A4Eval harness20-sample fixtures · concordance gated
01 · Six role-specific portals, one pipeline
Project managers, speakers, media reviewers, QC1, QC2, annotators, and annotation-QA each get a portal built around the decision they are making — not a generic dashboard squeezed into a sidebar role-switcher. Sessions move between portals through a single state machine, so a handoff is a logged transition, not a Slack thread.
02 · PM-authored pipelines via a node-graph builder
Project managers compose data-collection workflows visually — drag speakers, recording rooms, processing steps, QC, annotation, and conditionals onto a canvas; wire ports together; toggle the AI agents that run at each node. Adding a new collection methodology stops being an engineering ticket and becomes a PM action.
03 · Two AI agents inline in the pipeline
Agent #5 pre-checks every session entering QC for clipping, silence, noise, speaker count, and language detection. Agent #6 cross-validates annotator output against the audio for turn-boundary, transcript fidelity, and overlap issues. Both run on a self-hosted Llama 3.3 70B with JSON-schema-constrained outputs — and both are eval-gated to 85% / 80% concordance before they ship.
04 · Self-hosted everything except payment rails
Voice biometric data is GDPR special-category — speaker trust depends on it not leaving your infrastructure. The platform self-hosts the WebRTC stack, the LLM (vLLM + Llama 3.3 70B on owned GPU), the object storage (MinIO), the database, and the observability stack. Only Stripe Connect remains external, because payment rails fundamentally cannot be self-hosted.
Data collection became infrastructure, not a vendor stack.
Polyphony runs the pipeline behind multilingual speech-AI training programmes. Sessions flow through the state machine without manual coordination. AI agents handle the categorisation work. Speakers get paid on schedule. And the voice data never leaves the team's own GPU and storage.
Pipeline handoffs became first-class state transitions.
Recorded → processing → QC1 → noise → QC2 → annotation → annotation-QA → completed. Every transition writes an immutable audit row with actor, reason, and timestamp. A session that gets stuck surfaces to admin automatically — there is no quiet limbo where work sits and ages out of memory.
AI agents took the tagging work, humans kept the judgement.
Agent #5 produces a pass/warn/fail report against every QC checklist item before the human reviewer opens the workspace. Agent #6 flags turn-boundary disagreements and overlap issues for the QA reviewer to confirm or reject. The reviewers' day stops being categorisation and starts being decisions — at a fraction of the time per session.
Voice biometric data never leaves owned infrastructure.
Raw audio, processed segments, transcripts, agent prompts, and agent responses all live on self-hosted MinIO buckets and a self-hosted GPU. There is no third-party API call that takes speaker audio off-premises. GDPR special-category obligations are met by architecture, not by addendum.
The Workflow Builder turned pipeline composition into a PM task.
Adding a new data-collection methodology — a different number of speakers, a different QC ordering, a different agent configuration — used to be an engineering change. With the node-graph builder, it is a PM dragging nodes onto a canvas. Engineering ships node types; PMs assemble pipelines.
Speakers got paid on schedule, through one external dependency.
Approved sessions aggregate into weekly Stripe Connect transfers. Speakers onboard through Stripe Express — KYC, tax docs, and bank routing all handled outside our scope. The platform stores no card or bank data, and every payout is reconciled against the underlying sessions in the audit log.
Need a speech-data pipeline that respects data sovereignty?
Talk to us about how Polyphony can plug into your model training programme. We will walk you through the six portals, run a sample session through the live pipeline, and tailor a deployment to the languages, methodologies, and compliance constraints your team actually operates under.