Voice M2 Sub-300ms Pipeline — Executive Summary

Memo: VOICE-M2-TECH-MEMO-v1-2026-04-21

Date: 2026-04-21 | Author: SCOUT / TITAN

---

The Problem

Silent Infinity's M1 voice pipeline runs at ~750ms p50 first-audio-byte. The 2026 industry standard for conversational voice AI is sub-300ms. The gap is not marginal — it is the difference between a product that feels like a phone call and one that feels like a chatbot.

The Fix

Replace the six-hop M1 chain (Transcribe Streaming + Claude Sonnet + Polly + SSE) with a three-hop M2 chain built on AWS Nova Sonic — Amazon's unified speech-to-speech model on Bedrock. Nova Sonic collapses STT, LLM, and TTS into a single bidirectional stream, eliminating three service boundaries and their associated queuing and serialization overhead. Simultaneously, the transport flips from SSE (unidirectional, no barge-in) to WebSocket (full-duplex, native barge-in).

M2 Candidate Verdict

Nova Sonic wins. Sesame CSM-1B is a high-quality TTS model but not a speech-to-speech model — it still requires Transcribe and an LLM above it, preserving the multi-hop latency problem. Its RTF of 0.28x on A100-class GPU makes sub-300ms first-audio-byte structurally impractical without uninvestigated streaming optimization. CSM-1B is correctly deployed in M3 as a fine-tuneable voice persona layer, not as M2 backbone.

Migration in 60 Days

Week 1: WebSocket echo endpoint validates infra, zero M1 impact
Weeks 2–3: Nova Sonic adapter wired, internal load test, measure real p50/p95
Weeks 4–5: 5% canary to production users; auto-rollback alarms live
Week 6+: 100% Nova Sonic, M1 retired

Top 3 Recommendations

1. Commit to Nova Sonic now. The unified architecture is the only path to sub-300ms without a GPU fleet. The $2,252/month cost delta at 50K turns/month is justified by the product quality improvement and is 14x cheaper than OpenAI Realtime API at equivalent volume.

2. Build speculative end-of-turn detection in Phase 2. Triggering the Nova Sonic stream before the user fully stops speaking (Leviathan et al., 2023 speculative decoding applied to VAD) recovers 30–60ms of perceived latency — enough to turn a borderline result into a clean sub-300ms p50.

3. Preserve CSM-1B for M3 voice persona work. Its Apache 2.0 license and fine-tuning capability make it the right tool for custom voice characters and multilingual expansion — not for latency-critical unified inference.

---

Full memo: F:/TITAN/plans/advisors/VOICE-M2-TECH-MEMO-v1-2026-04-21.md