Author: SCOUT (TITAN Research Agent)
Date: 2026-04-21
Classification: Internal Engineering — Silent Infinity
Status: Pre-Implementation Review
---
2. Current State: M1 Architecture
8. Cost Model
9. Quality Measurement Framework
12. M3 Forward Look
13. Recommendations
14. References
---
Silent Infinity's current voice pipeline (M1) targets a p50 latency of 750ms end-to-end. This places it in the second tier of 2026 conversational voice products, where the industry standard for real-time conversational AI has converged below 300ms first-audio-byte. This memo specifies the complete upgrade path from M1 to M2, evaluating two architectural candidates — AWS Nova Sonic (unified speech-to-speech on Bedrock) and Sesame CSM-1B (open-weights, self-hosted) — against a 300ms p50 target.
Primary recommendation: AWS Nova Sonic is the correct M2 backbone. Its unified STT+LLM+TTS architecture eliminates three of the five latency-contributing hops present in M1, is available today on Amazon Bedrock in us-east-1, and requires no GPU fleet management. Sesame CSM-1B remains a viable Phase 3 fallback if Nova Sonic does not meet the sub-300ms target reliably at scale, but its real-time factor (RTF ~0.28x on an A100-class GPU, meaning ~2.8s to generate 10s of audio) presents a structural challenge without significant streaming optimization investment.
---
The M1 pipeline deployed at silentinfinity.com/voice consists of:
getUserMedia() mic capture, chunked audio via fetch POST1. Hop count: 6 discrete service boundaries between mic and first audio byte at speaker.
2. Non-streaming handoff: The Lambda must accumulate enough transcript from Transcribe before it can invoke Bedrock. This introduces buffering latency not visible in individual service SLAs.
3. SSE vs. WebSocket: SSE is unidirectional; the client cannot send mid-turn audio updates without opening a new POST. This prevents barge-in and forces end-of-utterance detection at the client, which adds 100–200ms buffer before the POST is dispatched [Ably, 2024].
4. p50 target: 750ms. This is approximately 2.5x the 2026 industry standard.
---
| Hop | Component | p50 (ms) | Notes |
|-----|-----------|----------|-------|
| 1 | Mic capture + VAD buffer | 80 | Browser-side; end-of-utterance detection |
| 2 | Network: browser → API GW | 30 | US East, typical RTT ~25–40ms |
| 3 | API Gateway + Lambda invoke | 25 | Warm Lambda; provisioned concurrency |
| 4 | Transcribe Streaming first token | 180 | Partial transcript confidence threshold |
| 5 | Bedrock Claude Sonnet first token | 250 | Time to first output token (TTFT) |
| 6 | Polly Neural/Generative first chunk | 120 | Synthesis start on first sentence |
| 7 | Network: Lambda → browser | 30 | SSE stream first byte |
| 8 | Browser audio buffer + playback | 35 | Web Audio API scheduling |
| TOTAL | | ~750ms | p50 estimate |
The elimination of discrete STT and TTS hops is the core architectural insight. A unified speech-to-speech model processes audio input and streams audio output within a single model session, collapsing hops 4, 5, and 6 into one.
| Hop | Component | p50 Target (ms) | Change from M1 |
|-----|-----------|-----------------|----------------|
| 1 | Mic capture + VAD buffer | 50 | Reduced via WebSocket barge-in |
| 2 | Network: browser → API GW WS | 20 | Persistent connection, no handshake overhead |
| 3 | API GW WS → Lambda / ECS | 10 | WS frame dispatch, sub-ms per frame |
| 4 | Nova Sonic unified model TTFA | 150 | Time-to-first-audio-byte (vendor claim) |
| 5 | Network: response → browser | 20 | WS frame, persistent connection |
| 6 | Browser audio buffer + playback | 30 | PCM/opus decode, minimal buffering |
| TOTAL | | ~280ms | -470ms vs M1 |
Key insight: The dominant term in M2 is the model TTFA (hop 4). Nova Sonic's advertised sub-200ms real-time audio processing latency is consistent with a 150ms p50 TTFA target [AWS Nova Sonic announcement, 2025]. If this claim holds under production load, the 300ms p50 is achievable. Speculative latency masking — triggering the model call before the user fully completes their utterance, using a draft prediction of end-of-turn — can shave a further 30–60ms of perceived latency by overlapping model prefill with the last frames of user speech [Leviathan et al., ICML 2023; PredGen framework, 2025].
The switch from SSE to WebSocket is not cosmetic. Empirical latency comparisons show WebSocket's persistent full-duplex connection eliminates the TCP/TLS handshake on every turn (saving ~50–80ms in cold-turn overhead) and enables continuous audio streaming rather than request/response batching [Ably, 2024; RxDB WebSocket analysis, 2024]. More importantly, WebSocket enables server-initiated interruption (barge-in handling) — a capability architecturally impossible with SSE. For conversational voice, barge-in is not a feature; it is a correctness requirement. A user who interrupts an AI response expects the audio to stop. With SSE, the only mechanism is closing and reopening the HTTP connection.
---
Released: April 2025 (Nova Sonic); December 2025 (Nova 2 Sonic with 1M token context, polyglot voices)
Architecture: Unified speech-to-speech model. Accepts continuous audio input and produces audio output within a single bidirectional streaming session. Internally handles ASR, language understanding, response generation, and speech synthesis as a unified forward pass rather than a pipeline. The API is exposed via InvokeModelWithBidirectionalStream over HTTP/2 on Amazon Bedrock Runtime — functionally equivalent to a WebSocket in behavior, though the underlying transport is HTTP/2 multiplexed streams [AWS Nova docs, 2025].
Protocol: Bidirectional streaming with JSON events. Three outbound event streams: (1) ASR transcription tokens, (2) tool-use events, (3) audio output chunks. Client sends audio input events plus system prompt events. The session lifecycle is: establish stream → send configuration → stream audio → receive interleaved audio response.
Latency: AWS claims "real-time, low-latency multi-turn conversations" with "sub-200ms round-trip audio processing." No official p50/p95 SLA is published. The Caylent integration blog reports that "Nova Sonic delivered responses faster than competing models" in their testing [Caylent, 2025]. The DEV Community AWS Builders post corroborates sub-300ms end-to-end for their support voice assistant use case.
Context window: 300K tokens (Nova Sonic v1); 1M tokens (Nova 2 Sonic).
Languages: English (US, UK) at launch. Spanish added June 2025. French, Italian, German added July 2025.
Tool use: Native function calling within the streaming session — enables RAG, MCP server integration, and real-time knowledge grounding without breaking the audio stream.
Availability: us-east-1 at launch. Check current Bedrock model availability for additional regions before Phase 3 production rollout.
Pricing (as of early 2026):
Strengths for M2:
Weaknesses:
Released: March 2025
Architecture: Llama backbone (~1B parameters) + Mimi RVQ audio decoder. Input: text + optional audio context. Output: RVQ audio codes decoded to 24kHz PCM. Note: CSM-1B is a TTS model, not a speech-to-speech model. It generates speech from text; it does not accept audio input for understanding. A conversational voice pipeline using CSM-1B therefore still requires a separate STT layer (e.g., Whisper, Deepgram, or Amazon Transcribe) and a separate LLM layer. This is a three-component pipeline, not a unified one [Sesame CSM-1B HuggingFace model card, 2025].
Inference speed: RTF of approximately 0.28x on a high-end GPU (NVIDIA A100/4090 class), per community benchmarks. This means 10 seconds of audio requires ~2.8 seconds of generation time. This is materially incompatible with a sub-300ms first-audio-byte target unless (a) audio is generated in very short chunks with aggressive streaming, or (b) speculative decoding is applied. The csm-streaming community fork demonstrates streaming generation, but published RTF data for first-chunk-to-audio remains below the threshold. A GitHub issue thread on realtime voice agents [SesameAILabs/csm issue #78, 2025] explicitly notes "slow inference time for real-time voice agent applications."
Hardware: Requires CUDA GPU. Minimum practical deployment: ml.g5.xlarge (NVIDIA A10G, 24 GB VRAM) at ~$1.41/hour on SageMaker. Note: AWS Fargate does not support GPU instances — GPU workloads require EC2 or SageMaker Inference Endpoints.
License: Apache 2.0 — fully open for commercial use.
Strengths for M2:
Weaknesses:
Verdict on CSM-1B for M2: CSM-1B is architecturally misaligned with the M2 objective. It is an excellent TTS component for a custom voice persona project, but it is not a speech-to-speech unified model. Using it in M2 preserves the multi-hop pipeline latency problem that Nova Sonic eliminates. It belongs in a M2.5 custom voice persona initiative where the goal is voice character consistency, not latency.
---
The following table provides control data points for calibrating the M2 target. All figures are industry-reported as of Q1 2026.
| System | Architecture | First-Audio-Byte Latency | Pricing | Notes |
|--------|-------------|--------------------------|---------|-------|
| OpenAI Realtime API | Unified GPT-4o-based S2S | ~300–500ms | $0.06/min audio in, $0.24/min audio out | Most expensive; strong quality; WebSocket-native [OpenAI docs, 2025] |
| Google Gemini 3.1 Flash Live | Unified multimodal S2S | ~300–500ms steady-state; 960ms TTFT | $0.018/min audio out | Cheapest audio output; TTFT higher than Nova Sonic claim [TokenMix, 2026] |
| Deepgram Voice Agent (Flux) | STT (Nova-3) + LLM + TTS (Aura-2) | Sub-300ms end-of-turn | $4.50/hour bundled | Not a unified model; achieves sub-300ms through optimized pipeline; $4.50/hr ≈ $0.075/min [Deepgram, 2026] |
| ElevenLabs Conversational | Proprietary S2S | ~300–500ms | Variable per character | Strong voice quality; closed-source |
| Nova Sonic (target) | Unified AWS S2S | Sub-200ms claimed TTFA | ~$0.017/min | AWS-native; best pricing at target quality |
Key observation: Nova Sonic's claimed ~$0.017/min represents approximately 3.5x cost advantage over Deepgram's bundled rate and 14x cost advantage over OpenAI Realtime API at equivalent audio output volume. The pricing advantage is structural because Nova Sonic charges per token of audio rather than per wall-clock minute.
---
BROWSER AWS CLOUD
┌────────────────┐
│ getUserMedia()│
│ mic capture │
│ VAD buffer │
│ ~80ms │
└───────┬────────┘
│ HTTPS POST (chunked)
│ +30ms network
▼
┌───────────────────┐
│ API Gateway │ HTTP API
│ +5ms │
└───────┬───────────┘
│
▼
┌───────────────────┐
│ Lambda │ Python 3.12 ARM64
│ +20ms invoke │
└───┬───────────┬───┘
│ │
│ WebSocket │ invoke
▼ ▼
┌──────────┐ ┌──────────────────────────┐
│Transcribe│ │ │
│Streaming │ │ (waiting for transcript)│
│+180ms │ │ │
└────┬─────┘ └──────────────────────────┘
│ partial transcript
▼
┌───────────────────┐
│ Bedrock │ Claude Sonnet 4.6
│ Claude Sonnet │
│ +250ms TTFT │
└────────┬──────────┘
│ text tokens
▼
┌───────────────────┐
│ Amazon Polly │ Neural/Generative
│ +120ms │
└────────┬──────────┘
│ audio stream
▼
┌───────────────────┐
│ Lambda SSE emit │
└────────┬──────────┘
│ SSE stream +30ms network
▼
┌────────────────┐
│ Browser │
│ Web Audio API │
│ +35ms buffer │
└────────────────┘
TOTAL HOPS: 6 service boundaries
p50 LATENCY: ~750ms first audio byte
TRANSPORT: SSE (unidirectional, no barge-in)
BROWSER AWS CLOUD
┌────────────────┐
│ getUserMedia()│
│ mic capture │
│ VAD (soft) │
│ ~50ms │
│ barge-in sig. │
└───────┬────────┘
│ WebSocket (persistent, full-duplex)
│ +20ms network
▼
┌───────────────────┐
│ API Gateway │ WebSocket API
│ WS route │
│ +5ms │
└───────┬───────────┘
│ WS frame
▼
┌─────────────────────────────────┐
│ Lambda (WS handler) │
│ or ECS Task (Phase 3+) │
│ +5ms │
│ │
│ InvokeModelWithBidirectional │
│ Stream (HTTP/2 to Bedrock) │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ Amazon Nova Sonic │
│ (Bedrock, us-east-1) │
│ ┌─────────────────────────┐ │
│ │ UNIFIED MODEL │ │
│ │ STT + LLM + TTS │ │
│ │ (single forward pass) │ │
│ └─────────────────────────┘ │
│ +150ms TTFA (p50 target) │
└──────────────┬──────────────────┘
│ audio chunk events (streaming)
▼
┌─────────────────────────────────┐
│ Lambda / ECS │
│ WS forward │
│ ~0ms (passthrough) │
└──────────────┬──────────────────┘
│ WS frame +20ms network
▼
┌────────────────┐
│ Browser │
│ Web Audio API │
│ +30ms buffer │
└────────────────┘
TOTAL HOPS: 3 service boundaries (vs 6 in M1)
p50 LATENCY TARGET: ~280ms first audio byte
TRANSPORT: WebSocket (full-duplex, barge-in native)
ELIMINATED: Transcribe Streaming, Polly, SSE
BROWSER AWS CLOUD
┌──────────────┐
│ WS mic │
│ +50ms │
└──────┬───────┘
│ WebSocket
▼
┌──────────────────────────────────────────┐
│ API Gateway WS + Lambda/ECS │
└───┬──────────────────┬───────────────────┘
│ │
▼ ▼
┌──────────┐ ┌───────────────────────┐
│ Whisper │ │ │
│ or Nova │ │ (waiting for STT) │
│ STT │ │ │
│ +120ms │ └───────────────────────┘
└────┬─────┘
│ transcript
▼
┌──────────────────────────────────────────┐
│ Bedrock Claude Sonnet 4.6 │
│ +200ms TTFT │
└─────────────────┬────────────────────────┘
│ text
▼
┌──────────────────────────────────────────┐
│ SageMaker Endpoint │
│ Sesame CSM-1B (ml.g5.xlarge) │
│ +200ms first chunk (streaming) │
└──────────────────┬───────────────────────┘
│ WS audio
▼
┌──────────────┐
│ Browser │
│ +30ms │
└──────────────┘
TOTAL HOPS: 5 (vs 3 for Nova Sonic)
p50 ESTIMATE: ~620ms (fails 300ms target)
NOTE: CSM-1B path does NOT achieve M2 latency target
---
Objective: Establish WebSocket endpoint; validate infra without touching M1.
Tasks:
1. Create API Gateway WebSocket API (wss://voice.silentinfinity.com/ws). Configure three routes: $connect, $disconnect, $default.
2. Deploy voice_ws_echo Lambda handler (Python 3.12 ARM64) on the $default route. Handler receives JSON frames, echoes them back. No ML inference.
3. Wire browser client: add WebSocket transport alongside existing fetch POST. Use variant registry to route voice_stt = "ws-echo" to the new path. Default remains transcribe-polly-baseline.
4. Validate: WebSocket connect/disconnect round-trip latency < 50ms from US East clients. Confirm API Gateway WS connection logs in CloudWatch.
5. Update variants.py: add ws-echo variant flagged as internal_only = True, rollout_pct = 0.
Exit criteria: WebSocket echo round-trip p50 < 50ms. Zero impact on existing M1 traffic. CloudWatch WS connection metrics flowing.
Risks: API Gateway WebSocket has a 10-minute idle connection timeout; implement heartbeat ping/pong (30s interval) in client.
Objective: Replace Transcribe + Polly with Nova Sonic. Measure real latency vs M1.
Tasks:
1. Implement NovaSonicAdapter class in Lambda: opens InvokeModelWithBidirectionalStream to Bedrock Runtime, forwards browser audio frames as audioInput events, receives audioOutput events and forwards as WS frames to browser.
2. Map system prompt (Silent Infinity voice persona, tone instructions) to Nova Sonic systemPrompt event at session start.
3. Implement barge-in: on receiving bargeIn event from browser (user speaks during AI response), send audioInputEnd to Nova Sonic and signal Lambda to stop forwarding audioOutput events.
4. Instrument custom CloudWatch metrics: VoiceM2_TTFA_ms (time from last user audio frame to first Nova Sonic audio output frame), VoiceM2_SessionError, VoiceM2_ModelLatency_p50, VoiceM2_ModelLatency_p95.
5. Deploy sonic-unified variant in variants.py at rollout_pct = 0, internal_only = True.
6. Run internal load test: 100 concurrent sessions, 50 voice turns each. Record p50/p95 TTFA.
Measurement protocol:
audioOutput event minus timestamp of Lambda receiving last user audio frame before end-of-turn signal.performance.now() timestamps.Exit criteria: Nova Sonic p50 TTFA ≤ 200ms in load test. p95 TTFA ≤ 500ms. Error rate < 1%. Internal team (5 people) subjective quality rating ≥ 4/5 on mirror-tone accuracy.
Objective: Canary to production users. Decide Nova Sonic vs Sesame CSM-1B.
Decision gate at end of Phase 2:
Path A — Nova Sonic meets target (p50 TTFA ≤ 200ms reliably):
1. Set sonic-unified rollout to 5% in variant registry.
2. Monitor CloudWatch for 48 hours: VoiceM2_TTFA_ms p50 < 300ms, VoiceM2_SessionError < 2%, VoiceM2_ModelLatency_p95 < 800ms.
3. If stable: increase to 25%, monitor 48h, increase to 50%, monitor 48h.
4. Proceed to Phase 4.
Path B — Nova Sonic fails target:
1. Activate Sesame CSM-1B fallback investigation.
2. Deploy CSM-1B on SageMaker ml.g5.xlarge endpoint (2 instances, auto-scaling 1–4).
3. Implement streaming generation: first audio chunk within 200ms of text input requires aggressive RVQ chunk streaming (50-token chunks vs default full-sequence generation).
4. Measure CSM-1B first-chunk latency. If p50 > 400ms, escalate to leadership — M2 sub-300ms target is not achievable with current open-source options and timeline. Consider OpenAI Realtime API as managed fallback.
5. Note: Path B adds 3–4 weeks to timeline.
Rollback in Phase 3: CodeDeploy canary with automatic rollback. CloudWatch alarm VoiceM2_P95_Breach (p95 TTFA > 800ms sustained 5 minutes) triggers CodeDeploy:RollbackDeployment. Variant registry flips sonic-unified to rollout_pct = 0 within 60 seconds of alarm state.
Objective: 100% traffic on M2. Decommission Transcribe Streaming, Polly, and SSE endpoint.
Tasks:
1. Ramp sonic-unified to 100% via variant registry.
2. Keep M1 Lambda warm for 72 hours post-ramp (provisioned concurrency 1) as emergency rollback target.
3. After 72h without rollback trigger: set M1 Lambda to 0 provisioned concurrency.
4. After 2 weeks: deprecate transcribe-polly-baseline variant, remove Transcribe IAM permissions, delete Polly voice config.
5. Update SSE endpoint to return 410 Gone with upgrade instructions for any clients still polling the old path.
6. Final cost audit: compare M2 Nova Sonic monthly bill against M1 Transcribe + Polly + Bedrock baseline.
---
| Service | Volume | Unit Price | Monthly Cost |
|---------|--------|-----------|--------------|
| Amazon Transcribe Streaming | 12,500 min | $0.024/min | $300 |
| Amazon Bedrock Claude Sonnet 4.6 | ~500K input + 250K output tokens | ~$3/M in, $15/M out | ~$5.25 |
| Amazon Polly Generative | 25,000 min (audio chars equiv.) | ~$0.03/min equiv. | ~$750 |
| API Gateway HTTP API | 50K requests + data | ~$1.00/M req | ~$0.05 |
| Lambda (ARM64, 1GB, 60s avg) | 50K invocations | ~$0.0000133/GB-s | ~$40 |
| M1 TOTAL | | | ~$1,095/month |
Note: Polly Generative pricing varies by characters; estimate based on ~150 words/30s at $0.00016/char.
| Component | Calculation | Monthly Cost |
|-----------|-------------|--------------|
| Nova Sonic speech input | 12,500 min × ~170 tokens/s × $0.0034/1K tokens | ~$362 |
| Nova Sonic speech output | 25,000 min × ~170 tokens/s × $0.0136/1K tokens | ~$2,890 |
| Nova Sonic text (system prompt, tools) | ~50K turns × 500 text tokens × $0.003/1K | ~$75 |
| API Gateway WebSocket | 50K connections + message fees | ~$15 |
| Lambda WS handler (lightweight) | 50K sessions × 60s × 128MB | ~$5 |
| Nova Sonic M2 TOTAL | | ~$3,347/month |
Note: The output token rate dominates. At $0.0136/1K tokens, 25,000 minutes of audio output at ~170 tokens/second = 255M tokens = ~$3,468. This is higher than M1 Polly. However, Transcribe Streaming ($300) and Polly ($750) are replaced by a single billing dimension, and the quality + latency improvement justifies the delta at this volume.
At 50K turns/month: Nova Sonic M2 costs approximately 3x M1. At 200K turns/month, the relative overhead decreases as fixed Lambda/API GW costs become negligible.
| Component | Spec | Monthly Cost |
|-----------|------|--------------|
| SageMaker ml.g5.xlarge (24/7) | $1.41/hr × 720hr | $1,015 |
| SageMaker ml.g5.xlarge (2nd instance, HA) | $1.41/hr × 720hr | $1,015 |
| Whisper/Transcribe STT (still needed) | 12,500 min × $0.024/min | $300 |
| Bedrock Claude Sonnet 4.6 (still needed) | ~$5.25 as above | $5 |
| ECS Task (orchestration) | 2 tasks × 0.5 vCPU | ~$30 |
| Data transfer, storage | | ~$20 |
| CSM-1B TOTAL | | ~$2,385/month |
Note: CSM-1B is $962/month cheaper than Nova Sonic at 50K turns, but this gap reverses at higher volume because SageMaker instances run 24/7 regardless of traffic, while Nova Sonic bills only on usage. Break-even is approximately 35K turns/month. Above that, Nova Sonic is cheaper per-turn; below it, CSM-1B idle cost dominates.
GPU idle cost is the structural risk for CSM-1B. At zero-traffic hours, the ml.g5.xlarge instances still bill $1.41/hr. Mitigation requires aggressive auto-scaling to zero (SageMaker Serverless Inference) but cold-start for a 1B parameter model on GPU is 15–45 seconds — unacceptable for voice.
---
Mirror-tone accuracy measures how faithfully the voice pipeline reflects the intended emotional register, conversational style, and persona of the Silent Infinity product. It is distinct from WER (word error rate) and MOS (mean opinion score). We define it along three axes:
1. Prosodic alignment: Does the AI's spoken response match the emotional valence of the user's input? (e.g., a user speaking with urgency should receive a response with matched urgency, not flat neutral TTS)
2. Persona consistency: Does the voice character remain consistent across turns within a session and across sessions?
3. Turn-taking naturalness: Do pauses, hesitations, and response timing feel human-calibrated?
Latency metrics (continuous, CloudWatch):
TTFA_p50, TTFA_p95 — per variant, per hourSessionError_rate — 5-minute rollingBargeIn_success_rate — fraction of barge-in events that result in clean interruptionTranscription accuracy (sampled, 1% of turns):
Audio quality (sampled, 1% of turns):
UTMOS score (neural MOS predictor) — target mean > 4.0Persona consistency (batch, weekly):
- Naturalness (1–5)
- Mirror-tone accuracy (1–5, with rubric: does the AI sound like it's listening?)
- Overall preference (A/B/neither)
---
The following additions to variants.py support the M2 migration. Phase percentages represent the rollout trajectory.
# variants.py — M2 additions
VOICE_VARIANTS = {
# BASELINE — current M1 production path
"transcribe-polly-baseline": {
"voice_stt": "transcribe_streaming",
"voice_llm": "bedrock_claude_sonnet_4_6",
"voice_tts": "polly_generative",
"transport": "sse",
"rollout_pct": 100, # Phase 1: 100% | Phase 4: 0%
"internal_only": False,
"min_sdk_version": None,
"description": "M1 production baseline. SSE transport. Polly Neural/Generative TTS.",
"deprecated": False, # Set True in Phase 4 week 2
},
# PRIMARY M2 CANDIDATE — Nova Sonic unified speech-to-speech
"sonic-unified": {
"voice_stt": "nova_sonic_unified", # STT handled internally by Nova Sonic
"voice_llm": "nova_sonic_unified", # LLM handled internally by Nova Sonic
"voice_tts": "nova_sonic_unified", # TTS handled internally by Nova Sonic
"transport": "websocket",
"bedrock_model_id": "amazon.nova-sonic-v1:0",
"rollout_pct": 0, # Phase 1: 0% | Phase 2: 0% internal | Phase 3: 5% | Phase 4: 100%
"internal_only": True, # Set False at Phase 3 canary
"barge_in_enabled": True,
"system_prompt_key": "voice_persona_v2",
"description": "M2 primary. Nova Sonic unified S2S. WebSocket transport. Sub-300ms target.",
"cloudwatch_alarm_group": "VoiceM2Alarms",
},
# FALLBACK M2 CANDIDATE — Sesame CSM-1B (activated only if Nova Sonic fails Phase 2 gate)
"sesame-csm1b": {
"voice_stt": "transcribe_streaming", # Still required; CSM-1B is TTS-only
"voice_llm": "bedrock_claude_sonnet_4_6",
"voice_tts": "sesame_csm1b_sagemaker",
"transport": "websocket",
"sagemaker_endpoint": "csm-1b-inference-v1",
"rollout_pct": 0, # Stays 0 unless Nova Sonic fails Phase 3 gate
"internal_only": True,
"barge_in_enabled": False, # Requires custom implementation
"description": "M2 fallback. CSM-1B on SageMaker. Higher latency; better voice persona control.",
"cloudwatch_alarm_group": "VoiceM2Alarms",
"notes": "Activate only if sonic-unified fails p50 < 300ms gate at end of Phase 2.",
},
}
# Rollout schedule
ROLLOUT_PHASES = {
"Phase1_Week1": {"transcribe-polly-baseline": 100, "sonic-unified": 0, "sesame-csm1b": 0},
"Phase2_Week2": {"transcribe-polly-baseline": 100, "sonic-unified": 0, "sesame-csm1b": 0}, # internal shadow
"Phase3_Week4": {"transcribe-polly-baseline": 95, "sonic-unified": 5, "sesame-csm1b": 0},
"Phase3_Week5": {"transcribe-polly-baseline": 75, "sonic-unified": 25, "sesame-csm1b": 0},
"Phase4_Week6": {"transcribe-polly-baseline": 50, "sonic-unified": 50, "sesame-csm1b": 0},
"Phase4_Week7": {"transcribe-polly-baseline": 0, "sonic-unified": 100,"sesame-csm1b": 0},
}
---
The following CloudWatch alarms are mandatory before any Phase 3 canary traffic.
Alarm 1: M2 P95 Latency Breach
MetricName: VoiceM2_TTFA_p95
Namespace: SilentInfinity/Voice
Statistic: p95
Period: 300 seconds (5 min)
Threshold: > 800ms
EvaluationPeriods: 2
TreatMissingData: notBreaching
AlarmActions:
- SNS: voice-m2-oncall
- Lambda: revert-to-m1-variant (sets sonic-unified rollout_pct = 0)
- CodeDeploy: RollbackDeployment (if active canary deployment)
Alarm 2: M2 Error Rate Breach
MetricName: VoiceM2_SessionError_rate
Namespace: SilentInfinity/Voice
Statistic: Average
Period: 300 seconds
Threshold: > 0.02 (2%)
EvaluationPeriods: 2
TreatMissingData: notBreaching
AlarmActions:
- SNS: voice-m2-oncall
- Lambda: revert-to-m1-variant
Alarm 3: Bedrock Service Error Spike
MetricName: AWS/Bedrock InvocationErrors
Dimensions: ModelId=amazon.nova-sonic-v1:0
Statistic: Sum
Period: 60 seconds
Threshold: > 5 errors in 60 seconds
EvaluationPeriods: 1
AlarmActions:
- Lambda: revert-to-m1-variant (immediate)
Alarm 4: M2 TTFA Regression (soft warning)
MetricName: VoiceM2_TTFA_p50
Threshold: > 400ms
Period: 600 seconds (10 min)
EvaluationPeriods: 3
AlarmActions:
- SNS: voice-m2-engineering (notification only, no auto-revert)
In any emergency (alarm not firing but quality is degraded):
# Immediate: flip variant registry via SSM Parameter Store
aws ssm put-parameter \
--name "/silentinfinity/voice/variant_overrides" \
--value '{"sonic-unified": 0, "transcribe-polly-baseline": 100}' \
--overwrite
# Lambda reads this parameter on each invocation via a 30-second cache
# RTO: < 60 seconds
M1 infrastructure (Transcribe IAM role, Polly config, SSE Lambda handler) is not deleted until:
---
M3 is not in scope for this memo but should inform decisions made in M2.
The primary M3 driver is multilingual expansion. Nova Sonic supports 5 languages as of early 2026 (English US/UK, Spanish, French, Italian, German). Silent Infinity's roadmap likely requires broader coverage. Two paths:
1. Nova 2 Sonic + language expansion: AWS has committed to expanding Nova Sonic language support. Monitor AWS release notes.
2. MMS + custom pipeline: Pratap et al. (2023) demonstrated that Massively Multilingual Speech (MMS) can support 1,000+ languages using public domain text with wav2vec 2.0 [Pratap et al., arXiv:2305.13516, 2023]. MMS is released under CC-BY-NC 4.0 (note: not commercially permissive — verify before use). An M3 architecture could use MMS for non-English STT piped into a multilingual LLM and a separate TTS system, at the cost of re-introducing pipeline hops and latency regression.
3. CSM-1B multilingual fine-tune: Sesame has announced plans to expand CSM-1B to 20+ languages. This is the most compelling M3 use case for CSM-1B — not as the latency-critical unified model, but as the fine-tunable voice persona layer on top of a Nova Sonic ASR stream.
M3 recommendation: Keep Nova Sonic as the primary unified backbone; layer a fine-tuned CSM-1B voice persona on top for non-English markets as they open, using CSM-1B's audio context input to maintain Silent Infinity's voice character across languages.
---
The evidence is unambiguous. Nova Sonic eliminates three service hops, provides native WebSocket-equivalent bidirectional streaming, includes tool use and barge-in as first-class features, and is fully managed on existing AWS infrastructure. At 50K turns/month the monthly cost increases from ~$1,095 to ~$3,347, but this delta purchases a 2.5x latency improvement and a qualitatively different product experience. The cost per turn is $0.067 — comparable to running a mid-tier dedicated voice API and substantially below OpenAI Realtime API ($0.30+ per turn at equivalent output).
CSM-1B is architecturally mismatched for M2. It is a TTS-only model requiring STT and LLM components above it — this preserves the multi-hop latency structure M2 is designed to eliminate. Its RTF of 0.28x on high-end GPU means first-audio-byte latency cannot reach 300ms without streaming optimization work that is not yet production-tested. The GPU idle cost on SageMaker creates a variable-to-fixed cost problem at low traffic volumes. CSM-1B belongs in M3 as a fine-tuneable voice persona layer for custom characters and multilingual expansion — this is where its Apache 2.0 license and fine-tuning capability create genuine value.
Speculative end-of-turn detection — triggering the Nova Sonic stream before the user fully stops speaking, using acoustic features to predict completion [Leviathan et al., ICML 2023; LTS-VoiceAgent framework, 2025] — is the lowest-cost, highest-leverage latency optimization available after the architecture swap. Implementing it in Phase 2 (not Phase 4) ensures you measure its effect in the canary window. A 30–60ms reduction in perceived TTFA turns a borderline 320ms p50 into a clean sub-300ms result.
---
1. AWS Nova Sonic announcement (April 2025): Introducing Amazon Nova Sonic
2. AWS Nova 2 Sonic announcement (December 2025): Introducing Amazon Nova 2 Sonic
3. Nova Sonic technical documentation: Amazon Nova Sonic Speech-to-Speech
4. Nova Sonic Bedrock model card: Nova Sonic — Amazon Bedrock
5. Amazon Nova pricing: Amazon Nova Pricing
6. Sesame CSM-1B model card: sesame/csm-1b — Hugging Face
7. SesameAILabs/csm GitHub: CSM Repository
8. CSM realtime issue thread: Issue #78 — Realtime voice agents
9. Leviathan et al. (ICML 2023): "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192. https://arxiv.org/abs/2211.17192
10. Pratap et al. (2023): "Scaling Speech Technology to 1,000+ Languages." arXiv:2305.13516. https://arxiv.org/abs/2305.13516
11. OpenAI Realtime API documentation: https://platform.openai.com/docs/guides/realtime
12. Google Gemini Live API documentation: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/live-api
13. Deepgram Voice Agent latency: Low Latency Voice AI — Deepgram
14. Voice AI latency comparison 2026: Realtime vs Gemini Live vs ElevenLabs — TokenMix
15. WebSocket vs SSE latency: WebSockets vs SSE — Ably
16. WebSocket vs SSE protocol comparison: RxDB WebSocket/SSE analysis
17. AWS API Gateway WebSocket setup (2026): OneUptime — WebSocket with API Gateway
18. CloudWatch canary rollback: Canary Deployment for AWS Lambda — Lumigo
19. SageMaker inference pricing: SageMaker Pricing — AWS
20. Caylent Nova Sonic integration blog: Introducing Amazon Nova Sonic — Caylent
21. LTS-VoiceAgent streaming framework (2025): arXiv:2601.19952. https://arxiv.org/html/2601.19952
22. Nova Sonic WebRTC sample: aws-samples/sample-nova-sonic-speech2speech-webrtc
---
Memo prepared by SCOUT (TITAN Research Agent) on 2026-04-21. All pricing figures are current as of Q1 2026 and subject to AWS rate changes. Latency claims for Nova Sonic reflect vendor-reported figures; independent p50/p95 benchmarks should be established in Phase 2 before committing to production rollout.