Silent Infinity Voice M2: Sub-300ms Pipeline

Silent Infinity's current voice pipeline (M1) targets a p50 latency of 750ms end-to-end. This places it in the second tier of 2026 conversational voice products, where the industry standard for real-time conversational AI has converged below 300ms first-audio-byte. This memo specifies the complete upgrade path from M1 to M2, evaluating two architectural candidates — AWS Nova Sonic (unified speech-to-speech on Bedrock) and Sesame CSM-1B (open-weights, self-hosted) — against a 300ms p50 target.

Primary recommendation: AWS Nova Sonic is the correct M2 backbone. Its unified STT+LLM+TTS architecture eliminates three of the five latency-contributing hops present in M1, is available today on Amazon Bedrock in us-east-1, and requires no GPU fleet management. Sesame CSM-1B remains a viable Phase 3 fallback if Nova Sonic does not meet the sub-300ms target reliably at scale, but its real-time factor (RTF ~0.28x on an A100-class GPU, meaning ~2.8s to generate 10s of audio) presents a structural challenge without significant streaming optimization investment.

---

2. Current State: M1 Architecture

2.1 Stack Description

The M1 pipeline deployed at silentinfinity.com/voice consists of:

Frontend: Browser getUserMedia() mic capture, chunked audio via fetch POST
Transport: HTTPS POST with chunked transfer encoding, SSE response stream
Ingress: Amazon API Gateway (HTTP API)
Compute: AWS Lambda, Python 3.12 ARM64, provisioned concurrency recommended
STT: Amazon Transcribe Streaming (bidirectional WebSocket from Lambda)
LLM: Amazon Bedrock — Claude Sonnet 4.6 (anthropic.claude-sonnet-4-6)
TTS: Amazon Polly — Neural or Generative engine
Response transport: SSE stream from Lambda back through API Gateway to browser

2.2 M1 Known Constraints

1. Hop count: 6 discrete service boundaries between mic and first audio byte at speaker.

2. Non-streaming handoff: The Lambda must accumulate enough transcript from Transcribe before it can invoke Bedrock. This introduces buffering latency not visible in individual service SLAs.

3. SSE vs. WebSocket: SSE is unidirectional; the client cannot send mid-turn audio updates without opening a new POST. This prevents barge-in and forces end-of-utterance detection at the client, which adds 100–200ms buffer before the POST is dispatched [Ably, 2024].

4. p50 target: 750ms. This is approximately 2.5x the 2026 industry standard.

---

3. Latency Budget Analysis

3.1 M1 Per-Hop Budget (actual, estimated p50)

| Hop | Component | p50 (ms) | Notes |

|-----|-----------|----------|-------|

| 1 | Mic capture + VAD buffer | 80 | Browser-side; end-of-utterance detection |

| 2 | Network: browser → API GW | 30 | US East, typical RTT ~25–40ms |

| 3 | API Gateway + Lambda invoke | 25 | Warm Lambda; provisioned concurrency |

| 4 | Transcribe Streaming first token | 180 | Partial transcript confidence threshold |

| 5 | Bedrock Claude Sonnet first token | 250 | Time to first output token (TTFT) |

| 6 | Polly Neural/Generative first chunk | 120 | Synthesis start on first sentence |

| 7 | Network: Lambda → browser | 30 | SSE stream first byte |

| 8 | Browser audio buffer + playback | 35 | Web Audio API scheduling |

3.2 M2 Target Per-Hop Budget (p50 ≤ 300ms)

The elimination of discrete STT and TTS hops is the core architectural insight. A unified speech-to-speech model processes audio input and streams audio output within a single model session, collapsing hops 4, 5, and 6 into one.

|-----|-----------|-----------------|----------------|

| 1 | Mic capture + VAD buffer | 50 | Reduced via WebSocket barge-in |

| 2 | Network: browser → API GW WS | 20 | Persistent connection, no handshake overhead |

| 3 | API GW WS → Lambda / ECS | 10 | WS frame dispatch, sub-ms per frame |

| 4 | Nova Sonic unified model TTFA | 150 | Time-to-first-audio-byte (vendor claim) |

| 5 | Network: response → browser | 20 | WS frame, persistent connection |

| 6 | Browser audio buffer + playback | 30 | PCM/opus decode, minimal buffering |

Key insight: The dominant term in M2 is the model TTFA (hop 4). Nova Sonic's advertised sub-200ms real-time audio processing latency is consistent with a 150ms p50 TTFA target [AWS Nova Sonic announcement, 2025]. If this claim holds under production load, the 300ms p50 is achievable. Speculative latency masking — triggering the model call before the user fully completes their utterance, using a draft prediction of end-of-turn — can shave a further 30–60ms of perceived latency by overlapping model prefill with the last frames of user speech [Leviathan et al., ICML 2023; PredGen framework, 2025].

3.3 The WebSocket Advantage

The switch from SSE to WebSocket is not cosmetic. Empirical latency comparisons show WebSocket's persistent full-duplex connection eliminates the TCP/TLS handshake on every turn (saving ~50–80ms in cold-turn overhead) and enables continuous audio streaming rather than request/response batching [Ably, 2024; RxDB WebSocket analysis, 2024]. More importantly, WebSocket enables server-initiated interruption (barge-in handling) — a capability architecturally impossible with SSE. For conversational voice, barge-in is not a feature; it is a correctness requirement. A user who interrupts an AI response expects the audio to stop. With SSE, the only mechanism is closing and reopening the HTTP connection.

---

4. M2 Candidate Evaluation

4.1 AWS Nova Sonic

Released: April 2025 (Nova Sonic); December 2025 (Nova 2 Sonic with 1M token context, polyglot voices)

Architecture: Unified speech-to-speech model. Accepts continuous audio input and produces audio output within a single bidirectional streaming session. Internally handles ASR, language understanding, response generation, and speech synthesis as a unified forward pass rather than a pipeline. The API is exposed via InvokeModelWithBidirectionalStream over HTTP/2 on Amazon Bedrock Runtime — functionally equivalent to a WebSocket in behavior, though the underlying transport is HTTP/2 multiplexed streams [AWS Nova docs, 2025].

Protocol: Bidirectional streaming with JSON events. Three outbound event streams: (1) ASR transcription tokens, (2) tool-use events, (3) audio output chunks. Client sends audio input events plus system prompt events. The session lifecycle is: establish stream → send configuration → stream audio → receive interleaved audio response.

Latency: AWS claims "real-time, low-latency multi-turn conversations" with "sub-200ms round-trip audio processing." No official p50/p95 SLA is published. The Caylent integration blog reports that "Nova Sonic delivered responses faster than competing models" in their testing [Caylent, 2025]. The DEV Community AWS Builders post corroborates sub-300ms end-to-end for their support voice assistant use case.

Context window: 300K tokens (Nova Sonic v1); 1M tokens (Nova 2 Sonic).

Languages: English (US, UK) at launch. Spanish added June 2025. French, Italian, German added July 2025.

Tool use: Native function calling within the streaming session — enables RAG, MCP server integration, and real-time knowledge grounding without breaking the audio stream.

Availability: us-east-1 at launch. Check current Bedrock model availability for additional regions before Phase 3 production rollout.

Pricing (as of early 2026):

Speech input: $0.0034 per 1,000 tokens
Speech output: $0.0136 per 1,000 tokens
Combined effective rate: ~$0.017/minute for typical conversational audio
Text tokens (tool calls, system prompts, conversation history): standard Bedrock text pricing applies

Strengths for M2:

Zero infrastructure to manage; fully managed on Bedrock
Native AWS IAM integration — no new auth layer
Eliminates Transcribe Streaming and Polly dependencies entirely
Tool use and barge-in handled natively within the stream
Nova 2 Sonic upgrade is API-compatible — free latency/quality improvement

Weaknesses:

No published p95 SLA — production tail latency risk
Limited to us-east-1 at launch (regional availability may be a latency concern for non-US users)
English + 4 EU languages only (M3 multilingual expansion requires MMS or alternative)
Vendor lock-in to AWS; no self-hosted option

4.2 Sesame CSM-1B

Released: March 2025

Architecture: Llama backbone (~1B parameters) + Mimi RVQ audio decoder. Input: text + optional audio context. Output: RVQ audio codes decoded to 24kHz PCM. Note: CSM-1B is a TTS model, not a speech-to-speech model. It generates speech from text; it does not accept audio input for understanding. A conversational voice pipeline using CSM-1B therefore still requires a separate STT layer (e.g., Whisper, Deepgram, or Amazon Transcribe) and a separate LLM layer. This is a three-component pipeline, not a unified one [Sesame CSM-1B HuggingFace model card, 2025].

Inference speed: RTF of approximately 0.28x on a high-end GPU (NVIDIA A100/4090 class), per community benchmarks. This means 10 seconds of audio requires ~2.8 seconds of generation time. This is materially incompatible with a sub-300ms first-audio-byte target unless (a) audio is generated in very short chunks with aggressive streaming, or (b) speculative decoding is applied. The csm-streaming community fork demonstrates streaming generation, but published RTF data for first-chunk-to-audio remains below the threshold. A GitHub issue thread on realtime voice agents [SesameAILabs/csm issue #78, 2025] explicitly notes "slow inference time for real-time voice agent applications."

Hardware: Requires CUDA GPU. Minimum practical deployment: ml.g5.xlarge (NVIDIA A10G, 24 GB VRAM) at ~$1.41/hour on SageMaker. Note: AWS Fargate does not support GPU instances — GPU workloads require EC2 or SageMaker Inference Endpoints.

License: Apache 2.0 — fully open for commercial use.

Strengths for M2:

Open-weights: full control, no per-token cost, portable across clouds
Apache 2.0 allows fine-tuning on Silent Infinity's specific voice persona
Natural prosody and contextual speech quality praised by community

Weaknesses:

Not a speech-to-speech model; requires STT + LLM + CSM stack = 3 hops
RTF 0.28x on enterprise GPU makes sub-300ms structurally difficult
GPU fleet management overhead (SageMaker endpoint, scaling, idle cost)
English-only
No built-in tool use, barge-in, or turn management

Verdict on CSM-1B for M2: CSM-1B is architecturally misaligned with the M2 objective. It is an excellent TTS component for a custom voice persona project, but it is not a speech-to-speech unified model. Using it in M2 preserves the multi-hop pipeline latency problem that Nova Sonic eliminates. It belongs in a M2.5 custom voice persona initiative where the goal is voice character consistency, not latency.

---

5. Competitor Benchmarks

The following table provides control data points for calibrating the M2 target. All figures are industry-reported as of Q1 2026.

|--------|-------------|--------------------------|---------|-------|

| Deepgram Voice Agent (Flux) | STT (Nova-3) + LLM + TTS (Aura-2) | Sub-300ms end-of-turn | $4.50/hour bundled | Not a unified model; achieves sub-300ms through optimized pipeline; $4.50/hr ≈ $0.075/min [Deepgram, 2026] |

Key observation: Nova Sonic's claimed ~$0.017/min represents approximately 3.5x cost advantage over Deepgram's bundled rate and 14x cost advantage over OpenAI Realtime API at equivalent audio output volume. The pricing advantage is structural because Nova Sonic charges per token of audio rather than per wall-clock minute.

---

6. Architecture Diagrams

6.1 M1 Architecture (Current State)


BROWSER                          AWS CLOUD
┌────────────────┐
│  getUserMedia()│
│  mic capture   │
│  VAD buffer    │
│  ~80ms         │
└───────┬────────┘
        │ HTTPS POST (chunked)
        │ +30ms network
        ▼
┌───────────────────┐
│  API Gateway      │  HTTP API
│  +5ms             │
└───────┬───────────┘
        │
        ▼
┌───────────────────┐
│  Lambda           │  Python 3.12 ARM64
│  +20ms invoke     │
└───┬───────────┬───┘
    │           │
    │ WebSocket │ invoke
    ▼           ▼
┌──────────┐  ┌──────────────────────────┐
│Transcribe│  │                          │
│Streaming │  │  (waiting for transcript)│
│+180ms    │  │                          │
└────┬─────┘  └──────────────────────────┘
     │ partial transcript
     ▼
┌───────────────────┐
│  Bedrock          │  Claude Sonnet 4.6
│  Claude Sonnet    │
│  +250ms TTFT      │
└────────┬──────────┘
         │ text tokens
         ▼
┌───────────────────┐
│  Amazon Polly     │  Neural/Generative
│  +120ms           │
└────────┬──────────┘
         │ audio stream
         ▼
┌───────────────────┐
│  Lambda SSE emit  │
└────────┬──────────┘
         │ SSE stream +30ms network
         ▼
┌────────────────┐
│  Browser       │
│  Web Audio API │
│  +35ms buffer  │
└────────────────┘

TOTAL HOPS: 6 service boundaries
p50 LATENCY: ~750ms first audio byte
TRANSPORT: SSE (unidirectional, no barge-in)

6.2 M2 Architecture (Nova Sonic Target)


BROWSER                          AWS CLOUD
┌────────────────┐
│  getUserMedia()│
│  mic capture   │
│  VAD (soft)    │
│  ~50ms         │
│  barge-in sig. │
└───────┬────────┘
        │ WebSocket (persistent, full-duplex)
        │ +20ms network
        ▼
┌───────────────────┐
│  API Gateway      │  WebSocket API
│  WS route         │
│  +5ms             │
└───────┬───────────┘
        │ WS frame
        ▼
┌─────────────────────────────────┐
│  Lambda (WS handler)            │
│  or ECS Task (Phase 3+)         │
│  +5ms                           │
│                                 │
│  InvokeModelWithBidirectional   │
│  Stream (HTTP/2 to Bedrock)     │
└─────────────┬───────────────────┘
              │
              ▼
┌─────────────────────────────────┐
│  Amazon Nova Sonic              │
│  (Bedrock, us-east-1)           │
│  ┌─────────────────────────┐    │
│  │ UNIFIED MODEL           │    │
│  │ STT + LLM + TTS         │    │
│  │ (single forward pass)   │    │
│  └─────────────────────────┘    │
│  +150ms TTFA (p50 target)       │
└──────────────┬──────────────────┘
               │ audio chunk events (streaming)
               ▼
┌─────────────────────────────────┐
│  Lambda / ECS                   │
│  WS forward                     │
│  ~0ms (passthrough)             │
└──────────────┬──────────────────┘
               │ WS frame +20ms network
               ▼
┌────────────────┐
│  Browser       │
│  Web Audio API │
│  +30ms buffer  │
└────────────────┘

TOTAL HOPS: 3 service boundaries (vs 6 in M1)
p50 LATENCY TARGET: ~280ms first audio byte
TRANSPORT: WebSocket (full-duplex, barge-in native)
ELIMINATED: Transcribe Streaming, Polly, SSE

6.3 M2 Fallback Architecture (Sesame CSM-1B)


BROWSER                    AWS CLOUD
┌──────────────┐
│  WS mic      │
│  +50ms       │
└──────┬───────┘
       │ WebSocket
       ▼
┌──────────────────────────────────────────┐
│  API Gateway WS + Lambda/ECS             │
└───┬──────────────────┬───────────────────┘
    │                  │
    ▼                  ▼
┌──────────┐      ┌───────────────────────┐
│ Whisper  │      │                       │
│ or Nova  │      │  (waiting for STT)    │
│ STT      │      │                       │
│ +120ms   │      └───────────────────────┘
└────┬─────┘
     │ transcript
     ▼
┌──────────────────────────────────────────┐
│  Bedrock Claude Sonnet 4.6               │
│  +200ms TTFT                             │
└─────────────────┬────────────────────────┘
                  │ text
                  ▼
┌──────────────────────────────────────────┐
│  SageMaker Endpoint                      │
│  Sesame CSM-1B (ml.g5.xlarge)            │
│  +200ms first chunk (streaming)          │
└──────────────────┬───────────────────────┘
                   │ WS audio
                   ▼
┌──────────────┐
│  Browser     │
│  +30ms       │
└──────────────┘

TOTAL HOPS: 5 (vs 3 for Nova Sonic)
p50 ESTIMATE: ~620ms (fails 300ms target)
NOTE: CSM-1B path does NOT achieve M2 latency target

---

7. Migration Phases

Phase 1 — WebSocket Infrastructure (Week 1)

Objective: Establish WebSocket endpoint; validate infra without touching M1.

Tasks:

1. Create API Gateway WebSocket API (wss://voice.silentinfinity.com/ws). Configure three routes: $connect, $disconnect, $default.

2. Deploy voice_ws_echo Lambda handler (Python 3.12 ARM64) on the $default route. Handler receives JSON frames, echoes them back. No ML inference.

3. Wire browser client: add WebSocket transport alongside existing fetch POST. Use variant registry to route voice_stt = "ws-echo" to the new path. Default remains transcribe-polly-baseline.

4. Validate: WebSocket connect/disconnect round-trip latency < 50ms from US East clients. Confirm API Gateway WS connection logs in CloudWatch.

5. Update variants.py: add ws-echo variant flagged as internal_only = True, rollout_pct = 0.

Exit criteria: WebSocket echo round-trip p50 < 50ms. Zero impact on existing M1 traffic. CloudWatch WS connection metrics flowing.

Risks: API Gateway WebSocket has a 10-minute idle connection timeout; implement heartbeat ping/pong (30s interval) in client.

Phase 2 — Nova Sonic Adapter Swap (Weeks 2–3)

Objective: Replace Transcribe + Polly with Nova Sonic. Measure real latency vs M1.

Tasks:

1. Implement NovaSonicAdapter class in Lambda: opens InvokeModelWithBidirectionalStream to Bedrock Runtime, forwards browser audio frames as audioInput events, receives audioOutput events and forwards as WS frames to browser.

2. Map system prompt (Silent Infinity voice persona, tone instructions) to Nova Sonic systemPrompt event at session start.

3. Implement barge-in: on receiving bargeIn event from browser (user speaks during AI response), send audioInputEnd to Nova Sonic and signal Lambda to stop forwarding audioOutput events.

4. Instrument custom CloudWatch metrics: VoiceM2_TTFA_ms (time from last user audio frame to first Nova Sonic audio output frame), VoiceM2_SessionError, VoiceM2_ModelLatency_p50, VoiceM2_ModelLatency_p95.

5. Deploy sonic-unified variant in variants.py at rollout_pct = 0, internal_only = True.

6. Run internal load test: 100 concurrent sessions, 50 voice turns each. Record p50/p95 TTFA.

Measurement protocol:

Define TTFA as: timestamp of Lambda receiving first Nova Sonic audioOutput event minus timestamp of Lambda receiving last user audio frame before end-of-turn signal.
Capture separately: (a) Nova Sonic model latency (p50/p95), (b) API GW WS frame delivery latency, (c) end-to-end browser-to-first-audio latency via client-side performance.now() timestamps.
Compare against M1 baseline measured identically over same 1-week window.

Exit criteria: Nova Sonic p50 TTFA ≤ 200ms in load test. p95 TTFA ≤ 500ms. Error rate < 1%. Internal team (5 people) subjective quality rating ≥ 4/5 on mirror-tone accuracy.

Phase 3 — Branch Rollout Decision (Weeks 4–5)

Objective: Canary to production users. Decide Nova Sonic vs Sesame CSM-1B.

Decision gate at end of Phase 2:

Path A — Nova Sonic meets target (p50 TTFA ≤ 200ms reliably):

1. Set sonic-unified rollout to 5% in variant registry.

2. Monitor CloudWatch for 48 hours: VoiceM2_TTFA_ms p50 < 300ms, VoiceM2_SessionError < 2%, VoiceM2_ModelLatency_p95 < 800ms.

3. If stable: increase to 25%, monitor 48h, increase to 50%, monitor 48h.

4. Proceed to Phase 4.

Path B — Nova Sonic fails target:

1. Activate Sesame CSM-1B fallback investigation.

2. Deploy CSM-1B on SageMaker ml.g5.xlarge endpoint (2 instances, auto-scaling 1–4).

3. Implement streaming generation: first audio chunk within 200ms of text input requires aggressive RVQ chunk streaming (50-token chunks vs default full-sequence generation).

4. Measure CSM-1B first-chunk latency. If p50 > 400ms, escalate to leadership — M2 sub-300ms target is not achievable with current open-source options and timeline. Consider OpenAI Realtime API as managed fallback.

5. Note: Path B adds 3–4 weeks to timeline.

Rollback in Phase 3: CodeDeploy canary with automatic rollback. CloudWatch alarm VoiceM2_P95_Breach (p95 TTFA > 800ms sustained 5 minutes) triggers CodeDeploy:RollbackDeployment. Variant registry flips sonic-unified to rollout_pct = 0 within 60 seconds of alarm state.

Phase 4 — Full Production, Retire M1 (Week 6+)

Objective: 100% traffic on M2. Decommission Transcribe Streaming, Polly, and SSE endpoint.

Tasks:

1. Ramp sonic-unified to 100% via variant registry.

2. Keep M1 Lambda warm for 72 hours post-ramp (provisioned concurrency 1) as emergency rollback target.

3. After 72h without rollback trigger: set M1 Lambda to 0 provisioned concurrency.

4. After 2 weeks: deprecate transcribe-polly-baseline variant, remove Transcribe IAM permissions, delete Polly voice config.

5. Update SSE endpoint to return 410 Gone with upgrade instructions for any clients still polling the old path.

6. Final cost audit: compare M2 Nova Sonic monthly bill against M1 Transcribe + Polly + Bedrock baseline.

---

8. Cost Model

8.1 Assumptions

50,000 voice turns/month
Average turn duration: 45 seconds (user speaks ~15s, AI responds ~30s)
Audio input: 15s × 50,000 = 750,000 seconds = 12,500 minutes/month
Audio output: 30s × 50,000 = 1,500,000 seconds = 25,000 minutes/month

8.2 M1 Baseline Cost

|---------|--------|-----------|--------------|

| Amazon Transcribe Streaming | 12,500 min | $0.024/min | $300 |

| Lambda (ARM64, 1GB, 60s avg) | 50K invocations | ~$0.0000133/GB-s | ~$40 |

Note: Polly Generative pricing varies by characters; estimate based on ~150 words/30s at $0.00016/char.

8.3 Nova Sonic M2 Cost

| Component | Calculation | Monthly Cost |

|-----------|-------------|--------------|

| Nova Sonic speech input | 12,500 min × ~170 tokens/s × $0.0034/1K tokens | ~$362 |

| Nova Sonic speech output | 25,000 min × ~170 tokens/s × $0.0136/1K tokens | ~$2,890 |

| Nova Sonic text (system prompt, tools) | ~50K turns × 500 text tokens × $0.003/1K | ~$75 |

| API Gateway WebSocket | 50K connections + message fees | ~$15 |

| Lambda WS handler (lightweight) | 50K sessions × 60s × 128MB | ~$5 |

| Nova Sonic M2 TOTAL | | ~$3,347/month |

Note: The output token rate dominates. At $0.0136/1K tokens, 25,000 minutes of audio output at ~170 tokens/second = 255M tokens = ~$3,468. This is higher than M1 Polly. However, Transcribe Streaming ($300) and Polly ($750) are replaced by a single billing dimension, and the quality + latency improvement justifies the delta at this volume.

At 50K turns/month: Nova Sonic M2 costs approximately 3x M1. At 200K turns/month, the relative overhead decreases as fixed Lambda/API GW costs become negligible.

8.4 Sesame CSM-1B Self-Hosted Cost (for comparison)

| Component | Spec | Monthly Cost |

|-----------|------|--------------|

| SageMaker ml.g5.xlarge (24/7) | $1.41/hr × 720hr | $1,015 |

| SageMaker ml.g5.xlarge (2nd instance, HA) | $1.41/hr × 720hr | $1,015 |

| Whisper/Transcribe STT (still needed) | 12,500 min × $0.024/min | $300 |

| Bedrock Claude Sonnet 4.6 (still needed) | ~$5.25 as above | $5 |

| ECS Task (orchestration) | 2 tasks × 0.5 vCPU | ~$30 |

| Data transfer, storage | | ~$20 |

| CSM-1B TOTAL | | ~$2,385/month |

Note: CSM-1B is $962/month cheaper than Nova Sonic at 50K turns, but this gap reverses at higher volume because SageMaker instances run 24/7 regardless of traffic, while Nova Sonic bills only on usage. Break-even is approximately 35K turns/month. Above that, Nova Sonic is cheaper per-turn; below it, CSM-1B idle cost dominates.

GPU idle cost is the structural risk for CSM-1B. At zero-traffic hours, the ml.g5.xlarge instances still bill $1.41/hr. Mitigation requires aggressive auto-scaling to zero (SageMaker Serverless Inference) but cold-start for a 1B parameter model on GPU is 15–45 seconds — unacceptable for voice.

---

9. Quality Measurement Framework

9.1 Mirror-Tone Accuracy Definition

Mirror-tone accuracy measures how faithfully the voice pipeline reflects the intended emotional register, conversational style, and persona of the Silent Infinity product. It is distinct from WER (word error rate) and MOS (mean opinion score). We define it along three axes:

1. Prosodic alignment: Does the AI's spoken response match the emotional valence of the user's input? (e.g., a user speaking with urgency should receive a response with matched urgency, not flat neutral TTS)

2. Persona consistency: Does the voice character remain consistent across turns within a session and across sessions?

3. Turn-taking naturalness: Do pauses, hesitations, and response timing feel human-calibrated?

9.2 Automated Evaluation

Latency metrics (continuous, CloudWatch):

TTFA_p50, TTFA_p95 — per variant, per hour
SessionError_rate — 5-minute rolling
BargeIn_success_rate — fraction of barge-in events that result in clean interruption

Transcription accuracy (sampled, 1% of turns):

WER via Whisper large-v3 reference transcription vs Nova Sonic ASR output
Target: WER < 8% on in-domain Silent Infinity vocabulary

Audio quality (sampled, 1% of turns):

UTMOS score (neural MOS predictor) — target mean > 4.0
Spectral distance (MCD) between Nova Sonic output and reference Polly Generative output
Background noise level in silence frames (PESQ-adjacent)

Persona consistency (batch, weekly):

Run 100 scripted test utterances through each variant
Embed responses with a sentence transformer; compute cosine similarity within-variant vs cross-variant
High within-variant similarity = consistent persona

9.3 Human Sampling

Panel: 5 internal reviewers + 10 external beta users
Cadence: Weekly during Phase 2–3, monthly in Phase 4
Protocol: Blinded A/B: reviewer hears pairs (M1 response, M2 response) to the same prompt, rates:

- Naturalness (1–5)

- Mirror-tone accuracy (1–5, with rubric: does the AI sound like it's listening?)

- Overall preference (A/B/neither)

Target: M2 naturalness ≥ M1 naturalness, mirror-tone ≥ 4/5, overall preference M2 > M1

---

10. Variant Registry Plan

The following additions to variants.py support the M2 migration. Phase percentages represent the rollout trajectory.


# variants.py — M2 additions

VOICE_VARIANTS = {
    # BASELINE — current M1 production path
    "transcribe-polly-baseline": {
        "voice_stt": "transcribe_streaming",
        "voice_llm": "bedrock_claude_sonnet_4_6",
        "voice_tts": "polly_generative",
        "transport": "sse",
        "rollout_pct": 100,       # Phase 1: 100% | Phase 4: 0%
        "internal_only": False,
        "min_sdk_version": None,
        "description": "M1 production baseline. SSE transport. Polly Neural/Generative TTS.",
        "deprecated": False,      # Set True in Phase 4 week 2
    },

    # PRIMARY M2 CANDIDATE — Nova Sonic unified speech-to-speech
    "sonic-unified": {
        "voice_stt": "nova_sonic_unified",   # STT handled internally by Nova Sonic
        "voice_llm": "nova_sonic_unified",   # LLM handled internally by Nova Sonic
        "voice_tts": "nova_sonic_unified",   # TTS handled internally by Nova Sonic
        "transport": "websocket",
        "bedrock_model_id": "amazon.nova-sonic-v1:0",
        "rollout_pct": 0,         # Phase 1: 0% | Phase 2: 0% internal | Phase 3: 5% | Phase 4: 100%
        "internal_only": True,    # Set False at Phase 3 canary
        "barge_in_enabled": True,
        "system_prompt_key": "voice_persona_v2",
        "description": "M2 primary. Nova Sonic unified S2S. WebSocket transport. Sub-300ms target.",
        "cloudwatch_alarm_group": "VoiceM2Alarms",
    },

    # FALLBACK M2 CANDIDATE — Sesame CSM-1B (activated only if Nova Sonic fails Phase 2 gate)
    "sesame-csm1b": {
        "voice_stt": "transcribe_streaming",   # Still required; CSM-1B is TTS-only
        "voice_llm": "bedrock_claude_sonnet_4_6",
        "voice_tts": "sesame_csm1b_sagemaker",
        "transport": "websocket",
        "sagemaker_endpoint": "csm-1b-inference-v1",
        "rollout_pct": 0,         # Stays 0 unless Nova Sonic fails Phase 3 gate
        "internal_only": True,
        "barge_in_enabled": False,  # Requires custom implementation
        "description": "M2 fallback. CSM-1B on SageMaker. Higher latency; better voice persona control.",
        "cloudwatch_alarm_group": "VoiceM2Alarms",
        "notes": "Activate only if sonic-unified fails p50 < 300ms gate at end of Phase 2.",
    },
}

# Rollout schedule
ROLLOUT_PHASES = {
    "Phase1_Week1":  {"transcribe-polly-baseline": 100, "sonic-unified": 0, "sesame-csm1b": 0},
    "Phase2_Week2":  {"transcribe-polly-baseline": 100, "sonic-unified": 0, "sesame-csm1b": 0},  # internal shadow
    "Phase3_Week4":  {"transcribe-polly-baseline": 95,  "sonic-unified": 5,  "sesame-csm1b": 0},
    "Phase3_Week5":  {"transcribe-polly-baseline": 75,  "sonic-unified": 25, "sesame-csm1b": 0},
    "Phase4_Week6":  {"transcribe-polly-baseline": 50,  "sonic-unified": 50, "sesame-csm1b": 0},
    "Phase4_Week7":  {"transcribe-polly-baseline": 0,   "sonic-unified": 100,"sesame-csm1b": 0},
}

---

11. Rollback Strategy

11.1 Automated Rollback Triggers

The following CloudWatch alarms are mandatory before any Phase 3 canary traffic.

Alarm 1: M2 P95 Latency Breach


MetricName: VoiceM2_TTFA_p95
Namespace:  SilentInfinity/Voice
Statistic:  p95
Period:     300 seconds (5 min)
Threshold:  > 800ms
EvaluationPeriods: 2
TreatMissingData: notBreaching
AlarmActions:
  - SNS: voice-m2-oncall
  - Lambda: revert-to-m1-variant  (sets sonic-unified rollout_pct = 0)
  - CodeDeploy: RollbackDeployment (if active canary deployment)

Alarm 2: M2 Error Rate Breach


MetricName: VoiceM2_SessionError_rate
Namespace:  SilentInfinity/Voice
Statistic:  Average
Period:     300 seconds
Threshold:  > 0.02  (2%)
EvaluationPeriods: 2
TreatMissingData: notBreaching
AlarmActions:
  - SNS: voice-m2-oncall
  - Lambda: revert-to-m1-variant

Alarm 3: Bedrock Service Error Spike


MetricName: AWS/Bedrock InvocationErrors
Dimensions: ModelId=amazon.nova-sonic-v1:0
Statistic:  Sum
Period:     60 seconds
Threshold:  > 5 errors in 60 seconds
EvaluationPeriods: 1
AlarmActions:
  - Lambda: revert-to-m1-variant (immediate)

Alarm 4: M2 TTFA Regression (soft warning)


MetricName: VoiceM2_TTFA_p50
Threshold:  > 400ms
Period:     600 seconds (10 min)
EvaluationPeriods: 3
AlarmActions:
  - SNS: voice-m2-engineering (notification only, no auto-revert)

11.2 Manual Rollback Procedure

In any emergency (alarm not firing but quality is degraded):


# Immediate: flip variant registry via SSM Parameter Store
aws ssm put-parameter \
  --name "/silentinfinity/voice/variant_overrides" \
  --value '{"sonic-unified": 0, "transcribe-polly-baseline": 100}' \
  --overwrite

# Lambda reads this parameter on each invocation via a 30-second cache
# RTO: < 60 seconds

11.3 M1 Deprecation Safety Gate

M1 infrastructure (Transcribe IAM role, Polly config, SSE Lambda handler) is not deleted until:

Nova Sonic M2 has run at 100% traffic for minimum 14 days
Zero rollback events in final 7 days
p95 TTFA stable under 600ms for 7-day trailing window
CFO sign-off on the increased monthly cost delta

---

12. M3 Forward Look

M3 is not in scope for this memo but should inform decisions made in M2.

The primary M3 driver is multilingual expansion. Nova Sonic supports 5 languages as of early 2026 (English US/UK, Spanish, French, Italian, German). Silent Infinity's roadmap likely requires broader coverage. Two paths:

1. Nova 2 Sonic + language expansion: AWS has committed to expanding Nova Sonic language support. Monitor AWS release notes.

2. MMS + custom pipeline: Pratap et al. (2023) demonstrated that Massively Multilingual Speech (MMS) can support 1,000+ languages using public domain text with wav2vec 2.0 [Pratap et al., arXiv:2305.13516, 2023]. MMS is released under CC-BY-NC 4.0 (note: not commercially permissive — verify before use). An M3 architecture could use MMS for non-English STT piped into a multilingual LLM and a separate TTS system, at the cost of re-introducing pipeline hops and latency regression.

3. CSM-1B multilingual fine-tune: Sesame has announced plans to expand CSM-1B to 20+ languages. This is the most compelling M3 use case for CSM-1B — not as the latency-critical unified model, but as the fine-tunable voice persona layer on top of a Nova Sonic ASR stream.

M3 recommendation: Keep Nova Sonic as the primary unified backbone; layer a fine-tuned CSM-1B voice persona on top for non-English markets as they open, using CSM-1B's audio context input to maintain Silent Infinity's voice character across languages.

---

13. Recommendations

Recommendation 1: Nova Sonic is the M2 backbone — commit to it

The evidence is unambiguous. Nova Sonic eliminates three service hops, provides native WebSocket-equivalent bidirectional streaming, includes tool use and barge-in as first-class features, and is fully managed on existing AWS infrastructure. At 50K turns/month the monthly cost increases from ~$1,095 to ~$3,347, but this delta purchases a 2.5x latency improvement and a qualitatively different product experience. The cost per turn is $0.067 — comparable to running a mid-tier dedicated voice API and substantially below OpenAI Realtime API ($0.30+ per turn at equivalent output).

Recommendation 2: Reject Sesame CSM-1B as M2 primary; revisit for M3 voice persona

CSM-1B is architecturally mismatched for M2. It is a TTS-only model requiring STT and LLM components above it — this preserves the multi-hop latency structure M2 is designed to eliminate. Its RTF of 0.28x on high-end GPU means first-audio-byte latency cannot reach 300ms without streaming optimization work that is not yet production-tested. The GPU idle cost on SageMaker creates a variable-to-fixed cost problem at low traffic volumes. CSM-1B belongs in M3 as a fine-tuneable voice persona layer for custom characters and multilingual expansion — this is where its Apache 2.0 license and fine-tuning capability create genuine value.

Recommendation 3: Instrument speculatively from day one

Speculative end-of-turn detection — triggering the Nova Sonic stream before the user fully stops speaking, using acoustic features to predict completion [Leviathan et al., ICML 2023; LTS-VoiceAgent framework, 2025] — is the lowest-cost, highest-leverage latency optimization available after the architecture swap. Implementing it in Phase 2 (not Phase 4) ensures you measure its effect in the canary window. A 30–60ms reduction in perceived TTFA turns a borderline 320ms p50 into a clean sub-300ms result.

---

14. References

1. AWS Nova Sonic announcement (April 2025): Introducing Amazon Nova Sonic

2. AWS Nova 2 Sonic announcement (December 2025): Introducing Amazon Nova 2 Sonic

3. Nova Sonic technical documentation: Amazon Nova Sonic Speech-to-Speech

4. Nova Sonic Bedrock model card: Nova Sonic — Amazon Bedrock

5. Amazon Nova pricing: Amazon Nova Pricing

6. Sesame CSM-1B model card: sesame/csm-1b — Hugging Face

7. SesameAILabs/csm GitHub: CSM Repository

8. CSM realtime issue thread: Issue #78 — Realtime voice agents

9. Leviathan et al. (ICML 2023): "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192. https://arxiv.org/abs/2211.17192

10. Pratap et al. (2023): "Scaling Speech Technology to 1,000+ Languages." arXiv:2305.13516. https://arxiv.org/abs/2305.13516

11. OpenAI Realtime API documentation: https://platform.openai.com/docs/guides/realtime

12. Google Gemini Live API documentation: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/live-api

13. Deepgram Voice Agent latency: Low Latency Voice AI — Deepgram

14. Voice AI latency comparison 2026: Realtime vs Gemini Live vs ElevenLabs — TokenMix

15. WebSocket vs SSE latency: WebSockets vs SSE — Ably

16. WebSocket vs SSE protocol comparison: RxDB WebSocket/SSE analysis

17. AWS API Gateway WebSocket setup (2026): OneUptime — WebSocket with API Gateway

18. CloudWatch canary rollback: Canary Deployment for AWS Lambda — Lumigo

19. SageMaker inference pricing: SageMaker Pricing — AWS

20. Caylent Nova Sonic integration blog: Introducing Amazon Nova Sonic — Caylent

21. LTS-VoiceAgent streaming framework (2025): arXiv:2601.19952. https://arxiv.org/html/2601.19952

22. Nova Sonic WebRTC sample: aws-samples/sample-nova-sonic-speech2speech-webrtc

---

Memo prepared by SCOUT (TITAN Research Agent) on 2026-04-21. All pricing figures are current as of Q1 2026 and subject to AWS rate changes. Latency claims for Nova Sonic reflect vendor-reported figures; independent p50/p95 benchmarks should be established in Phase 2 before committing to production rollout.

Silent Infinity Voice M2: Sub-300ms Pipeline

Technical Migration Memo — v1.0

Table of Contents

1. Executive Summary

2. Current State: M1 Architecture

2.1 Stack Description

2.2 M1 Known Constraints

3. Latency Budget Analysis

3.1 M1 Per-Hop Budget (actual, estimated p50)

3.2 M2 Target Per-Hop Budget (p50 ≤ 300ms)

3.3 The WebSocket Advantage

4. M2 Candidate Evaluation

4.1 AWS Nova Sonic

4.2 Sesame CSM-1B

5. Competitor Benchmarks

6. Architecture Diagrams

6.1 M1 Architecture (Current State)

6.2 M2 Architecture (Nova Sonic Target)

6.3 M2 Fallback Architecture (Sesame CSM-1B)

7. Migration Phases

Phase 1 — WebSocket Infrastructure (Week 1)

Phase 2 — Nova Sonic Adapter Swap (Weeks 2–3)

Phase 3 — Branch Rollout Decision (Weeks 4–5)

Phase 4 — Full Production, Retire M1 (Week 6+)

8. Cost Model

8.1 Assumptions

8.2 M1 Baseline Cost

8.3 Nova Sonic M2 Cost

8.4 Sesame CSM-1B Self-Hosted Cost (for comparison)

9. Quality Measurement Framework

9.1 Mirror-Tone Accuracy Definition

9.2 Automated Evaluation

9.3 Human Sampling

10. Variant Registry Plan

11. Rollback Strategy

11.1 Automated Rollback Triggers

11.2 Manual Rollback Procedure

11.3 M1 Deprecation Safety Gate

12. M3 Forward Look

13. Recommendations

Recommendation 1: Nova Sonic is the M2 backbone — commit to it

Recommendation 2: Reject Sesame CSM-1B as M2 primary; revisit for M3 voice persona

Recommendation 3: Instrument speculatively from day one

14. References