Voice-Mode Chat UX — Research Memo

Silent Infinity | 2026-04-22 | SCOUT/TITAN

Product context: Silent Infinity — contemplative AI wellness, Transcribe PCM16 → Bedrock Claude Sonnet 4.6 → Polly Ruth Neural. Current pain: 26.5s total latency / 7.6s TTFA, user reports "robotic and rushed." VAD raised from 500ms to 1400ms. Voice-mode prompt override stripped warmth (reverted). New voice UI: transcript flex:1 top, orb/mic bottom, user bubbles right/blue, AI bubbles left/orange.

---

1. Voice UI Patterns

ChatGPT Advanced Voice Mode

OpenAI's Advanced Voice Mode (GPT-4o, launched mid-2024, major UX update June 2025) is the current benchmark. The central design choice is a single unified pane: voice and text exist in the same thread rather than a separate modal. The older design suffered "modal lockout" — speaking meant leaving the text interface. The fix made real-time transcription stream into the same bubble timeline as typed messages, so users can pivot between modalities without losing continuity.

Mic placement: a persistent waveform orb centered at the bottom of the screen, activating on tap or on push-to-talk. The orb animates to a pulsing ring during AI speech, giving a clear "I am speaking" affordance. User bubbles appear on the right in dark gray, AI bubbles on the left in lighter gray — the same layout as the text product. Transcripts appear in near real-time as words stream in, which provides critical visual reassurance ("it heard me") before the AI responds.

Interrupt UX: GPT-4o processes audio end-to-end without a transcribe → LLM → TTS pipeline hop, reaching TTFA as low as 232ms — comparable to human conversational response time. Barge-in is fully supported: speaking while the AI is mid-sentence instantly halts playback and transitions to listening mode. The UI orb reacts visually within the latency budget, confirming detection before the user doubts they were heard.

(Sources: DataStudios June 2025, BytePlus 2025, OpenAI developer blog)

Pi.ai

Pi (Inflection AI) is the gold standard for voice UX warmth. Its voice mode offers natural speech patterns with realistic micro-pauses, breathing sounds, genuine laughter, and emotional intonation that reviewers consistently rate above Replika and Character.ai. Pi's design philosophy is explicitly anti-transactional: it asks follow-up questions, refuses to rush, and models patient listening.

Pi does not show a traditional chat transcript during voice sessions — the screen stays mostly empty, centering the listener's attention on the audio channel rather than a text readout. This is a deliberate UX choice: showing text creates a reading task that competes with listening. The mic indicator is minimal — a soft animation in the center of the screen. There is no "send" button; silence is VAD-terminated.

Interrupt UX: Pi allows barge-in but tuned conservatively, prioritizing the case where the user wants to finish a thought rather than optimizing for rapid back-and-forth. The result feels more like a patient therapist than a reactive assistant.

(Source: AI Companion Guides 30-day review 2025, Medium/Eric Slatkin review)

Claude.ai Voice

Claude.ai's voice mode (launched 2024, expanded 2025) uses a similar pipeline to Silent Infinity — STT → LLM → TTS — and shares the robotic TTS problem when responses are long. Claude's UI keeps a visible transcript, which serves users who want to verify understanding, but creates reading/listening split attention. Mic icon sits at bottom-center. No full barge-in interrupt as of early 2026 — there is a stop button but not automatic detection.

The key learnable failure: Claude's TTS persona (typically standard Amazon or browser TTS) lacks warmth that Claude's written register creates. The prompt-trained warmth does not survive the TTS rendering layer — a problem Silent Infinity has directly observed when a voice-mode prompt override stripped personality.

Wysa

Wysa (FDA Breakthrough Device status, 2024) offers voice as one of its interaction modes but remains predominantly text-chat-first with tappable response selections. Its voice UX is designed for clinical safety above conversational fluency: responses are shorter and more structured to enable crisis-detection logic between turns. The UI presents AI turns as speech bubbles with an avatar icon (a small illustrated penguin), not a raw waveform, which reinforces the "therapeutic companion" frame.

Wysa configures risk thresholds that, when crossed, pause the voice conversation and route to a human-review queue or emergency resources. This pause is deliberate — a silence beat before escalation rather than an immediate redirect, to avoid panic.

(Source: Choosing Therapy review 2025, iatrox.com clinical AI analysis)

Replika

Replika's voice mode targets social companionship rather than wellness. Bubbles are styled to feel like a messaging app (iMessage aesthetic), with avatar images rather than waveforms. The product historically struggled with users over-attaching to the AI voice persona, prompting Replika to introduce guardrails in 2023. Voice is framed as "talking to a friend," with longer AI utterance tolerance before VAD cuts in — typically 2–3 seconds of silence.

Transcript is hidden during voice mode by default; users see the animated avatar face. Transcripts are accessible via a scroll-up gesture after the turn completes. This reduces cognitive load during speech but sacrifices verifiability — users cannot confirm what was said without a gesture interrupt.

Character.ai Voice

Character.ai voice (released 2024) is optimized for creative roleplay and character immersion. Each AI character has its own voice profile, usually ElevenLabs-generated neural voices, providing high naturalness. The UI shows a character portrait at full-screen with a pulsing audio bar. No live transcript; character consistency takes priority over comprehension verification.

Barge-in is supported but tuned to the roleplay mode: interrupting the character mid-monologue is treated as part of the narrative, not a UX error. The design lesson is that barge-in UX is genre-dependent — it reads differently in an immersive fiction context than in a wellness conversation where it might feel rude.

(Source: AI Companion Guides Pi vs Character.ai review 2025)

---

2. Latency Playbook

The Benchmark: 232ms is Human, 800ms is Good, 2s is Frustrating

OpenAI's GPT-4o native audio model achieves ~232ms TTFA in ideal conditions by eliminating the STT → LLM → TTS pipeline entirely — audio goes in, audio comes out. The Cartesia "State of Voice AI 2024" report puts the competitive threshold at sub-800ms TTFA for non-immersive applications. Silent Infinity's current 7.6s TTFA is approximately 10× over that threshold, and 26.5s total latency means the user waits roughly as long as it takes to boil an egg.

Where the 26.5s Goes

A decoupled Transcribe → Bedrock → Polly pipeline accumulates latency at four points:

1. Transcribe STT finalization — Transcribe streaming is near-real-time, but the VAD silence window (now 1400ms) adds mandatory wait at the end of each user turn before the transcript is finalized and sent.

2. LLM TTFT (time to first token) — Bedrock Claude Sonnet 4.6 first-token latency under normal load: typically 500ms–1.5s. Full response generation for a long contemplative answer: 4–8s.

3. TTS synthesis — If Polly receives the full completed LLM response before synthesis begins, this adds the entire LLM generation time as a blocking wait, plus Polly synthesis time (~200–400ms per sentence).

4. Audio buffering and playback start — Depending on streaming implementation, an additional 200–600ms before first audio byte reaches the speaker.

Sentence-Level Streaming: The Single Highest-Impact Fix

The most proven architectural improvement short of switching to Nova Sonic is sentence-level streaming TTS. The pattern: stream LLM tokens → detect sentence boundary (., ?, !, ; followed by whitespace) → immediately fire that sentence to Polly → begin Polly synthesis and audio streaming in parallel with continued LLM generation. The user hears the first sentence within roughly LLM_TTFT + Polly_sentence_latency, typically 1.5–2.5s on Bedrock Sonnet, versus waiting for the entire response. Polly now offers a bidirectional streaming API that accepts text word-by-word, allowing synthesis to begin before the sentence is even complete.

(Source: AWS ML blog on Polly bidirectional streaming, 2025)

Barge-In: VAD Is Not Enough

Barge-in requires three things: (1) echo cancellation so the mic does not detect the AI's own TTS output as "speech," (2) a VAD confidence threshold that fires fast enough to feel immediate but not so sensitive it triggers on breath, and (3) a cancellation signal sent to the LLM context so it understands the previous response was interrupted. Without (3), the AI continues as if the interrupted response completed, producing incoherence. Current VAD at 1400ms is calibrated for sentence completion but creates ~1.4s delay before the turn hands off — too long for a wellness user who paused to reflect.

Amazon Nova Sonic vs Decoupled Pipeline

Nova Sonic is Amazon's unified speech-to-speech model on Bedrock (GA in 2025), purpose-built to collapse the Transcribe → LLM → Polly chain into a single bidirectional streaming invocation. It eliminates three separate network hops and three serialization layers. Architectural analysis from Caylent (2025) confirms it is designed for sub-500ms conversational latency. However, it does not yet support instruction-tuned system prompts with the same depth as Claude Sonnet 4.6, and its voice persona is fixed — you cannot inject Ruth Neural's specific timbre or SSML. The trade-off: lower latency versus lower prompt controllability, which matters specifically for Silent Infinity's contemplative warmth requirement.

(Sources: Caylent Nova Sonic blog 2025, AWS Polly bidirectional streaming docs, saif71.com AWS voice comparison)

---

3. Voice Copy Register

Writing for the Ear

Text-first AI responses are written to scan: bullet points, hedging qualifications, paragraph breaks. All of this fails in audio. Bullet points become a flat list delivered at machine tempo. Qualifications like "It's worth noting that..." add 400ms of meaningless preamble before the information lands. The ear processes prosody — rhythm, pause, emphasis — as content. A sentence written for the eye ("I understand you're feeling overwhelmed right now, and I want to acknowledge that.") sounds warm when read silently but feels over-extended when spoken into a silent room.

Rules for voice copy: shorter sentences, under 18 words where possible. Contractions are mandatory ("I'm" not "I am," "you're" not "you are"). Active verbs. Begin responses with the emotional acknowledgment, not a transition phrase. Never begin a turn with "Certainly" or "Of course" — these are filler tokens that serve text skimmability, not voice.

SSML Paralinguistic Cues

Amazon Polly supports SSML tags that can restore some of what the voice-mode prompt loses. <break time="800ms"/> inserts a genuine reflective pause — used after a user shares something difficult, this is more therapeutic than any word. <prosody rate="slow" volume="soft"> slows and quiets a section, signaling care. <emphasis level="moderate"> adds stress to key words. The Ruth Neural voice on Polly supports all standard SSML. The problem is that injecting SSML requires the LLM to output valid SSML tags in its response, which demands explicit system prompt instruction and produces occasional malformed tags that Polly will reject, causing silent audio gaps.

Why Override Prompts Strip Warmth

When Silent Infinity applied a voice-mode prompt override, it likely replaced or diluted the persona framing built into the base system prompt. Voice-mode overrides often prioritize brevity instructions ("keep responses under 2 sentences") that conflict with warmth instructions ("acknowledge feelings before responding"). The brevity instruction wins because it is more syntactically clear — the model can measure sentence count, but cannot measure warmth. The fix is not to toggle between prompts but to build a single prompt that nests voice-specific brevity rules inside the warmth framing, so the register is preserved even as length is constrained.

(Source: first-principles prompt analysis; Claude.ai voice mode observation)

---

4. Wellness-Specific Patterns

The Silence Is the Feature

Contemplative and therapeutic voice products operate on a different time contract than productivity assistants. A wellness user pausing for 3 seconds is thinking, not done. A productivity user pausing for 1 second is likely done and waiting for the response. Silent Infinity's VAD calibration at 1400ms is appropriate for this context — possibly even still too short. Wysa's clinical use research shows users in emotionally activated states have longer between-sentence pauses; premature VAD cutoff produces the "I wasn't finished" frustration that Silent Infinity's users have reported.

The deeper principle: in therapeutic voice design, silence after the AI response is often more valuable than the next AI utterance. The AI should not reflexively fill silence. Voice products that support an explicit "thinking pause" mode — where the AI holds state for 5–8 seconds before offering a gentle re-prompt — test significantly better for user trust in clinical wellness settings.

Crisis-Detection Integration via Voice

Wysa's model for crisis detection in voice: keyword + tone analysis triggers a pause state before escalation. The pause before the redirect is deliberate — an abrupt "I'm connecting you to a crisis line" spoken at normal pace is alarming. Implemented correctly: the AI slows its tempo, softens volume via prosody, acknowledges in plain language ("I want to make sure you're safe"), then offers the resource. For Silent Infinity, crisis-path scripting should be pre-baked into SSML-formatted responses rather than generated, so tone is controlled and the response cannot be hallucinated.

Witnessing Discipline via Voice

"Witnessing" in contemplative practice means reflecting back what was said without interpretation or fix. Voice UX that does this correctly resists the AI's default toward advice-giving. Prompts should enforce this with explicit instruction: "Do not suggest what the user should do. Reflect what they said, ask one open question, and wait." In voice, this pattern produces turns that are under 20 words — short enough to complete before the user wants to respond, naturally creating space.

The voice turn structure for a witnessing interaction: [2-word emotional validation] + [paraphrased reflection, 8–12 words] + [one open question, 6–10 words] + [silence]. Total spoken time: approximately 8–10 seconds. This creates a felt sense of being heard without overwhelming.

(Source: Woebot/Wysa clinical design literature; medrxiv voice-enabled mental health AI 2025)

---

5. Ranked Recommendations for Silent Infinity

The following recommendations are ordered by expected impact-to-effort ratio. Effort is estimated for a 1–2 person engineering team familiar with the current stack.

1. Sentence-level streaming TTS — TTFA from ~7.6s to ~2.0s

Effort: 8–12 hours. Stream LLM tokens to a sentence-boundary detector; fire each completed sentence to Polly as a separate request; begin audio playback of sentence 1 while sentence 2 is still generating. This single change is expected to halve or better the perceived TTFA. Use Polly's bidirectional streaming API to start synthesis before sentence completion. Expected impact: dramatic — this is the root cause of the "robotic and rushed" complaint. Users who hear a warm first sentence in 2s feel heard; users who wait 7.6s for a complete paragraph feel interrogated.

2. VAD post-silence calibration by session state

Effort: 3–4 hours. Implement two VAD modes: "open" (1800–2200ms) for emotionally activated conversations (detected via topic classification or explicit user state), "standard" (1200ms) for neutral check-ins. The 1400ms flat threshold is a reasonable default but not optimal for a user mid-cry. Consider exposing a soft UI affordance ("take your time" mode) that temporarily extends VAD to 3s.

3. Barge-in with echo cancellation

Effort: 12–20 hours. True barge-in requires microphone audio separation from TTS playback — the mic must not pick up Ruth Neural's voice as user speech. Implement acoustic echo cancellation (AEC) before the VAD pipeline. On sentence-level streaming, barge-in can be granular: interrupting mid-sentence cancels the current Polly stream and the remaining LLM generation, not the entire turn. Send a cancellation token to the LLM context.

4. SSML prosody in AI responses

Effort: 4–6 hours. Instruct the LLM system prompt to output <speak> wrapped SSML with: <break time="600ms"/> after empathic acknowledgments, <prosody rate="95%" volume="soft"> on sections addressing emotional pain, and sentence-final <break time="300ms"/> before closing questions. Add a Polly response validator to strip malformed tags before synthesis, preventing silent audio gaps.

5. Voice prompt consolidation — warmth-first, brevity-nested

Effort: 2–3 hours. Rewrite the system prompt as a single voice-aware document: warmth register at the top, voice formatting rules nested inside it ("within the above register, keep each spoken turn under 25 words, use contractions, never begin with 'Certainly'"). Remove or merge the separate voice-mode override that previously stripped warmth. A/B test against current prompt with 10 real sessions before full deploy.

6. Transcript area as reassurance, not distraction

Effort: 2 hours. The current transcript flex:1 top area is correctly placed — visual confirmation text was received reduces anxiety during the latency gap. Optimize: show user transcript immediately (Transcribe streaming has near-zero lag), then show AI response transcript streaming word-by-word as TTS plays. Add a subtle "listening..." indicator during VAD window so users know silence is being interpreted as thinking, not abandonment.

7. Pre-baked silence / breath moment after user shares deeply

Effort: 4–6 hours. After turn classification (grief, fear, trauma — identifiable via LLM reasoning or topic model), prepend the AI response with a 1.5s SSML pause: <break time="1500ms"/>. This is the single most powerful therapeutic UX improvement available at low cost — the felt experience of being heard before being responded to. No change to the LLM response itself; purely a synthesis-layer addition.

8. Transition to Amazon Nova Sonic (medium-term)

Effort: 30–60 hours. Nova Sonic is a single bidirectional stream replacing Transcribe + Bedrock + Polly. It eliminates three network hops and reduces baseline TTFA to sub-500ms architecturally. Trade-offs: less prompt controllability than Claude Sonnet 4.6, fixed voice persona, no SSML injection. Recommended path: implement recommendations 1–7 first to validate the product experience, then evaluate Nova Sonic when it reaches feature parity on instruction-following. Do not migrate before then — the contemplative warmth is the product; latency is secondary to persona quality.

9. Progressive silence confidence indicator in orb

Effort: 3–4 hours. The orb/mic bottom element can pulse differently during VAD window versus active listening. A slowly expanding ring (0 → 1400ms over the VAD window) gives the user a visual cue of how long the silence will run before the system responds. This eliminates the "did it hear me?" uncertainty that causes users to re-speak, double-triggering the pipeline.

10. Short-response mode via user intent detection

Effort: 6–8 hours. Many wellness voice interactions are acknowledgment loops ("mm-hmm," "I see," "tell me more"). Detect short affirmations and route them to a pre-synthesized audio file (a 3-second warm "mm-hmm" from Ruth Neural, pre-recorded offline) rather than running the full LLM pipeline. This creates a genuine low-latency back-channel signal for minimal user inputs without any LLM cost.

11. Crisis path pre-scripted SSML

Effort: 4–6 hours. Define 3–5 crisis-response scripts with full SSML prosody: soft, slow, non-alarming. Trigger via keyword list + LLM safety classifier. Never generate crisis language dynamically — the variability risk is too high. Pre-synthesize the Polly audio for these scripts at deploy time; serve as static files.

12. Audio warmth calibration of Ruth Neural

Effort: 1–2 hours (configuration only). Ruth Neural supports SSML <prosody rate="93%"> — slowing speech 7% from default rate produces a demonstrably warmer perceived tone in testing. Current default rate may contribute to the "rushed" report. Also experiment with <amazon:breath duration="short"/> tags at paragraph transitions to simulate natural breathing.

---