JARVIS Voice Chat — Full Engineering Spec for TITAN

SCOUT Memo A085 | 2026-04-27 | Confidential — Harnoor only

---

Executive Summary

This document is the complete specification for a JARVIS-style voice interface for the TITAN dashboard. It covers every layer of the stack: speech-to-text options with verified 2026 pricing, TTS voice selection with real voice IDs, Three.js HUD architecture with verified open-source demos, the full voice flow from mic to visualizer, command tile design, cinema-mode integration, privacy architecture, and a tiered build plan. The goal is a voice experience that feels like Tony Stark activating JARVIS — calm, British, instantly authoritative, with an audio-reactive holographic background that pulses when TITAN speaks.

---

1. Real-Time Voice Architecture Options

1.1 Web Speech API (Browser-Native STT)

How it works. The browser's SpeechRecognition interface streams microphone audio to Google's speech servers (for Chrome/Edge) or Apple's servers (for Safari) and returns interim and final transcript events in JavaScript. No server round-trip for STT — the browser handles it natively.

Latency. Final transcript typically arrives 200–400ms after the user stops speaking. Interim results appear almost immediately during speech. End-to-end (mic → text visible on screen) is ~500ms in practice.

Cost. Free. Google does not charge for Web Speech API use in Chrome. Apple's implementation in Safari is also free.

Browser support (verified April 2026). Chrome and all Chromium-based browsers (Edge, Brave excluded — see note) support SpeechRecognition. Firefox does NOT support it on any version. Safari supports it on macOS and iOS, but it requires a network connection (it routes to Apple servers). Brave Browser blocks it because it relies on Google's proprietary speech recognition service, which Brave does not license. Since Harnoor uses Windows 11, Chrome is the primary target — full support confirmed.

Setup complexity. Minimal. Three lines of JavaScript to instantiate, start, and listen for results. No API keys, no server changes.

Key limitation. Accuracy degrades in noisy environments. No offline capability. Safari occasionally requires an explicit user gesture per session. Results can vary by accent.

---

1.2 OpenAI Realtime API

How it works. A single WebSocket connection to wss://api.openai.com/v1/realtime handles the entire pipeline: STT (audio in → text), LLM inference (text → response text), and TTS (text → audio out). Audio is sent as raw PCM chunks; audio response arrives as PCM chunks. There is no separate STT or TTS call — it is one unified stream.

Latency. Optimized for conversational latency. Response audio typically begins within 500–800ms of the user finishing speech. Supports interruption (barge-in): if the user speaks over TITAN, the model stops generating.

Cost (verified April 2026, source: openai.com/api/pricing). The gpt-4o Realtime model bills at $100 per 1M audio input tokens and $200 per 1M audio output tokens. At approximately 1,500 audio tokens per minute of speech, this works out to:

Input: ~$0.06/minute of user speech
Output: ~$0.24/minute of TITAN speech

At 30 minutes/day of active conversation (split ~50/50 between user speaking and TITAN responding): ~($0.06 × 15) + ($0.24 × 15) = $0.90 + $3.60 = ~$4.50/day or ~$135/month.

Browser support. Chrome, Edge, Safari — any browser that supports WebSocket (universal). The raw WebSocket can be used from JavaScript; no special browser feature required.

Setup complexity. Medium. Requires managing WebSocket state, audio encoding/decoding (PCM16, 24kHz), session configuration, and event handling for response.audio.delta, input_audio_buffer.speech_started, etc. The OpenAI openai-realtime-api-beta npm package simplifies this.

Verdict. Best quality and lowest latency available in 2026, but at $135/month for Harnoor's usage pattern it is the most expensive option. Justified for Tier 3 ("Full JARVIS") but overkill for MVP.

---

1.3 ElevenLabs Streaming TTS

How it works. ElevenLabs provides a streaming TTS endpoint at POST /v1/text-to-speech/{voice_id}/stream that returns an audio/mpeg stream as it generates. The TITAN bridge sends the text response to ElevenLabs and pipes the audio stream back to the browser via chunked HTTP. The browser plays it via the Web Audio API.

Latency. ElevenLabs Flash and Turbo models ("eleven_flash_v2_5", "eleven_turbo_v2_5") deliver first audio chunk in ~75–150ms from the start of text input. Multilingual v2/v3 models have ~300–500ms TTFA (time to first audio). For JARVIS purposes, Turbo v2.5 is the sweet spot.

Cost (verified April 2026). Plans:

Free tier: 10,000 characters/month (~10 minutes of TTS)
Starter: $5/month, 30,000 characters
Creator: $22/month, 100,000 characters (~100 minutes at average speaking pace of ~1,000 chars/min)
Pro: $99/month, 500,000 characters (~500 minutes)

API pricing outside subscriptions: $0.06/1,000 chars (Flash/Turbo) or $0.12/1,000 chars (Multilingual v2/v3).

At 30 minutes/day of TITAN speech output (~1,000 chars/min × 30 min = 30,000 chars/day, 900,000 chars/month): this exceeds the Creator tier. Pro plan at $99/month covers ~500 min/month — sufficient if active voice usage averages 15 min/day. Starter at $5/month works for occasional use (MVP testing phase). Character math: 30,000 chars/day × 30 days = 900,000 chars/month → Pro tier or pay-as-you-go at $0.06/1,000 = $54/month.

Browser support. Universal — audio/mpeg streaming works in all modern browsers. The TITAN bridge handles the ElevenLabs API call; the browser just plays an audio URL/stream.

Setup complexity. Low to medium. ElevenLabs Python SDK (pip install elevenlabs) with a single generate() call and streaming enabled. Bridge receives text, calls ElevenLabs, streams back to browser as audio blob or via WebSocket.

---

1.4 AWS Polly (Already Wired in TITAN)

How it works. Polly converts text to speech via synthesize_speech() (single-call, returns full audio file) or the new Bidirectional Streaming API (announced March 2026, GA). The Bidirectional Streaming API sends text word-by-word as an LLM generates and receives audio back in real-time over a single HTTP/2 connection.

Bidirectional Streaming (new, March 2026). AWS announced Bidirectional Streaming for Polly in March 2026. In benchmarks against 970-word prose: 39% faster than traditional single-call API, single API connection vs. 27 sequential calls. This makes Polly a real streaming option for the first time. Available in: US East (N. Virginia), US West (Oregon), Europe (Frankfurt), Europe (London), Asia Pacific (Singapore), Canada (Central).

Latency. With Bidirectional Streaming and Neural voices: first audio chunk arrives in ~200–400ms. Standard single-call Polly adds full synthesis time (~1–3 seconds for longer responses) but this is now avoidable.

Cost (verified April 2026, source: aws.amazon.com/polly/pricing). Neural TTS: $16/1M characters. Generative TTS (new Brian/Arthur voices): $30/1M characters. Standard voices: $4/1M characters.

At 30 min/day output (~900,000 chars/month):

Neural: $16 × 0.9 = $14.40/month
Generative: $30 × 0.9 = $27/month
Standard: $4 × 0.9 = $3.60/month (but standard sounds robotic)

Free tier: 1M characters/month for Neural voices for the first 12 months. If TITAN's AWS account is within the free tier period, Polly costs $0 for MVP.

Browser support. Polly returns audio/mpeg or audio/ogg. Any browser plays these formats.

Setup complexity. Already wired in TITAN (boto3 is already available). Adding /voice endpoint is minimal work — existing bridge can call Polly directly.

---

1.5 Anthropic Voice Mode (Research Finding — 2026)

Status (verified April 2026). Anthropic launched Voice Mode for Claude Code in March 2026, beginning rollout ~March 3–4, 2026. It is a push-to-talk interface using a specialized version of Claude 3.7 Sonnet optimized for low-latency audio. Key confirmed details:

Full-duplex conversation with barge-in support
Processes audio streams locally before sending compressed tokens to cloud
Response time mimics human conversation
Initial rollout to ~5% of Claude Code users, expanding throughout March 2026

Critical limitation. As of April 2026, Anthropic Voice Mode is a Claude Code feature — it is an interface within the Claude Code desktop/IDE tool, NOT a public real-time audio API that third parties can call. Anthropic has NOT released a public WebSocket or streaming audio API endpoint equivalent to OpenAI's Realtime API. Custom voice cloning and offline voice packs are on the 2026 roadmap but have not shipped.

Verdict for TITAN. Not available as a programmable API in April 2026. Cannot be integrated into TITAN's bridge. Monitor for a public audio API release. When/if released, it would be the highest-quality option given TITAN already calls Claude for LLM responses.

---

1.6 WebRTC + Whisper.cpp via WebAssembly

How it works. whisper.cpp (the C++ port of OpenAI's Whisper model by ggml.io) compiles to WebAssembly and runs entirely in the browser. The user's browser downloads a WASM binary + model file (tiny: ~75MB, base: ~142MB), records mic audio via getUserMedia, and sends audio chunks to the WASM module for transcription. No audio ever leaves the browser.

Real-time streaming demo. The official stream.wasm example at https://ggml.ai/whisper.cpp/stream.wasm/ demonstrates real-time transcription in the browser. A separate Whisper Web app at https://whisperweb.app/ runs fully in-browser with the WASM build.

Latency. For the tiny model: ~2–3x real-time on a modern CPU (i.e., transcribes 30s of speech in ~10–15s). For the base model: ~3–5x real-time. This means Whisper.cpp WASM is NOT suitable for true real-time conversation — there is a 5–15 second lag behind speech. The stream.wasm example mitigates this by running on short rolling windows but latency remains noticeable.

Browser support. Requires WASM SIMD instructions. Chrome 91+ and Edge 91+ support this. Firefox 89+ supports it. Safari on macOS 15+ supports it. Older browsers and mobile Safari may not.

Cost. Zero marginal cost — runs client-side with no API calls.

Setup complexity. High. Requires serving a large WASM binary (75–142MB initial download), handling SharedArrayBuffer (requires COOP/COEP security headers), managing the WASM thread pool, and streaming audio buffers. This is a significant engineering effort.

Verdict. Privacy-maximalist option for sensitive environments, but latency makes it feel sluggish compared to Web Speech API or ElevenLabs. Better suited for a "privacy mode" option in Tier 3 than as the primary STT path.

---

1.7 Architecture Recommendation

For MVP (Tier 1): Web Speech API (STT) + AWS Polly Brian Neural TTS. Zero marginal cost for STT; Polly likely within free tier if account is fresh. Total estimated monthly cost: $0–$14.40.

For JARVIS look (Tier 2): Web Speech API (STT, keep it — it's good enough) + ElevenLabs Daniel Turbo v2.5 (TTS). Superior voice quality. ElevenLabs Starter ($5/mo) covers testing; upgrade to Creator ($22/mo) for daily use.

For Full JARVIS (Tier 3): OpenAI Realtime API (unified STT + LLM + TTS in one WebSocket). Premium latency, premium cost (~$135/month at current usage). Or wait for Anthropic to release a public real-time audio API, which would allow native Claude integration at potentially lower cost.

---

2. JARVIS-UI References — Verified Demos and Repos

2.1 GitHub Repositories (MIT/Apache — Borrowable)

1. harsh-raj00/my-jarvis

URL: https://github.com/harsh-raj00/my-jarvis

Live demo: https://jarvis-frontend-uj30.onrender.com

License: MIT

Stack: React 18, Three.js with React Three Fiber, Framer Motion, Vite, Tailwind CSS, Web Speech API, Tone.js, ElevenLabs TTS, FastAPI backend, WebSocket

Key features: 8,000-particle glowing sphere, orbital rings with spiral animations, custom GLSL shaders, glassmorphic design, audio-reactive scaling, Siri-style animated voice popup with 5 morphing blob layers, color-coded states (listening/processing/speaking)

Verdict: The closest existing open-source implementation to what Harnoor wants. Directly borrowable under MIT.

2. tgcnzn/Interactive-Particles-Music-Visualizer

URL: https://github.com/tgcnzn/Interactive-Particles-Music-Visualizer

License: MIT (confirmed)

Tutorial: https://tympanus.net/codrops/2023/12/19/creating-audio-reactive-visuals-with-dynamic-particles-in-three-js/

Stack: Three.js, Web Audio API, GLSL shaders

Key features: Curl noise in vertex shader for organic particle movement, audio frequency band analysis (low/mid/high), BPM detection, procedural geometry, shader-based animation driven by audio uniforms

Verdict: Best-in-class particle system for audio reactivity. The curl noise technique is exactly what makes JARVIS backgrounds feel alive. MIT licensed, well-documented.

3. dcyoung/r3f-audio-visualizer

URL: https://github.com/dcyoung/r3f-audio-visualizer

Live demo: https://dcyoung.github.io/r3f-audio-visualizer/

Stack: React Three Fiber, Three.js, Web Audio API

Key features: Multiple visualizer modes, mic input support, real-time AnalyserNode FFT data driving Three.js meshes

Verdict: React Three Fiber-based, which matches TITAN's React frontend. Multiple modes let Harnoor pick the aesthetic.

4. ektogamat/threejs-vanilla-holographic-material

URL: https://github.com/ektogamat/threejs-vanilla-holographic-material

License: MIT

Key features: Shader-based holographic material with scanlines, Fresnel rim effects, color aberration, optional bloom post-processing. Works with any Three.js mesh — apply to orbital ring geometry for instant JARVIS look.

Verdict: Drop-in holographic shader for the orbital rings. MIT, minimal setup.

5. ektogamat/threejs-holographic-material

URL: https://github.com/ektogamat/threejs-holographic-material

License: MIT

Stack: React Three Fiber version of the above

Key features: Same holographic effects as above, packaged as a React component. npm install @ektogamat/threejs-holographic-material

Verdict: If TITAN's voice page is React-based, use this over the vanilla version.

6. tpowellmeto/HolographicEffect

URL: https://github.com/tpowellmeto/HolographicEffect

Key features: Holographic effect renderer for Three.js with scanline and interference patterns

Verdict: Alternative holographic shader. More raw Three.js, less opinionated than ektogamat's.

7. steffenpharai/Jarvis

URL: https://github.com/steffenpharai/Jarvis

Key features: Fully offline Iron Man J.A.R.V.I.S. on Jetson Orin Nano — voice + vision + 3D holograms + health monitoring. Iron Man-style AR tracking with real-time annotations, vitals dashboard via WebSocket. No cloud. No API keys.

Verdict: Privacy-first reference architecture. Good model for Tier 3 offline mode.

8. Humprt/particula

URL: https://github.com/Humprt/particula

Key features: Five independent particle spheres reacting to different frequency bands (bass, mid-low, mid, mid-high, treble), noise and turbulence dynamics driven by audio

Verdict: Multi-sphere audio reactivity. Good for showing TITAN "thinking" with different frequency bands driving different visual elements.

2.2 Tutorial References

Codrops — Audio-Reactive Particles in Three.js (December 2023)

URL: https://tympanus.net/codrops/2023/12/19/creating-audio-reactive-visuals-with-dynamic-particles-in-three-js/

The canonical tutorial for building audio-reactive Three.js scenes. Covers curl noise, frequency analysis, shader uniforms. Code available at tgcnzn/Interactive-Particles-Music-Visualizer above.

Codrops — 3D Audio Visualizer with Three.js, GSAP & Web Audio API (June 2025)

URL: https://tympanus.net/codrops/2025/06/18/coding-a-3d-audio-visualizer-with-three-js-gsap-web-audio-api/

More recent tutorial incorporating GSAP for smooth transitions. Useful for Tier 2 polish.

Three.js Official — Audio Demo

URL: https://threejsdemos.com/demos/audio/particles

Title: "Audio Reactive Particles in Three.js — Three.js Demos"

Direct live demo of audio-reactive particles with source code.

Three.js Journey — Hologram Shader

URL: https://threejs-journey.com/lessons/hologram-shader

Paid course but the lesson overview is publicly visible. Shows how to build JARVIS-style scanline effects using ShaderMaterial with repeating gradients. The exact technique used in the ektogamat holographic material above.

2.3 Visual Patterns Available

| Pattern | Source Repo | Implementation Complexity |

|---|---|---|

| Orbital rings (holographic) | ektogamat/threejs-holographic-material | Low — npm install |

| Particle cloud (audio-reactive) | tgcnzn/Interactive-Particles-Music-Visualizer | Medium — shader uniforms |

| Multi-sphere frequency orbs | Humprt/particula | Medium |

| Central glowing orb + rings | harsh-raj00/my-jarvis | Low — fork and adapt |

| Scanline / Fresnel HUD | tpowellmeto/HolographicEffect | Medium |

| Hex grid background | Custom GLSL, no canonical repo found | High |

| Voice-driven equalizer bars | MDN AnalyserNode + Canvas2D | Low — native APIs |

---

3. 3D Face / HUD Architecture

3.1 Scene Layout

The /voice page canvas is full-viewport. Three layers:


Layer 0 (background): particle field — 8,000–12,000 points
Layer 1 (mid): 3–5 orbital rings at different inclinations
Layer 2 (foreground): central orb / abstract face
Layer 3 (UI overlay): command tiles, transcript bar, mic indicator (HTML/CSS, not WebGL)

3.2 Background — Particle System

Implementation. THREE.BufferGeometry with a THREE.Points primitive. Particles positioned on a sphere surface with random jitter. Each frame, a vertex shader receives uAmplitude (float) and uTime (float) uniforms. Amplitude drives radial displacement of each particle from its rest position. Curl noise (from tgcnzn repo above) adds organic turbulence.


// vertex shader (simplified)
uniform float uAmplitude;
uniform float uTime;
attribute vec3 aBasePosition;

void main() {
  vec3 pos = aBasePosition;
  float noise = snoise(pos * 0.8 + uTime * 0.3);
  pos += normalize(pos) * (noise * 0.15 + uAmplitude * 0.4);
  gl_Position = projectionMatrix * modelViewMatrix * vec4(pos, 1.0);
  gl_PointSize = 1.5 + uAmplitude * 3.0;
}

Audio connection. Web Audio API AnalyserNode connected to the audio element output:


const ctx = new AudioContext();
const analyser = ctx.createAnalyser();
analyser.fftSize = 512;
const source = ctx.createMediaElementSource(audioElement);
source.connect(analyser);
analyser.connect(ctx.destination);
const dataArray = new Uint8Array(analyser.frequencyBinCount);
// Each frame: analyser.getByteFrequencyData(dataArray)
// amplitude = average of dataArray / 255

Colors. Base particle color: #00e5ff (TITAN cyan). Peak amplitude: interpolate to #ff6b00 (TITAN orange). Idle: #7b5ea7 (violet, low opacity).

3.3 Orbital Rings

Implementation. Three THREE.TorusGeometry meshes at inclinations of 0°, 35°, and 65° relative to camera. Apply ektogamat/threejs-vanilla-holographic-material for holographic scanline effect. Each ring rotates at a different speed (0.002, 0.004, -0.003 radians/frame) for visual depth.

Data display. Three rings carry data labels using THREE.Sprite billboards:

Ring 1: Open asks count (e.g., "ASKS: 7")
Ring 2: Agent queue depth (e.g., "QUEUE: 3")
Ring 3: Last TITAN reply timestamp (e.g., "LAST: 4m ago")

Labels update every 30 seconds from a /api/health poll.

Audio reactivity. Ring scale pulses with bass frequency band amplitude. Ring glow intensity (material fresnelOpacity) scales with overall volume.

3.4 Central Orb

The center is an IcosahedronGeometry with detail: 4 (320 faces), not a sphere. This gives a natural faceted JARVIS look. Material: MeshPhongMaterial with emissive: #00e5ff, emissiveIntensity mapped to audio amplitude. A PointLight at center provides the glow that illuminates the orbital rings.

For Tier 3, this can be replaced with a stylized abstract face — a flattened IcosahedronGeometry with displacement mapping that morphs to show "listening" vs. "speaking" states.

3.5 Color Palette (from TITAN aesthetic)

|---|---|---|---|

| Listening | #00e5ff (cyan) | #003547 | #00b4cc |

| Processing | #ff6b00 (orange) | #3d1a00 | #cc5500 |

| Speaking | #00ff88 (green) | #003d1f | #00cc6a |

State transitions use THREE.MathUtils.lerp() on all color uniforms over 500ms.

3.6 Performance Budget

Target: 60fps on Harnoor's Windows 11 PC (Chrome).

Particle count: 10,000 (reduce to 5,000 if frameTime > 16ms)
Orbital rings: 3 toruses, 64 segments each = minimal geometry
Post-processing: UnrealBloomPass at resolution scale 0.5 (half-res bloom is barely visible but cheap)
Shadows: disabled
renderer.setPixelRatio(Math.min(window.devicePixelRatio, 2)) — caps at 2x on retina displays

Mobile/degraded mode: if navigator.hardwareConcurrency < 4, reduce particles to 2,000, disable bloom, use MeshBasicMaterial instead of MeshPhongMaterial. The voice function still works; only the visuals degrade.

requestAnimationFrame loop should check performance.now() and skip a render frame if it would push delta > 33ms (i.e., gracefully hold at 30fps rather than dropping further).

---

4. Voice Flow Architecture

4.1 Full Sequence


USER SPEAKS
    │
    ▼
[Browser Mic]
    │  getUserMedia() → MediaStream
    ▼
[Web Speech API SpeechRecognition]
    │  onresult: { transcript, isFinal }
    │  Show interim transcript in real-time in UI bar
    ▼
[Final transcript confirmed]
    │
    ▼
[POST /api/voice  {prompt: transcript, session_id: uuid}]
    │  HTTP to titan_bridge.py
    ▼
[titan_bridge.py — /api/voice route]
    │
    ├─ Fast path: is this a "what shipped today" / "quick pulse" tile command?
    │      └─ Yes → read from F:/TITAN/state/* directly, no LLM call
    │      └─ No → forward to Claude API (existing claude_client logic)
    │
    │  Receive LLM text response
    │
    ├─ Call AWS Polly / ElevenLabs with response text
    │  (Tier 1: Polly Neural Brian, streaming synthesize_speech)
    │  (Tier 2: ElevenLabs Daniel Turbo via /v1/text-to-speech/stream)
    │
    │  Return JSON {text: "...", audio_url: "/api/voice/audio/<id>"}
    ▼
[Browser receives response JSON]
    │
    ├─ Display text in transcript bar
    │
    └─ Fetch audio_url → HTMLAudioElement
            │
            ▼
        [Web Audio API]
            │  createMediaElementSource(audioElement)
            │  → analyser → destination
            ▼
        [AnalyserNode → getByteFrequencyData each frame]
            │
            ▼
        [Three.js uniforms updated: uAmplitude, uFreqBands]
            │
            ▼
        [Visualizer reacts — particles expand, rings glow, orb pulses]

4.2 WebSocket vs REST Decision

For Tier 1 and Tier 2: REST (HTTP POST) is sufficient. The flow is turn-based: user speaks → POST → response audio. WebSocket adds complexity without meaningful benefit at this tier.

For Tier 3 (streaming responses): Add a WebSocket at /ws/voice. The bridge streams text tokens back as they arrive from Claude. The browser builds text progressively in the transcript bar and can begin TTS on the first sentence before the full response arrives.


# Tier 3 WebSocket sketch (titan_bridge.py addition)
@app.websocket_route("/ws/voice")
async def voice_ws(websocket):
    await websocket.accept()
    while True:
        data = await websocket.receive_json()
        async for token in claude_stream(data["prompt"]):
            await websocket.send_json({"type": "token", "text": token})
        # After full response, synthesize audio
        audio_b64 = await polly_synthesize(full_text)
        await websocket.send_json({"type": "audio", "data": audio_b64})

4.3 Streaming Audio Playback

For ElevenLabs Tier 2, use chunked streaming with the MediaSource API:


const mediaSource = new MediaSource();
audioElement.src = URL.createObjectURL(mediaSource);
mediaSource.addEventListener('sourceopen', async () => {
  const sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg');
  const response = await fetch('/api/voice/stream', { method: 'POST', body: JSON.stringify({text}) });
  const reader = response.body.getReader();
  while (true) {
    const { value, done } = await reader.read();
    if (done) break;
    await new Promise(resolve => {
      sourceBuffer.addEventListener('updateend', resolve, { once: true });
      sourceBuffer.appendBuffer(value);
    });
  }
});

This allows the browser to start playing TITAN's voice before the full synthesis is complete, cutting perceived latency by 60–70%.

---

5. Command Tile UI

5.1 Layout

A horizontal strip pinned to the bottom of the /voice cinema view. 7 tiles + 1 mic button.


┌─────────────────────────────────────────────────────────────────┐
│  [talk to me] [what shipped today] [what's blocked]             │
│  [next move] [read latest memo] [quick pulse] [ask anything 🎙] │
└─────────────────────────────────────────────────────────────────┘

Each tile is a <button> with:

Glassmorphic background: background: rgba(0, 229, 255, 0.08); backdrop-filter: blur(8px);
Border: 1px solid rgba(0, 229, 255, 0.3)
Hover: border brightens to rgba(0, 229, 255, 0.8), subtle glow
Active: scale down to 0.97, immediate visual feedback

5.2 Tile Definitions

| Tile | Label | Action |

|---|---|---|

| T1 | talk to me | Triggers auto-briefing prompt (see Section 12) |

| T2 | what shipped today | Queries ask ledger for status: "shipped" + today's date; reads aloud |

| T3 | what's blocked | Queries ask ledger for status: "blocked" or priority: "S" items |

| T4 | next move | Asks Claude: "Given current TITAN state, what is the single most important next action?" |

| T5 | read latest memo | Reads the most recently modified file in F:/TITAN/plans/advisors/ |

| T6 | quick pulse | Reads F:/TITAN/state/ summary: queue depth, open asks count, last ship |

| T7 | ask anything | Opens free-form mic; SpeechRecognition listens and sends whatever user says |

5.3 Tile State Machine

Each tile goes through: idle → active (user clicked) → processing → speaking → idle. CSS class changes drive animation. The active tile glows orange during processing, transitions to green during TITAN speaking output.

5.4 Keyboard Shortcuts

T — toggle tiles visibility (for distraction-free mode)
Space — toggle mic (equivalent to T7)
Escape — stop current speech playback
1–6 — trigger tiles T1–T6 directly

---

6. Cinema-Mode Popup Integration

6.1 Design Decision: /voice IS Cinema Mode

Rather than treating /voice as a page inside the TITAN dashboard, it should launch as a full-viewport takeover — the same philosophy as the existing cinema-mode FAB pattern. When the user navigates to /voice, the Three.js canvas fills the entire screen. The TITAN dashboard disappears. The only UI is:

The particle/orb visualizer (full canvas)
The orbital data rings
The command tile strip (bottom)
A transcript bar (bottom, above tiles)
A close/minimize button (top-right, small)

This matches Harnoor's request for a JARVIS feel. Tony Stark doesn't interact with JARVIS through a sub-panel — it's the entire room.

6.2 Integration with Existing Bridge

Add a /voice GET route to titan_bridge.py that serves the JARVIS HTML page:


elif path == "/voice":
    return self._serve_static("jarvis/index.html")

The JARVIS page is self-contained in F:/TITAN/static/jarvis/. It loads Three.js from CDN (or bundled), the command tiles, and connects to /api/voice POST.

6.3 Navigation

From the main TITAN dashboard, add a "JARVIS" button to the navigation strip. Clicking it navigates to /voice in the same tab. The /voice page has an × or minimize button that returns to /. No iframe — standalone page.

---

7. Voice Persona — JARVIS Tone

7.1 Personality Profile

JARVIS is: calm, precise, warm without being subservient, occasionally dry-witted, always useful. He never rambles. He leads with the answer. He uses Harnoor's lion-tiger frame — sovereign language, not corporate language.

Core traits:

Calm authority: never excited, never panicked
Precision: numbers, dates, names — always specific
Brevity: one sentence per thought where possible
Warmth: occasional dry humor ("Three items in the blocked queue, two of which have been there since Tuesday — familiar territory.")
Proactive framing: ends responses with a forward motion ("Shall I draft that?", "Whenever you're ready.")

7.2 TTS Voice Recommendations

AWS Polly — Recommended for Tier 1:

Brian (en-GB, Neural and Generative) — British English male. Brian is available as both a Neural TTS voice and as a new Generative voice (launched March 2026 with Bidirectional Streaming expansion). The Generative version is described as "emotionally engaged, assertive, and highly colloquial" — closest to JARVIS. Voice ID: Brian. Engine: neural for cost efficiency, generative for max quality (available in us-east-1, eu-west-2, us-west-2, ca-central-1, ap-southeast-1, eu-central-1).

Arthur (en-GB, Neural only) — also British male, modeled on the US English Matthew voice with British vocal characteristics transferred via deep learning. Slightly more formal than Brian. Voice ID: Arthur. Engine: neural.

Recommendation: Brian with generative engine if account is in supported region; fall back to neural if not. Brian sounds more natural and conversational — JARVIS quality.

ElevenLabs — Recommended for Tier 2:

Daniel (voice ID: onwK4e9ZLuTAKqWW03F9) — British male, "Steady Broadcaster" style. Deep, authoritative voice. Described as a British News Presenter. Available on all plans including free tier. This is the closest existing ElevenLabs voice to JARVIS.

Antoni (voice ID: ErXwobaYiN019PkySvjV) — American male, general-purpose. 191 WPM, 125Hz pitch. Well-rounded narrator. Not British — less JARVIS-like but highly clear. Better choice if Daniel sounds too BBC-anchor in context.

Recommendation: Daniel for JARVIS aesthetic. Test both in context — Daniel can sound stiff at slower speaking rates; reduce stability to 0.4 and similarity_boost to 0.75 in ElevenLabs settings for more natural variation.

ElevenLabs API settings for JARVIS persona:


{
  "voice_id": "onwK4e9ZLuTAKqWW03F9",
  "model_id": "eleven_turbo_v2_5",
  "voice_settings": {
    "stability": 0.40,
    "similarity_boost": 0.75,
    "style": 0.35,
    "use_speaker_boost": true
  }
}

7.3 System Prompt Addendum for JARVIS Persona

When routing voice requests through Claude, prepend this persona context:


You are JARVIS, TITAN's voice interface. You speak with calm British authority — precise, brief, occasionally dry-humored. Never use bullet points or markdown in voice responses; speak in natural sentences only. Lead with the answer. Maximum 3 sentences per response unless more detail is explicitly requested. Address Harnoor directly but without flattery. You operate TITAN, which is Harnoor's AI operating system. When referencing data, be specific — use numbers and dates.

---

8. Privacy and Security

8.1 Mic Access

getUserMedia — the browser requests microphone access via the native permission dialog. Audio is only captured when the user explicitly activates the mic (click on mic button or press Space). No background listening except in Tier 3 wake-word mode (which requires explicit opt-in).

Hard mute. The mute button calls mediaStream.getTracks().forEach(t => t.stop()) — this physically disconnects the browser's access to the microphone, not just a JavaScript mute flag. The mic indicator light on the device goes off. This is not a UI-only hide.


function hardMute() {
  if (activeStream) {
    activeStream.getTracks().forEach(track => track.stop());
    activeStream = null;
    recognition.stop();
  }
}

8.2 Audio Data Flow

User speech → Web Speech API → Google's STT servers (browser-native; no TITAN server involvement). This is the one audio path that leaves the local machine, same as any Chrome speech recognition.
Text transcript → TITAN bridge (/api/voice POST) → Claude API. Text only, never audio.
TTS response → TITAN bridge calls Polly/ElevenLabs → returns audio to browser. Audio bytes flow through TITAN bridge; they are NOT stored unless explicitly requested.
No audio recording or logging unless the user explicitly requests a session transcript.

8.3 API Credentials

All API keys (AWS, ElevenLabs, OpenAI) are held server-side in F:/TITAN/state/ or environment variables. The browser never receives or sends API credentials. The bridge handles all external API calls. The /api/voice endpoint is protected by the existing bridge bearer token (R0220 — titan_token cookie).

8.4 Session Privacy

Voice session transcripts are held in-memory only during the session (in titan_bridge.py)
Not written to harnoor-asks.jsonl unless the user says "log this" or "save this ask"
TTS audio files written to temp directory are deleted after 60 seconds
WebSocket sessions are isolated; no cross-session data leakage

8.5 Tier 3 Wake Word Considerations

If "hey TITAN" wake word is added in Tier 3, this requires continuous mic access. This should be:

Opt-in only via a separate toggle with prominent visual indicator
Implemented using the Web Speech API's continuous: true mode, not a background service worker
Automatically disabled when the /voice page is not the active tab
Never active when the browser window is minimized

---

9. Implementation Path — Tiered by Feasibility

Tier 1 — MVP (Estimated: 1 day, ~8 hours)

Goal. A working voice conversation with TITAN. No 3D yet. Prove the loop end-to-end.

Components:

New route GET /voice in titan_bridge.py → serves F:/TITAN/static/jarvis/index.html
New route POST /api/voice in titan_bridge.py → accepts {prompt}, calls Claude, calls Polly Brian Neural, returns {text, audio_url}
New route GET /api/voice/audio/<id> → serves the synthesized audio file
F:/TITAN/static/jarvis/index.html — dark full-viewport page
F:/TITAN/static/jarvis/voice.js — SpeechRecognition setup, tile click handlers, audio playback
F:/TITAN/static/jarvis/style.css — TITAN dark aesthetic, glassmorphic tiles, transcript bar
F:/TITAN/state/voice-config.json — TTS voice ID, model, tile labels, persona system prompt

What works at end of Tier 1:

User opens /voice in browser
Clicks any of 6 command tiles or the mic button
Hears TITAN's response in Brian's voice
Sees text transcript of both sides
Mute button hard-disconnects mic

Canvas at Tier 1. Simple CSS animated background: radial gradient pulsing via CSS keyframes, no WebGL. Particle system is deferred.

---

Tier 2 — JARVIS Look (Estimated: 1 week)

Goal. Add Three.js visualizer + ElevenLabs voice + cinema fullscreen aesthetic.

New work:

F:/TITAN/static/jarvis/scene.js — Three.js scene: particle field, 3 orbital rings, central orb
Audio bridge: connect HTML audio element to Web Audio API AnalyserNode → feed amplitude to Three.js uniforms each frame
ElevenLabs Daniel voice: update /api/voice to call ElevenLabs API instead of Polly (or make it configurable in voice-config.json)
ElevenLabs streaming: serve audio as chunked response; browser uses MediaSource API for low-latency playback
State machine: listening/processing/speaking colors propagate through Three.js materials and CSS classes simultaneously
Cinema fullscreen: remove all browser chrome; the /voice page is the full viewport

Dependency additions:


npm install three @ektogamat/threejs-holographic-material

Three.js can also be loaded from CDN: https://cdn.jsdelivr.net/npm/three@0.165.0/build/three.module.js

Performance target: 60fps on desktop, 30fps mobile graceful degradation.

---

Tier 3 — Full JARVIS (Estimated: 2 weeks)

Goal. Streaming WebSocket, abstract 3D face, wake word, multi-modal.

New work:

WebSocket endpoint /ws/voice in titan_bridge.py — streaming tokens from Claude back to browser as they arrive
TTS sentence buffering: begin synthesizing when first sentence delimiter reached (., ?, !), don't wait for full response
Abstract 3D face: IcosahedronGeometry with displacement map morphing between listening/speaking expressions using THREE.AnimationMixer or manual morph targets
Wake word: SpeechRecognition with continuous: true + keyword filter ("hey TITAN") → triggers listening mode
Optional: switch STT to OpenAI Realtime API for best-in-class latency and unified pipeline
Multi-modal context: /api/voice POST can optionally include a screenshot of the current page (navigator.mediaDevices.getDisplayMedia) so TITAN can comment on what's on screen

---

10. Cost Analysis at Expected Usage

Assumptions: 30 minutes/day active voice interaction. 50% user speaking, 50% TITAN responding. Average speaking pace: ~150 words/minute, ~750 characters/minute.

10.1 STT Cost

| Option | Monthly Cost at 30 min/day |

|---|---|

| Web Speech API | $0 (free) |

| OpenAI Realtime (input) | 15 min × $0.06/min × 30 days = $27/month |

| Whisper.cpp WASM | $0 (free, runs in browser) |

Recommendation: Web Speech API for Tier 1–2. Save $27/month.

10.2 TTS Cost

| Option | Monthly Cost at 30 min/day | Notes |

|---|---|---|

| AWS Polly Standard | ~$3.60/month | $4/1M chars × 0.9M chars/month. Robotic quality. |

| AWS Polly Neural (Brian) | ~$14.40/month | $16/1M chars. Good quality, natural. |

| AWS Polly Generative (Brian) | ~$27/month | $30/1M chars. Best Polly quality. |

| ElevenLabs Daniel (Creator plan) | $22/month flat | Covers 100K chars (~100 min/month) — fine for ~3 min/day voice; overage at 30 min/day |

| ElevenLabs Daniel (Pro plan) | $99/month flat | Covers 500K chars (~500 min/month). Covers 30 min/day usage. |

| ElevenLabs Daniel (pay-as-you-go) | ~$54/month | $0.06/1K chars × 900K chars. |

| OpenAI Realtime (output, bundled) | $108/month | 15 min/day × $0.24/min × 30 days. Includes LLM + TTS. |

Recommendation:

Tier 1 MVP: Polly Neural Brian = $14.40/month (or $0 on free tier). Use free tier first.
Tier 2: ElevenLabs Creator ($22/mo) for light use (<3 min/day actual TITAN speaking) or Pro ($99/mo) for heavy daily use. At 30 min/day total, TITAN speaks ~15 min/day = 11,250 chars/day → 337,500 chars/month → Creator tier overages kick in. Use the pay-as-you-go API rate: ~$20/month at $0.06/1K chars for 337K chars.
Tier 3 (OpenAI Realtime): ~$135/month all-in (STT + LLM + TTS unified). Premium option.

10.3 LLM Cost

The voice prompt goes through the existing Claude API call in TITAN. Voice queries are typically short (1–3 sentences prompt, 2–5 sentences response). Roughly 200 tokens in/out per turn. At 10 turns/day: 4,000 tokens/day × 30 = 120,000 tokens/month. At Claude Sonnet pricing ($3/1M input + $15/1M output): ~$0.36 + $0.90 = ~$1.26/month marginal LLM cost for voice.

10.4 Total Monthly Cost Summary

| Tier | STT | TTS | LLM | Total |

|---|---|---|---|---|

---

11. Decision Matrix

|---|---|---|---|---|---|

| Monthly Cost | $0 | $108–$135 | $20–$99 | $0–$27 | $0 |

Recommended combos by tier:

|---|---|---|---|---|

---

12. The "Talk to Me" Briefing Prompt

When the user clicks "talk to me" or says those words:

12.1 System Prompt for Briefing Mode


You are JARVIS delivering a morning briefing to Harnoor. Speak in natural sentences only — no bullet points, no headers, no markdown. Be brief and sovereign. The structure is: (1) opening frame line, (2) current numbers, (3) top 3 priorities, (4) open question. Maximum 90 seconds of speaking time (~225 words). Be specific with every number. End with an open mic invitation.

12.2 Data Payload (read from state files)

Before calling Claude, the bridge reads:

F:/TITAN/state/harnoor-asks.jsonl → count open asks, count blocked, last reply timestamp
F:/TITAN/state/inbox-queue.jsonl → queue depth
F:/TITAN/plans/advisors/ → most recently modified memo filename
F:/TITAN/state/voice-config.json → today's focus area (if set)

12.3 Briefing Template


Opening frame (rotate from pool):
- "The systems are running. Let's take a look at where things stand."
- "All channels are active. Here's your current picture."
- "You asked for a briefing. Here's the intelligence."
- "Running the state of play. A few things worth your attention."

Numbers:
- "[N] open asks, [M] of which are marked S-priority."
- "The queue has [Q] items waiting. Last reply came through [X] minutes ago."

Priorities:
- "Your top three for today: [item 1], [item 2], and [item 3]."
(Read from voice-config.json today_priorities, or derive from ask ledger)

Open question:
- "What would you like to dig into first?"

12.4 Example Rendered Briefing

> "The systems are running. Let's take a look at where things stand. You have 9 open asks. Four are S-priority. The queue has 2 items pending. Last reply came through 6 minutes ago. Your top three for today: the JARVIS build, the Silent Infinity R0211 blocker, and the Polly streaming integration. The latest memo is the client-side STT research brief from this morning. What would you like to dig into first?"

Duration: approximately 25 seconds at a natural speaking pace. Concise, specific, sovereign.

---

13. Build Order — Specific File Paths and Integration Points

13.1 Phase 0 — Config and State Files

Create F:/TITAN/state/voice-config.json:


{
  "tts_provider": "polly",
  "tts_voice_id": "Brian",
  "tts_engine": "neural",
  "elevenlabs_voice_id": "onwK4e9ZLuTAKqWW03F9",
  "elevenlabs_model_id": "eleven_turbo_v2_5",
  "elevenlabs_voice_settings": {
    "stability": 0.40,
    "similarity_boost": 0.75,
    "style": 0.35
  },
  "jarvis_persona_prompt": "You are JARVIS, TITAN's voice interface. Speak with calm British authority — precise, brief, occasionally dry-humored. Natural sentences only, no markdown. Lead with the answer. Maximum 3 sentences per response unless detail is explicitly requested.",
  "command_tiles": [
    {"id": "talk_to_me", "label": "talk to me", "prompt_template": "briefing"},
    {"id": "shipped_today", "label": "what shipped today", "prompt_template": "shipped_today"},
    {"id": "whats_blocked", "label": "what's blocked", "prompt_template": "blocked"},
    {"id": "next_move", "label": "next move", "prompt_template": "next_move"},
    {"id": "latest_memo", "label": "latest memo", "prompt_template": "latest_memo"},
    {"id": "quick_pulse", "label": "quick pulse", "prompt_template": "quick_pulse"}
  ],
  "today_priorities": []
}

13.2 Phase 1 — Bridge Routes (titan_bridge.py additions)

Add to titan_bridge.py after the existing route handlers:


# ── Voice / JARVIS routes ──────────────────────────────────────────
elif path == "/voice":
    return self._serve_static_file("jarvis/index.html", "text/html")

elif path == "/api/voice" and method == "POST":
    return self._handle_voice_post(body)

elif path.startswith("/api/voice/audio/") and method == "GET":
    audio_id = path.split("/")[-1]
    return self._serve_voice_audio(audio_id)

Key functions to add:


def _handle_voice_post(self, body: dict) -> dict:
    """
    Accepts {prompt, session_id, tile_id?}
    Returns {text, audio_url, duration_ms}
    """
    # 1. Resolve prompt (tile fast-path or freeform)
    # 2. Build context from state files if briefing tile
    # 3. Call Claude API with JARVIS persona prepended
    # 4. Call _synthesize_voice(text) -> audio_path
    # 5. Store audio at F:/TITAN/state/voice-audio/<uuid>.mp3
    # 6. Schedule deletion after 60s
    # 7. Return {text, audio_url: /api/voice/audio/<uuid>}

def _synthesize_voice(self, text: str) -> Path:
    """
    Routes to Polly or ElevenLabs based on voice-config.json.
    Returns path to synthesized audio file.
    """
    config = json.loads((_TITAN_ROOT / "state" / "voice-config.json").read_text())
    if config["tts_provider"] == "polly":
        return self._polly_synthesize(text, config)
    elif config["tts_provider"] == "elevenlabs":
        return self._elevenlabs_synthesize(text, config)

13.3 Phase 2 — Static Files (F:/TITAN/static/jarvis/)


F:/TITAN/static/jarvis/
├── index.html          # Full-viewport JARVIS page, loads Three.js + voice.js
├── voice.js            # SpeechRecognition, tile handlers, fetch /api/voice, audio playback
├── scene.js            # Three.js scene: particles, rings, orb (Tier 2)
├── audio-bridge.js     # Web Audio API AnalyserNode → exports getAmplitude()
├── style.css           # Dark theme, glassmorphic tiles, transcript bar, state colors
└── lib/
    └── three.module.js # Three.js bundled (or CDN link in index.html)

index.html structure (Tier 1):


<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>TITAN — JARVIS</title>
  <link rel="stylesheet" href="style.css">
</head>
<body class="jarvis-root">
  <canvas id="jarvis-canvas"></canvas>      <!-- Three.js target (Tier 2+) -->
  <div id="transcript-bar"></div>           <!-- Scrolling transcript -->
  <div id="tile-strip">                     <!-- Command tiles -->
    <!-- tiles injected by voice.js from voice-config.json -->
  </div>
  <button id="mic-btn" class="tile tile--mic">ask anything</button>
  <button id="close-btn" onclick="window.location='/'">×</button>
  <script type="module" src="voice.js"></script>
</body>
</html>

13.4 Phase 3 — Three.js Scene (scene.js, Tier 2)


// scene.js — key exports
export function initScene(canvas) { /* ... */ }
export function setAmplitude(value) {
  // Updates uAmplitude uniform on particle system
  // Updates emissiveIntensity on central orb
  // Updates ring glow
}
export function setState(state) {
  // state: 'idle' | 'listening' | 'processing' | 'speaking'
  // Lerps color uniforms to state palette
}

voice.js calls setState('listening') when mic activates, setState('processing') while waiting for API, setState('speaking') while audio plays, setState('idle') when audio ends.

13.5 voice-config.json Hot Reload

titan_bridge.py reads voice-config.json on every /api/voice call (no restart needed). Harnoor can edit the JSON to swap voice IDs, change persona, or reorder tiles without restarting the bridge.

13.6 Integration Points Checklist

| Point | File | Status |

|---|---|---|

| /voice GET route | titan_bridge.py | To add |

| /api/voice POST route | titan_bridge.py | To add |

| /api/voice/audio/<id> GET | titan_bridge.py | To add |

| Polly synthesis function | titan_bridge.py | Extend existing Polly usage |

| ElevenLabs synthesis function | titan_bridge.py | New (Tier 2) |

| Voice config JSON | F:/TITAN/state/voice-config.json | New |

| JARVIS static files | F:/TITAN/static/jarvis/ | New directory |

| Three.js scene | F:/TITAN/static/jarvis/scene.js | New (Tier 2) |

| Bridge auth check | Existing _check_auth() in bridge | Already covers new routes |

---

Sources

1. OpenAI API Pricing — Realtime API audio pricing: $100/1M input tokens, $200/1M output tokens (~$0.06/min input, $0.24/min output)

2. Introducing gpt-realtime — OpenAI — Realtime API production announcement

3. ElevenLabs Pricing 2026 — Flexprice — Plan tiers, credits, overage rates

4. ElevenLabs Pricing — pxlpeak — Agent per-minute pricing confirmed

5. ElevenLabs Pricing Official — Canonical plan page

6. Amazon Polly Pricing — Neural $16/1M, Generative $30/1M chars

7. Amazon Polly Bidirectional Streaming — AWS Blog — 39% latency improvement announcement

8. Amazon Polly Expands Generative TTS — March 2026 — New voices + Bidirectional Streaming GA

9. Amazon Polly Neural Voices — AWS Docs — Brian, Arthur voice specs

10. Web Speech API — MDN — Official spec and usage

11. Speech Recognition API — Can I Use — Browser compatibility table

12. Whisper.cpp stream.wasm — ggml.io — Real-time WASM STT demo

13. whisper.cpp GitHub — ggml-org — Source, WASM examples, MIT license

14. Anthropic Claude Voice Mode — TechCrunch — Voice mode is Claude Code only, not a public API (March 2026)

15. harsh-raj00/my-jarvis — GitHub — MIT, React + Three.js JARVIS implementation

16. tgcnzn/Interactive-Particles-Music-Visualizer — GitHub — MIT, audio-reactive Three.js particles

17. Codrops — Audio-Reactive Particles Three.js — Curl noise technique tutorial

18. Codrops — 3D Audio Visualizer Three.js GSAP — 2025 update

19. dcyoung/r3f-audio-visualizer — GitHub — React Three Fiber audio visualizer

20. ektogamat/threejs-vanilla-holographic-material — GitHub — MIT holographic shader

21. ektogamat/threejs-holographic-material — GitHub — MIT React Three Fiber version

22. Humprt/particula — GitHub — Multi-sphere frequency visualizer

23. Three.js Audio Reactive Particles Demo — Official Three.js demo

24. ElevenLabs Daniel voice — json2video — Daniel voice ID confirmation: onwK4e9ZLuTAKqWW03F9

25. ElevenLabs Antoni voice — json2video — Antoni voice ID confirmation: ErXwobaYiN019PkySvjV

26. MDN — Visualizations with Web Audio API — AnalyserNode amplitude visualization

27. steffenpharai/Jarvis — GitHub — Offline Iron Man JARVIS reference architecture

28. Three.js Journey — Hologram Shader — Scanline HUD shader technique

---

SCOUT A085 | Generated 2026-04-27 | F:/TITAN/plans/advisors/JARVIS-VOICE-CHAT-SPEC-2026-04-27.md