SCOUT Memo A085 | 2026-04-27 | Confidential — Harnoor only
---
This document is the complete specification for a JARVIS-style voice interface for the TITAN dashboard. It covers every layer of the stack: speech-to-text options with verified 2026 pricing, TTS voice selection with real voice IDs, Three.js HUD architecture with verified open-source demos, the full voice flow from mic to visualizer, command tile design, cinema-mode integration, privacy architecture, and a tiered build plan. The goal is a voice experience that feels like Tony Stark activating JARVIS — calm, British, instantly authoritative, with an audio-reactive holographic background that pulses when TITAN speaks.
---
How it works. The browser's SpeechRecognition interface streams microphone audio to Google's speech servers (for Chrome/Edge) or Apple's servers (for Safari) and returns interim and final transcript events in JavaScript. No server round-trip for STT — the browser handles it natively.
Latency. Final transcript typically arrives 200–400ms after the user stops speaking. Interim results appear almost immediately during speech. End-to-end (mic → text visible on screen) is ~500ms in practice.
Cost. Free. Google does not charge for Web Speech API use in Chrome. Apple's implementation in Safari is also free.
Browser support (verified April 2026). Chrome and all Chromium-based browsers (Edge, Brave excluded — see note) support SpeechRecognition. Firefox does NOT support it on any version. Safari supports it on macOS and iOS, but it requires a network connection (it routes to Apple servers). Brave Browser blocks it because it relies on Google's proprietary speech recognition service, which Brave does not license. Since Harnoor uses Windows 11, Chrome is the primary target — full support confirmed.
Setup complexity. Minimal. Three lines of JavaScript to instantiate, start, and listen for results. No API keys, no server changes.
Key limitation. Accuracy degrades in noisy environments. No offline capability. Safari occasionally requires an explicit user gesture per session. Results can vary by accent.
---
How it works. A single WebSocket connection to wss://api.openai.com/v1/realtime handles the entire pipeline: STT (audio in → text), LLM inference (text → response text), and TTS (text → audio out). Audio is sent as raw PCM chunks; audio response arrives as PCM chunks. There is no separate STT or TTS call — it is one unified stream.
Latency. Optimized for conversational latency. Response audio typically begins within 500–800ms of the user finishing speech. Supports interruption (barge-in): if the user speaks over TITAN, the model stops generating.
Cost (verified April 2026, source: openai.com/api/pricing). The gpt-4o Realtime model bills at $100 per 1M audio input tokens and $200 per 1M audio output tokens. At approximately 1,500 audio tokens per minute of speech, this works out to:
At 30 minutes/day of active conversation (split ~50/50 between user speaking and TITAN responding): ~($0.06 × 15) + ($0.24 × 15) = $0.90 + $3.60 = ~$4.50/day or ~$135/month.
Browser support. Chrome, Edge, Safari — any browser that supports WebSocket (universal). The raw WebSocket can be used from JavaScript; no special browser feature required.
Setup complexity. Medium. Requires managing WebSocket state, audio encoding/decoding (PCM16, 24kHz), session configuration, and event handling for response.audio.delta, input_audio_buffer.speech_started, etc. The OpenAI openai-realtime-api-beta npm package simplifies this.
Verdict. Best quality and lowest latency available in 2026, but at $135/month for Harnoor's usage pattern it is the most expensive option. Justified for Tier 3 ("Full JARVIS") but overkill for MVP.
---
How it works. ElevenLabs provides a streaming TTS endpoint at POST /v1/text-to-speech/{voice_id}/stream that returns an audio/mpeg stream as it generates. The TITAN bridge sends the text response to ElevenLabs and pipes the audio stream back to the browser via chunked HTTP. The browser plays it via the Web Audio API.
Latency. ElevenLabs Flash and Turbo models ("eleven_flash_v2_5", "eleven_turbo_v2_5") deliver first audio chunk in ~75–150ms from the start of text input. Multilingual v2/v3 models have ~300–500ms TTFA (time to first audio). For JARVIS purposes, Turbo v2.5 is the sweet spot.
Cost (verified April 2026). Plans:
API pricing outside subscriptions: $0.06/1,000 chars (Flash/Turbo) or $0.12/1,000 chars (Multilingual v2/v3).
At 30 minutes/day of TITAN speech output (~1,000 chars/min × 30 min = 30,000 chars/day, 900,000 chars/month): this exceeds the Creator tier. Pro plan at $99/month covers ~500 min/month — sufficient if active voice usage averages 15 min/day. Starter at $5/month works for occasional use (MVP testing phase). Character math: 30,000 chars/day × 30 days = 900,000 chars/month → Pro tier or pay-as-you-go at $0.06/1,000 = $54/month.
Browser support. Universal — audio/mpeg streaming works in all modern browsers. The TITAN bridge handles the ElevenLabs API call; the browser just plays an audio URL/stream.
Setup complexity. Low to medium. ElevenLabs Python SDK (pip install elevenlabs) with a single generate() call and streaming enabled. Bridge receives text, calls ElevenLabs, streams back to browser as audio blob or via WebSocket.
---
How it works. Polly converts text to speech via synthesize_speech() (single-call, returns full audio file) or the new Bidirectional Streaming API (announced March 2026, GA). The Bidirectional Streaming API sends text word-by-word as an LLM generates and receives audio back in real-time over a single HTTP/2 connection.
Bidirectional Streaming (new, March 2026). AWS announced Bidirectional Streaming for Polly in March 2026. In benchmarks against 970-word prose: 39% faster than traditional single-call API, single API connection vs. 27 sequential calls. This makes Polly a real streaming option for the first time. Available in: US East (N. Virginia), US West (Oregon), Europe (Frankfurt), Europe (London), Asia Pacific (Singapore), Canada (Central).
Latency. With Bidirectional Streaming and Neural voices: first audio chunk arrives in ~200–400ms. Standard single-call Polly adds full synthesis time (~1–3 seconds for longer responses) but this is now avoidable.
Cost (verified April 2026, source: aws.amazon.com/polly/pricing). Neural TTS: $16/1M characters. Generative TTS (new Brian/Arthur voices): $30/1M characters. Standard voices: $4/1M characters.
At 30 min/day output (~900,000 chars/month):
Free tier: 1M characters/month for Neural voices for the first 12 months. If TITAN's AWS account is within the free tier period, Polly costs $0 for MVP.
Browser support. Polly returns audio/mpeg or audio/ogg. Any browser plays these formats.
Setup complexity. Already wired in TITAN (boto3 is already available). Adding /voice endpoint is minimal work — existing bridge can call Polly directly.
---
Status (verified April 2026). Anthropic launched Voice Mode for Claude Code in March 2026, beginning rollout ~March 3–4, 2026. It is a push-to-talk interface using a specialized version of Claude 3.7 Sonnet optimized for low-latency audio. Key confirmed details:
Critical limitation. As of April 2026, Anthropic Voice Mode is a Claude Code feature — it is an interface within the Claude Code desktop/IDE tool, NOT a public real-time audio API that third parties can call. Anthropic has NOT released a public WebSocket or streaming audio API endpoint equivalent to OpenAI's Realtime API. Custom voice cloning and offline voice packs are on the 2026 roadmap but have not shipped.
Verdict for TITAN. Not available as a programmable API in April 2026. Cannot be integrated into TITAN's bridge. Monitor for a public audio API release. When/if released, it would be the highest-quality option given TITAN already calls Claude for LLM responses.
---
How it works. whisper.cpp (the C++ port of OpenAI's Whisper model by ggml.io) compiles to WebAssembly and runs entirely in the browser. The user's browser downloads a WASM binary + model file (tiny: ~75MB, base: ~142MB), records mic audio via getUserMedia, and sends audio chunks to the WASM module for transcription. No audio ever leaves the browser.
Real-time streaming demo. The official stream.wasm example at https://ggml.ai/whisper.cpp/stream.wasm/ demonstrates real-time transcription in the browser. A separate Whisper Web app at https://whisperweb.app/ runs fully in-browser with the WASM build.
Latency. For the tiny model: ~2–3x real-time on a modern CPU (i.e., transcribes 30s of speech in ~10–15s). For the base model: ~3–5x real-time. This means Whisper.cpp WASM is NOT suitable for true real-time conversation — there is a 5–15 second lag behind speech. The stream.wasm example mitigates this by running on short rolling windows but latency remains noticeable.
Browser support. Requires WASM SIMD instructions. Chrome 91+ and Edge 91+ support this. Firefox 89+ supports it. Safari on macOS 15+ supports it. Older browsers and mobile Safari may not.
Cost. Zero marginal cost — runs client-side with no API calls.
Setup complexity. High. Requires serving a large WASM binary (75–142MB initial download), handling SharedArrayBuffer (requires COOP/COEP security headers), managing the WASM thread pool, and streaming audio buffers. This is a significant engineering effort.
Verdict. Privacy-maximalist option for sensitive environments, but latency makes it feel sluggish compared to Web Speech API or ElevenLabs. Better suited for a "privacy mode" option in Tier 3 than as the primary STT path.
---
For MVP (Tier 1): Web Speech API (STT) + AWS Polly Brian Neural TTS. Zero marginal cost for STT; Polly likely within free tier if account is fresh. Total estimated monthly cost: $0–$14.40.
For JARVIS look (Tier 2): Web Speech API (STT, keep it — it's good enough) + ElevenLabs Daniel Turbo v2.5 (TTS). Superior voice quality. ElevenLabs Starter ($5/mo) covers testing; upgrade to Creator ($22/mo) for daily use.
For Full JARVIS (Tier 3): OpenAI Realtime API (unified STT + LLM + TTS in one WebSocket). Premium latency, premium cost (~$135/month at current usage). Or wait for Anthropic to release a public real-time audio API, which would allow native Claude integration at potentially lower cost.
---
1. harsh-raj00/my-jarvis
URL: https://github.com/harsh-raj00/my-jarvis
Live demo: https://jarvis-frontend-uj30.onrender.com
License: MIT
Stack: React 18, Three.js with React Three Fiber, Framer Motion, Vite, Tailwind CSS, Web Speech API, Tone.js, ElevenLabs TTS, FastAPI backend, WebSocket
Key features: 8,000-particle glowing sphere, orbital rings with spiral animations, custom GLSL shaders, glassmorphic design, audio-reactive scaling, Siri-style animated voice popup with 5 morphing blob layers, color-coded states (listening/processing/speaking)
Verdict: The closest existing open-source implementation to what Harnoor wants. Directly borrowable under MIT.
2. tgcnzn/Interactive-Particles-Music-Visualizer
URL: https://github.com/tgcnzn/Interactive-Particles-Music-Visualizer
License: MIT (confirmed)
Tutorial: https://tympanus.net/codrops/2023/12/19/creating-audio-reactive-visuals-with-dynamic-particles-in-three-js/
Stack: Three.js, Web Audio API, GLSL shaders
Key features: Curl noise in vertex shader for organic particle movement, audio frequency band analysis (low/mid/high), BPM detection, procedural geometry, shader-based animation driven by audio uniforms
Verdict: Best-in-class particle system for audio reactivity. The curl noise technique is exactly what makes JARVIS backgrounds feel alive. MIT licensed, well-documented.
3. dcyoung/r3f-audio-visualizer
URL: https://github.com/dcyoung/r3f-audio-visualizer
Live demo: https://dcyoung.github.io/r3f-audio-visualizer/
Stack: React Three Fiber, Three.js, Web Audio API
Key features: Multiple visualizer modes, mic input support, real-time AnalyserNode FFT data driving Three.js meshes
Verdict: React Three Fiber-based, which matches TITAN's React frontend. Multiple modes let Harnoor pick the aesthetic.
4. ektogamat/threejs-vanilla-holographic-material
URL: https://github.com/ektogamat/threejs-vanilla-holographic-material
License: MIT
Key features: Shader-based holographic material with scanlines, Fresnel rim effects, color aberration, optional bloom post-processing. Works with any Three.js mesh — apply to orbital ring geometry for instant JARVIS look.
Verdict: Drop-in holographic shader for the orbital rings. MIT, minimal setup.
5. ektogamat/threejs-holographic-material
URL: https://github.com/ektogamat/threejs-holographic-material
License: MIT
Stack: React Three Fiber version of the above
Key features: Same holographic effects as above, packaged as a React component. npm install @ektogamat/threejs-holographic-material
Verdict: If TITAN's voice page is React-based, use this over the vanilla version.
6. tpowellmeto/HolographicEffect
URL: https://github.com/tpowellmeto/HolographicEffect
Key features: Holographic effect renderer for Three.js with scanline and interference patterns
Verdict: Alternative holographic shader. More raw Three.js, less opinionated than ektogamat's.
7. steffenpharai/Jarvis
URL: https://github.com/steffenpharai/Jarvis
Key features: Fully offline Iron Man J.A.R.V.I.S. on Jetson Orin Nano — voice + vision + 3D holograms + health monitoring. Iron Man-style AR tracking with real-time annotations, vitals dashboard via WebSocket. No cloud. No API keys.
Verdict: Privacy-first reference architecture. Good model for Tier 3 offline mode.
8. Humprt/particula
URL: https://github.com/Humprt/particula
Key features: Five independent particle spheres reacting to different frequency bands (bass, mid-low, mid, mid-high, treble), noise and turbulence dynamics driven by audio
Verdict: Multi-sphere audio reactivity. Good for showing TITAN "thinking" with different frequency bands driving different visual elements.
Codrops — Audio-Reactive Particles in Three.js (December 2023)
URL: https://tympanus.net/codrops/2023/12/19/creating-audio-reactive-visuals-with-dynamic-particles-in-three-js/
The canonical tutorial for building audio-reactive Three.js scenes. Covers curl noise, frequency analysis, shader uniforms. Code available at tgcnzn/Interactive-Particles-Music-Visualizer above.
Codrops — 3D Audio Visualizer with Three.js, GSAP & Web Audio API (June 2025)
URL: https://tympanus.net/codrops/2025/06/18/coding-a-3d-audio-visualizer-with-three-js-gsap-web-audio-api/
More recent tutorial incorporating GSAP for smooth transitions. Useful for Tier 2 polish.
Three.js Official — Audio Demo
URL: https://threejsdemos.com/demos/audio/particles
Title: "Audio Reactive Particles in Three.js — Three.js Demos"
Direct live demo of audio-reactive particles with source code.
Three.js Journey — Hologram Shader
URL: https://threejs-journey.com/lessons/hologram-shader
Paid course but the lesson overview is publicly visible. Shows how to build JARVIS-style scanline effects using ShaderMaterial with repeating gradients. The exact technique used in the ektogamat holographic material above.
| Pattern | Source Repo | Implementation Complexity |
|---|---|---|
| Orbital rings (holographic) | ektogamat/threejs-holographic-material | Low — npm install |
| Particle cloud (audio-reactive) | tgcnzn/Interactive-Particles-Music-Visualizer | Medium — shader uniforms |
| Multi-sphere frequency orbs | Humprt/particula | Medium |
| Central glowing orb + rings | harsh-raj00/my-jarvis | Low — fork and adapt |
| Scanline / Fresnel HUD | tpowellmeto/HolographicEffect | Medium |
| Hex grid background | Custom GLSL, no canonical repo found | High |
| Voice-driven equalizer bars | MDN AnalyserNode + Canvas2D | Low — native APIs |
---
The /voice page canvas is full-viewport. Three layers:
Layer 0 (background): particle field — 8,000–12,000 points
Layer 1 (mid): 3–5 orbital rings at different inclinations
Layer 2 (foreground): central orb / abstract face
Layer 3 (UI overlay): command tiles, transcript bar, mic indicator (HTML/CSS, not WebGL)
Implementation. THREE.BufferGeometry with a THREE.Points primitive. Particles positioned on a sphere surface with random jitter. Each frame, a vertex shader receives uAmplitude (float) and uTime (float) uniforms. Amplitude drives radial displacement of each particle from its rest position. Curl noise (from tgcnzn repo above) adds organic turbulence.
// vertex shader (simplified)
uniform float uAmplitude;
uniform float uTime;
attribute vec3 aBasePosition;
void main() {
vec3 pos = aBasePosition;
float noise = snoise(pos * 0.8 + uTime * 0.3);
pos += normalize(pos) * (noise * 0.15 + uAmplitude * 0.4);
gl_Position = projectionMatrix * modelViewMatrix * vec4(pos, 1.0);
gl_PointSize = 1.5 + uAmplitude * 3.0;
}
Audio connection. Web Audio API AnalyserNode connected to the audio element output:
const ctx = new AudioContext();
const analyser = ctx.createAnalyser();
analyser.fftSize = 512;
const source = ctx.createMediaElementSource(audioElement);
source.connect(analyser);
analyser.connect(ctx.destination);
const dataArray = new Uint8Array(analyser.frequencyBinCount);
// Each frame: analyser.getByteFrequencyData(dataArray)
// amplitude = average of dataArray / 255
Colors. Base particle color: #00e5ff (TITAN cyan). Peak amplitude: interpolate to #ff6b00 (TITAN orange). Idle: #7b5ea7 (violet, low opacity).
Implementation. Three THREE.TorusGeometry meshes at inclinations of 0°, 35°, and 65° relative to camera. Apply ektogamat/threejs-vanilla-holographic-material for holographic scanline effect. Each ring rotates at a different speed (0.002, 0.004, -0.003 radians/frame) for visual depth.
Data display. Three rings carry data labels using THREE.Sprite billboards:
Labels update every 30 seconds from a /api/health poll.
Audio reactivity. Ring scale pulses with bass frequency band amplitude. Ring glow intensity (material fresnelOpacity) scales with overall volume.
The center is an IcosahedronGeometry with detail: 4 (320 faces), not a sphere. This gives a natural faceted JARVIS look. Material: MeshPhongMaterial with emissive: #00e5ff, emissiveIntensity mapped to audio amplitude. A PointLight at center provides the glow that illuminates the orbital rings.
For Tier 3, this can be replaced with a stylized abstract face — a flattened IcosahedronGeometry with displacement mapping that morphs to show "listening" vs. "speaking" states.
| State | Primary | Secondary | Emissive |
|---|---|---|---|
| Idle | #7b5ea7 (violet) | #1a1a2e (deep navy) | #3d2b6b |
| Listening | #00e5ff (cyan) | #003547 | #00b4cc |
| Processing | #ff6b00 (orange) | #3d1a00 | #cc5500 |
| Speaking | #00ff88 (green) | #003d1f | #00cc6a |
State transitions use THREE.MathUtils.lerp() on all color uniforms over 500ms.
Target: 60fps on Harnoor's Windows 11 PC (Chrome).
UnrealBloomPass at resolution scale 0.5 (half-res bloom is barely visible but cheap)renderer.setPixelRatio(Math.min(window.devicePixelRatio, 2)) — caps at 2x on retina displaysMobile/degraded mode: if navigator.hardwareConcurrency < 4, reduce particles to 2,000, disable bloom, use MeshBasicMaterial instead of MeshPhongMaterial. The voice function still works; only the visuals degrade.
requestAnimationFrame loop should check performance.now() and skip a render frame if it would push delta > 33ms (i.e., gracefully hold at 30fps rather than dropping further).
---
USER SPEAKS
│
▼
[Browser Mic]
│ getUserMedia() → MediaStream
▼
[Web Speech API SpeechRecognition]
│ onresult: { transcript, isFinal }
│ Show interim transcript in real-time in UI bar
▼
[Final transcript confirmed]
│
▼
[POST /api/voice {prompt: transcript, session_id: uuid}]
│ HTTP to titan_bridge.py
▼
[titan_bridge.py — /api/voice route]
│
├─ Fast path: is this a "what shipped today" / "quick pulse" tile command?
│ └─ Yes → read from F:/TITAN/state/* directly, no LLM call
│ └─ No → forward to Claude API (existing claude_client logic)
│
│ Receive LLM text response
│
├─ Call AWS Polly / ElevenLabs with response text
│ (Tier 1: Polly Neural Brian, streaming synthesize_speech)
│ (Tier 2: ElevenLabs Daniel Turbo via /v1/text-to-speech/stream)
│
│ Return JSON {text: "...", audio_url: "/api/voice/audio/<id>"}
▼
[Browser receives response JSON]
│
├─ Display text in transcript bar
│
└─ Fetch audio_url → HTMLAudioElement
│
▼
[Web Audio API]
│ createMediaElementSource(audioElement)
│ → analyser → destination
▼
[AnalyserNode → getByteFrequencyData each frame]
│
▼
[Three.js uniforms updated: uAmplitude, uFreqBands]
│
▼
[Visualizer reacts — particles expand, rings glow, orb pulses]
For Tier 1 and Tier 2: REST (HTTP POST) is sufficient. The flow is turn-based: user speaks → POST → response audio. WebSocket adds complexity without meaningful benefit at this tier.
For Tier 3 (streaming responses): Add a WebSocket at /ws/voice. The bridge streams text tokens back as they arrive from Claude. The browser builds text progressively in the transcript bar and can begin TTS on the first sentence before the full response arrives.
# Tier 3 WebSocket sketch (titan_bridge.py addition)
@app.websocket_route("/ws/voice")
async def voice_ws(websocket):
await websocket.accept()
while True:
data = await websocket.receive_json()
async for token in claude_stream(data["prompt"]):
await websocket.send_json({"type": "token", "text": token})
# After full response, synthesize audio
audio_b64 = await polly_synthesize(full_text)
await websocket.send_json({"type": "audio", "data": audio_b64})
For ElevenLabs Tier 2, use chunked streaming with the MediaSource API:
const mediaSource = new MediaSource();
audioElement.src = URL.createObjectURL(mediaSource);
mediaSource.addEventListener('sourceopen', async () => {
const sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg');
const response = await fetch('/api/voice/stream', { method: 'POST', body: JSON.stringify({text}) });
const reader = response.body.getReader();
while (true) {
const { value, done } = await reader.read();
if (done) break;
await new Promise(resolve => {
sourceBuffer.addEventListener('updateend', resolve, { once: true });
sourceBuffer.appendBuffer(value);
});
}
});
This allows the browser to start playing TITAN's voice before the full synthesis is complete, cutting perceived latency by 60–70%.
---
A horizontal strip pinned to the bottom of the /voice cinema view. 7 tiles + 1 mic button.
┌─────────────────────────────────────────────────────────────────┐
│ [talk to me] [what shipped today] [what's blocked] │
│ [next move] [read latest memo] [quick pulse] [ask anything 🎙] │
└─────────────────────────────────────────────────────────────────┘
Each tile is a <button> with:
background: rgba(0, 229, 255, 0.08); backdrop-filter: blur(8px);1px solid rgba(0, 229, 255, 0.3)rgba(0, 229, 255, 0.8), subtle glow| Tile | Label | Action |
|---|---|---|
| T1 | talk to me | Triggers auto-briefing prompt (see Section 12) |
| T2 | what shipped today | Queries ask ledger for status: "shipped" + today's date; reads aloud |
| T3 | what's blocked | Queries ask ledger for status: "blocked" or priority: "S" items |
| T4 | next move | Asks Claude: "Given current TITAN state, what is the single most important next action?" |
| T5 | read latest memo | Reads the most recently modified file in F:/TITAN/plans/advisors/ |
| T6 | quick pulse | Reads F:/TITAN/state/ summary: queue depth, open asks count, last ship |
| T7 | ask anything | Opens free-form mic; SpeechRecognition listens and sends whatever user says |
Each tile goes through: idle → active (user clicked) → processing → speaking → idle. CSS class changes drive animation. The active tile glows orange during processing, transitions to green during TITAN speaking output.
T — toggle tiles visibility (for distraction-free mode)Space — toggle mic (equivalent to T7)Escape — stop current speech playback1–6 — trigger tiles T1–T6 directly---
Rather than treating /voice as a page inside the TITAN dashboard, it should launch as a full-viewport takeover — the same philosophy as the existing cinema-mode FAB pattern. When the user navigates to /voice, the Three.js canvas fills the entire screen. The TITAN dashboard disappears. The only UI is:
This matches Harnoor's request for a JARVIS feel. Tony Stark doesn't interact with JARVIS through a sub-panel — it's the entire room.
Add a /voice GET route to titan_bridge.py that serves the JARVIS HTML page:
elif path == "/voice":
return self._serve_static("jarvis/index.html")
The JARVIS page is self-contained in F:/TITAN/static/jarvis/. It loads Three.js from CDN (or bundled), the command tiles, and connects to /api/voice POST.
From the main TITAN dashboard, add a "JARVIS" button to the navigation strip. Clicking it navigates to /voice in the same tab. The /voice page has an × or minimize button that returns to /. No iframe — standalone page.
---
JARVIS is: calm, precise, warm without being subservient, occasionally dry-witted, always useful. He never rambles. He leads with the answer. He uses Harnoor's lion-tiger frame — sovereign language, not corporate language.
Core traits:
AWS Polly — Recommended for Tier 1:
Brian (en-GB, Neural and Generative) — British English male. Brian is available as both a Neural TTS voice and as a new Generative voice (launched March 2026 with Bidirectional Streaming expansion). The Generative version is described as "emotionally engaged, assertive, and highly colloquial" — closest to JARVIS. Voice ID: Brian. Engine: neural for cost efficiency, generative for max quality (available in us-east-1, eu-west-2, us-west-2, ca-central-1, ap-southeast-1, eu-central-1).
Arthur (en-GB, Neural only) — also British male, modeled on the US English Matthew voice with British vocal characteristics transferred via deep learning. Slightly more formal than Brian. Voice ID: Arthur. Engine: neural.
Recommendation: Brian with generative engine if account is in supported region; fall back to neural if not. Brian sounds more natural and conversational — JARVIS quality.
ElevenLabs — Recommended for Tier 2:
Daniel (voice ID: onwK4e9ZLuTAKqWW03F9) — British male, "Steady Broadcaster" style. Deep, authoritative voice. Described as a British News Presenter. Available on all plans including free tier. This is the closest existing ElevenLabs voice to JARVIS.
Antoni (voice ID: ErXwobaYiN019PkySvjV) — American male, general-purpose. 191 WPM, 125Hz pitch. Well-rounded narrator. Not British — less JARVIS-like but highly clear. Better choice if Daniel sounds too BBC-anchor in context.
Recommendation: Daniel for JARVIS aesthetic. Test both in context — Daniel can sound stiff at slower speaking rates; reduce stability to 0.4 and similarity_boost to 0.75 in ElevenLabs settings for more natural variation.
ElevenLabs API settings for JARVIS persona:
{
"voice_id": "onwK4e9ZLuTAKqWW03F9",
"model_id": "eleven_turbo_v2_5",
"voice_settings": {
"stability": 0.40,
"similarity_boost": 0.75,
"style": 0.35,
"use_speaker_boost": true
}
}
When routing voice requests through Claude, prepend this persona context:
You are JARVIS, TITAN's voice interface. You speak with calm British authority — precise, brief, occasionally dry-humored. Never use bullet points or markdown in voice responses; speak in natural sentences only. Lead with the answer. Maximum 3 sentences per response unless more detail is explicitly requested. Address Harnoor directly but without flattery. You operate TITAN, which is Harnoor's AI operating system. When referencing data, be specific — use numbers and dates.
---
getUserMedia — the browser requests microphone access via the native permission dialog. Audio is only captured when the user explicitly activates the mic (click on mic button or press Space). No background listening except in Tier 3 wake-word mode (which requires explicit opt-in).
Hard mute. The mute button calls mediaStream.getTracks().forEach(t => t.stop()) — this physically disconnects the browser's access to the microphone, not just a JavaScript mute flag. The mic indicator light on the device goes off. This is not a UI-only hide.
function hardMute() {
if (activeStream) {
activeStream.getTracks().forEach(track => track.stop());
activeStream = null;
recognition.stop();
}
}
/api/voice POST) → Claude API. Text only, never audio.All API keys (AWS, ElevenLabs, OpenAI) are held server-side in F:/TITAN/state/ or environment variables. The browser never receives or sends API credentials. The bridge handles all external API calls. The /api/voice endpoint is protected by the existing bridge bearer token (R0220 — titan_token cookie).
titan_bridge.py)harnoor-asks.jsonl unless the user says "log this" or "save this ask"If "hey TITAN" wake word is added in Tier 3, this requires continuous mic access. This should be:
continuous: true mode, not a background service worker/voice page is not the active tab---
Goal. A working voice conversation with TITAN. No 3D yet. Prove the loop end-to-end.
Components:
GET /voice in titan_bridge.py → serves F:/TITAN/static/jarvis/index.htmlPOST /api/voice in titan_bridge.py → accepts {prompt}, calls Claude, calls Polly Brian Neural, returns {text, audio_url}GET /api/voice/audio/<id> → serves the synthesized audio fileF:/TITAN/static/jarvis/index.html — dark full-viewport pageF:/TITAN/static/jarvis/voice.js — SpeechRecognition setup, tile click handlers, audio playbackF:/TITAN/static/jarvis/style.css — TITAN dark aesthetic, glassmorphic tiles, transcript barF:/TITAN/state/voice-config.json — TTS voice ID, model, tile labels, persona system promptWhat works at end of Tier 1:
/voice in browserCanvas at Tier 1. Simple CSS animated background: radial gradient pulsing via CSS keyframes, no WebGL. Particle system is deferred.
---
Goal. Add Three.js visualizer + ElevenLabs voice + cinema fullscreen aesthetic.
New work:
F:/TITAN/static/jarvis/scene.js — Three.js scene: particle field, 3 orbital rings, central orb/api/voice to call ElevenLabs API instead of Polly (or make it configurable in voice-config.json)/voice page is the full viewportDependency additions:
npm install three @ektogamat/threejs-holographic-material
Three.js can also be loaded from CDN: https://cdn.jsdelivr.net/npm/three@0.165.0/build/three.module.js
Performance target: 60fps on desktop, 30fps mobile graceful degradation.
---
Goal. Streaming WebSocket, abstract 3D face, wake word, multi-modal.
New work:
/ws/voice in titan_bridge.py — streaming tokens from Claude back to browser as they arrive., ?, !), don't wait for full responseTHREE.AnimationMixer or manual morph targetsSpeechRecognition with continuous: true + keyword filter ("hey TITAN") → triggers listening mode/api/voice POST can optionally include a screenshot of the current page (navigator.mediaDevices.getDisplayMedia) so TITAN can comment on what's on screen---
Assumptions: 30 minutes/day active voice interaction. 50% user speaking, 50% TITAN responding. Average speaking pace: ~150 words/minute, ~750 characters/minute.
| Option | Monthly Cost at 30 min/day |
|---|---|
| Web Speech API | $0 (free) |
| OpenAI Realtime (input) | 15 min × $0.06/min × 30 days = $27/month |
| Whisper.cpp WASM | $0 (free, runs in browser) |
Recommendation: Web Speech API for Tier 1–2. Save $27/month.
| Option | Monthly Cost at 30 min/day | Notes |
|---|---|---|
| AWS Polly Standard | ~$3.60/month | $4/1M chars × 0.9M chars/month. Robotic quality. |
| AWS Polly Neural (Brian) | ~$14.40/month | $16/1M chars. Good quality, natural. |
| AWS Polly Generative (Brian) | ~$27/month | $30/1M chars. Best Polly quality. |
| ElevenLabs Daniel (Creator plan) | $22/month flat | Covers 100K chars (~100 min/month) — fine for ~3 min/day voice; overage at 30 min/day |
| ElevenLabs Daniel (Pro plan) | $99/month flat | Covers 500K chars (~500 min/month). Covers 30 min/day usage. |
| ElevenLabs Daniel (pay-as-you-go) | ~$54/month | $0.06/1K chars × 900K chars. |
| OpenAI Realtime (output, bundled) | $108/month | 15 min/day × $0.24/min × 30 days. Includes LLM + TTS. |
Recommendation:
The voice prompt goes through the existing Claude API call in TITAN. Voice queries are typically short (1–3 sentences prompt, 2–5 sentences response). Roughly 200 tokens in/out per turn. At 10 turns/day: 4,000 tokens/day × 30 = 120,000 tokens/month. At Claude Sonnet pricing ($3/1M input + $15/1M output): ~$0.36 + $0.90 = ~$1.26/month marginal LLM cost for voice.
| Tier | STT | TTS | LLM | Total |
|---|---|---|---|---|
| Tier 1 MVP | $0 (Web Speech) | $0–$14.40 (Polly free tier / Neural) | $1.26 | $0–$16/month |
| Tier 2 JARVIS | $0 (Web Speech) | $20–$22 (ElevenLabs) | $1.26 | $21–$23/month |
| Tier 3 Full | $0 (Web Speech) | $27 (Polly Gen) or $20 (ElevenLabs) | $1.26 | $22–$28/month |
| Tier 3 Premium | OpenAI Realtime (bundled) | Included | Included | ~$135/month |
---
| | Web Speech API | OpenAI Realtime | ElevenLabs | Polly Neural | Whisper WASM |
|---|---|---|---|---|---|
| Latency | 200–400ms | 500–800ms | 75–300ms TTFA | 200–400ms (streaming) | 5–15s lag |
| Quality | Good (Google STT) | Excellent | Excellent | Good | Good |
| Monthly Cost | $0 | $108–$135 | $20–$99 | $0–$27 | $0 |
| Setup | 3 lines JS | Medium | Low | Already wired | High |
| Safari | Partial | Full | Full | Full | Partial |
| Chrome | Full | Full | Full | Full | Full |
| Privacy | Google STT servers | OpenAI servers | ElevenLabs via bridge | AWS via bridge | Fully local |
| Best for | Tier 1–2 STT | Tier 3 unified | Tier 2–3 TTS | Tier 1 TTS | Tier 3 privacy mode |
Recommended combos by tier:
| Tier | STT | TTS | Animation | Est. Cost |
|---|---|---|---|---|
| Tier 1 (MVP) | Web Speech API | Polly Brian Neural | CSS gradient pulse | $0–$14/mo |
| Tier 2 (JARVIS look) | Web Speech API | ElevenLabs Daniel Turbo | Three.js particle + rings | $21–$23/mo |
| Tier 3 (Full JARVIS) | Web Speech API (or OpenAI) | ElevenLabs Daniel Turbo | Three.js face + orb + rings | $22–$28/mo |
| Tier 3 Premium | OpenAI Realtime (all-in) | OpenAI Realtime (all-in) | Three.js face + orb + rings | ~$135/mo |
---
When the user clicks "talk to me" or says those words:
You are JARVIS delivering a morning briefing to Harnoor. Speak in natural sentences only — no bullet points, no headers, no markdown. Be brief and sovereign. The structure is: (1) opening frame line, (2) current numbers, (3) top 3 priorities, (4) open question. Maximum 90 seconds of speaking time (~225 words). Be specific with every number. End with an open mic invitation.
Before calling Claude, the bridge reads:
F:/TITAN/state/harnoor-asks.jsonl → count open asks, count blocked, last reply timestampF:/TITAN/state/inbox-queue.jsonl → queue depthF:/TITAN/plans/advisors/ → most recently modified memo filenameF:/TITAN/state/voice-config.json → today's focus area (if set)
Opening frame (rotate from pool):
- "The systems are running. Let's take a look at where things stand."
- "All channels are active. Here's your current picture."
- "You asked for a briefing. Here's the intelligence."
- "Running the state of play. A few things worth your attention."
Numbers:
- "[N] open asks, [M] of which are marked S-priority."
- "The queue has [Q] items waiting. Last reply came through [X] minutes ago."
Priorities:
- "Your top three for today: [item 1], [item 2], and [item 3]."
(Read from voice-config.json today_priorities, or derive from ask ledger)
Open question:
- "What would you like to dig into first?"
> "The systems are running. Let's take a look at where things stand. You have 9 open asks. Four are S-priority. The queue has 2 items pending. Last reply came through 6 minutes ago. Your top three for today: the JARVIS build, the Silent Infinity R0211 blocker, and the Polly streaming integration. The latest memo is the client-side STT research brief from this morning. What would you like to dig into first?"
Duration: approximately 25 seconds at a natural speaking pace. Concise, specific, sovereign.
---
Create F:/TITAN/state/voice-config.json:
{
"tts_provider": "polly",
"tts_voice_id": "Brian",
"tts_engine": "neural",
"elevenlabs_voice_id": "onwK4e9ZLuTAKqWW03F9",
"elevenlabs_model_id": "eleven_turbo_v2_5",
"elevenlabs_voice_settings": {
"stability": 0.40,
"similarity_boost": 0.75,
"style": 0.35
},
"jarvis_persona_prompt": "You are JARVIS, TITAN's voice interface. Speak with calm British authority — precise, brief, occasionally dry-humored. Natural sentences only, no markdown. Lead with the answer. Maximum 3 sentences per response unless detail is explicitly requested.",
"command_tiles": [
{"id": "talk_to_me", "label": "talk to me", "prompt_template": "briefing"},
{"id": "shipped_today", "label": "what shipped today", "prompt_template": "shipped_today"},
{"id": "whats_blocked", "label": "what's blocked", "prompt_template": "blocked"},
{"id": "next_move", "label": "next move", "prompt_template": "next_move"},
{"id": "latest_memo", "label": "latest memo", "prompt_template": "latest_memo"},
{"id": "quick_pulse", "label": "quick pulse", "prompt_template": "quick_pulse"}
],
"today_priorities": []
}
Add to titan_bridge.py after the existing route handlers:
# ── Voice / JARVIS routes ──────────────────────────────────────────
elif path == "/voice":
return self._serve_static_file("jarvis/index.html", "text/html")
elif path == "/api/voice" and method == "POST":
return self._handle_voice_post(body)
elif path.startswith("/api/voice/audio/") and method == "GET":
audio_id = path.split("/")[-1]
return self._serve_voice_audio(audio_id)
Key functions to add:
def _handle_voice_post(self, body: dict) -> dict:
"""
Accepts {prompt, session_id, tile_id?}
Returns {text, audio_url, duration_ms}
"""
# 1. Resolve prompt (tile fast-path or freeform)
# 2. Build context from state files if briefing tile
# 3. Call Claude API with JARVIS persona prepended
# 4. Call _synthesize_voice(text) -> audio_path
# 5. Store audio at F:/TITAN/state/voice-audio/<uuid>.mp3
# 6. Schedule deletion after 60s
# 7. Return {text, audio_url: /api/voice/audio/<uuid>}
def _synthesize_voice(self, text: str) -> Path:
"""
Routes to Polly or ElevenLabs based on voice-config.json.
Returns path to synthesized audio file.
"""
config = json.loads((_TITAN_ROOT / "state" / "voice-config.json").read_text())
if config["tts_provider"] == "polly":
return self._polly_synthesize(text, config)
elif config["tts_provider"] == "elevenlabs":
return self._elevenlabs_synthesize(text, config)
F:/TITAN/static/jarvis/
├── index.html # Full-viewport JARVIS page, loads Three.js + voice.js
├── voice.js # SpeechRecognition, tile handlers, fetch /api/voice, audio playback
├── scene.js # Three.js scene: particles, rings, orb (Tier 2)
├── audio-bridge.js # Web Audio API AnalyserNode → exports getAmplitude()
├── style.css # Dark theme, glassmorphic tiles, transcript bar, state colors
└── lib/
└── three.module.js # Three.js bundled (or CDN link in index.html)
index.html structure (Tier 1):
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>TITAN — JARVIS</title>
<link rel="stylesheet" href="style.css">
</head>
<body class="jarvis-root">
<canvas id="jarvis-canvas"></canvas> <!-- Three.js target (Tier 2+) -->
<div id="transcript-bar"></div> <!-- Scrolling transcript -->
<div id="tile-strip"> <!-- Command tiles -->
<!-- tiles injected by voice.js from voice-config.json -->
</div>
<button id="mic-btn" class="tile tile--mic">ask anything</button>
<button id="close-btn" onclick="window.location='/'">×</button>
<script type="module" src="voice.js"></script>
</body>
</html>
// scene.js — key exports
export function initScene(canvas) { /* ... */ }
export function setAmplitude(value) {
// Updates uAmplitude uniform on particle system
// Updates emissiveIntensity on central orb
// Updates ring glow
}
export function setState(state) {
// state: 'idle' | 'listening' | 'processing' | 'speaking'
// Lerps color uniforms to state palette
}
voice.js calls setState('listening') when mic activates, setState('processing') while waiting for API, setState('speaking') while audio plays, setState('idle') when audio ends.
titan_bridge.py reads voice-config.json on every /api/voice call (no restart needed). Harnoor can edit the JSON to swap voice IDs, change persona, or reorder tiles without restarting the bridge.
| Point | File | Status |
|---|---|---|
| /voice GET route | titan_bridge.py | To add |
| /api/voice POST route | titan_bridge.py | To add |
| /api/voice/audio/<id> GET | titan_bridge.py | To add |
| Polly synthesis function | titan_bridge.py | Extend existing Polly usage |
| ElevenLabs synthesis function | titan_bridge.py | New (Tier 2) |
| Voice config JSON | F:/TITAN/state/voice-config.json | New |
| JARVIS static files | F:/TITAN/static/jarvis/ | New directory |
| Three.js scene | F:/TITAN/static/jarvis/scene.js | New (Tier 2) |
| Bridge auth check | Existing _check_auth() in bridge | Already covers new routes |
---
1. OpenAI API Pricing — Realtime API audio pricing: $100/1M input tokens, $200/1M output tokens (~$0.06/min input, $0.24/min output)
2. Introducing gpt-realtime — OpenAI — Realtime API production announcement
3. ElevenLabs Pricing 2026 — Flexprice — Plan tiers, credits, overage rates
4. ElevenLabs Pricing — pxlpeak — Agent per-minute pricing confirmed
5. ElevenLabs Pricing Official — Canonical plan page
6. Amazon Polly Pricing — Neural $16/1M, Generative $30/1M chars
7. Amazon Polly Bidirectional Streaming — AWS Blog — 39% latency improvement announcement
8. Amazon Polly Expands Generative TTS — March 2026 — New voices + Bidirectional Streaming GA
9. Amazon Polly Neural Voices — AWS Docs — Brian, Arthur voice specs
10. Web Speech API — MDN — Official spec and usage
11. Speech Recognition API — Can I Use — Browser compatibility table
12. Whisper.cpp stream.wasm — ggml.io — Real-time WASM STT demo
13. whisper.cpp GitHub — ggml-org — Source, WASM examples, MIT license
14. Anthropic Claude Voice Mode — TechCrunch — Voice mode is Claude Code only, not a public API (March 2026)
15. harsh-raj00/my-jarvis — GitHub — MIT, React + Three.js JARVIS implementation
16. tgcnzn/Interactive-Particles-Music-Visualizer — GitHub — MIT, audio-reactive Three.js particles
17. Codrops — Audio-Reactive Particles Three.js — Curl noise technique tutorial
18. Codrops — 3D Audio Visualizer Three.js GSAP — 2025 update
19. dcyoung/r3f-audio-visualizer — GitHub — React Three Fiber audio visualizer
20. ektogamat/threejs-vanilla-holographic-material — GitHub — MIT holographic shader
21. ektogamat/threejs-holographic-material — GitHub — MIT React Three Fiber version
22. Humprt/particula — GitHub — Multi-sphere frequency visualizer
23. Three.js Audio Reactive Particles Demo — Official Three.js demo
24. ElevenLabs Daniel voice — json2video — Daniel voice ID confirmation: onwK4e9ZLuTAKqWW03F9
25. ElevenLabs Antoni voice — json2video — Antoni voice ID confirmation: ErXwobaYiN019PkySvjV
26. MDN — Visualizations with Web Audio API — AnalyserNode amplitude visualization
27. steffenpharai/Jarvis — GitHub — Offline Iron Man JARVIS reference architecture
28. Three.js Journey — Hologram Shader — Scanline HUD shader technique
---
SCOUT A085 | Generated 2026-04-27 | F:/TITAN/plans/advisors/JARVIS-VOICE-CHAT-SPEC-2026-04-27.md