Client-Side STT — Research Brief

Ask ID: A013

Declared by: Harnoor

Declared at: 2026-04-25T13:40 UTC

Priority: S-tier

Status: QUEUED — research begins after current ship queue clears

Bookmark: YES (Harnoor explicitly asked for visible tracking)

---

Harnoor's exact words (2026-04-25)

> "android an iPhone have the features where people can transcribe locally try to use that intelligently so we don't have to process the text on the server side. People could send text from their side right so can the chat be integrated in a way where when they speak it uses their native CPU on their iPhone or whatever software and then the ability to transfer on their side and then send it over that can save us some time some CPU cost. I don't know if it's possible to intelligently included into the chat somehow, but you had that feature regardless I would love to see some innovation where you can actually include that to save us 50% of the cost and it will also reduce latency since it's being processed locally thank you so much so yeah this is something that you need to research the pros and cons of look at Claude code delete code as well as what's in the interface and see what they're doing while they're advocate thank you."

What Harnoor is actually asking

Use on-device speech-to-text (Web Speech API on browsers, native dictation on mobile) so the user's voice is transcribed by their phone — not by a Lambda → Deepgram/Whisper round-trip.

Two things get better:

1. Cost — every voice turn currently spends Deepgram or Whisper minutes. Harnoor estimates "50% off." Plausible if voice is a meaningful share of turns.

2. Latency — eliminates the upload + STT round-trip. Text appears in the compose box the moment the user finishes speaking.

The constraint: must not regress emotion-aware experiences (Hume EVI prosody, voice tone, frustration detection). That work currently leans on server-side voice.

---

What needs to be answered (research scope)

1. Capability survey (must research)

Web Speech API support matrix in 2026: Chrome (Android, desktop), Safari (iOS 14.5+, macOS), Firefox, Edge, Samsung Internet, Brave
iOS Safari specifically — does it use Apple's on-device dictation or send audio to Apple servers?
Android Chrome specifically — does it use Google's on-device or cloud STT?
Accuracy benchmarks for emotional / mid-sentence-pause speech (not commands)
Which languages are supported on-device vs cloud-fallback
Privacy implications: when does voice leave the device?

2. Competitive scan (Harnoor explicitly asked)

Claude Code — does its CLI use any client-side STT? (Almost certainly not — CLI is text-only, but check.)
Claude.ai desktop + mobile — does Anthropic use Web Speech API for the voice mode? What does the network tab show?
ChatGPT mobile + web — what's their voice path? (They use Whisper server-side as of 2025; check 2026.)
Pi.ai — voice-first product; how do they balance client/server?
Hume EVI playground — they're prosody-heavy; would they sacrifice emotion-data for client STT?
Replika, Character.AI, Poe — what's the dominant pattern in 2026?
Apple Intelligence dictation — can a web app invoke it explicitly?
Android live captions — can a web app invoke them?

3. Implementation paths (must compare)

Pure Web Speech API (SpeechRecognition) — interim + final results, single-shot
Hybrid: Web Speech API for transcription, server-side for emotion analysis on raw audio in parallel (best of both)
WebAssembly Whisper (small model, ~100ms latency) — client-side, no privacy leaving device, but ~80MB download
Capability detection + graceful fallback to server STT when unsupported

4. Cost model

Current Deepgram/Whisper cost per voice turn
Estimated ratio of voice vs text turns on Innerverse
True cost saving if 80% of voice turns become client-side
Account for the cost of audio uploaded for emotion-only analysis (if hybrid)

5. Trade-off matrix

Quality regression risk per competitor
Latency improvement (estimate ms saved)
Privacy posture (better — voice never leaves device)
Battery drain on mobile (Web Speech API is non-trivial)
What we lose: word-level timestamps, prosody, voice fingerprinting

---

Deliverables (from SCOUT, when run)

1. Memo at F:/TITAN/plans/advisors/CLIENT-SIDE-STT-RESEARCH-2026-04-25.md (3500-5000 words)

2. Implementation spec with capability detection JS snippet + fallback contract

3. R-number queue — recommend next 1-3 R-numbers to ship the feature

4. SES email to Harnoor per AGENT-REPORTS-EMAIL-DAILY directive — top findings + recommendation + memo path

---

Execution Order

Per Harnoor's own words: "this should be the last thing you do for other things."

Current queue ahead of A013:

1. CloudWatch alarm on upstream_error EMF metric (in flight)

2. Migrate titan-daily-pa-email + titan-daily-newsletter from Gmail draft to SES send

3. BUILD livegroweveryday dashboard (after Harnoor sign-off on plan)

A013 SCOUT spawn: after item 3 completes, or when Harnoor explicitly says "go on the STT research."

Bookmark

This brief lives at F:/TITAN/plans/advisors/research-queue/. The dashboard /research/queue page (when built) will surface this prominently as the next research scheduled. The ask ledger entry A013 also points back to this brief.

— TITAN · 2026-04-25