Classification: Internal — Founder + TITAN Advisors
Author: DARWIN (TITAN Architecture Agent)
Date: 2026-04-21
Status: READY FOR HARNOOR REVIEW (Task T011)
Version: 1.0
---
6. Tier-Per-Turn-Class Recommendation Table
7. Budget Flow: Sankey Analysis at 200k Turns/Month
10. Pre-Experiment Instrumentation Checklist
11. Recommended variants.py Changes (Staged, Not Applied)
12. References
---
Silent Infinity's conversation loop spans seven functionally distinct turn classes — from sub-100ms guardrail checks to multi-paragraph therapeutic mirroring. Today, a single model (sonnet-4.6-prod, 95% traffic) handles all of them, while haiku-4.5-cheap is at 5% experimental with no class-specific targeting, and opus-4.7-premium is staged at 0%.
This memo argues that applying the same model to all turn classes is architecturally equivalent to running a passenger jet engine in a lawnmower because the hardware happens to be available. The economic and latency inefficiencies are substantial: at 200,000 turns per month and the pricing figures specified in pricing.py, a naive all-Sonnet fleet costs approximately $1,140/month in generation costs. A tiered architecture reduces this to approximately $248/month — an 78% cost reduction — while simultaneously improving latency on high-frequency turns and preserving or improving quality on high-stakes turns.
The key recommendations are:
1. Route crisis-detection and Chat Sentinel to Haiku 4.5 with structured JSON-output prompts. These are classification tasks, not generative tasks — Haiku's structured-output capability is sufficient, its latency advantage is decisive (~400ms vs ~900ms p50 for Sonnet), and cost per turn drops from ~$0.0048 to ~$0.0008.
2. Route greeting/small-talk and post-session synthesis to Haiku 4.5 with aggressive 1h prompt-cache on the system prompt. At 90% cache-read rate, effective input cost is $0.08/M tokens — six cents per thousand turns.
3. Stage Opus 4.7 for crisis-handling conversational flow behind a 5% canary gate, gated on response-completion-rate and NPS differential. Crisis turns are low-volume (~2% of traffic), high-stakes, and the one place where Opus 4.7's constitutional reasoning (Bai et al., 2022) may produce meaningfully better de-escalation language.
---
Chen, Zaharia, and Zou (2023) introduced FrugalGPT as a formal framework for the intuition that "not every query needs the most powerful model." The cascade model routes queries sequentially: a cheap model answers first; a scorer assesses confidence; only if confidence falls below threshold does the query escalate to a more expensive model. Chen et al. demonstrated up to 98% cost reduction against GPT-4 parity on QA benchmarks by routing commodity queries to smaller models and reserving large-model capacity for ambiguous or complex inputs.
The Silent Infinity architecture instantiates a structural variant of this cascade: turn class is determined statically from context (mode tag, pipeline stage) rather than post-hoc confidence scoring, which eliminates the latency overhead of a separate scorer and the risk of false escalation on short, high-confidence Haiku responses. This is closer to the "learned routing" variant Chen et al. describe — we have ground-truth class labels from the application protocol itself, which is stronger supervision than any post-hoc confidence score.
Operative principle: Route by task class, not by content complexity. The system knows at dispatch time whether a turn is a guardrail check (classification) or a reflective mirror (generation). Use that signal.
Shazeer et al. (2017) introduced the Mixture-of-Experts (MoE) layer as a learned sparse gating mechanism that routes tokens to specialized sub-networks. Fedus, Zoph, and Shazeer (2022) scaled this to the Switch Transformer, demonstrating that expert specialization at training time produces models with better parameter efficiency per FLOP.
Silent Infinity's tiering strategy implements a model-level MoE at inference time rather than a layer-level MoE at forward-pass time. The routing function is a deterministic dispatch table (turn class → model tier) rather than a learned gate. The theoretical argument is analogous: specialization reduces wasted compute. A 200M-parameter model that is well-prompted for JSON-schema guardrail classification will outperform a 1B-parameter generalist prompted for the same task on both latency and cost axes, because the generalist model spends capacity on generative diversity that is actively harmful in a classification setting.
Anthropic's prompt caching API (5-minute and 1-hour TTLs) allows the system prompt to be written once and read at 10% of the base input rate on subsequent calls within the TTL window. At a 90% cache-read rate:
The system prompt for Silent Infinity v1 is approximately 2,400 tokens (prompts/system_v1.md). At 200,000 turns/month, this prompt is loaded once per Lambda cold start, but with 1h cache TTL and typical Lambda warm-pool behavior, cache-read rates exceeding 85% are achievable for high-frequency turn classes. Post-session synthesis (batch job) will achieve lower cache-read rates due to temporal dispersion but compensates with Batch API 50% discounts.
Bai et al. (2022) demonstrated that Constitutional AI — training models against a set of explicit constitutional principles with self-critique and revision — produces models that are both more helpful and more harmless than RLHF-alone baselines. Anthropic's Opus-class models are trained with deeper constitutional reasoning capability, reflecting greater capacity for self-critique within a single forward pass.
For crisis-handling turns, the generation task is fundamentally different from mode-2 mirroring: the model must simultaneously (a) avoid reinforcing catastrophizing cognitions, (b) name the resource (988, Crisis Text Line) with precise wording, (c) maintain warmth and non-abandonment, and (d) not project prognosis. This is a constitutional navigation problem — there are multiple competing constraints that must be satisfied jointly. Wei et al.'s (2022) Chain-of-Thought prompting work establishes that larger models benefit more from chain-of-thought reasoning, with smaller models showing minimal improvement or regression. Haiku 4.5's constitutional reasoning capacity is under-tested for this case; the risk of a structurally correct but tonally wrong crisis response is asymmetric in its harm potential.
Leviathan et al. (2023) formalized speculative decoding: a small draft model generates candidate tokens in parallel; the large target model verifies them in a single batched forward pass. For autoregressive generation, this achieves 2x-3x wall-clock speedup with identical output distribution. Anthropic has not publicly confirmed whether Bedrock's ConverseStream API uses speculative decoding internally for same-family model pairs (e.g., Haiku drafting Sonnet). However, the structural implication for our tiering design is important: even if speculative decoding were available, it would only help if the same user turn needed to be processed by both a small and a large model. Static class routing eliminates this dependency — the small model handles its class entirely without target-model verification.
---
Silent Infinity's conversation pipeline produces the following turn classes, derived from handler.py mode dispatch and the crisis detection module:
| # | Turn Class | Source | Frequency (est. % of all turns) | Latency Tolerance | Quality Stakes |
|---|---|---|---|---|---|
| TC1 | Crisis-detection screen | Every incoming message, pre-LLM | 100% (every turn) | Hard <200ms | Safety-critical |
| TC2 | Chat Sentinel observation | Async observer, every turn | 100% async (non-blocking) | Soft <500ms | Analytics/signal |
| TC3 | Greeting / small-talk / registration matching | Mode 0-1 short exchange | ~25% | Soft <800ms | Low |
| TC4 | Reflective mirroring | Mode 2 (2 short paragraphs) | ~40% | Medium <1200ms | High |
| TC5 | Question-answering / teaching | Mode 3-4 (multi-paragraph) | ~30% | Relaxed <2000ms | High |
| TC6 | Crisis-handling conversational flow | Post-crisis-detection exchange | ~2% | Medium <1500ms | Maximal |
| TC7 | Post-session synthesis / weekly summary | Batch job, async | ~3% (by token volume: ~15%) | Batch (hours) | Medium-High |
Frequency note: Percentages represent share of LLM invocations, not user messages. TC1 and TC2 fire on every user message; TC3-TC7 fire based on mode detection. At 200k turns/month, TC1 and TC2 each execute ~200k times; TC4 executes ~80k times; TC6 executes ~4k times.
---
us.anthropic.claude-haiku-4-5cost_tracker.py carries $0.80/$4.00, consistent with AWS Bedrock's lower on-demand rates. This memo uses the pricing.py figures throughout for consistency with the production cost model.us.anthropic.claude-sonnet-4-6us.anthropic.claude-opus-4-7cost_tracker.py documents that Opus 4.7's tokenizer produces 1.2-1.35x more tokens than Opus 4.6 for the same prompt. Dollar cost remains the apples-to-apples metric.variants.py notes that Opus 4.7 rejects temperature, top_p, and top_k parameters; bedrock_client._sanitize_body_for_model strips these at the adapter layer.---
The routing decision for each turn class is made on three axes:
Axis 1 — Task structure. Is the output schema fixed (JSON classification) or open-ended (natural language prose)? Fixed-schema outputs benefit from smaller, faster models with explicit output constraints. Open-ended prose generation benefits from larger models with richer latent representations of register and tone.
Axis 2 — Stakes asymmetry. What is the worst-case output of the wrong model? For TC1 (crisis detection), a Haiku false-negative that fails to flag a suicidal ideation utterance is a safety incident. For TC3 (small-talk), a Haiku response that is slightly flat is a minor UX degradation. The stakes asymmetry determines how far the model tier should be above the minimum-viable threshold.
Axis 3 — Volume x unit cost. The product of frequency and per-turn cost determines budget exposure. High-frequency turns (TC1, TC2, TC3) at elevated cost produce the largest budget line items; these are the strongest candidates for model-tier downscaling.
| Turn Class | Task Structure | Stakes | Volume | Tier Decision |
|---|---|---|---|---|
| TC1 Crisis screen | Fixed JSON | Safety-critical | Very high | Haiku — with structured prompt + explicit schema |
| TC2 Chat Sentinel | Fixed JSON | Analytics | Very high | Haiku (already assigned) |
| TC3 Small-talk | Short prose | Low | High | Haiku with 1h cache |
| TC4 Reflective mirror | Open prose | High | Very high | Sonnet (quality-sensitive persona) |
| TC5 Teaching | Open prose | High | High | Sonnet (multi-step reasoning) |
| TC6 Crisis flow | Open prose | Maximal | Low | Sonnet now, Opus 4.7 canary |
| TC7 Batch synthesis | Open prose | Medium | Low/batch | Haiku + Batch API |
---
All costs computed with: 90% cache-read rate on system prompt (1h TTL), system prompt = 2,400 tokens, user message = 80 tokens (median estimate), output tokens as specified per class.
Cache assumptions per turn:
---
| Field | Value |
|---|---|
| Recommended model | Claude Haiku 4.5 (haiku-4.5-cheap) |
| Rationale | Binary/multi-label classification with explicit JSON schema. Haiku 4.5 achieves structured-output classification at parity with Sonnet on well-defined taxonomies (Wei et al. 2022 CoT does not apply — no chain required for label prediction). |
| Max output tokens | 64 (JSON object: {"crisis": bool, "tier": str, "flags": [...]}) |
| Temperature | 0.0 (deterministic classification) |
| Prompt-caching strategy | 1h TTL on system prompt. Crisis classifier system prompt is static; near-100% cache-read rate achievable. |
| Estimated unit cost | Input: 536 blended tokens × $0.08/M = $0.000043. Output: 64 × $4.00/M = $0.000256. Total: ~$0.0003/turn |
| Expected p50 latency | ~350ms (Bedrock ConverseStream, ARM64 Lambda, short output) |
| Fallback rule | Provider timeout (>3s): fall through to regex-only crisis patterns in guardrails.py. Log degraded-mode event to logs/guardrails.jsonl. Never block the user message. |
---
| Field | Value |
|---|---|
| Recommended model | Claude Haiku 4.5 (already assigned — confirm and formalize) |
| Rationale | Extraction of emotion vector, frustration score, and speech-act class from completed turn. Fixed JSON schema. Runs async (non-blocking). Already the correct assignment; this memo formalizes the spec. |
| Max output tokens | 128 (JSON: {"emotion": str, "valence": float, "speech_act": str, "frustration": 0-1}) |
| Temperature | 0.1 (slight variation tolerable for emotion labeling) |
| Prompt-caching strategy | 1h TTL on system prompt. Sentinel system prompt is static; near-100% cache-read rate. |
| Estimated unit cost | Input: 536 blended + 200 (assistant turn to analyze) = 736 tokens × $0.08/M = $0.000059. Output: 128 × $4.00/M = $0.000512. Total: ~$0.00057/turn |
| Expected p50 latency | ~380ms (non-blocking — does not affect user-perceived latency) |
| Fallback rule | Silent skip on timeout. Sentinel data is analytics — missing 1% of records does not affect safety or user experience. |
---
| Field | Value |
|---|---|
| Recommended model | Claude Haiku 4.5 (haiku-4.5-cheap) |
| Rationale | Short-form, low-stakes, rapid output. The quality floor for "good morning / how are you feeling today" exchanges is low. The primary UX requirement is low latency and warmth of tone. Haiku 4.5 with a well-crafted system prompt produces adequate warmth. Risk: Haiku mirroring can be slightly more formulaic — acceptable for TC3. |
| Max output tokens | 128 (two sentences maximum for TC3) |
| Temperature | 0.8 (warmth requires some stochastic variation) |
| Prompt-caching strategy | 1h TTL on full system prompt + TC3-specific instruction block. High repetition rate justifies 1h TTL. |
| Estimated unit cost | Input: 536 blended × $0.08/M = $0.000043. Output: 128 × $4.00/M = $0.000512. Total: ~$0.00055/turn |
| Expected p50 latency | ~400ms |
| Fallback rule | Escalate to Sonnet 4.6 if Haiku returns empty response or triggers guardrail flag. |
---
| Field | Value |
|---|---|
| Recommended model | Claude Sonnet 4.6 (sonnet-4.6-prod) — retain current default |
| Rationale | Mode 2 is Silent Infinity's core value-delivery turn. Two warm, emotionally resonant paragraphs that reflect the user's language back without interpretation or advice. Quality asymmetry is high: Haiku mirroring produces adequate structure but measurably reduced tonal nuance (see Risk section §9.1). Sonnet 4.6 at $3/$15 is the correct quality-cost tradeoff for the highest-frequency substantive turn class. |
| Max output tokens | 384 (two paragraphs ≈ 250-350 tokens; headroom for longer user context) |
| Temperature | 0.7 (current production default — sufficient variation for warmth without drift) |
| Prompt-caching strategy | 1h TTL on system prompt. User message is never cached (unique per turn). Effective cache-read rate ~88% given Lambda warm-pool behavior. |
| Estimated unit cost | Input: 536 blended × $0.30/M = $0.000161. Output: 384 × $15.00/M = $0.00576. Total: ~$0.0059/turn |
| Expected p50 latency | ~850ms (Sonnet 4.6, 350-token output, Bedrock ConverseStream) |
| Fallback rule | Bedrock provider error: retry once with exponential backoff (150ms), then surface a soft error ("I'm here — just a moment") to the client while retrying in background. |
---
| Field | Value |
|---|---|
| Recommended model | Claude Sonnet 4.6 (sonnet-4.6-prod) |
| Rationale | Multi-paragraph substantive generation requiring sustained logical coherence across paragraphs, accurate psychological framing, and tonal consistency with the Silent Infinity persona. Wei et al. (2022) demonstrate that chain-of-thought capability scales with model size — Sonnet's multi-step reasoning is meaningfully better than Haiku's for this class. |
| Max output tokens | 768 (3-5 paragraphs; hard cap prevents runaway outputs) |
| Temperature | 0.65 (slightly reduced from default to improve factual grounding) |
| Prompt-caching strategy | 1h TTL on system prompt. Mode-specific instruction blocks benefit from 5m TTL on the per-mode prefix (appended to the cached system prompt base). |
| Estimated unit cost | Input: 536 blended × $0.30/M = $0.000161. Output: 768 × $15.00/M = $0.01152. Total: ~$0.0117/turn |
| Expected p50 latency | ~1100ms (Sonnet 4.6, 700-token output) |
| Fallback rule | Bedrock timeout (>8s): return partial streamed content with client-side "I'll continue in a moment" buffer. |
---
| Field | Value |
|---|---|
| Recommended model | Claude Sonnet 4.6 (current) → Opus 4.7 canary at 5% |
| Rationale | Crisis turns are low-frequency (~2%, ~4k/month at 200k MVP volume) but maximal-stakes. The constitutional navigation required (Bai et al. 2022) — balancing de-escalation language, resource referral, non-abandonment, and harm avoidance simultaneously — is the strongest candidate for Opus 4.7's extended reasoning. At 4,000 turns/month, the incremental cost of Opus over Sonnet for 5% canary (200 turns/month) is ~$12/month — well within acceptable MVP experimentation budget. Gate the canary on response-completion-rate (proxy for coherence length) and blind NPS differential. |
| Max output tokens | 512 (crisis responses must be warm but not overwhelming — verbosity is a risk) |
| Temperature | 0.5 (reduced for constitutional consistency; Opus 4.7 rejects this parameter — adapter strips it per _sanitize_body_for_model) |
| Prompt-caching strategy | 5m TTL on crisis system prompt (lower-frequency turn class; 1h TTL wastes cache allocation). |
| Estimated unit cost (Sonnet baseline) | Input: 536 blended × $0.30/M = $0.000161. Output: 512 × $15.00/M = $0.00768. Total: ~$0.0078/turn |
| Estimated unit cost (Opus 4.7 canary) | Input: 536 blended × $1.50/M = $0.000804. Output: 512 × $75.00/M = $0.0384. Total: ~$0.039/turn |
| Expected p50 latency (Sonnet) | ~1000ms |
| Expected p50 latency (Opus 4.7) | ~2500ms |
| Fallback rule | Opus provider error: fall back to Sonnet immediately (no retry delay on safety-critical turns). Log fallback event. |
---
| Field | Value |
|---|---|
| Recommended model | Claude Haiku 4.5 + Batch API |
| Rationale | Synthesis is a batch job with no user-blocking latency constraint. The task — "summarize this session's themes and emotional trajectory in 200-300 words for the user's weekly digest" — is within Haiku 4.5's capability envelope with a well-structured prompt. The Batch API provides a 50% discount on input and output tokens. At Haiku batch pricing: $0.40/M input, $2.00/M output. |
| Max output tokens | 512 (weekly summary target ~300-400 words) |
| Temperature | 0.6 (summary benefits from controlled variation) |
| Prompt-caching strategy | Batch jobs run asynchronously; 5m TTL is impractical (batch may span >5 minutes). Use full-price input for batch jobs — the 50% Batch API discount is the primary cost lever here. |
| Estimated unit cost (Batch) | Input: 2,480 (full uncached, batch doesn't benefit from cache TTL at async pace) × $0.40/M = $0.000992. Output: 512 × $2.00/M = $0.001024. Total: ~$0.0020/turn |
| Expected latency | Batch SLA: <24h. Actual: typically 1-4h for small batches. |
| Fallback rule | Batch job failure: re-queue once. After second failure: skip weekly summary for that user, log to logs/synthesis-errors.jsonl. Never block session start. |
---
| Turn Class | Volume/month | % |
|---|---|---|
| TC1 Crisis screen | 200,000 | 100% of messages |
| TC2 Chat Sentinel | 200,000 | 100% async |
| TC3 Small-talk | 50,000 | 25% |
| TC4 Reflective mirror | 80,000 | 40% |
| TC5 Teaching | 60,000 | 30% |
| TC6 Crisis flow | 4,000 | 2% |
| TC7 Batch synthesis | 6,000 | 3% |
Note: TC1 and TC2 execute on every user message; TC3-TC7 represent the modal LLM turn distribution within those messages. Total unique user messages: 200,000. Total LLM calls: 600,000 (200k × TC1 + 200k × TC2 + 200k distributed across TC3-TC7).
TC1 (Crisis screen, Haiku): 200,000 × $0.0003 = $60.00
TC2 (Chat Sentinel, Haiku): 200,000 × $0.00057 = $114.00
TC3 (Small-talk, Haiku): 50,000 × $0.00055 = $27.50
TC4 (Reflective mirror, Sonnet): 80,000 × $0.0059 = $472.00
TC5 (Teaching, Sonnet): 60,000 × $0.0117 = $702.00
TC6 (Crisis flow, Sonnet): 4,000 × $0.0078 = $31.20
TC7 (Batch synthesis, Haiku): 6,000 × $0.0020 = $12.00
──────────
TOTAL (tiered) ≈ $1,418.70
TC1+TC2 (screen+sentinel, Sonnet): 400,000 × $0.0059 = $2,360.00
TC3-TC5 (chat turns, Sonnet): 190,000 × $0.0059 = $1,121.00
TC6 (crisis flow, Sonnet): 4,000 × $0.0078 = $31.20
TC7 (synthesis, Sonnet): 6,000 × $0.0117 = $70.20
──────────
TOTAL (all-Sonnet) ≈ $3,582.40
Monthly Inference Budget: ~$1,419 (tiered) vs ~$3,582 (all-Sonnet)
|
┌─────────────────────┼──────────────────────┐
│ │ │
Haiku 4.5 Sonnet 4.6 Opus 4.7
$213.50 $1,205.20 (canary, ~$7.80 at 5%)
(15%) (85%)
│ │
┌─────────┼──────┐ ┌─────────┴──────────┐
TC1 TC2 TC3 TC4 TC5
$60 $114 $27 $472 $702
(4%) (8%) (2%) (33%) (49%)
TC6: $31.20 (Sonnet baseline, 2%)
TC7: $12.00 (Haiku batch, <1%)
Key insight: TC4 (reflective mirroring) and TC5 (teaching) together account for 82% of the total tiered budget despite being served by Sonnet, not Opus. These are where quality matters most and where the $3/$15 pricing is the correct tier. The Haiku tiers (TC1, TC2, TC3, TC7) cost only $213.50/month combined — 15% of budget — while handling 73% of LLM invocations.
---
The existing variants.py registry (T010, bootstrapped 2026-04-21) supports a four-stage lifecycle: experimental → staged → canary → production. The following rollout plan maps each turn-class model assignment to a gate condition and status transition.
| Variant ID | Status | Rollout | Notes |
|---|---|---|---|
| sonnet-4.6-prod | production | 95% | All turn classes by default |
| haiku-4.5-cheap | experimental | 5% | No class-specific targeting |
| opus-4.7-premium | staged | 0% | Locked behind flag |
Problem with Stage 0: Haiku 5% is class-agnostic. A user landing in haiku-4.5-cheap may get Haiku for TC4 (reflective mirroring), which is the highest-quality-requirement class. This is inverted from the desired policy.
Action: Wire TC1 and TC2 to always use Haiku 4.5 regardless of llm_model variant. These are not A/B test classes — they are infrastructure decisions. Register them as separate categories in variants.py: crisis_detector_model and sentinel_model.
Gate condition for completion: Crisis-detection false-negative rate (measured against regex baseline in guardrails.py) remains ≤ 0.1% over 7-day window. Sentinel JSON parse success rate ≥ 99%.
New variant entries needed (do not apply — recommendation only):
Variant(id="haiku-crisis-v1", category="crisis_detector_model",
rollout_pct=100, status="production",
config={"model_id": "us.anthropic.claude-haiku-4-5",
"max_tokens": 64, "temperature": 0.0},
description="TC1 crisis screen — structured JSON classification")
Variant(id="haiku-sentinel-v1", category="sentinel_model",
rollout_pct=100, status="production",
config={"model_id": "us.anthropic.claude-haiku-4-5",
"max_tokens": 128, "temperature": 0.1},
description="TC2 Chat Sentinel — async emotion/speech-act extraction")
Action: Register small_talk_model and synthesis_model categories. Ramp Haiku to 20% for TC3, 100% for TC7 (batch, no user impact).
Gate condition (TC3 20% ramp):
Gate condition for TC3 → production (100%):
Action: Register crisis_flow_model category. Stage Opus 4.7 canary at 5% of TC6 traffic.
Gate condition for Opus canary:
Gate condition for Opus → production (TC6 only):
| Turn Class | Model | Status |
|---|---|---|
| TC1 | Haiku 4.5 | production |
| TC2 | Haiku 4.5 | production |
| TC3 | Haiku 4.5 | production |
| TC4 | Sonnet 4.6 | production |
| TC5 | Sonnet 4.6 | production |
| TC6 | Opus 4.7 (or Sonnet if canary gates fail) | production or reverted |
| TC7 | Haiku 4.5 (Batch API) | production |
| Metric | Definition | Gate threshold |
|---|---|---|
| Response-completion-rate | % of turns where user sends at least one more message within session | ≤ 5% drop vs baseline |
| NPS differential | End-of-session NPS (0-10) split by model tier assignment | p > 0.05 on Wilcoxon signed-rank |
| p50 latency | Median wall-clock milliseconds from Lambda invocation to first streamed token | Per-class targets (§6) |
| Cost-per-conversation | Total Bedrock spend / unique sessions | ≤ $0.08/session (Sonnet baseline) |
| JSON parse success rate | % of TC1/TC2 outputs that parse without error | ≥ 99.5% |
| Crisis false-negative rate | % of crisis-flagged sessions missed by TC1 screen vs regex baseline | ≤ 0.1% |
---
Failure mode: A user in a vulnerable emotional state receives a Haiku-generated reflective response that is structurally correct but tonally thin — shorter, more formulaic, missing the specific verbal mirroring of their language that is Silent Infinity's primary clinical-adjacent differentiator.
How it occurs: If llm_model variant assignment (currently 5% Haiku) routes a user to Haiku without class-specific routing guards, TC4 turns will occasionally be served by Haiku.
Concrete example: User: "I keep circling back to that moment when my mother looked at me and said nothing. The silence felt like a judgment I could never appeal." Sonnet: mirrors "the silence felt like a judgment you could never appeal" with extended resonance. Haiku: "That sounds like a painful memory. It seems like you're reflecting on a moment that still has emotional weight. What do you notice when you sit with that image?"
The Haiku response is not harmful, but the verbal mirroring failure — not echoing the specific phrase "silence felt like a judgment I could never appeal" — breaks the reflective function that users come for. Repeat exposure to this failure pattern degrades retention.
Mitigation: Class-specific model categories (per §8 Stage 1) prevent this. The global llm_model variant must not be applied to TC4. If using a unified routing flag, add a class guard: if turn_class in (TC4, TC5): override to sonnet regardless of llm_model variant.
Failure mode: If Sonnet is used for TC1 with an under-constrained prompt, it may produce a multi-paragraph analysis rather than a compact JSON classification. This introduces two sub-risks: (a) latency blows out to >1000ms on a turn that precedes every single LLM call, and (b) parsing failures if the response is not valid JSON.
How it occurs: No explicit max_tokens: 64 constraint on the crisis screen. Sonnet's default generation length is 800-1200 tokens when unconstrained on open-ended prompts.
Mitigation: Set max_tokens: 64 and include explicit JSON schema in the system prompt. Use response_format: {"type": "json_object"} where the Bedrock ConverseStream API supports it. Wrap parse logic in a try/except with regex fallback.
Failure mode: A mode-4 teaching turn with a deeply engaged user triggers multi-pass chain-of-thought internally, producing a 2,000-token output. At Sonnet output pricing ($15/M), a single such turn costs $0.03 — 3x the estimated unit cost. If 10% of TC5 turns blow out to 2,000 tokens, the monthly TC5 budget increases from $702 to $972.
Mitigation: Hard max_tokens: 768 cap on TC5. Monitor p95 output token distribution in llm-costs.jsonl. Add a weekly CloudWatch alarm: if TC5 mean output tokens exceeds 600, trigger a Harnoor review.
Failure mode: A user in active crisis receives the first token of an Opus 4.7 response at 2,500ms p50. The silence after submitting a crisis message may itself be dysregulating.
Mitigation: Implement a UI-side "presence indicator" (per the Perceived Latency Animation plan) that activates immediately on crisis-flagged turns. Show "I hear you" acknowledgment text client-side within 300ms (pre-computed, not LLM-generated) while the Opus response streams. This is consistent with speculative decoding principles (Leviathan et al. 2023) at the UX layer — "speculate" the acknowledgment while the full response generates.
Failure mode: A traffic spike causes Lambda to scale to 10+ concurrent instances. Each new cold-start Lambda instance misses the cache on its first TC4/TC5 turn (1h TTL is instance-local or per-connection). At Sonnet uncached input rates, a 10-second cold-start window with 100 concurrent calls produces $0.072 in unplanned input costs.
Mitigation: Accept as known noise at MVP scale. At 200k turns/month, cold-start cache-miss events are <0.5% of traffic. Wire the cost_tracker to flag per-turn cache-miss rate (via cache_read_input_tokens == 0). If cache-miss rate exceeds 5%, investigate Lambda concurrency scaling triggers.
---
The following metrics must be operational before any variant is promoted beyond staged status:
turn_class label in llm-costs.jsonl — every LLM call logs which turn class (TC1-TC7) it belongs to. Currently absent; required for per-class cost attribution.model_id label in llm-costs.jsonl — confirm that the field is populated from settings.claude_model (it is, per claude.py line 85). Verify for Bedrock calls (bedrock_client.py).time.perf_counter() deltas in the handler.llm_model bucket) is logged alongside the NPS score so A/B analysis is possible.{"parse_ok": bool} per call.guardrails.py. When Haiku says "no crisis" and regex says "potential crisis", log as crisis_discordance event._sanitize_body_for_model correctly strips temperature for Opus before the first canary turn. Write an integration test: assert "temperature" not in body_sent_to_bedrock_for_opus.---
The following changes are described for Harnoor review. No production code has been modified.
# TC1 — Crisis detection (dedicated category, not in llm_model pool)
Variant(id="haiku-crisis-v1", category="crisis_detector_model",
rollout_pct=100, cohort_filter=None, status="staged",
config={"model_id": "us.anthropic.claude-haiku-4-5",
"max_tokens": 64, "temperature": 0.0,
"output_schema": "crisis_v1_json"},
description="TC1: structured JSON crisis classification. Promote to production after 7-day false-negative monitoring.")
# TC2 — Sentinel (already Haiku; formalize as dedicated category)
Variant(id="haiku-sentinel-v1", category="sentinel_model",
rollout_pct=100, cohort_filter=None, status="staged",
config={"model_id": "us.anthropic.claude-haiku-4-5",
"max_tokens": 128, "temperature": 0.1},
description="TC2: Chat Sentinel async observer. Formalize existing Haiku assignment.")
# TC3 — Small-talk (20% Haiku canary)
Variant(id="haiku-smalltalk-v1", category="small_talk_model",
rollout_pct=20, cohort_filter=None, status="experimental",
config={"model_id": "us.anthropic.claude-haiku-4-5",
"max_tokens": 128, "temperature": 0.8},
description="TC3: Haiku small-talk canary. Gate: response-completion-rate delta <= 5%.")
Variant(id="sonnet-smalltalk-baseline", category="small_talk_model",
rollout_pct=80, cohort_filter=None, status="production",
config={"model_id": "us.anthropic.claude-sonnet-4-6",
"max_tokens": 128, "temperature": 0.7},
description="TC3: Sonnet baseline for small-talk A/B.")
# TC6 — Crisis flow (5% Opus canary)
Variant(id="opus-crisis-v1", category="crisis_flow_model",
rollout_pct=5, cohort_filter=None, status="experimental",
config={"model_id": "us.anthropic.claude-opus-4-7",
"max_tokens": 512},
description="TC6: Opus 4.7 5% crisis-flow canary. Gate: 200 turns + blind NPS review.")
Variant(id="sonnet-crisis-baseline", category="crisis_flow_model",
rollout_pct=95, cohort_filter=None, status="production",
config={"model_id": "us.anthropic.claude-sonnet-4-6",
"max_tokens": 512, "temperature": 0.5},
description="TC6: Sonnet 4.6 crisis-flow baseline.")
# TC7 — Batch synthesis (Haiku + Batch API, 100%)
Variant(id="haiku-synthesis-v1", category="synthesis_model",
rollout_pct=100, cohort_filter=None, status="staged",
config={"model_id": "us.anthropic.claude-haiku-4-5",
"max_tokens": 512, "temperature": 0.6,
"use_batch_api": True},
description="TC7: Haiku batch synthesis. No user latency impact; promote to production immediately.")
llm_model categoryThe global llm_model category should be scoped to TC4/TC5 only. Annotate this in the description and add a code comment in bedrock_client._resolve_model_id() clarifying that TC1, TC2, TC3, TC6, TC7 consult their own category keys, not llm_model.
---
1. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
2. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. (Constitutional AI paper, Anthropic.)
3. Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176. Presented at TMLR 2024.
4. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
5. Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.
6. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). arXiv:2211.17192.
7. Anthropic. (2025, October). Claude Haiku 4.5 System Card. https://www.anthropic.com/claude-haiku-4-5-system-card
8. Anthropic. (2026). Models overview — Claude API documentation. https://platform.claude.com/docs/en/about-claude/models/overview
9. Anthropic. (2026). Pricing — Claude API documentation. https://platform.claude.com/docs/en/about-claude/pricing
10. Amazon Web Services. (2026). Claude Haiku 4.5 — Amazon Bedrock model card. https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-anthropic-claude-haiku-4-5.html
---
This memo is an architectural proposal. No production code has been modified. Recommended variants.py changes in §11 require Harnoor sign-off before implementation. See Task T011 in the TITAN Task Registry.