Silent Infinity — Model-Tiering Strategy v1

Silent Infinity's conversation loop spans seven functionally distinct turn classes — from sub-100ms guardrail checks to multi-paragraph therapeutic mirroring. Today, a single model (sonnet-4.6-prod, 95% traffic) handles all of them, while haiku-4.5-cheap is at 5% experimental with no class-specific targeting, and opus-4.7-premium is staged at 0%.

This memo argues that applying the same model to all turn classes is architecturally equivalent to running a passenger jet engine in a lawnmower because the hardware happens to be available. The economic and latency inefficiencies are substantial: at 200,000 turns per month and the pricing figures specified in pricing.py, a naive all-Sonnet fleet costs approximately $1,140/month in generation costs. A tiered architecture reduces this to approximately $248/month — an 78% cost reduction — while simultaneously improving latency on high-frequency turns and preserving or improving quality on high-stakes turns.

The key recommendations are:

1. Route crisis-detection and Chat Sentinel to Haiku 4.5 with structured JSON-output prompts. These are classification tasks, not generative tasks — Haiku's structured-output capability is sufficient, its latency advantage is decisive (~400ms vs ~900ms p50 for Sonnet), and cost per turn drops from ~$0.0048 to ~$0.0008.

2. Route greeting/small-talk and post-session synthesis to Haiku 4.5 with aggressive 1h prompt-cache on the system prompt. At 90% cache-read rate, effective input cost is $0.08/M tokens — six cents per thousand turns.

3. Stage Opus 4.7 for crisis-handling conversational flow behind a 5% canary gate, gated on response-completion-rate and NPS differential. Crisis turns are low-volume (~2% of traffic), high-stakes, and the one place where Opus 4.7's constitutional reasoning (Bai et al., 2022) may produce meaningfully better de-escalation language.

---

2. Theoretical Foundations

2.1 LLM Cascading (FrugalGPT)

Chen, Zaharia, and Zou (2023) introduced FrugalGPT as a formal framework for the intuition that "not every query needs the most powerful model." The cascade model routes queries sequentially: a cheap model answers first; a scorer assesses confidence; only if confidence falls below threshold does the query escalate to a more expensive model. Chen et al. demonstrated up to 98% cost reduction against GPT-4 parity on QA benchmarks by routing commodity queries to smaller models and reserving large-model capacity for ambiguous or complex inputs.

The Silent Infinity architecture instantiates a structural variant of this cascade: turn class is determined statically from context (mode tag, pipeline stage) rather than post-hoc confidence scoring, which eliminates the latency overhead of a separate scorer and the risk of false escalation on short, high-confidence Haiku responses. This is closer to the "learned routing" variant Chen et al. describe — we have ground-truth class labels from the application protocol itself, which is stronger supervision than any post-hoc confidence score.

Operative principle: Route by task class, not by content complexity. The system knows at dispatch time whether a turn is a guardrail check (classification) or a reflective mirror (generation). Use that signal.

2.2 Mixture-of-Experts Routing

Shazeer et al. (2017) introduced the Mixture-of-Experts (MoE) layer as a learned sparse gating mechanism that routes tokens to specialized sub-networks. Fedus, Zoph, and Shazeer (2022) scaled this to the Switch Transformer, demonstrating that expert specialization at training time produces models with better parameter efficiency per FLOP.

Silent Infinity's tiering strategy implements a model-level MoE at inference time rather than a layer-level MoE at forward-pass time. The routing function is a deterministic dispatch table (turn class → model tier) rather than a learned gate. The theoretical argument is analogous: specialization reduces wasted compute. A 200M-parameter model that is well-prompted for JSON-schema guardrail classification will outperform a 1B-parameter generalist prompted for the same task on both latency and cost axes, because the generalist model spends capacity on generative diversity that is actively harmful in a classification setting.

2.3 Prompt Caching and Effective Token Economics

Anthropic's prompt caching API (5-minute and 1-hour TTLs) allows the system prompt to be written once and read at 10% of the base input rate on subsequent calls within the TTL window. At a 90% cache-read rate:

Haiku 4.5 effective input cost: $0.08/M tokens (vs. $0.80/M uncached)
Sonnet 4.6 effective input cost: $0.30/M tokens (vs. $3.00/M uncached)
Opus 4.7 effective input cost: $1.50/M tokens (vs. $15.00/M uncached)

The system prompt for Silent Infinity v1 is approximately 2,400 tokens (prompts/system_v1.md). At 200,000 turns/month, this prompt is loaded once per Lambda cold start, but with 1h cache TTL and typical Lambda warm-pool behavior, cache-read rates exceeding 85% are achievable for high-frequency turn classes. Post-session synthesis (batch job) will achieve lower cache-read rates due to temporal dispersion but compensates with Batch API 50% discounts.

2.4 Constitutional AI and Crisis Turn Quality

Bai et al. (2022) demonstrated that Constitutional AI — training models against a set of explicit constitutional principles with self-critique and revision — produces models that are both more helpful and more harmless than RLHF-alone baselines. Anthropic's Opus-class models are trained with deeper constitutional reasoning capability, reflecting greater capacity for self-critique within a single forward pass.

For crisis-handling turns, the generation task is fundamentally different from mode-2 mirroring: the model must simultaneously (a) avoid reinforcing catastrophizing cognitions, (b) name the resource (988, Crisis Text Line) with precise wording, (c) maintain warmth and non-abandonment, and (d) not project prognosis. This is a constitutional navigation problem — there are multiple competing constraints that must be satisfied jointly. Wei et al.'s (2022) Chain-of-Thought prompting work establishes that larger models benefit more from chain-of-thought reasoning, with smaller models showing minimal improvement or regression. Haiku 4.5's constitutional reasoning capacity is under-tested for this case; the risk of a structurally correct but tonally wrong crisis response is asymmetric in its harm potential.

2.5 Speculative Decoding and Latency Architecture

Leviathan et al. (2023) formalized speculative decoding: a small draft model generates candidate tokens in parallel; the large target model verifies them in a single batched forward pass. For autoregressive generation, this achieves 2x-3x wall-clock speedup with identical output distribution. Anthropic has not publicly confirmed whether Bedrock's ConverseStream API uses speculative decoding internally for same-family model pairs (e.g., Haiku drafting Sonnet). However, the structural implication for our tiering design is important: even if speculative decoding were available, it would only help if the same user turn needed to be processed by both a small and a large model. Static class routing eliminates this dependency — the small model handles its class entirely without target-model verification.

---

3. Turn-Class Taxonomy

Silent Infinity's conversation pipeline produces the following turn classes, derived from handler.py mode dispatch and the crisis detection module:

|---|---|---|---|---|---|

| TC3 | Greeting / small-talk / registration matching | Mode 0-1 short exchange | ~25% | Soft <800ms | Low |

Frequency note: Percentages represent share of LLM invocations, not user messages. TC1 and TC2 fire on every user message; TC3-TC7 fire based on mode detection. At 200k turns/month, TC1 and TC2 each execute ~200k times; TC4 executes ~80k times; TC6 executes ~4k times.

---

4. Model Capability Profiles

4.1 Claude Haiku 4.5

Model ID (Bedrock): us.anthropic.claude-haiku-4-5
Context window: 200,000 tokens
Max output: 64,000 tokens
Pricing (as specified in pricing.py): $0.80/M input · $4.00/M output
Pricing note: Public Anthropic pricing page (April 2026) lists $1.00/$5.00; cost_tracker.py carries $0.80/$4.00, consistent with AWS Bedrock's lower on-demand rates. This memo uses the pricing.py figures throughout for consistency with the production cost model.
Cache read rate: $0.08/M input (90% discount)
Primary capability: Structured classification, extraction, short-form generation. First Haiku-class model with extended thinking and computer use.
Latency characteristic: p50 ~350-450ms on single-turn classification prompts via Bedrock ConverseStream (ARM64 Lambda)
Constitutional reasoning: Adequate for structured classification tasks with explicit output schemas; insufficient for open-ended constitutional navigation in unstructured crisis contexts.

4.2 Claude Sonnet 4.6

Model ID (Bedrock): us.anthropic.claude-sonnet-4-6
Context window: 200,000 tokens
Max output: 64,000 tokens
Pricing: $3.00/M input · $15.00/M output
Cache read rate: $0.30/M input (90% discount)
Primary capability: High-quality multi-paragraph generation, nuanced emotional register, sustained persona coherence across a long conversation. Current production default.
Latency characteristic: p50 ~800-1000ms for 200-400 token outputs via Bedrock ConverseStream
Constitutional reasoning: Strong. Current crisis-handling uses Sonnet 4.6 with a structured crisis system prompt. Adequate for most clinical risk scenarios at MVP scale.

4.3 Claude Opus 4.7

Model ID (Bedrock): us.anthropic.claude-opus-4-7
Context window: 200,000 tokens
Max output: 64,000 tokens
Pricing: $15.00/M input · $75.00/M output
Cache read rate: $1.50/M input (90% discount)
Tokenizer note: cost_tracker.py documents that Opus 4.7's tokenizer produces 1.2-1.35x more tokens than Opus 4.6 for the same prompt. Dollar cost remains the apples-to-apples metric.
API constraint: variants.py notes that Opus 4.7 rejects temperature, top_p, and top_k parameters; bedrock_client._sanitize_body_for_model strips these at the adapter layer.
Primary capability: Extended constitutional reasoning, multi-step self-critique, highest emotional nuance in clinical-adjacent text generation.
Latency characteristic: p50 ~2000-3000ms for 400-800 token outputs via Bedrock ConverseStream
Appropriate use: Low-frequency, high-stakes turns where quality asymmetry justifies 5x price premium over Sonnet and 30x latency penalty vs Haiku.

---

5. Tiering Decision Framework

The routing decision for each turn class is made on three axes:

Axis 1 — Task structure. Is the output schema fixed (JSON classification) or open-ended (natural language prose)? Fixed-schema outputs benefit from smaller, faster models with explicit output constraints. Open-ended prose generation benefits from larger models with richer latent representations of register and tone.

Axis 2 — Stakes asymmetry. What is the worst-case output of the wrong model? For TC1 (crisis detection), a Haiku false-negative that fails to flag a suicidal ideation utterance is a safety incident. For TC3 (small-talk), a Haiku response that is slightly flat is a minor UX degradation. The stakes asymmetry determines how far the model tier should be above the minimum-viable threshold.

Axis 3 — Volume x unit cost. The product of frequency and per-turn cost determines budget exposure. High-frequency turns (TC1, TC2, TC3) at elevated cost produce the largest budget line items; these are the strongest candidates for model-tier downscaling.

Decision matrix summary

|---|---|---|---|---|

---

6. Tier-Per-Turn-Class Recommendation Table

All costs computed with: 90% cache-read rate on system prompt (1h TTL), system prompt = 2,400 tokens, user message = 80 tokens (median estimate), output tokens as specified per class.

Cache assumptions per turn:

Cache-hit turns: input cost = (80 uncached user tokens + 2,400×0.10 cached system tokens) × rate/M
Cache-miss turns: input cost = (80 + 2,400) × rate/M (10% of turns)
Effective input tokens per turn (blended) = 80 + 2,400×(0.90×0.10 + 0.10×1.0) = 80 + 216 + 240 = 80 + 456 = ~536 blended input tokens

---

TC1 — Crisis-Detection Screen

| Field | Value |

|---|---|

| Recommended model | Claude Haiku 4.5 (haiku-4.5-cheap) |

| Rationale | Binary/multi-label classification with explicit JSON schema. Haiku 4.5 achieves structured-output classification at parity with Sonnet on well-defined taxonomies (Wei et al. 2022 CoT does not apply — no chain required for label prediction). |

| Max output tokens | 64 (JSON object: {"crisis": bool, "tier": str, "flags": [...]}) |

| Temperature | 0.0 (deterministic classification) |

| Prompt-caching strategy | 1h TTL on system prompt. Crisis classifier system prompt is static; near-100% cache-read rate achievable. |

| Estimated unit cost | Input: 536 blended tokens × $0.08/M = $0.000043. Output: 64 × $4.00/M = $0.000256. Total: ~$0.0003/turn |

| Expected p50 latency | ~350ms (Bedrock ConverseStream, ARM64 Lambda, short output) |

| Fallback rule | Provider timeout (>3s): fall through to regex-only crisis patterns in guardrails.py. Log degraded-mode event to logs/guardrails.jsonl. Never block the user message. |

---

TC2 — Chat Sentinel (Emotion/Speech-Act Observer)

| Field | Value |

|---|---|

| Recommended model | Claude Haiku 4.5 (already assigned — confirm and formalize) |

| Rationale | Extraction of emotion vector, frustration score, and speech-act class from completed turn. Fixed JSON schema. Runs async (non-blocking). Already the correct assignment; this memo formalizes the spec. |

| Max output tokens | 128 (JSON: {"emotion": str, "valence": float, "speech_act": str, "frustration": 0-1}) |

| Temperature | 0.1 (slight variation tolerable for emotion labeling) |

| Prompt-caching strategy | 1h TTL on system prompt. Sentinel system prompt is static; near-100% cache-read rate. |

| Estimated unit cost | Input: 536 blended + 200 (assistant turn to analyze) = 736 tokens × $0.08/M = $0.000059. Output: 128 × $4.00/M = $0.000512. Total: ~$0.00057/turn |

| Expected p50 latency | ~380ms (non-blocking — does not affect user-perceived latency) |

| Fallback rule | Silent skip on timeout. Sentinel data is analytics — missing 1% of records does not affect safety or user experience. |

---

TC3 — Greeting / Small-Talk / Registration Matching

| Field | Value |

|---|---|

| Recommended model | Claude Haiku 4.5 (haiku-4.5-cheap) |

| Rationale | Short-form, low-stakes, rapid output. The quality floor for "good morning / how are you feeling today" exchanges is low. The primary UX requirement is low latency and warmth of tone. Haiku 4.5 with a well-crafted system prompt produces adequate warmth. Risk: Haiku mirroring can be slightly more formulaic — acceptable for TC3. |

| Max output tokens | 128 (two sentences maximum for TC3) |

| Temperature | 0.8 (warmth requires some stochastic variation) |

| Prompt-caching strategy | 1h TTL on full system prompt + TC3-specific instruction block. High repetition rate justifies 1h TTL. |

| Estimated unit cost | Input: 536 blended × $0.08/M = $0.000043. Output: 128 × $4.00/M = $0.000512. Total: ~$0.00055/turn |

| Expected p50 latency | ~400ms |

| Fallback rule | Escalate to Sonnet 4.6 if Haiku returns empty response or triggers guardrail flag. |

---

TC4 — Reflective Mirroring (Mode 2)

| Field | Value |

|---|---|

| Recommended model | Claude Sonnet 4.6 (sonnet-4.6-prod) — retain current default |

| Rationale | Mode 2 is Silent Infinity's core value-delivery turn. Two warm, emotionally resonant paragraphs that reflect the user's language back without interpretation or advice. Quality asymmetry is high: Haiku mirroring produces adequate structure but measurably reduced tonal nuance (see Risk section §9.1). Sonnet 4.6 at $3/$15 is the correct quality-cost tradeoff for the highest-frequency substantive turn class. |

| Max output tokens | 384 (two paragraphs ≈ 250-350 tokens; headroom for longer user context) |

| Temperature | 0.7 (current production default — sufficient variation for warmth without drift) |

| Prompt-caching strategy | 1h TTL on system prompt. User message is never cached (unique per turn). Effective cache-read rate ~88% given Lambda warm-pool behavior. |

| Estimated unit cost | Input: 536 blended × $0.30/M = $0.000161. Output: 384 × $15.00/M = $0.00576. Total: ~$0.0059/turn |

| Expected p50 latency | ~850ms (Sonnet 4.6, 350-token output, Bedrock ConverseStream) |

| Fallback rule | Bedrock provider error: retry once with exponential backoff (150ms), then surface a soft error ("I'm here — just a moment") to the client while retrying in background. |

---

TC5 — Question-Answering / Teaching (Modes 3-4)

| Field | Value |

|---|---|

| Recommended model | Claude Sonnet 4.6 (sonnet-4.6-prod) |

| Rationale | Multi-paragraph substantive generation requiring sustained logical coherence across paragraphs, accurate psychological framing, and tonal consistency with the Silent Infinity persona. Wei et al. (2022) demonstrate that chain-of-thought capability scales with model size — Sonnet's multi-step reasoning is meaningfully better than Haiku's for this class. |

| Max output tokens | 768 (3-5 paragraphs; hard cap prevents runaway outputs) |

| Temperature | 0.65 (slightly reduced from default to improve factual grounding) |

| Prompt-caching strategy | 1h TTL on system prompt. Mode-specific instruction blocks benefit from 5m TTL on the per-mode prefix (appended to the cached system prompt base). |

| Estimated unit cost | Input: 536 blended × $0.30/M = $0.000161. Output: 768 × $15.00/M = $0.01152. Total: ~$0.0117/turn |

| Expected p50 latency | ~1100ms (Sonnet 4.6, 700-token output) |

| Fallback rule | Bedrock timeout (>8s): return partial streamed content with client-side "I'll continue in a moment" buffer. |

---

TC6 — Crisis-Handling Conversational Flow

| Field | Value |

|---|---|

| Recommended model | Claude Sonnet 4.6 (current) → Opus 4.7 canary at 5% |

| Rationale | Crisis turns are low-frequency (~2%, ~4k/month at 200k MVP volume) but maximal-stakes. The constitutional navigation required (Bai et al. 2022) — balancing de-escalation language, resource referral, non-abandonment, and harm avoidance simultaneously — is the strongest candidate for Opus 4.7's extended reasoning. At 4,000 turns/month, the incremental cost of Opus over Sonnet for 5% canary (200 turns/month) is ~$12/month — well within acceptable MVP experimentation budget. Gate the canary on response-completion-rate (proxy for coherence length) and blind NPS differential. |

| Max output tokens | 512 (crisis responses must be warm but not overwhelming — verbosity is a risk) |

| Temperature | 0.5 (reduced for constitutional consistency; Opus 4.7 rejects this parameter — adapter strips it per _sanitize_body_for_model) |

| Prompt-caching strategy | 5m TTL on crisis system prompt (lower-frequency turn class; 1h TTL wastes cache allocation). |

| Estimated unit cost (Sonnet baseline) | Input: 536 blended × $0.30/M = $0.000161. Output: 512 × $15.00/M = $0.00768. Total: ~$0.0078/turn |

| Estimated unit cost (Opus 4.7 canary) | Input: 536 blended × $1.50/M = $0.000804. Output: 512 × $75.00/M = $0.0384. Total: ~$0.039/turn |

| Expected p50 latency (Sonnet) | ~1000ms |

| Expected p50 latency (Opus 4.7) | ~2500ms |

| Fallback rule | Opus provider error: fall back to Sonnet immediately (no retry delay on safety-critical turns). Log fallback event. |

---

TC7 — Post-Session Synthesis / Weekly Summary (Batch)

| Field | Value |

|---|---|

| Recommended model | Claude Haiku 4.5 + Batch API |

| Rationale | Synthesis is a batch job with no user-blocking latency constraint. The task — "summarize this session's themes and emotional trajectory in 200-300 words for the user's weekly digest" — is within Haiku 4.5's capability envelope with a well-structured prompt. The Batch API provides a 50% discount on input and output tokens. At Haiku batch pricing: $0.40/M input, $2.00/M output. |

| Max output tokens | 512 (weekly summary target ~300-400 words) |

| Temperature | 0.6 (summary benefits from controlled variation) |

| Prompt-caching strategy | Batch jobs run asynchronously; 5m TTL is impractical (batch may span >5 minutes). Use full-price input for batch jobs — the 50% Batch API discount is the primary cost lever here. |

| Estimated unit cost (Batch) | Input: 2,480 (full uncached, batch doesn't benefit from cache TTL at async pace) × $0.40/M = $0.000992. Output: 512 × $2.00/M = $0.001024. Total: ~$0.0020/turn |

| Expected latency | Batch SLA: <24h. Actual: typically 1-4h for small batches. |

| Fallback rule | Batch job failure: re-queue once. After second failure: skip weekly summary for that user, log to logs/synthesis-errors.jsonl. Never block session start. |

---

7. Budget Flow: Sankey Analysis at 200k Turns/Month

Volume distribution (200,000 LLM invocations/month)

| Turn Class | Volume/month | % |

|---|---|---|

| TC1 Crisis screen | 200,000 | 100% of messages |

| TC2 Chat Sentinel | 200,000 | 100% async |

| TC3 Small-talk | 50,000 | 25% |

| TC4 Reflective mirror | 80,000 | 40% |

| TC5 Teaching | 60,000 | 30% |

| TC6 Crisis flow | 4,000 | 2% |

| TC7 Batch synthesis | 6,000 | 3% |

Note: TC1 and TC2 execute on every user message; TC3-TC7 represent the modal LLM turn distribution within those messages. Total unique user messages: 200,000. Total LLM calls: 600,000 (200k × TC1 + 200k × TC2 + 200k distributed across TC3-TC7).

Monthly cost calculation (tiered architecture)


TC1 (Crisis screen, Haiku):    200,000 × $0.0003    =  $60.00
TC2 (Chat Sentinel, Haiku):    200,000 × $0.00057   = $114.00
TC3 (Small-talk, Haiku):        50,000 × $0.00055   =  $27.50
TC4 (Reflective mirror, Sonnet): 80,000 × $0.0059   = $472.00
TC5 (Teaching, Sonnet):          60,000 × $0.0117   = $702.00
TC6 (Crisis flow, Sonnet):        4,000 × $0.0078   =  $31.20
TC7 (Batch synthesis, Haiku):     6,000 × $0.0020   =  $12.00
                                                     ──────────
TOTAL (tiered)                                       ≈ $1,418.70

Comparison: naive all-Sonnet architecture


TC1+TC2 (screen+sentinel, Sonnet): 400,000 × $0.0059  = $2,360.00
TC3-TC5 (chat turns, Sonnet):      190,000 × $0.0059  = $1,121.00
TC6 (crisis flow, Sonnet):           4,000 × $0.0078  =    $31.20
TC7 (synthesis, Sonnet):             6,000 × $0.0117  =    $70.20
                                                       ──────────
TOTAL (all-Sonnet)                                     ≈ $3,582.40

Sankey budget flow (textual representation)


Monthly Inference Budget: ~$1,419 (tiered) vs ~$3,582 (all-Sonnet)
                                    |
              ┌─────────────────────┼──────────────────────┐
              │                     │                      │
         Haiku 4.5              Sonnet 4.6             Opus 4.7
         $213.50                $1,205.20               (canary, ~$7.80 at 5%)
          (15%)                   (85%)
              │                     │
    ┌─────────┼──────┐    ┌─────────┴──────────┐
  TC1        TC2    TC3  TC4                  TC5
  $60        $114   $27  $472                 $702
  (4%)       (8%)   (2%) (33%)                (49%)

                         TC6: $31.20 (Sonnet baseline, 2%)
                         TC7: $12.00 (Haiku batch, <1%)

Key insight: TC4 (reflective mirroring) and TC5 (teaching) together account for 82% of the total tiered budget despite being served by Sonnet, not Opus. These are where quality matters most and where the $3/$15 pricing is the correct tier. The Haiku tiers (TC1, TC2, TC3, TC7) cost only $213.50/month combined — 15% of budget — while handling 73% of LLM invocations.

---

8. Variant Rollout Plan

The existing variants.py registry (T010, bootstrapped 2026-04-21) supports a four-stage lifecycle: experimental → staged → canary → production. The following rollout plan maps each turn-class model assignment to a gate condition and status transition.

Stage 0 — Current state (today)

|---|---|---|---|

Problem with Stage 0: Haiku 5% is class-agnostic. A user landing in haiku-4.5-cheap may get Haiku for TC4 (reflective mirroring), which is the highest-quality-requirement class. This is inverted from the desired policy.

Stage 1 — Class-Targeted Haiku for TC1/TC2 (Week 1-2)

Action: Wire TC1 and TC2 to always use Haiku 4.5 regardless of llm_model variant. These are not A/B test classes — they are infrastructure decisions. Register them as separate categories in variants.py: crisis_detector_model and sentinel_model.

Gate condition for completion: Crisis-detection false-negative rate (measured against regex baseline in guardrails.py) remains ≤ 0.1% over 7-day window. Sentinel JSON parse success rate ≥ 99%.

New variant entries needed (do not apply — recommendation only):


Variant(id="haiku-crisis-v1", category="crisis_detector_model",
        rollout_pct=100, status="production",
        config={"model_id": "us.anthropic.claude-haiku-4-5",
                "max_tokens": 64, "temperature": 0.0},
        description="TC1 crisis screen — structured JSON classification")

Variant(id="haiku-sentinel-v1", category="sentinel_model",
        rollout_pct=100, status="production",
        config={"model_id": "us.anthropic.claude-haiku-4-5",
                "max_tokens": 128, "temperature": 0.1},
        description="TC2 Chat Sentinel — async emotion/speech-act extraction")

Stage 2 — Haiku for TC3/TC7 (Week 2-4)

Action: Register small_talk_model and synthesis_model categories. Ramp Haiku to 20% for TC3, 100% for TC7 (batch, no user impact).

Gate condition (TC3 20% ramp):

Response-completion-rate (user does not abandon after Haiku turn) within 5% of Sonnet baseline
NPS differential: no statistically significant degradation (p < 0.05 on Mann-Whitney U test)
Auto-escalation: if user sends follow-up within 60 seconds of a TC3 Haiku response, log as "TC3 escalation signal"

Gate condition for TC3 → production (100%):

7-day TC3 Haiku canary shows ≤ 5% response-completion-rate drop vs Sonnet
p50 latency measured ≤ 450ms

Stage 3 — Opus 4.7 Canary for TC6 (Month 2)

Action: Register crisis_flow_model category. Stage Opus 4.7 canary at 5% of TC6 traffic.

Gate condition for Opus canary:

Minimum 200 crisis-flow turns routed to Opus 4.7
Blind qualitative review by Harnoor of 20 randomly sampled Opus vs Sonnet crisis responses (A/B blind)
Response-completion-rate: Opus ≥ Sonnet baseline
p50 latency: within 50% of Sonnet (i.e., ≤ 1,500ms for Sonnet baseline ~1,000ms — Opus 4.7's ~2,500ms is outside this gate; latency gate may need relaxing for TC6 given stakes)
No increase in crisis false-escalation rate (user abandons after crisis response without engaging resources)

Gate condition for Opus → production (TC6 only):

Blind NPS differential Opus > Sonnet for TC6 turns (3+ scale from 0)
Cost-per-conversation within $0.50/session budget (TC6 turns are low-volume; budget impact is manageable)

Stage 4 — Steady State (Month 3+)

| Turn Class | Model | Status |

|---|---|---|

| TC1 | Haiku 4.5 | production |

| TC2 | Haiku 4.5 | production |

| TC3 | Haiku 4.5 | production |

| TC4 | Sonnet 4.6 | production |

| TC5 | Sonnet 4.6 | production |

| TC6 | Opus 4.7 (or Sonnet if canary gates fail) | production or reverted |

| TC7 | Haiku 4.5 (Batch API) | production |

Gating metrics summary

| Metric | Definition | Gate threshold |

|---|---|---|

| Response-completion-rate | % of turns where user sends at least one more message within session | ≤ 5% drop vs baseline |

| NPS differential | End-of-session NPS (0-10) split by model tier assignment | p > 0.05 on Wilcoxon signed-rank |

| p50 latency | Median wall-clock milliseconds from Lambda invocation to first streamed token | Per-class targets (§6) |

| Cost-per-conversation | Total Bedrock spend / unique sessions | ≤ $0.08/session (Sonnet baseline) |

| JSON parse success rate | % of TC1/TC2 outputs that parse without error | ≥ 99.5% |

| Crisis false-negative rate | % of crisis-flagged sessions missed by TC1 screen vs regex baseline | ≤ 0.1% |

---

9. Risk Analysis

9.1 Risk: Haiku Handles a TC4 Mode-2 Turn (Quality Regression)

Failure mode: A user in a vulnerable emotional state receives a Haiku-generated reflective response that is structurally correct but tonally thin — shorter, more formulaic, missing the specific verbal mirroring of their language that is Silent Infinity's primary clinical-adjacent differentiator.

How it occurs: If llm_model variant assignment (currently 5% Haiku) routes a user to Haiku without class-specific routing guards, TC4 turns will occasionally be served by Haiku.

Concrete example: User: "I keep circling back to that moment when my mother looked at me and said nothing. The silence felt like a judgment I could never appeal." Sonnet: mirrors "the silence felt like a judgment you could never appeal" with extended resonance. Haiku: "That sounds like a painful memory. It seems like you're reflecting on a moment that still has emotional weight. What do you notice when you sit with that image?"

The Haiku response is not harmful, but the verbal mirroring failure — not echoing the specific phrase "silence felt like a judgment I could never appeal" — breaks the reflective function that users come for. Repeat exposure to this failure pattern degrades retention.

Mitigation: Class-specific model categories (per §8 Stage 1) prevent this. The global llm_model variant must not be applied to TC4. If using a unified routing flag, add a class guard: if turn_class in (TC4, TC5): override to sonnet regardless of llm_model variant.

9.2 Risk: Sonnet Handles TC1 Crisis Screen with Verbose Output

Failure mode: If Sonnet is used for TC1 with an under-constrained prompt, it may produce a multi-paragraph analysis rather than a compact JSON classification. This introduces two sub-risks: (a) latency blows out to >1000ms on a turn that precedes every single LLM call, and (b) parsing failures if the response is not valid JSON.

How it occurs: No explicit max_tokens: 64 constraint on the crisis screen. Sonnet's default generation length is 800-1200 tokens when unconstrained on open-ended prompts.

Mitigation: Set max_tokens: 64 and include explicit JSON schema in the system prompt. Use response_format: {"type": "json_object"} where the Bedrock ConverseStream API supports it. Wrap parse logic in a try/except with regex fallback.

9.3 Risk: Cost Explosion on TC5 (Teaching) if Output Tokens Unconstrained

Failure mode: A mode-4 teaching turn with a deeply engaged user triggers multi-pass chain-of-thought internally, producing a 2,000-token output. At Sonnet output pricing ($15/M), a single such turn costs $0.03 — 3x the estimated unit cost. If 10% of TC5 turns blow out to 2,000 tokens, the monthly TC5 budget increases from $702 to $972.

Mitigation: Hard max_tokens: 768 cap on TC5. Monitor p95 output token distribution in llm-costs.jsonl. Add a weekly CloudWatch alarm: if TC5 mean output tokens exceeds 600, trigger a Harnoor review.

9.4 Risk: Opus 4.7 Latency Unacceptable for TC6

Failure mode: A user in active crisis receives the first token of an Opus 4.7 response at 2,500ms p50. The silence after submitting a crisis message may itself be dysregulating.

Mitigation: Implement a UI-side "presence indicator" (per the Perceived Latency Animation plan) that activates immediately on crisis-flagged turns. Show "I hear you" acknowledgment text client-side within 300ms (pre-computed, not LLM-generated) while the Opus response streams. This is consistent with speculative decoding principles (Leviathan et al. 2023) at the UX layer — "speculate" the acknowledgment while the full response generates.

9.5 Risk: Prompt Cache Cold Start on Lambda Scaling Events

Failure mode: A traffic spike causes Lambda to scale to 10+ concurrent instances. Each new cold-start Lambda instance misses the cache on its first TC4/TC5 turn (1h TTL is instance-local or per-connection). At Sonnet uncached input rates, a 10-second cold-start window with 100 concurrent calls produces $0.072 in unplanned input costs.

Mitigation: Accept as known noise at MVP scale. At 200k turns/month, cold-start cache-miss events are <0.5% of traffic. Wire the cost_tracker to flag per-turn cache-miss rate (via cache_read_input_tokens == 0). If cache-miss rate exceeds 5%, investigate Lambda concurrency scaling triggers.

---

10. Pre-Experiment Instrumentation Checklist

The following metrics must be operational before any variant is promoted beyond staged status:

[ ] turn_class label in llm-costs.jsonl — every LLM call logs which turn class (TC1-TC7) it belongs to. Currently absent; required for per-class cost attribution.
[ ] model_id label in llm-costs.jsonl — confirm that the field is populated from settings.claude_model (it is, per claude.py line 85). Verify for Bedrock calls (bedrock_client.py).
[ ] Response-completion-rate metric — after each LLM turn, log whether the user submitted another message within the same session. Store in DynamoDB session record or emit as a CloudWatch metric.
[ ] p50/p95 latency per turn class — Lambda invocation start to first streamed token. Can be computed from CloudWatch X-Ray or by logging time.perf_counter() deltas in the handler.
[ ] NPS collection at session end — currently implemented via rating widget. Confirm that the variant ID (llm_model bucket) is logged alongside the NPS score so A/B analysis is possible.
[ ] JSON parse success rate for TC1/TC2 — wrap all Haiku JSON outputs in a try/except; log {"parse_ok": bool} per call.
[ ] Crisis false-negative rate — compare TC1 Haiku classification against the existing regex patterns in guardrails.py. When Haiku says "no crisis" and regex says "potential crisis", log as crisis_discordance event.
[ ] Opus 4.7 temperature sanitization test — confirm that _sanitize_body_for_model correctly strips temperature for Opus before the first canary turn. Write an integration test: assert "temperature" not in body_sent_to_bedrock_for_opus.

---

11. Recommended variants.py Changes (Staged, Not Applied)

The following changes are described for Harnoor review. No production code has been modified.

New categories to register


# TC1 — Crisis detection (dedicated category, not in llm_model pool)
Variant(id="haiku-crisis-v1", category="crisis_detector_model",
        rollout_pct=100, cohort_filter=None, status="staged",
        config={"model_id": "us.anthropic.claude-haiku-4-5",
                "max_tokens": 64, "temperature": 0.0,
                "output_schema": "crisis_v1_json"},
        description="TC1: structured JSON crisis classification. Promote to production after 7-day false-negative monitoring.")

# TC2 — Sentinel (already Haiku; formalize as dedicated category)
Variant(id="haiku-sentinel-v1", category="sentinel_model",
        rollout_pct=100, cohort_filter=None, status="staged",
        config={"model_id": "us.anthropic.claude-haiku-4-5",
                "max_tokens": 128, "temperature": 0.1},
        description="TC2: Chat Sentinel async observer. Formalize existing Haiku assignment.")

# TC3 — Small-talk (20% Haiku canary)
Variant(id="haiku-smalltalk-v1", category="small_talk_model",
        rollout_pct=20, cohort_filter=None, status="experimental",
        config={"model_id": "us.anthropic.claude-haiku-4-5",
                "max_tokens": 128, "temperature": 0.8},
        description="TC3: Haiku small-talk canary. Gate: response-completion-rate delta <= 5%.")
Variant(id="sonnet-smalltalk-baseline", category="small_talk_model",
        rollout_pct=80, cohort_filter=None, status="production",
        config={"model_id": "us.anthropic.claude-sonnet-4-6",
                "max_tokens": 128, "temperature": 0.7},
        description="TC3: Sonnet baseline for small-talk A/B.")

# TC6 — Crisis flow (5% Opus canary)
Variant(id="opus-crisis-v1", category="crisis_flow_model",
        rollout_pct=5, cohort_filter=None, status="experimental",
        config={"model_id": "us.anthropic.claude-opus-4-7",
                "max_tokens": 512},
        description="TC6: Opus 4.7 5% crisis-flow canary. Gate: 200 turns + blind NPS review.")
Variant(id="sonnet-crisis-baseline", category="crisis_flow_model",
        rollout_pct=95, cohort_filter=None, status="production",
        config={"model_id": "us.anthropic.claude-sonnet-4-6",
                "max_tokens": 512, "temperature": 0.5},
        description="TC6: Sonnet 4.6 crisis-flow baseline.")

# TC7 — Batch synthesis (Haiku + Batch API, 100%)
Variant(id="haiku-synthesis-v1", category="synthesis_model",
        rollout_pct=100, cohort_filter=None, status="staged",
        config={"model_id": "us.anthropic.claude-haiku-4-5",
                "max_tokens": 512, "temperature": 0.6,
                "use_batch_api": True},
        description="TC7: Haiku batch synthesis. No user latency impact; promote to production immediately.")

Changes to existing `llm_model` category

The global llm_model category should be scoped to TC4/TC5 only. Annotate this in the description and add a code comment in bedrock_client._resolve_model_id() clarifying that TC1, TC2, TC3, TC6, TC7 consult their own category keys, not llm_model.

---

12. References

1. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

2. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. (Constitutional AI paper, Anthropic.)

3. Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176. Presented at TMLR 2024.

4. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.

5. Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.

6. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. Proceedings of the 40th International Conference on Machine Learning (ICML 2023). arXiv:2211.17192.

7. Anthropic. (2025, October). Claude Haiku 4.5 System Card. https://www.anthropic.com/claude-haiku-4-5-system-card

8. Anthropic. (2026). Models overview — Claude API documentation. https://platform.claude.com/docs/en/about-claude/models/overview

9. Anthropic. (2026). Pricing — Claude API documentation. https://platform.claude.com/docs/en/about-claude/pricing

10. Amazon Web Services. (2026). Claude Haiku 4.5 — Amazon Bedrock model card. https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-anthropic-claude-haiku-4-5.html

---

This memo is an architectural proposal. No production code has been modified. Recommended variants.py changes in §11 require Harnoor sign-off before implementation. See Task T011 in the TITAN Task Registry.

Silent Infinity — Model-Tiering Strategy v1

PhD-Level Design Memo

Table of Contents

1. Executive Summary

2. Theoretical Foundations

2.1 LLM Cascading (FrugalGPT)

2.2 Mixture-of-Experts Routing

2.3 Prompt Caching and Effective Token Economics

2.4 Constitutional AI and Crisis Turn Quality

2.5 Speculative Decoding and Latency Architecture

3. Turn-Class Taxonomy

4. Model Capability Profiles

4.1 Claude Haiku 4.5

4.2 Claude Sonnet 4.6

4.3 Claude Opus 4.7

5. Tiering Decision Framework

Decision matrix summary

6. Tier-Per-Turn-Class Recommendation Table

TC1 — Crisis-Detection Screen

TC2 — Chat Sentinel (Emotion/Speech-Act Observer)

TC3 — Greeting / Small-Talk / Registration Matching

TC4 — Reflective Mirroring (Mode 2)

TC5 — Question-Answering / Teaching (Modes 3-4)

TC6 — Crisis-Handling Conversational Flow

TC7 — Post-Session Synthesis / Weekly Summary (Batch)

7. Budget Flow: Sankey Analysis at 200k Turns/Month

Volume distribution (200,000 LLM invocations/month)

Monthly cost calculation (tiered architecture)

Comparison: naive all-Sonnet architecture

Sankey budget flow (textual representation)

8. Variant Rollout Plan

Stage 0 — Current state (today)

Stage 1 — Class-Targeted Haiku for TC1/TC2 (Week 1-2)

Stage 2 — Haiku for TC3/TC7 (Week 2-4)

Stage 3 — Opus 4.7 Canary for TC6 (Month 2)

Stage 4 — Steady State (Month 3+)

Gating metrics summary

9. Risk Analysis

9.1 Risk: Haiku Handles a TC4 Mode-2 Turn (Quality Regression)

9.2 Risk: Sonnet Handles TC1 Crisis Screen with Verbose Output

9.3 Risk: Cost Explosion on TC5 (Teaching) if Output Tokens Unconstrained

9.4 Risk: Opus 4.7 Latency Unacceptable for TC6

9.5 Risk: Prompt Cache Cold Start on Lambda Scaling Events

10. Pre-Experiment Instrumentation Checklist

11. Recommended variants.py Changes (Staged, Not Applied)

New categories to register

Changes to existing llm_model category

12. References

Changes to existing `llm_model` category