ALL MEMOS Download .docx

Silent Infinity Model-Tiering Strategy v1 — One-Page Summary

Date: 2026-04-21 | Full memo: DARWIN-MODEL-TIERING-PROPOSAL-v1-2026-04-21.md

---

Tier-Per-Turn-Class Table

| Turn Class | Model | Max Tokens | Temp | Cache Strategy | Est. Unit Cost | p50 Latency |

|---|---|---|---|---|---|---|

| TC1 Crisis screen | Haiku 4.5 | 64 | 0.0 | 1h TTL on system prompt | $0.0003 | ~350ms |

| TC2 Chat Sentinel | Haiku 4.5 | 128 | 0.1 | 1h TTL on system prompt | $0.00057 | ~380ms (async) |

| TC3 Small-talk | Haiku 4.5 | 128 | 0.8 | 1h TTL on system prompt | $0.00055 | ~400ms |

| TC4 Reflective mirror | Sonnet 4.6 | 384 | 0.7 | 1h TTL on system prompt | $0.0059 | ~850ms |

| TC5 Teaching | Sonnet 4.6 | 768 | 0.65 | 1h + 5m on mode prefix | $0.0117 | ~1100ms |

| TC6 Crisis flow | Sonnet 4.6 → Opus 4.7 (5% canary) | 512 | 0.5 / N/A | 5m TTL on crisis prompt | $0.0078 / $0.039 | ~1000ms / ~2500ms |

| TC7 Batch synthesis | Haiku 4.5 + Batch API | 512 | 0.6 | No cache (batch async) | $0.0020 | <4h batch |

Fallback: all Haiku failures escalate to Sonnet. Opus failures fall back to Sonnet immediately (no retry delay on TC6).

---

Budget Flow: 200k turns/month


TIERED:   $1,419/month   vs   ALL-SONNET: $3,582/month  (60% cost reduction)

Haiku (TC1+TC2+TC3+TC7):  $213.50  — 73% of invocations, 15% of budget
Sonnet (TC4+TC5+TC6):   $1,205.20  — 27% of invocations, 85% of budget
Opus canary (TC6 5%):       ~$7.80  — 0.1% of invocations, <1% of budget

Largest line items: TC5 Teaching $702 (49%) | TC4 Mirroring $472 (33%)

---

Rollout Plan (4 stages)

| Stage | What | Gate condition | Timeline |

|---|---|---|---|

| 1 | Haiku for TC1/TC2 (dedicated categories) | Crisis false-negative rate ≤ 0.1% over 7 days; JSON parse ≥ 99% | Week 1-2 |

| 2 | Haiku for TC3 (20% canary), TC7 (100% batch) | TC3 response-completion-rate delta ≤ 5%; p50 ≤ 450ms | Week 2-4 |

| 3 | Opus 4.7 at 5% of TC6 | ≥ 200 crisis turns served; blind NPS Opus ≥ Sonnet; no false-escalation increase | Month 2 |

| 4 | Steady-state (promote or revert based on gates) | All metrics green | Month 3+ |

Wire before any experiment: turn_class label in llm-costs.jsonl, response-completion-rate metric, p50/p95 latency per class, NPS with variant ID, crisis false-negative discordance log.

---

Top 3 Recommendations

1. Immediately separate TC1/TC2 into dedicated variant categories (crisis_detector_model, sentinel_model) and lock both to Haiku 4.5. This eliminates the risk that the global llm_model A/B bucket accidentally routes a vulnerable user's crisis screen through an unconfigured model.

2. Do not test Haiku on TC4 (reflective mirroring) until Stage 2 gates pass on TC3. Quality regression on TC4 is the highest-retention-risk failure mode. Haiku's verbal mirroring capability is unvalidated for Silent Infinity's core use case.

3. Stage Opus 4.7 for TC6 crisis flow in Month 2, but with a pre-computed acknowledgment ("I hear you") injected client-side within 300ms to mask the ~2,500ms p50 latency gap. The constitutional reasoning advantage justifies the 5x cost premium only if the latency penalty is mitigated at the UI layer.

No production code modified. Full recommendations and variants.py entries in T011 / full memo.