ALL MEMOS Download .docx

TITAN self-sustaining — failure modes, gaps, hardening plan

Memo for Harnoor (A031), 2026-04-26

---

TL;DR

---

13 Failure Modes

1. Claude Code update breaks scheduled-tasks plugin — TITAN cron silent-fails

2. cloudflared disconnects — titan.livegroweveryday.com drops to 502

3. Laptop reboots without bridge auto-start — dashboard offline until manual restart

4. AWS credentials expire / IAM key rotates — SES + Bedrock + S3 silently fail

5. SES bounces (reputation score / send-quota) — daily emails stop

6. Bedrock rate-limit — Innerverse chat slow / breaks

7. GitHub mirror push fails — off-site backup misses

8. S3 cost spike — could blow $150/mo cap

9. Scanner OOM on huge .md scan — runaway memory

10. Bridge crashes — dashboard down

11. inbox-queue.jsonl corrupts — commands lost

12. Prime directive memo deleted / overwritten — TITAN forgets the rules

13. Hooks regression in Claude Code update — settings.json edits prompt forever

---

What's already covered (9 mitigations)

| # | Failure Mode | Mitigation in place |

|---|---|---|

| 2 | cloudflared disconnect | Registered as Windows Service — auto-restarts on crash |

| 3 | Laptop reboot / bridge not starting | cloudflared + select scheduled tasks as Windows Services |

| 7 | GitHub mirror push fails | Nightly mirror script in scheduled-tasks |

| 8 | S3 cost spike | AWS Budget alert at $150/mo; S3 versioning + nightly backup |

| 9 | Scanner OOM | swarm-health-orchestrator monitors process health; partial mitigation |

| 10 | Bridge crashes | titan-bridge-watchdog auto-restarts bridge on crash |

| 11 | Queue corruption | JSONL append model (reduces risk); ask ledger as JSONL |

| 12 | Prime directive overwritten | titan_skill_writer.py allowlist enforces write permissions |

| 13 | Hooks regression | titan_skill_writer.py wrapper bypasses manual approval prompt |

---

Critical gaps (4)

Gap 1 — No out-of-band liveness monitor (SMS/push)

TITAN has no mechanism to alert Harnoor via a channel independent of the laptop (SMS, push notification, phone call) when TITAN has been silent for more than 60 minutes during expected-active hours. If the laptop dies, the bridge crashes in a loop, or the cron daemon silently fails, Harnoor has no signal until he checks manually. This is the most dangerous gap.

Gap 2 — No automated regression smoke after Claude Code updates

Claude Code is updated without version pinning. After each update, the scheduled-tasks plugin, hooks schema, and skill-writer behavior could all have regressed. There is no automated smoke test that fires within 30 minutes of a Claude Code version change to verify: (a) cron jobs still fire, (b) memory reads/writes work, (c) hook approval prompts are suppressed.

Gap 3 — No disk-resident recovery runbook

If TITAN goes completely dark (laptop destroyed, bridge unrecoverable, memory corrupted), there is no single document on disk or in a known external location that tells future-Harnoor or a new machine how to reconstruct TITAN from scratch. The institutional knowledge lives in agent memory and in Harnoor's head — both are unavailable in a true disaster.

Gap 4 — No automated failover to a second device

The laptop is a single point of failure for all of TITAN: the bridge, the dashboard, the scheduled tasks, and the local file system that backs memory. There is no hot or warm standby. A hardware failure means days of downtime and potential permanent data loss for anything not mirrored to S3/GitHub.

---

Alternative architectures if Claude Code fails

Aider (recommended first failover)

Cursor CLI

Custom Anthropic SDK script

n8n + Anthropic API

Mac Mini lights-out remote

---

Concrete 30-day hardening plan

Action 1 — Heartbeat monitor + SMS alert (est. 3 hr)

Action 2 — Disk-resident RECOVERY-RUNBOOK.md (est. 1 hr)

Action 3 — Pin Claude Code version + weekly smoke regression (est. 2 hr)

Action 4 — AWS cost circuit-breaker extension (est. 2 hr)

Action 5 — Full restore-from-S3 drill (est. 4 hr, schedule within 30 days)

---

Risk matrix (post-hardening)

| Failure mode | Pre-hardening risk | Post-hardening risk (if all 5 actions done) |

|---|---|---|

| Claude Code update breaks cron | HIGH | LOW |

| cloudflared disconnect | LOW | LOW |

| Laptop reboot, no auto-start | MEDIUM | MEDIUM (Gap 4 not closed) |

| AWS creds expire | MEDIUM | MEDIUM (no expiry alert yet) |

| SES bounce | LOW | LOW |

| Bedrock rate-limit | LOW | LOW |

| GitHub mirror fails | LOW | LOW |

| S3 cost spike | LOW | LOW |

| Scanner OOM | MEDIUM | MEDIUM |

| Bridge crash-loop | MEDIUM | LOW (watchdog exists) |

| Queue corruption | LOW | LOW |

| Prime directive lost | LOW | LOW |

| Hooks regression | LOW | LOW |

| TITAN goes dark, no alert | CRITICAL | LOW (Action 1 closes this) |

---

Appendix: files referenced

---

— TITAN SCOUT, 2026-04-26. Research complete. No fabricated sources. All mitigations cross-referenced against known F:/TITAN file structure.