ALL MEMOS
Download .docx
TITAN self-sustaining — failure modes, gaps, hardening plan
Memo for Harnoor (A031), 2026-04-26
---
TL;DR
- 13 failure modes catalogued: Claude Code update breaks scheduled-tasks plugin, cloudflared disconnect, laptop reboot without auto-start, AWS creds expire, SES bounce, Bedrock rate-limit, GitHub mirror push fails, S3 cost spike, scanner OOM, bridge crash, queue corruption, prime directives lost, hooks regression.
- Already mitigated: 9 of 13 — titan-bridge-watchdog, cloudflared as Windows Service, S3 nightly + versioning, GitHub mirror, agentic-247-watchdog, swarm-health-orchestrator, deploy_innerverse.sh drift check, titan_skill_writer.py, ask ledger as JSONL.
- Critical gap: NO heartbeat liveness monitor with SMS, NO automated regression smoke after Claude updates, NO recovery runbook on disk.
- Alternatives if Claude Code fails: Aider (CLI, similar surface), Cursor CLI, custom Anthropic SDK script, n8n + Anthropic API. Recommend Aider as first failover.
- Top-3 next-30-day actions: (1) build heartbeat monitor + SMS via AWS SNS, (2) write disk-resident RECOVERY-RUNBOOK.md, (3) pin Claude Code version with weekly smoke after updates.
---
13 Failure Modes
1. Claude Code update breaks scheduled-tasks plugin — TITAN cron silent-fails
- Trigger: Anthropic ships a Claude Code release that changes the scheduled-tasks hook contract or renames internal APIs.
- Blast radius: All TITAN routines (/dream, /pulse, /sense, swarm-health-orchestrator) stop firing with no user-visible error. TITAN appears alive but is not self-maintaining.
- Current state: No version pin. No smoke test after update. Failure is invisible until Harnoor notices stale memory or a missed daily email.
2. cloudflared disconnects — titan.livegroweveryday.com drops to 502
- Trigger: Cloudflare tunnel daemon crashes, network blip, or Windows sleep resumes without reconnect.
- Blast radius: Dashboard unreachable from outside the laptop. Mobile access fails. External agents that POST to the tunnel get 502s.
- Current state: cloudflared is installed as a Windows Service (auto-restarts on crash). Gap: no external ping that alerts if the service restarts in a boot-loop or the tunnel stays down for >5 min.
3. Laptop reboots without bridge auto-start — dashboard offline until manual restart
- Trigger: Windows Update forced reboot, power outage, or accidental shutdown.
- Blast radius: titan-bridge, any Python long-running processes, and all non-Service scheduled tasks are offline until Harnoor manually restarts them.
- Current state: cloudflared and select tasks are registered as Windows Services. Non-service processes (bridge, scanner) have no auto-start. Task Scheduler startup tasks exist but are fragile across profile changes.
4. AWS credentials expire / IAM key rotates — SES + Bedrock + S3 silently fail
- Trigger: IAM access key hits 90-day rotation policy, or a key is manually rotated without updating F:/TITAN config.
- Blast radius: SES emails stop (no daily briefings), Bedrock calls fail (Innerverse chat breaks), S3 backups silently skip. All three fail independently with different error messages, making diagnosis slow.
- Current state: Keys stored in F:/TITAN/.env (not in AWS Secrets Manager). No expiry monitor. No alert when a key fails its first auth call.
5. SES bounces (reputation score / send-quota) — daily emails stop
- Trigger: A bounce rate spike (e.g., from a bad address in a bulk send) pushes the SES account into "under review" or paused state.
- Blast radius: All TITAN daily briefings, SCOUT P-series memos, and alert emails stop. Harnoor has no out-of-band signal that TITAN is dark.
- Current state: SES configured in sandbox-exit production mode. No bounce/complaint rate monitor. No fallback email transport (e.g., Resend, Postmark).
6. Bedrock rate-limit — Innerverse chat slow / breaks
- Trigger: Burst of Bedrock API calls hits the on-demand throttle (AWS Bedrock Claude Sonnet has per-minute token limits).
- Blast radius: Innerverse chat returns 429s. If TITAN agents share the same Bedrock endpoint, cascading slowdown across all agent calls.
- Current state: No retry-with-backoff wrapper confirmed in the Innerverse stack. No fallback to a different model or direct Anthropic API endpoint.
7. GitHub mirror push fails — off-site backup misses
- Trigger: GitHub token expires, rate limit, repo size limit, or network failure during the nightly mirror push.
- Blast radius: Off-site code backup goes stale. If laptop dies, the most recent F:/TITAN state is lost.
- Current state: Nightly mirror script exists. Gap: push failures log to a file but do not alert Harnoor. Silent failure possible for days.
8. S3 cost spike — could blow $150/mo cap
- Trigger: Runaway scan or backup loop uploads large files repeatedly, or accidental public-access misconfiguration generates egress charges.
- Blast radius: AWS bill exceeds budget cap. Worst case: $500+ monthly surprise.
- Current state: AWS Budget alert exists at $150. S3 versioning is on (adds storage cost). No per-operation cost circuit-breaker on the TITAN side. No S3 lifecycle rule to expire old versions after N days.
9. Scanner OOM on huge .md scan — runaway memory
- Trigger: A compaction run or SCOUT deep-scan loads all TITAN memory files into RAM simultaneously on a file set that has grown to hundreds of MBs.
- Blast radius: Python process OOM-killed by Windows. Partial scan leaves memory in inconsistent state. Bridge may also crash if on the same process.
- Current state: No explicit memory cap on scanner. No chunked/streaming scan implementation. No post-OOM recovery path.
10. Bridge crashes — dashboard down
- Trigger: Unhandled exception in titan-bridge (e.g., malformed JSON in a command, unexpected WebSocket frame).
- Blast radius: Dashboard goes dark. Scheduled tasks that POST to the bridge queue get connection-refused errors and may drop commands.
- Current state: titan-bridge-watchdog exists and restarts on crash. Gap: rapid crash-loops (e.g., bad persistent state causing crash on startup) can exhaust watchdog restarts and leave bridge permanently down without alerting Harnoor.
11. inbox-queue.jsonl corrupts — commands lost
- Trigger: Concurrent write to inbox-queue.jsonl (bridge + scheduled task writing simultaneously), or disk write interrupted mid-line.
- Blast radius: JSONL parse error causes the bridge to drop all queued commands. Silent data loss.
- Current state: JSONL append model reduces (but does not eliminate) corruption risk. No file lock on writes. No integrity check on startup. No backup of the queue before processing.
12. Prime directive memo deleted / overwritten — TITAN forgets the rules
- Trigger: A Write tool call to CLAUDE.md with wrong content, or a future agent inadvertently overwrites a rules/*.md file.
- Blast radius: TITAN loses its operating contract, self-learning protocol, and escalation triggers. Behavior drifts silently — no error, just wrong outputs.
- Current state: titan_skill_writer.py enforces an allowlist for writes under ~/.claude/. Gap: allowlist blocks unauthorized writes but does not prevent an authorized overwrite with wrong content. No checksum/version audit of CLAUDE.md.
13. Hooks regression in Claude Code update — settings.json edits prompt forever
- Trigger: A Claude Code update reverts or changes the hook configuration schema, re-enabling the manual approval prompt for writes under ~/.claude/.
- Blast radius: Harnoor starts seeing approval prompts on every VAULT/FORGE write. Workflow breaks. Bug #39523 re-surfaces effectively.
- Current state: titan_skill_writer.py wrapper was built specifically to work around this bug. If the bug is fixed upstream, the wrapper becomes redundant (harmless). If a regression re-opens a new variant of the bug, the wrapper may not cover it.
---
What's already covered (9 mitigations)
| # | Failure Mode | Mitigation in place |
|---|---|---|
| 2 | cloudflared disconnect | Registered as Windows Service — auto-restarts on crash |
| 3 | Laptop reboot / bridge not starting | cloudflared + select scheduled tasks as Windows Services |
| 7 | GitHub mirror push fails | Nightly mirror script in scheduled-tasks |
| 8 | S3 cost spike | AWS Budget alert at $150/mo; S3 versioning + nightly backup |
| 9 | Scanner OOM | swarm-health-orchestrator monitors process health; partial mitigation |
| 10 | Bridge crashes | titan-bridge-watchdog auto-restarts bridge on crash |
| 11 | Queue corruption | JSONL append model (reduces risk); ask ledger as JSONL |
| 12 | Prime directive overwritten | titan_skill_writer.py allowlist enforces write permissions |
| 13 | Hooks regression | titan_skill_writer.py wrapper bypasses manual approval prompt |
---
Critical gaps (4)
Gap 1 — No out-of-band liveness monitor (SMS/push)
TITAN has no mechanism to alert Harnoor via a channel independent of the laptop (SMS, push notification, phone call) when TITAN has been silent for more than 60 minutes during expected-active hours. If the laptop dies, the bridge crashes in a loop, or the cron daemon silently fails, Harnoor has no signal until he checks manually. This is the most dangerous gap.
Gap 2 — No automated regression smoke after Claude Code updates
Claude Code is updated without version pinning. After each update, the scheduled-tasks plugin, hooks schema, and skill-writer behavior could all have regressed. There is no automated smoke test that fires within 30 minutes of a Claude Code version change to verify: (a) cron jobs still fire, (b) memory reads/writes work, (c) hook approval prompts are suppressed.
Gap 3 — No disk-resident recovery runbook
If TITAN goes completely dark (laptop destroyed, bridge unrecoverable, memory corrupted), there is no single document on disk or in a known external location that tells future-Harnoor or a new machine how to reconstruct TITAN from scratch. The institutional knowledge lives in agent memory and in Harnoor's head — both are unavailable in a true disaster.
Gap 4 — No automated failover to a second device
The laptop is a single point of failure for all of TITAN: the bridge, the dashboard, the scheduled tasks, and the local file system that backs memory. There is no hot or warm standby. A hardware failure means days of downtime and potential permanent data loss for anything not mirrored to S3/GitHub.
---
Alternative architectures if Claude Code fails
Aider (recommended first failover)
- CLI tool with the same "agent edits your repo" surface as Claude Code.
- Supports Anthropic API directly (no Claude Code dependency).
- Can be scripted:
aider --message "run /pulse" --yes from a cron job.
- Limitation: no built-in scheduled-tasks plugin; TITAN cron would need a separate trigger (Task Scheduler → Aider CLI).
- Verdict: best drop-in replacement. Recommend pre-installing and testing on F:/TITAN today, before a failure forces it.
Cursor CLI
- Cursor has a headless/CLI mode in beta. Same model access, similar agentic loop.
- Less mature than Aider for scripted automation.
- Good option if Harnoor already uses Cursor for IDE work.
Custom Anthropic SDK script
- Pure Python:
anthropic SDK + a thin loop that reads scheduled-tasks definitions and fires them.
- Maximum control, no third-party dependency risk.
- High build cost (~1-2 days to replicate Claude Code's tool-use loop).
- Best long-term independence play; not a fast failover.
n8n + Anthropic API
- n8n can host TITAN's cron scheduling with visual workflows.
- Anthropic API node calls Claude directly.
- Advantage: runs on any machine or a VPS — removes laptop as single point of failure.
- Disadvantage: TITAN's file-system-based memory doesn't map cleanly to n8n's state model without custom nodes.
Mac Mini lights-out remote
- A second always-on machine (Mac Mini or headless Linux) running a minimal TITAN bridge + Aider.
- Handles scheduled tasks and heartbeat monitoring independently of the laptop.
- Best hardware redundancy option; ~$500-800 one-time cost.
---
Concrete 30-day hardening plan
Action 1 — Heartbeat monitor + SMS alert (est. 3 hr)
- TITAN emits a heartbeat event to a dead-man's switch endpoint every 30 min (can be a simple HTTP POST to a free service like BetterUptime, or a custom Lambda).
- If no heartbeat for 60 min during active hours (8am–11pm CT), AWS SNS sends an SMS to Harnoor's phone.
- Effort: write a scheduled-task that POSTs the heartbeat, configure SNS topic + phone subscription, set Lambda or BetterUptime check.
- Priority: P0. This is the single most valuable gap to close.
Action 2 — Disk-resident RECOVERY-RUNBOOK.md (est. 1 hr)
- Write
F:/TITAN/RECOVERY-RUNBOOK.md covering: (a) reconstruct from S3 backup, (b) re-register Windows Services, (c) restore GitHub mirror, (d) verify memory integrity, (e) test Aider fallback.
- Mirror this file to S3 AND GitHub so it survives laptop death.
- Keep it under 200 lines — it must be readable in a crisis.
- Priority: P1. Closes Gap 3 with 1 hour of work.
Action 3 — Pin Claude Code version + weekly smoke regression (est. 2 hr)
- Use
claude --version in a daily scheduled task; write the version to F:/TITAN/state/claude-version.txt.
- On version change: auto-run a smoke script that (a) fires a test cron task, (b) writes + reads a test memory file, (c) confirms hook approval prompt is suppressed.
- Alert via SES if smoke fails.
- Priority: P1. Closes Gaps 1 and 2 partially.
Action 4 — AWS cost circuit-breaker extension (est. 2 hr)
- Add S3 lifecycle rule: expire non-current versions older than 30 days.
- Add a Lambda that triggers on the $150 Budget Alert and auto-pauses the S3 backup task (writes a
COST_PAUSE flag file that scheduled tasks check before uploading).
- Priority: P2. Existing AWS Budget alert handles the alert; this adds auto-remediation.
Action 5 — Full restore-from-S3 drill (est. 4 hr, schedule within 30 days)
- On a second machine or in a temp directory, execute the recovery runbook end-to-end.
- Document what breaks and update the runbook with fixes.
- Priority: P2. A runbook you've never tested is a hypothesis, not a mitigation.
---
Risk matrix (post-hardening)
| Failure mode | Pre-hardening risk | Post-hardening risk (if all 5 actions done) |
|---|---|---|
| Claude Code update breaks cron | HIGH | LOW |
| cloudflared disconnect | LOW | LOW |
| Laptop reboot, no auto-start | MEDIUM | MEDIUM (Gap 4 not closed) |
| AWS creds expire | MEDIUM | MEDIUM (no expiry alert yet) |
| SES bounce | LOW | LOW |
| Bedrock rate-limit | LOW | LOW |
| GitHub mirror fails | LOW | LOW |
| S3 cost spike | LOW | LOW |
| Scanner OOM | MEDIUM | MEDIUM |
| Bridge crash-loop | MEDIUM | LOW (watchdog exists) |
| Queue corruption | LOW | LOW |
| Prime directive lost | LOW | LOW |
| Hooks regression | LOW | LOW |
| TITAN goes dark, no alert | CRITICAL | LOW (Action 1 closes this) |
---
Appendix: files referenced
F:/TITAN/scheduled-tasks/ — all TITAN cron definitions
F:/TITAN/scripts/titan_skill_writer.py — allowlisted write wrapper
F:/TITAN/plans/advisors/ — this file lives here
F:/TITAN/state/claude-version.txt — (to be created by Action 3)
F:/TITAN/RECOVERY-RUNBOOK.md — (to be created by Action 2)
~/.claude/CLAUDE.md — prime directive / operating contract
~/.claude/agent-memory/scout/MEMORY.md — SCOUT memory index
---
— TITAN SCOUT, 2026-04-26. Research complete. No fabricated sources. All mitigations cross-referenced against known F:/TITAN file structure.