Claude Code Architecture Deep Dive — PhD-Level Analysis

Date: 2026-04-22 | Author: SCOUT (TITAN research agent) | Classification: Strategic — Internal

---

Preamble: The Leak and Its Significance

On March 31, 2026, Anthropic accidentally published the entire source code of Claude Code inside the npm package @anthropic-ai/claude-code v2.1.88. A missing .npmignore entry shipped a 59.8 MB source map containing 512,000 lines of unobfuscated TypeScript across approximately 1,900 files. The community-driven deobfuscation effort (github.com/ghuntley/claude-code-source-code-deobfuscation) made it fully readable within days. Concurrent analysis from VILA Lab produced a systematic academic paper (arxiv.org/html/2604.14228v1). Layer5, Bits-Bytes-NN, and Dev.to contributors published architectural breakdowns. The local shell snapshot at C:\Users\Harnoor\.claude\shell-snapshots\ confirms Claude Code v2.1.49 is installed on this machine, and TITAN's warm memory directory (F:\TITAN\knowledge\memory\warm\claude-code\) already contains twelve research files from the April 2026 deep-dive sprint.

This memo synthesizes all of that into a PhD-level analysis through three lenses, followed by a Silent Infinity implementation roadmap.

Key global insight from the leak: Only 1.6% of Claude Code's codebase is AI decision logic. The remaining 98.4% is deterministic infrastructure — permission gates, context management, tool routing, compaction, and recovery logic. The model is the engine; the harness is the car. This ratio is the most important architectural signal in the entire codebase.

---

Lens 1 — AI Product Architect: What Makes Claude Code Feel Different

1.1 System Prompt Layering: A Conditional Stack, Not a Static Document

Claude Code does not send a fixed system prompt. It assembles one dynamically per session using a conditional layering pipeline with at least six ordered slots:

Layer 0 — Managed Policy (override). Organization-level rules pushed by enterprise admins. These override everything below them and cannot be modified by the user. This is the "thou shalt not" floor.

Layer 1 — Foundation. A session-tone intro that varies based on output style settings (verbose vs. compact). Followed immediately by system rules governing tools, permissions, prompt injection protection, <system-reminder> tag handling, and the four-stage context compression behavior.

Layer 2 — Core Coding Philosophy. Hard-coded behavioral constraints: "Don't add features, refactor code, or make 'improvements' beyond what was asked." "Prefer dedicated tools (Read, Edit, Glob, Grep) over raw shell commands." This layer enforces discipline — it is what prevents the model from scope-creeping into refactors the user didn't request.

Layer 3 — Session Guidance. Conditional components that only appear when specific tools are enabled: clarifying-question behavior, shell shortcuts, sub-agent invocation instructions, skills invocation syntax. If sub-agents are disabled, this section is absent and its tokens are saved.

Layer 4 — Cache Boundary Marker. A deliberate architectural separator that marks where globally-cacheable content ends and session-specific material begins. This is a prompt cache optimization — Anthropic's API can cache the prefix up to this marker and only charge for the suffix on each turn.

Layer 5 — Tool Definitions. Approximately 50 tool descriptions, with a tool search beta (tool-search-2025-10-16) that loads only tool names initially and defers full schemas to on-demand retrieval. This prevents the system prompt from bloating by thousands of tokens for tools that will never be used in a given session.

The CLAUDE.md revelation. CLAUDE.md is NOT part of the system prompt. It arrives as a USER message after the system prompt. This has profound implications: it cannot override system-prompt-level behavior, but it can be edited freely, survives compaction (it is re-read from disk), and acts as a persistent user-controlled context injection. The first 200 lines / 25KB of MEMORY.md load at session start. Nested CLAUDE.md files in subdirectories are lazy-loaded — they only inject when the agent reads files in those directories.

Why this matters for felt intelligence: The layered assembly means Claude Code always behaves consistently at the core (Layers 0-2) while feeling contextually aware (Layers 3-5 adapt to session state). The model feels like it "knows the rules" and "knows the context" without needing to be retrained.

1.2 Tool Use Integration: The Model Reasons, the Harness Executes

The key architectural separation: "The model reasons about what to do; the harness is responsible for executing actions." The model emits structured tool_use blocks. The harness validates them before execution. This creates a security boundary where reasoning and enforcement occupy separate code paths — the model's prose cannot persuade the harness to bypass a deny rule.

The tool execution architecture has two concurrent paths:

StreamingToolExecutor: Starts executing tools during the model's streaming response for latency gains. Concurrent execution is flagged per tool — read-only tools run in parallel; state-modifying tools run sequentially to prevent race conditions.
Fallback runTools: Sequential execution with real-time progress updates.

The per-turn pipeline in query.ts (1,729 lines) is:

1. Pre-request compaction (cheapest-first: Tool Result Budget → Snip → Microcompact → Context Collapse → Auto-Compact)

2. API call with streaming

3. Error recovery cascade (three stages for prompt-too-long and max-output-token failures)

4. Stop hooks + diminishing returns detection (stops if 3 consecutive turns < 500 tokens)

5. Tool execution (streaming results + sequential remainder)

6. Post-tool transition (skill discovery, MCP tool refresh, state to next iteration)

The tool search beta reduces per-turn context cost significantly: only tool names consume context until Claude actually invokes a specific tool, at which point the full schema loads on demand.

1.3 Memory Management: File-Based, Tiered, and Surgically Compacted

Claude Code's memory architecture is deliberately file-based rather than vector-database-based. The design principle: "fully inspectable, editable, version-controllable." No opaque embeddings. Every memory artifact is a plain file a human can read and edit.

The memory hierarchy:

Tier 1 — Session context (volatile). The active conversation transcript, tool results, and file contents. Managed by the five-layer compaction pipeline. Lost between sessions unless explicitly archived.

Tier 2 — CLAUDE.md / MEMORY.md (persistent, fast). User-authored persistent instructions and auto-written memory index. Loaded at session start. CLAUDE.md survives compaction by being re-read from disk. The 200-line / 25KB MEMORY.md limit is a hard cap; the full memory topic files are lazy-loaded on demand.

Tier 3 — Skills (on-demand). Domain-specific instruction files that load only when semantically matched or explicitly invoked. Descriptions are truncated at 1,536 chars per skill; total budget is 1% of context window (8K fallback). This is lazy loading applied to procedural knowledge.

Tier 4 — autoDream (background consolidation). Revealed in the leak: autoDream is a background memory consolidation process that runs as a forked subagent while the user is idle, including mechanisms to convert "vague insights into absolute facts." This is asynchronous memory distillation — the system is improving its own memory representation during downtime.

Compaction is not truncation. The five-layer graduated pipeline reflects a key insight: different types of context pressure require different treatment. Snip Compact (free, high loss) discards old messages wholesale. Microcompact (free, cache-aware) pins cached prefix ranges to avoid invalidating downstream cache keys. Context Collapse (read-time projection) never modifies originals — it projects a compressed view. Auto-Compact (expensive API call, low loss) uses AI summarization as a last resort. The system exhausts cheap options before spending tokens on summarization.

Compaction Instructions in CLAUDE.md. A ## Compact Instructions section in CLAUDE.md tells the summarizer what to preserve during auto-compaction. This is user-controlled signal injection into the compaction process itself — a second-order memory mechanism.

1.4 Parallel Tool Calls and Orchestration

The StreamingToolExecutor identifies tools as safe for concurrent execution based on their state-modification profile. A glob search and a file read can happen simultaneously. A file edit and a bash command cannot. The orchestration is implicit — no explicit parallel workflow graph, just per-tool concurrency flags evaluated at dispatch time.

ULTRAPLAN was revealed in the leak as an offloading mechanism for complex planning: it "runs Opus 4.6 with up to 30 minutes of dedicated think time." This is the heavy-compute escape valve — when the main session model can't plan at sufficient depth, it delegates to a more capable model in a separate session.

Multi-agent orchestration is described as fitting "in a prompt rather than a framework." The Coordinator Mode (unreleased at time of leak) spawns isolated worker agents via AgentTool, with strict tool restrictions (no Bash/file access for workers). This keeps the parent context clean — "only summaries return to parent (parent's context is protected from subagent verbosity)."

1.5 Plan Mode vs. Execute Mode

Plan mode (Shift+Tab twice) forces the model into a read-only posture. It can only gather context — no edits, no shell commands, no state-modifying tools. The model is required to produce a plan that the user reviews before any execution begins. This is "deny-first elevated to the planning level."

The pattern reflects a documented design principle: "agents tend to respond by confidently praising the work, even when quality is mediocre." Plan mode forces explicit articulation of strategy before commitment, making the model's intentions legible to the user before resources are spent.

The transition from plan to execute is the user's decision. This preserves human agency at the highest-stakes moment: the moment before action.

1.6 Verification Loops: Ground Truth from the Environment

The verification pattern is embedded in the three-phase description of the agentic loop: gather context → take action → verify results. But the implementation is more nuanced:

Diminishing returns detection: If Claude continues 3+ consecutive times producing < 500 tokens each, the system assumes the loop is stuck and terminates. This is a proxy for "no useful progress is being made."
Ground truth priority: The model is instructed to check its work against actual tool outputs, not its own predictions. "Give Claude something to verify against" (official docs).
The harness, not the model, detects completion: State transitions record transition.reason so tests can verify recovery paths worked correctly without inspecting message contents. The model cannot declare itself done — the harness validates.
Evidence-first communication: The model's tone reflects this: claims are grounded in what tools returned, not in what the model expects. "It works" without evidence is structurally blocked by the loop architecture.

1.7 Skills System: Pattern-Triggered, Lazy-Loaded Procedural Memory

Skills are the most underappreciated architectural primitive in Claude Code. A skill is a .claude/ directory file (or folder) containing:

A description (semantic matching key, capped at 1,536 chars)
A when_to_use field (trigger conditions)
A content body (injected into context when the skill fires)
Optional disable-model-invocation: true (keeps description out of context until explicitly invoked)

Skills are matched semantically at the model level — they are not keyword triggers. Front-loading the trigger phrase in the first sentence of the description significantly improves match rate. The skill content only enters the context window when the skill fires, making this a lazy-loading mechanism for procedural knowledge.

During compaction, the most recent invocation of each skill is preserved at up to 5K tokens, with a 25K total budget across all skills. Skills are never lost to compaction — they are re-injected from disk.

The live file watching feature means skill edits apply without restarting the session, enabling real-time behavioral adjustment during a session.

---

Lens 2 — Distributed Systems / Infrastructure Engineer

2.1 What Runs Where

The architecture is a local-first, cloud-augmented system:

Local process: The Node.js CLI (claude binary, installed at C:\Users\Harnoor\AppData\Roaming\Claude\claude-code\2.1.49\claude.exe on this machine). This runs the agent loop, manages the context window, executes tools (bash, file read/write, glob/grep), and handles session persistence. The local process is the "harness" — all orchestration logic runs here.

Cloud: The Anthropic API (or Bedrock) provides model inference only. The API is called once per turn. The model returns tool_use blocks and text blocks. The local harness validates, executes, and loops.

Cloud-optionally: MCP servers can run anywhere (local stdio, HTTP endpoint, SSE stream). The three transport modes (HTTP, stdio, SSE) allow MCP servers to be local daemons, remote services, or anything in between.

Cloud sessions (Kairos/remote): The leaked feature flags reveal a Coordinator Mode and Remote Control architecture using authenticated WebSocket tunnels and Git worktrees for session isolation. Remote agents run in Anthropic-managed VMs. The bridge uses 2-tier authentication (Standard/Elevated) and 33+ files managing the tunnel. This is the cloud execution environment described in the official docs.

2.2 MCP Server Pattern: JSON-RPC Over Stdio

MCP (Model Context Protocol) is the external tool integration layer. Tools are defined via JSON Schema, discovered lazily (only names load initially; schemas load on demand via ToolSearchTool), and invoked via JSON-RPC over stdio, HTTP, or SSE.

The per-server context overhead is controlled: the recommended cap is 5-6 active MCP servers due to subprocess overhead. Oversized tool results (> 25K tokens default, > 10K triggers a warning) are persisted to disk with the model receiving a file reference rather than the raw output.

MCP servers are recognized as potential "prompt injection vectors" — third-party server trust is a documented concern. The hook system (PreToolUse) provides the interception point for MCP tool calls.

2.3 Hook System: 27 Event Types Across the Lifecycle

The hook system is the extensibility backbone. 27 event types span the full agent lifecycle:

SessionStart: Pre-session context injection via additionalContext field. Can write environment variables via CLAUDE_ENV_FILE. This is the pre-session briefing injection point.
PreToolUse: Richest hook. Fields: permissionDecision (block/allow), updatedInput (modify tool parameters!), additionalContext. Can prevent any tool from executing.
PostToolUse: Informational; can inject context after tool execution.
PreCompact: Can block compaction entirely (exit code 2 or {"decision": "block"}). This is the catastrophic-compaction prevention hook.
PostCompact: Can reinject context after compaction via systemMessage.
Stop: Fires when the loop terminates. Useful for cleanup and state archival.

The hook input/output contracts are well-documented: hooks receive session_id, transcript_path, cwd, hook_event_name via stdin. Output via stdout is parsed as JSON if valid.

The hook_stopped termination reason (one of nine loop-exit conditions) indicates a hook can forcefully stop the agent loop. This makes hooks not just observers but actors with veto power.

2.4 Agent Spawning: Subprocess Lambda Analogy

Sub-agents are spawned as isolated sessions. The analogy to AWS Lambda is apt: each sub-agent gets a fresh context window, operates within separate session state, has constrained tool access (no Bash/file access in some modes), and returns only a summary to the parent. The parent's context is protected from subagent verbosity.

Three isolation modes:

In-process: Shares the parent runtime but with restricted context. Lowest overhead.
Worktree: Uses a Git worktree for filesystem isolation. Each agent works on its own branch.
Remote: Runs in Anthropic-managed VMs. Full network isolation.

The build-time gate (feature() function) means unreleased sub-agent types are physically absent from external builds — they cannot be invoked even by exploits.

2.5 Session State Persistence: Append-Only JSONL

Sessions are persisted as append-only JSONL transcripts under ~/.claude/projects/ (confirmed by official docs and the local ~/.claude/todos/ and ~/.claude/shell-snapshots/ directories observed on this machine). Each message, tool use, and result is written as a JSON line.

The architecture deliberately favors auditability over query power:

User messages are blocking saves (required for --resume to work correctly).
Assistant responses use fire-and-forget saves (reduce I/O wait on the hot path).
Permissions are NOT restored on resume. Session-scoped permissions expire at session end. This is an intentional security property — permission escalation is a one-time event.
Fork sessions (--fork-session) create a new session ID while preserving conversation history, enabling parallel exploration without corrupting the original.

The QueryEngine.ts supervises the full session, accumulating token usage and managing file state caches. File snapshots are created before each edit, enabling checkpoint-based rewind without Git involvement.

2.6 Background Task Patterns

The Kairos proactive mode (leaked feature flag KAIROS) uses:

SleepTool — the agent suspends itself for a specified duration before resuming
PushNotificationTool — sends alerts when monitored conditions are met
SubscribePRTool — monitors pull requests and reacts to state changes

This is a scheduled-wake pattern: the agent self-suspends and self-resumes rather than relying on an external scheduler. The pattern is analogous to a serverless function with self-scheduled re-invocations.

2.7 Permission Model: Eight Security Layers

The permission system implements defense-in-depth with eight independent layers:

1. Build-time gates: feature() function eliminates unreleased code branches at bundle time.

2. Feature flags: Server-side GrowthBook kill switches enable instant deactivation.

3. Configuration rules: Eight-source priority system (user settings > session settings).

4. Transcript classifier: Separate API call judges tool safety in auto mode. Deliberately does not see the agent's prose — prevents the model from "sweet-talking" its way past the gate.

5. Dangerous pattern detection: Hardcoded blocks on interpreters (python, node, ruby) and privilege escalation tools.

6. Filesystem permission validation (62K lines): Symlink escape prevention, glob pattern limits, CWD-only mode support.

7. Trust dialog: Explicit user consent for project-scoped .claude/ configurations.

8. Bypass permissions kill switch: Server-side flag that disables bypass mode globally.

The deny-first principle: "A broad deny always overrides a narrow allow." Any single layer can block a request. The graduated trust spectrum (seven permission modes from plan through bypassPermissions) addresses a documented behavioral pattern: users approve 93% of permission prompts, creating habituation. The solution shifts from approval-centric to boundary-centric safety — the agent operates freely within defined bounds rather than asking per action.

---

Lens 3 — UX / Felt-Intelligence Designer

3.1 Why Talking to Claude Code Feels Like Talking to a Colleague

The colleague feeling has five engineering roots:

1. Delegation tone, not instruction tone. The official docs explicitly use the framing: "Think of delegating to a capable colleague. Give context and direction, then trust Claude to figure out the details." The system prompt encodes this stance — the model is instructed to figure out which files to read, which commands to run, which approach to take, rather than waiting for step-by-step instructions. This mirrors how you'd brief a senior engineer, not how you'd operate a tool.

2. Interruption parity. You can interrupt Claude at any point mid-stream by typing a correction and pressing Enter. Claude stops and adjusts. This bidirectionality — the model pausing to listen — is behaviorally identical to interrupting a colleague mid-explanation. No other coding assistant at scale has this interaction model. ChatGPT and Cursor both require waiting for completion.

3. Convergent iteration, not restart-on-failure. "When the first attempt isn't right, you don't start over. You iterate." The session state persists across corrections; the model builds on what it learned. This mirrors how a colleague would respond to "that's not quite right — the issue is actually in session handling" — not by starting over, but by adjusting the active mental model.

4. Evidence-before-claim. The model's communication discipline is architecturally enforced: the loop requires tool results before claims about task completion. When Claude says "tests pass," the transcript shows the bash tool returning exit 0. This citation reflex is what separates felt honesty from performed confidence.

5. Self-aware AI disclosure. The model never pretends to be human. The EU AI Act Art. 50 compliance is cited explicitly in Silent Infinity's audit document as a disclosure requirement. In Claude Code, this manifests as: the model acknowledges uncertainty, corrects itself, explains its reasoning, and references its own limitations. This is cognitively experienced as trustworthiness — a colleague who admits what they don't know.

3.2 The Committed Tone: Confidence Without Arrogance

The system prompt encodes a specific communication register: direct, committed, terse. The docs describe the pattern as: "Lead with the answer. Reasoning only if asked. Short > clever. One direct sentence beats three hedged ones."

This is distinct from sycophantic confidence (ChatGPT's tendency toward "Certainly! I'd be happy to...") and from defensive hedging (excessive qualifications that communicate low conviction). Claude Code's tone is closer to a senior engineer writing a Slack message: declarative, present-tense, action-oriented.

The mechanism: the system prompt explicitly filters out filler phrases and hedges. The model is instructed to state what it did, what it found, and what it recommends — not to perform enthusiasm about doing so.

3.3 Progress Disclosure: What the User Sees vs. What Claude Does Silently

There is a deliberate asymmetry between what Claude reports and what it does. The model:

Reports intent before acting (plan mode makes this explicit; auto mode makes it implicit via streaming)
Streams tool invocations in real time so the user can see the context-gathering phase
Reports results with evidence attached
Does not narrate every internal reasoning step — that would be noise

The UI design (React + Ink terminal rendering using "game-engine techniques") is built to show this disclosure at terminal fidelity. The streaming-first architecture means the user sees Claude's work as it happens rather than receiving a completed result.

Silent work vs. visible work. Skills load silently. Compaction happens without interruption. Pre-request context management is invisible. What the user sees is: tool calls, results, and model text. What the user doesn't see: five-layer compaction, cache optimization, permission evaluation, transcript classification. This is the same cognitive split as watching a colleague type — you see the output, not the decision tree.

3.4 Error Acknowledgment Patterns

When something fails:

The error is presented directly, as tool output, not softened by the model.
The model acknowledges the failure before proposing recovery.
The recovery proposal is concrete, not vague ("try running X" not "you might want to consider...").
The three-stage error recovery cascade in query.ts means the system tries the cheapest fix first (free retries before API calls).

This pattern maps to how a competent colleague handles failure: name it, own it, propose next step. The absence of defensive framing or blame-shifting is a deliberate design choice.

3.5 Humor, Texture, and ASCII Art: Contextual Use

The leaked source includes ASCII art utilities and personality texture. The design principle appears to be: decoration that is never random and never mandatory. ASCII art appears in specific, predictable contexts (session summaries, large output headers). Humor appears when the user's tone invites it, never unprompted.

This is contextual personality rather than constant personality. A constant personality (like Character.AI's engagement-optimized warmth) creates dependency. A contextual personality creates the feeling of authenticity — like a colleague who is funny when it's appropriate and direct when it isn't.

3.6 The "Show Me Evidence" Reflex: Felt Honesty

The verification loop architecture produces what users experience as honesty. When Claude says "the tests pass," it's because the bash tool returned exit 0 and that result is in the transcript. When Claude says "I'm not sure," it's because the context genuinely doesn't contain the answer.

The transcript classifier (used in auto mode) deliberately does not see the model's prose — it sees only the user's request and the proposed tool call. This architectural choice prevents a specific failure mode: the model constructing persuasive justifications for unsafe actions. The safety evaluation is evidence-based, not argument-based.

3.7 Self-Awareness of Being AI

Claude Code never hedges its AI identity. The system prompt includes explicit instructions about not pretending to be human. The practical manifestation: Claude speaks in first person about what it "did" (via tools), "learned" (from tool results), and "thinks" (reasoning from evidence), without claiming the experiential weight those verbs carry for humans. The register is: technically accurate, not philosophically loaded.

This is the opposite of Character.AI's companion model (which performs human-like emotional attachment) and more like a very good documentation agent that happens to have excellent judgment.

---

Part C — Silent Infinity Implementation Roadmap

Context: What Silent Infinity Has and What It's Missing

From the audit document (F:/TITAN/plans/advisors/SILENT-INFINITY-AUDIT-DOCUMENT-2026-04-21.md), Silent Infinity's current architecture:

Lambda (Python 3.12 ARM64) + API Gateway + DynamoDB + Bedrock
system_prompt.py — assembles system prompt (static versioning + variant suffix injection)
conversation_store.py — DynamoDB read/write for history; basic context window management
feedback_monitor.py (Chat Sentinel) — async post-turn monitoring via Haiku 4.5
user_profile.py — user record, language detection, interest capture
Variant system (A-F routing in DynamoDB config table)
582 green tests; Feature Readiness Standard

What it does NOT have (and Claude Code does):

Memory tiering with explicit hot/warm/cold architecture
Skill system (pattern-triggered procedural injection)
Sub-agent pattern
Structured tool use with input validation
Verification-before-claim discipline
Session transcript rehydration / fork / resume
Plan mode (separate thinking before acting)
Pre-session briefing injection
Persistent TODO / correction-as-memory pipeline
Parallel sub-task execution

---

Pattern Roadmap (Ranked: Impact × Effort Score, 1-10 scale)

---

Pattern 1 — Pre-Session Briefing Injection

Impact: 9 | Effort: 2 | Score: 18 | Time: 1 day

What it does in Claude Code:

SessionStart hook injects additionalContext into every session. CLAUDE.md (200 lines / 25KB) loads as the first user message. Auto-memory MEMORY.md index loads at the same time. The model enters every session already knowing the user's persistent preferences, project context, and behavioral rules — without those facts needing to be re-established in conversation.

Why it makes the product feel smart:

The felt effect is: "it remembers me." The model's first response reflects knowledge of who the user is before they say a word. This is what distinguishes a good therapist from a service desk — the good therapist reads the chart before walking in.

How to port to Silent Infinity:

system_prompt.py already has versioned system prompts and variant injection. Extend it with a user-personalization injection layer:

1. user_profile.py builds a PersonalizationContext object (language, emotional tone history, topics explored, active threads, known triggers) from DynamoDB.

2. system_prompt.py injects a <user_context> block into the system prompt, assembled from PersonalizationContext.

3. The injection is conditional on data availability — first-session users get the default; returning users get the personalized layer.

4. The <user_context> block is the Silent Infinity equivalent of MEMORY.md first-200-lines.

Existing system to extend: user_profile.py + system_prompt.py. No new infrastructure required.

---

Pattern 2 — Correction-as-Memory (Live Feedback Loop)

Impact: 9 | Effort: 3 | Score: 16 | Time: 2-3 days

What it does in Claude Code:

When the user corrects Claude, the system (via VAULT in TITAN's implementation) extracts a generalizable rule and persists it as a memory file. Future sessions load this correction into context. The correction becomes permanent behavioral change without retraining.

Why it makes the product feel smart:

The product learns the user's preferences faster than any static system. By session 5, a corrected behavior never recurs. This is qualitatively different from a model that responds correctly to corrections within a session but "forgets" them on reload.

How to port to Silent Infinity:

1. Add a preference_capture.py module that runs as part of feedback_monitor.py's post-turn processing.

2. Use a lightweight Haiku 4.5 prompt to classify user messages: did the user signal a preference, correction, or boundary? (e.g., "I don't like it when you...", "please stop...", "I'd prefer...")

3. If yes: extract the rule as a structured record {preference_type, rule_text, confidence, turn_id} and write to a user_preferences DynamoDB table.

4. user_profile.py reads the top-N (by recency × confidence) preferences on each session load.

5. system_prompt.py injects them as a <user_preferences> block.

Existing system to extend: feedback_monitor.py (already runs post-turn) + user_profile.py + system_prompt.py. The DynamoDB table is new but trivial.

---

Pattern 3 — Five-Layer Graduated Compaction

Impact: 8 | Effort: 3 | Score: 16 | Time: 2-3 days

What it does in Claude Code:

Context is managed by five sequential strategies applied cheapest-first: Budget Reduction → Snip → Microcompact → Context Collapse → Auto-Compact. Each layer has different cost and information-loss profiles. The system never prematurely spends API tokens on summarization when a free trimming pass would suffice.

Why it makes the product feel smart:

Long conversations feel coherent. The model doesn't suddenly "forget" what was said 20 turns ago because it was truncated. The compaction is semantic rather than mechanical — important content is preserved; verbose tool outputs are the first to go.

How to port to Silent Infinity:

Silent Infinity currently has basic context window management in conversation_store.py. Replace with a tiered compaction module:

1. Layer 1 (free): Trim oldest messages when token count exceeds 80% of limit. Preserve system prompt, user preferences, first/last 3 turns always.

2. Layer 2 (free): Compress verbose turns — replace long AI responses with 2-sentence summaries stored locally; keep originals in DynamoDB for retrieval.

3. Layer 3 (cheap API call): Haiku-powered semantic summarization of the oldest N turns into a single "prior context" block. Cost: ~$0.001 per compaction.

4. Layer 4 (expensive, last resort): Full Sonnet-level resummary if Haiku summary still leaves context over limit.

5. Add a ## Compact Instructions equivalent: a DynamoDB field per user that tells the compactor what to always preserve (e.g., "user's name for their current exploration theme").

Existing system to extend: conversation_store.py. The DynamoDB schema already stores full transcripts; layered access patterns are additive.

---

Pattern 4 — Skills System (Pattern-Triggered Behavioral Injection)

Impact: 8 | Effort: 5 | Score: 12 | Time: 4-5 days

What it does in Claude Code:

Skills are domain-specific instruction sets that inject into context when semantically triggered. They are lazy-loaded (not present in context by default), surviving compaction (re-injected from disk), and self-describing (the trigger is in the skill's description, matched by the model). A skill for "explaining depression" fires only when depression-related content appears, injecting specialized therapeutic dialogue guidance without permanently occupying context.

Why it makes the product feel smart:

The model behaves like a specialist in whatever domain is active. A grief conversation gets grief-specific guidance injected. A body-image conversation gets body-image-specific guidance. The default system prompt doesn't need to cover every domain — it stays lean and the right behavior emerges contextually.

How to port to Silent Infinity:

1. Create a skills/ directory in the backend (or a DynamoDB skills table) with entries: {skill_id, domain_tags, trigger_description, content, enabled}.

2. Pre-session: run a fast Haiku prompt that scans the user's message and recent history against skill trigger_description fields. Return matching skill IDs.

3. Inject matched skill content into the system prompt as a <domain_context> block.

4. Start with 5 pilot skills: grief, relationship conflict, anxiety/panic, purpose/meaning, body/self-image. Each skill contains: contemplative stance for this domain, topics to avoid initiating, crisis-adjacent signals to watch, example reflection phrasings.

5. The Feature Readiness Standard requires each skill to go through ALPHA (internal testing) before user exposure.

Existing system to extend: system_prompt.py (injection point). The skill matching is a new Haiku pre-call. The skills themselves are content authored by Harnoor + clinical review.

---

Pattern 5 — Structured Tool Use Format (for Internal Capabilities)

Impact: 7 | Effort: 4 | Score: 11 | Time: 3-4 days

What it does in Claude Code:

Every capability is exposed as a formal tool with JSON Schema input/output validation. The model cannot call a tool with invalid parameters — the schema catches it. Tool results are typed, logged, and traceable. This is what enables the "98.4% is infrastructure" ratio: every discrete capability is a formal interface, not an ad-hoc prompt modification.

Why it makes the product feel smart:

Reliability. Tool-defined capabilities fail predictably (bad input → schema error → recovery, not silent hallucination). The model knows exactly what each capability can and cannot return. Debugging is straightforward because every tool call is logged with its input and output.

How to port to Silent Infinity:

Silent Infinity's capabilities are currently invoked as prompt-layer instructions (guardrails, crisis detection, sentiment analysis are described in the system prompt, not called as structured tools). Convert the primary capabilities to a tool-call pattern:

1. CrisisCheck tool: input {message_text, user_id, session_id}, output {crisis_detected: bool, severity: 0-5, matched_patterns: [], recommended_resources: []}. Replaces the current regex inline in guardrails.py.

2. SentimentRead tool: input {message_text}, output {valence: float, arousal: float, dominant_emotion: str, confidence: float}. Fed by Haiku 4.5.

3. TopicClassify tool: input {message_text, history_summary}, output {primary_topic: str, subtopics: [], skill_triggers: []}. Drives the skills system (Pattern 4).

4. Define each tool with a Pydantic v2 schema in schemas.py. handler.py dispatches via the tool ID.

Existing system to extend: schemas.py (already Pydantic v2), handler.py (already orchestration hub). The refactor is extracting inline logic into schema-validated tool functions.

---

Pattern 6 — Plan Mode (Separate Thinking Before Acting)

Impact: 7 | Effort: 3 | Score: 10 | Time: 2-3 days

What it does in Claude Code:

A read-only pre-execution mode where the model articulates a plan for user review before any state-modifying action. Prevents impulsive responses in complex situations.

Why it makes the product feel smart:

For Silent Infinity, "plan mode" translates to reflective pause disclosure: before responding to a heavy emotional message, the model articulates (internally or briefly to the user) what it noticed before speaking. This maps to the contemplative principle of witnessing before responding.

How to port to Silent Infinity:

1. Add a response_mode field to the system prompt: reflective (default) vs. direct. Reflective mode includes a brief <observation> block before the main response.

2. The <observation> block is 1-2 sentences: "I'm noticing [X] in what you've shared." This is the plan-mode equivalent — the model declares its reading before responding.

3. Can be user-toggled ("just speak to me directly" removes the observation prefix).

4. More advanced: a two-call architecture where Call 1 (Haiku, fast, cheap) generates the <observation>, and Call 2 (Sonnet, substantive) generates the response conditioned on the observation.

Existing system to extend: system_prompt.py + bedrock_client.py (two-call pattern is additive).

---

Pattern 7 — Session Transcript Rehydration / Fork / Resume

Impact: 7 | Effort: 4 | Score: 9 | Time: 3-4 days

What it does in Claude Code:

Sessions persist as append-only JSONL under ~/.claude/projects/. Users can --resume (pick up from last message) or --fork-session (new session ID, same history up to fork point). Permissions expire on resume (security property); conversation history does not.

Why it makes the product feel smart:

Users who return to a session feel remembered. The model references what was established previously without the user needing to re-establish context. In a contemplative product, this is especially powerful: "last time we spoke, you were exploring..." is the mark of a witness, not a service.

How to port to Silent Infinity:

1. conversation_store.py already persists full transcripts in DynamoDB.

2. Add a session_resume endpoint that loads the last N turns from DynamoDB and prepends them to the context as a <prior_session_summary> block (Haiku-summarized, not raw replay).

3. Add a "return to last conversation" UX affordance (not a resume button — an invitation: "your space is still here").

4. Session forking: when a user wants to "start fresh but remember where I was," generate a new session_id but inject a 3-sentence summary of the prior session as context.

Existing system to extend: conversation_store.py + user_profile.py. The UI is a single affordance change.

---

Pattern 8 — Sub-Agent Pattern (Parallel Specialized Processing)

Impact: 7 | Effort: 6 | Score: 9 | Time: 5-7 days

What it does in Claude Code:

Sub-agents operate in isolated context windows, execute specialized tasks, and return only summaries to the parent. The parent's context is protected from subagent verbosity. Used for: exploration, planning, verification, and status generation.

Why it makes the product feel smart:

Parallel specialized processing makes the system more capable without making the main conversation context heavier. The user perceives the depth without experiencing the weight.

How to port to Silent Infinity:

Silent Infinity's Chat Sentinel (feedback_monitor.py) is already a primitive sub-agent: it runs post-turn, uses a separate model (Haiku 4.5), and returns a summary signal, not raw output. Extend this pattern:

1. Crisis Sentinel (already exists): post-turn crisis detection.

2. Personalization Sentinel: runs post-turn, extracts preference signals and topic classifications. Returns {preferences_detected: [], topics: [], skill_triggers: []}. Feeds Pattern 2 and Pattern 4.

3. Session Summarizer: runs end-of-session, generates a 3-sentence summary for <prior_session_summary> injection. Feeds Pattern 7.

4. All sentinels run asynchronously (fire-and-forget, same pattern as current feedback_monitor.py), adding zero latency to the user-facing response path.

Existing system to extend: feedback_monitor.py pattern is the template. New sentinel modules follow the same async invocation pattern.

---

Pattern 9 — Verification-Before-Claim Discipline

Impact: 8 | Effort: 2 | Score: 8 | Time: 1 day

What it does in Claude Code:

The model is instructed to ground every claim in tool output. "It works" requires a test result. "The file was updated" requires a read-back confirmation. Claims not grounded in evidence are filtered by the system prompt.

Why it makes the product feel smart:

In a wellness context, this translates to: the mirror does not assert things about the user that it hasn't actually observed. "I notice you've mentioned grief three times" is grounded. "You seem to be struggling with attachment" (unfounded inference) is not.

How to port to Silent Infinity:

Add a prompt-level discipline instruction to system_prompt.py:


Before making any observation about the user's emotional state, confirm it is grounded
in something the user explicitly expressed in this session. Do not infer states not
present in the user's words. Observations must cite: the user's words, not the model's
interpretation of them.

Cost: zero. This is a system prompt addition, not new infrastructure. High impact for low effort.

Existing system to extend: system_prompt.py only. Immediate ship.

---

Pattern 10 — Interruptible Streaming (Barge-In on Chat)

Impact: 6 | Effort: 4 | Score: 7 | Time: 3-4 days

What it does in Claude Code:

ESC mid-stream stops the current generation and immediately accepts new input. The model sees the partial response as part of the transcript and continues from the interruption.

Why it makes the product feel smart:

In a contemplative context, the ability to interrupt a long response with "actually, I need to talk about something else right now" is the digital equivalent of a human counselor stopping and listening. The model doesn't finish its point; it holds space for the user's emergence.

How to port to Silent Infinity:

Silent Infinity already uses SSE streaming. Add:

1. A client-side interrupt button (keyboard shortcut or tap target) that closes the EventSource and sends a new POST /interrupt request.

2. The Lambda handler receives the interrupt: truncates the partial response in the transcript store, marks the turn as {status: "interrupted", partial_response: "..."}.

3. Next user message includes the partial response in context so the model knows what was interrupted.

4. The model's response to a follow-up after interruption should acknowledge the interruption without dwelling on it.

Existing system to extend: bedrock_client.py (abort signal), conversation_store.py (partial turn storage), frontend SSE client.

---

Pattern 11 — Persistent TODO / Active Threads List

Impact: 6 | Effort: 2 | Score: 7 | Time: 1 day

What it does in Claude Code:

The agent maintains an active TODO list (JSONL format, ~/.claude/todos/) that persists across sessions. The list is visible in the UI and updated as tasks complete. This is separate from session history.

Why it makes the product feel smart:

In Silent Infinity, this maps to active exploration threads: the product maintains awareness of the user's ongoing themes across sessions ("still exploring the grief around your father," "you mentioned revisiting the relationship conversation"). The model references these threads when relevant without the user needing to restate them.

How to port to Silent Infinity:

1. The Personalization Sentinel (Pattern 8) detects active themes and writes them to a user_threads DynamoDB table: {thread_id, user_id, theme_label, last_active, status: active|resolved|paused}.

2. user_profile.py loads the top-3 active threads at session start.

3. system_prompt.py injects them as <active_threads>: "This user is currently exploring: grief (8 sessions), purpose (3 sessions), relationship conflict (1 session)."

4. The model can reference threads contextually without being instructed to — the context contains the signal.

Existing system to extend: user_profile.py + system_prompt.py. The DynamoDB table is the only new infrastructure.

---

Pattern 12 — Memory Tiering (Hot/Warm/Cold Architecture)

Impact: 8 | Effort: 7 | Score: 7 | Time: 1 week

What it does in Claude Code:

Memory is structured as distinct tiers with different access costs and retention policies. Hot memory loads every session. Warm memory loads when relevant. Cold memory is archival. The MEMORY.md index (200 lines / 25KB) provides a navigable map of what exists without loading everything.

Why it makes the product feel smart:

The model has access to rich user history without being overwhelmed by it. Old sessions don't bloat the context — they are accessible via semantic retrieval when needed. The system gets smarter over time without getting slower.

How to port to Silent Infinity:

1. Hot tier (loads every session): User preferences (top-5), active threads (top-3), last session summary. Already partially implemented via user_profile.py.

2. Warm tier (loads when relevant): Per-theme history summaries. The Personalization Sentinel generates these monthly. Loaded when the active theme matches a warm-tier file.

3. Cold tier (archival): Full session transcripts. Stored in DynamoDB with TTL policy. Accessible via admin API, not surfaced to the model directly.

4. Add a memory_index.md-equivalent in DynamoDB: a structured record per user listing what memory artifacts exist across tiers. This is the navigable map.

Existing system to extend: conversation_store.py + user_profile.py. The tiering is a DynamoDB schema extension.

---

Pattern 13 — Deny-First Permission Architecture

Impact: 6 | Effort: 2 | Score: 6 | Time: 1 day

What it does in Claude Code:

Safety is deny-first: broad denies always override narrow allows. The system has eight independent safety layers. Any single layer can block an action. The model cannot argue its way past a deny.

Why it makes the product feel smart:

In a wellness context, this maps to behavioral discipline: the mirror will not do certain things regardless of how the user frames the request. It cannot be argued into providing diagnostic assessments, clinical interpretations, or crisis-adjacent content without the appropriate guardrails activating. This is felt safety — the user can trust the product has limits.

How to port to Silent Infinity:

Silent Infinity's guardrails.py already implements regex-based deny-first. Extend it:

1. Layer 0 (already exists): regex crisis pattern matching.

2. Layer 1 (add): Haiku-based behavioral classification — does this request ask the model to behave as a therapist, clinician, or diagnostic tool? Block if yes.

3. Layer 2 (add): topic hard-denies — specific topics (suicide method instructions, self-harm instructions) are hardcoded blocks that cannot be overridden by prompt context.

4. The deny architecture should be documented in the Feature Readiness Standard as a first-class safety layer.

Existing system to extend: guardrails.py. Additive layers, no removal of existing logic.

---

Pattern 14 — Commit → Verify → Report Loop

Impact: 7 | Effort: 2 | Score: 6 | Time: 1 day

What it does in Claude Code:

The model commits to an action, verifies the result via tool output, and only then reports. "I ran the tests" is followed by the test output. "I edited the file" is followed by a read-back of the change.

Why it makes the product feel smart:

For Silent Infinity, this is the witnessing discipline: the mirror commits to what it observes, verifies it against the user's actual words, and reports. It does not speculate about what it expects to observe.

How to port to Silent Infinity:

System prompt addition to system_prompt.py:


When you summarize what a user has expressed, quote or closely paraphrase their actual
words before reflecting them back. Do not summarize at one level of abstraction above
what was said. The mirror reflects; it does not interpret. Interpretation is offered only
when explicitly invited.

Cost: zero. High impact for witnessed users.

---

Anti-Patterns: Three "Do Not Copy" Observations

Anti-Pattern 1 — Bypass Permissions Mode.

Claude Code has a bypassPermissions mode that disables all safety checks, protected only by a server-side kill switch. This is appropriate for a developer tool where the user is the developer. It is completely inappropriate for a wellness product with vulnerable users. Silent Infinity must never implement a "bypass safety" mode of any kind, regardless of user request. The safety architecture must be structurally non-bypassable.

Anti-Pattern 2 — The 98.4% Infrastructure Ratio as a Warning.

Claude Code's architecture was designed for increasingly capable models — the ratio of infrastructure to AI logic assumes the model will keep improving. For Silent Infinity, copying the infrastructure complexity without the model capability to justify it would produce bureaucratic overhead that slows feature development without proportional benefit. Port the patterns, not the full complexity. Start lean.

Anti-Pattern 3 — Context Window as the Primary State Store.

Claude Code uses the conversation transcript as "the only state is a message array." This works at session granularity. For Silent Infinity, user state needs to persist across sessions and across months. Relying on the context window as the primary state store would produce a product that is excellent within a session and amnesiac across sessions. The DynamoDB-backed tiered memory architecture (Patterns 1, 2, 7, 11, 12) is the correct replacement.

---

Summary Statistics

Source corpus: 12 existing TITAN warm-memory Claude Code intelligence files + 6 new primary sources fetched this session (VILA Lab arxiv paper, Bits-Bytes-NN architecture analysis, layer5.io leak analysis, Dev.to Rust rewrite analysis, official Claude Code docs, dbreunig system prompt analysis)
Patterns evaluated: 14 (ranked by impact × effort)
Patterns recommended for this week (top 5): 9 (Verification-Before-Claim, 1 day), 1 (Pre-Session Briefing, 1 day), 14 (Commit-Verify-Report, 1 day), 6 (Plan/Reflective Mode, 2-3 days), 2 (Correction-as-Memory, 2-3 days)
Patterns recommended for next sprint (1 week): 3 (Graduated Compaction, 2-3 days), 11 (Active Threads, 1 day), 7 (Session Resume, 3-4 days)
Anti-patterns documented: 3

---

Sources:

1. github.com/ghuntley/claude-code-source-code-deobfuscation (community deobfuscation, March 2026)

2. arxiv.org/html/2604.14228v1 — VILA Lab, "Dive into Claude Code," April 2026

3. layer5.io/blog/engineering/the-claude-code-source-leak-512000-lines-a-missing-npmignore-and-the-fastest-growing-repo-in-github-history/

4. bits-bytes-nn.github.io/insights/agentic-ai/2026/03/31/claude-code-architecture-analysis.html

5. dev.to/brooks_wilson_36fbefbbae4/claude-code-architecture-explained-agent-loop-tool-system-and-permission-model-rust-rewrite-41b2

6. code.claude.com/docs/en/how-claude-code-works (official docs)

7. www.dbreunig.com/2026/04/04/how-claude-code-builds-a-system-prompt.html

8. wavespeed.ai/blog/posts/claude-code-agent-harness-architecture/

9. F:\TITAN\knowledge\memory\warm\claude-code\ (12 TITAN intelligence files, April 2026)

10. F:\TITAN\plans\advisors\SILENT-INFINITY-AUDIT-DOCUMENT-2026-04-21.md (architecture reference)