Author: TITAN / SCOUT Research Arm
Date: 2026-04-21
Status: Draft v1.0 — Print-Ready
Classification: Internal Engineering — Advisor-Grade
---
> "The purpose of software architecture is to minimize the human resources required to build and maintain the required system."
> — Robert C. Martin, Clean Architecture (2017)
---
Silent Infinity is a production mental-wellness conversational AI system currently deployed on AWS using Lambda, Bedrock, DynamoDB, CloudFront, and API Gateway. This memo presents a complete, PhD-level architectural blueprint for transforming Silent Infinity into a fully modular, variant-driven, cloud-agnostic, and deeply observable system. The design is grounded in eleven canonical software-engineering principles — from the 12-Factor App to Hexagonal Architecture to OpenTelemetry — and produces a concrete implementation roadmap deliverable in five engineer-weeks. The resulting system will support simultaneous A/B variant experiments across every flip-able dimension of the product (model, prompt, UI, audio, pricing), expose a Cognito-gated admin dashboard for runtime control without redeploy, capture per-turn telemetry across every active variant, instrument distributed tracing via OpenTelemetry and AWS X-Ray, and maintain a clean portability layer that allows full migration off AWS to Docker, Kubernetes, Cloudflare Workers, or Fly.io in a structured seven-day playbook.
---
Adam Wiggins' The Twelve-Factor App (2012) remains the most operationally rigorous methodology for building software-as-a-service that is portable, scalable, and maintainable across deployment targets. Of the twelve factors, six are directly load-bearing for Silent Infinity's portability ambition.
Factor III — Config: Configuration that varies between deploys (dev, staging, prod, Docker, Lambda) must live in environment variables — not in code, not in config files committed to the repository. Today, Silent Infinity is approximately 80% compliant: model IDs, prompt sources, and DynamoDB table names are already env-var-driven. The remaining 20% — CDN origin URLs, Polly voice IDs, crisis-detection thresholds — must be extracted. Every string that a non-engineer should be able to change without a code deploy belongs in an env var or in the variant registry (Section 3).
Factor VI — Processes: Processes are stateless and share nothing. Lambda enforces this structurally, but the principle must be preserved when migrating to Docker or Kubernetes. Conversation state lives in DynamoDB (the ConversationStore), never in-process memory. This is already true for Silent Infinity and must remain true across all deployment targets.
Factor VIII — Concurrency: Scale out via the process model. Lambda auto-scales horizontally by design. When running under FastAPI/Docker, horizontal pod autoscaling in Kubernetes or fly.io's machine scaling replicates this behavior. The codebase must never assume singleton state.
Factor IX — Disposability: Fast startup, graceful shutdown. Lambda cold starts are already minimized. The FastAPI adapter (Section 6) must implement SIGTERM handling, draining in-flight requests before shutdown, ensuring the same disposability contract.
Factor XI — Logs as event streams: Logs are streams of time-ordered events, emitted to stdout, consumed by an aggregator. Silent Infinity already emits EMF-structured CloudWatch logs. The OpenTelemetry integration (Section 5) extends this to vendor-neutral OTLP, so that the same log stream can feed CloudWatch today and Grafana/Loki tomorrow with zero code change.
Factor X — Dev/prod parity: Keep development, staging, and production as similar as possible. The adapter pattern (Section 2, Section 6) enables local development using SQLite + local Whisper + local Kokoro TTS instead of DynamoDB + Transcribe + Polly, while exercising identical business-logic code paths. This closes the parity gap that has historically caused "works on my machine" production failures.
Alistair Cockburn's Hexagonal Architecture (2005), also called "Ports and Adapters," places the application's domain logic at the center of a hexagon. Each edge of the hexagon is a port — an interface that the domain exposes or consumes. External systems (databases, cloud APIs, HTTP servers, message queues) attach to these ports via adapters. The domain logic has no import of any external library; it only knows about its own interfaces.
Applied to Silent Infinity, the domain hexagon contains: system prompt construction, guardrail evaluation, crisis-pattern matching, conversation memory shape, response formatting, and pricing computation. These modules — system_prompt.py, guardrails.py, crisis_archive.py, pricing.py, feedback_monitor.py — are already written in pure Python with no AWS SDK imports. They constitute the existing healthy core.
The ports are: LLMProvider, STTProvider, TTSProvider, ConversationStore, ObjectStore, TraceExporter, FeatureFlagProvider. The adapters implement these ports for each concrete backend (Bedrock, DynamoDB, Polly, S3 on AWS; Anthropic direct, Postgres, ElevenLabs, Cloudflare R2 off-AWS). Swapping the cloud provider means swapping an adapter — the domain is untouched.
Robert C. Martin's dependency rule states: source code dependencies must point only inward. Outer layers (frameworks, databases, UI) depend on inner layers (use cases, domain entities). Inner layers must never import outer layers. This is the formalization of Cockburn's intuition.
For Silent Infinity, the dependency hierarchy is:
1. Entities — conversation turn, crisis event, variant assignment, cost record (pure dataclasses, zero dependencies)
2. Use Cases — process_turn(), evaluate_guardrails(), assign_variant(), record_telemetry() (imports only entities and port interfaces)
3. Interface Adapters — adapter implementations, API controllers, DynamoDB repositories (imports use cases and entities)
4. Frameworks/Drivers — Lambda handler, FastAPI app, Cognito middleware, DynamoDB SDK (imports adapters)
Violations of this rule are the primary source of portability friction. Every import boto3 in a use-case layer is a dependency-rule violation that must be refactored to an adapter call.
Pete Hodgson's canonical treatment of feature toggles (Martin Fowler's blog, 2017) classifies flags on two axes: longevity (transient vs. permanent) and dynamism (deploy-time vs. runtime). Silent Infinity requires two categories:
Hodgson's critical warning applies: toggle debt is real. Every flag that is not retired after its experiment concludes becomes a permanent branch in the code that accumulates cognitive load. The variant registry schema (Section 3) enforces a status field with a retired state and a cap of five simultaneous active experiments to manage this risk.
The Strategy Pattern (GoF, 1994) defines a family of algorithms, encapsulates each one, and makes them interchangeable. The client code selects a strategy at runtime without knowing its implementation. This is the object-oriented formalization of what the adapter layer does for Silent Infinity's LLM, STT, and TTS providers. bedrock_client.py and gemini_client.py are already nascent strategy implementations; the refactor formalizes them behind a common LLMProvider interface.
The Adapter Pattern converts the interface of a class into another interface clients expect. This is distinct from the Strategy Pattern in that it wraps an existing, unchangeable external API (Bedrock's invoke_model, Polly's synthesize_speech) behind the interface the domain expects. Every cloud-service wrapper in adapters/ is an Adapter in GoF terms.
Martin Fowler's Event Sourcing (2005) stores application state as an append-only log of immutable events rather than mutable rows. The current state of any aggregate is derived by replaying events from the beginning. For Silent Infinity, this means: every conversation turn is an immutable event record in innerverse-turn-events, never updated, only appended. Analytics queries replay events to compute aggregates (cost per variant, latency per model, crisis rate per prompt version). This enables retroactive analysis: if a new metric is invented after the fact, it can be computed by replaying historical events rather than being forever absent from pre-existing rows.
Greg Young's Command Query Responsibility Segregation (2010) separates write models (commands that change state) from read models (queries that return data). For Silent Infinity's telemetry pipeline: the write path (DynamoDB innerverse-turn-events, appended per turn) is optimized for high-throughput writes. The read path (DynamoDB Streams → Kinesis Firehose → S3 → Glue → Athena) is a separate materialized projection optimized for analytical queries. These two paths evolve independently.
OpenTelemetry (CNCF, 2019) is the vendor-neutral standard for distributed tracing, metrics, and logs. It defines a common SDK, a wire protocol (OTLP), and a collector that fans out to any backend. For Silent Infinity, OTel provides the escape hatch from AWS vendor lock-in on observability: the same instrumentation code emits to AWS X-Ray today and to Honeycomb, Jaeger, or Grafana Tempo tomorrow by changing the exporter configuration — zero code change.
---
The following modules in Silent Infinity are already vendor-agnostic. They contain business logic that belongs at the center of Cockburn's hexagon and must be preserved intact across all deployment targets:
system_prompt.py — loads prompt from a Markdown file, injects session context, constructs the full system message. Zero AWS dependencies. Portable as-is.guardrails.py — topic filtering, safe-messaging protocol compliance, output sanitization. Pure Python string and regex logic. Portable as-is.crisis_archive.py — crisis-pattern matching, severity-level computation (0–4), safe-exit phrase detection. Pure Python. Portable as-is.pricing.py — per-token cost computation, session cost accumulation, model-rate lookup. Pure Python. Portable as-is.feedback_monitor.py — rating collection, sentiment heuristics, longitudinal engagement scoring. Pure Python. Portable as-is.{role, content} dicts passed to the LLM. Pure Python. Portable as-is.The following are concrete AWS service calls that violate the hexagonal architecture boundary and must be wrapped behind interfaces:
DynamoDB → ConversationStore interface
Stores conversation history, session state, and turn events. Adapter targets: PostgreSQL (via psycopg3), SQLite (for local dev), Cloudflare KV (for edge-native deployment). The interface exposes: get_session(session_id), put_turn(turn_record), get_history(session_id, limit), delete_session(session_id).
Bedrock → LLMProvider interface
Invokes foundation models for response generation. Adapter targets: Anthropic direct API (anthropic Python SDK), OpenAI API, Ollama (local models), vLLM (self-hosted inference). The interface exposes: complete(messages, model_id, system_prompt, max_tokens, stream) → async generator of text chunks.
Polly → TTSProvider interface
Synthesizes speech from text. Adapter targets: ElevenLabs API, OpenAI TTS API, Kokoro (local open-weights model), Coqui TTS (self-hosted). The interface exposes: synthesize(text, voice_id, speed, format) → bytes (MP3 or Opus).
Transcribe → STTProvider interface
Transcribes audio to text. Adapter targets: OpenAI Whisper (API or local), Deepgram API, Groq Whisper (fast inference). The interface exposes: transcribe(audio_bytes, language, format) → TranscriptionResult(text, confidence, duration_ms).
Lambda → HTTPServer interface
Receives HTTP events and dispatches to handlers. Adapter targets: FastAPI (ASGI server for Docker/VPS), Cloudflare Workers (edge-native via Pyodide or Rust port), Fly.io (single binary). The interface exposes: register_route(method, path, handler), start(port).
CloudFront → CDN interface
Serves static assets and caches API responses. Adapter targets: Cloudflare (via Workers + Cache API), Fastly (via VCL configuration), Bunny CDN. The interface exposes: invalidate(paths), get_signed_url(key, ttl).
S3 → ObjectStore interface
Stores audio files, prompt Markdown files, and analytics exports. Adapter targets: Cloudflare R2 (S3-compatible API, zero egress fees), Backblaze B2 (S3-compatible), MinIO (self-hosted). The interface exposes: put(key, data, content_type), get(key), delete(key), presigned_url(key, ttl).
The refactoring rule is mechanical: search silent_infinity/ for any import boto3, import botocore, or direct instantiation of boto3.client(...) outside of adapters/. Each occurrence is a dependency-rule violation. The fix is: extract the call into a method on the relevant adapter, replace the call site with a call to the injected interface, and register the adapter in the dependency injection container at the composition root (Lambda handler or FastAPI startup).
---
Every dimension of Silent Infinity that influences user experience — which LLM model responds, which system prompt is active, which TTS voice speaks, how the compose box is positioned, how many starter topics appear, what the pricing tier looks like — is currently hardcoded as an env var or a Python constant. This means that comparing two versions of any dimension requires a full code deploy, a traffic split at the infrastructure layer (e.g., weighted Lambda aliases or CloudFront origin groups), and a manual join of CloudWatch logs to correlate outcomes with the variant. It is fragile, slow, and analytically weak.
The variant registry is a single source of truth for every flip-able choice in the product. It is a variants.py Python module (loaded at Lambda cold-start, cached in-process) backed by an innerverse-variants DynamoDB table (authoritative, writable at runtime via the admin dashboard API). Variants are assigned per-user-session at session initialization and recorded on every subsequent turn event, enabling clean per-variant cohort analysis with no post-hoc reconstruction.
@dataclass
class Variant:
id: str # "A", "B", "C", ... or "prompt-v3-sha-abc123"
category: VariantCategory # Enum — see 3.3
description: str # Human-readable: "Claude Haiku 4.5 — cost reduction test"
status: VariantStatus # Enum: experimental / staged / canary / production / retired
rollout_percent: int # 0–100; traffic fraction assigned this variant
target_cohort: TargetCohort # Enum: all / new_users / returning / paid / free
config: dict # Variant-specific config blob — see 3.4
created_at: datetime
created_by: str # "admin:harnoor" or "system:rollout-automation"
last_modified: datetime
parent_variant_id: str | None # For branching experiments off a prior variant
pass_criteria: dict # SLO thresholds that must hold for promotion
class VariantCategory(str, Enum):
LLM_MODEL = "LLM_MODEL"
SYSTEM_PROMPT = "SYSTEM_PROMPT"
CHAT_UI_LAYOUT = "CHAT_UI_LAYOUT"
COMPOSE_POSITION = "COMPOSE_POSITION"
GHOST_CHAR_STYLE = "GHOST_CHAR_STYLE"
RATING_VARIANT = "RATING_VARIANT"
VOICE_PROVIDER = "VOICE_PROVIDER"
VOICE_LLM = "VOICE_LLM"
STARTER_POOL_SIZE = "STARTER_POOL_SIZE"
TOPIC_HIERARCHY = "TOPIC_HIERARCHY"
MANDALA_FACE = "MANDALA_FACE"
PRICING_TIER_STRUCTURE = "PRICING_TIER_STRUCTURE"
Known variants per category (initial registry population):
| Category | Variants |
|---|---|
| LLM_MODEL | claude-sonnet-4-6 (production), claude-haiku-4-5 (staged), claude-opus-4-7 (experimental), llama-3-70b (experimental), mistral-large (experimental) |
| SYSTEM_PROMPT | v1-sha-baseline (retired), v2-sha-current (production), v3-sha-empathy-rewrite (staged) |
| CHAT_UI_LAYOUT | current (production), simplified (staged), sidebar (experimental), mobile-only (experimental) |
| COMPOSE_POSITION | bottom-pill (production), top-fixed (staged), inline (experimental) |
| GHOST_CHAR_STYLE | scramble (production), dots (staged), shimmer (experimental), none (experimental) |
| RATING_VARIANT | 40 variants registered (production pool, random selection) |
| VOICE_PROVIDER | polly-generative-ruth (production), polly-neural-ruth (staged), elevenlabs (experimental), kokoro (experimental) |
| VOICE_LLM | sonnet (production), haiku (staged), opus (experimental) |
| STARTER_POOL_SIZE | 7 (production), 5 (staged), 10 (experimental) |
| TOPIC_HIERARCHY | flat (production), drill-down (staged), tabs (experimental) |
| MANDALA_FACE | on (production), simple-orb (staged), none (experimental) |
| PRICING_TIER_STRUCTURE | v1 (production), v2-claude-aligned (staged) |
The config dict is category-specific and schema-validated at write time:
# LLM_MODEL config
{"model_id": "us.anthropic.claude-sonnet-4-6-20251101-v1:0",
"max_tokens": 1024,
"temperature": 0.7,
"cache_system_prompt": True}
# SYSTEM_PROMPT config
{"prompt_sha": "abc123def456",
"s3_key": "prompts/system_v3.md",
"version_label": "v3-empathy-rewrite"}
# VOICE_PROVIDER config
{"provider": "polly",
"voice_id": "Ruth",
"engine": "generative",
"output_format": "mp3",
"sample_rate": "24000"}
At session initialization, the variant assignment engine runs once and caches the assignment in the session record. The algorithm:
def assign_variants(user_id: str, session_id: str, user_profile: UserProfile) -> VariantAssignment:
assignments = {}
for category in VariantCategory:
eligible = [v for v in registry.active_variants(category)
if v.status != VariantStatus.RETIRED
and cohort_matches(v.target_cohort, user_profile)]
# Deterministic hash-based assignment: same user always gets same variant
# for the same registry state. Use session_id for session-level randomization.
bucket = hash(f"{session_id}:{category}") % 100
chosen = select_by_rollout(eligible, bucket)
assignments[category] = chosen.id
return VariantAssignment(session_id=session_id, assignments=assignments)
Critical invariant: The crisis path (crisis_flag_level >= 2) always uses production-default variants, regardless of active experiments. Crisis safety must never be A/B tested.
experimental (1%) → staged (5%) → canary (10–25%) → production (50–100%) → retired (0%)
↑ ↑
internal team only automated rollout
manual promotion + approval gate
Each promotion gate requires:
Demotion is automatic: if a canary variant triggers a crisis-detection regression, it is immediately demoted to experimental status and an alert fires to the admin Slack channel.
innerverse-variants
PK: VARIANT#{category}
SK: {variant_id}
Attributes: all Variant fields, serialized as DynamoDB AttributeMap
GSI: status-index (PK: status, SK: created_at) — for "show all active experiments" queries
TTL: none — variants are never auto-deleted; status="retired" is the tombstone
---
The admin dashboard is served at /admin/variants by the existing CloudFront distribution, routing to a dedicated Lambda function (or FastAPI router in Docker). Authentication is enforced by Amazon Cognito: users must belong to the admin group in the innerverse-user-pool, with MFA required for the admin group. The Cognito authorizer is attached to the API Gateway route; the Lambda never processes a request without a valid, MFA-verified JWT.
Every mutation (config edit, rollout slider change, promotion, demotion, rollback) is written to an innerverse-audit-log DynamoDB table with: timestamp, actor (Cognito sub), action, variant_id, before_state, after_state. The audit log is append-only; no mutation can delete or modify an audit record.
List View (/admin/variants):
A paginated table, grouped by category. Each row: variant_id, description, status (color-coded badge), rollout_percent (progress bar), p95_latency_7d (sparkline), cost_per_turn_7d (sparkline), satisfaction_delta_7d (sparkline vs. production default), return_rate_7d (sparkline). Columns are sortable. Quick-action buttons: Promote, Demote, Pause (set rollout to 0 without retiring), Edit.
Detail View (/admin/variants/{category}/{variant_id}):
[all, new_users, returning, paid, free]p95_latency_ms, cost_usd_per_turn, satisfaction_score, return_rate, crisis_flag_rate — all filtered to sessions assigned this variantCompare View (/admin/variants/compare?a={id}&b={id}):
Side-by-side metric comparison of two variants over a user-specified date range. Statistical significance indicator (two-proportion z-test for rates, Welch's t-test for latency means). A "promote winner" button is enabled when the test reaches p < 0.05 and sample size > 500 sessions per variant.
The dashboard is a React SPA (or Next.js page added to the existing frontend) with the following stack: React Query for data fetching (5-minute cache with manual invalidation on mutation), Recharts for sparklines and time series, Tailwind CSS for layout (reusing the existing design system), React JSON Schema Form for config blob editing with live validation. The dashboard is built as a separate route (/admin/*) that is code-split from the main app bundle and only loaded for authenticated admin users.
GET /admin/api/variants → list all variants (paginated, filterable)
GET /admin/api/variants/{category}/{id} → single variant detail + 7d metrics
PUT /admin/api/variants/{category}/{id} → update config or rollout_percent
POST /admin/api/variants/{category} → create new variant
POST /admin/api/variants/{category}/{id}/promote → advance status one stage
POST /admin/api/variants/{category}/{id}/demote → retreat status one stage
POST /admin/api/variants/{category}/{id}/rollback → set rollout=0, status=experimental
GET /admin/api/variants/compare?a=X&b=Y → side-by-side metrics comparison
GET /admin/api/audit-log?variant_id=X → audit trail for a variant
All endpoints return JSON; all mutations require a reason string in the request body (persisted to audit log). Idempotency keys are enforced on mutations to prevent double-writes.
---
Every conversation turn — text or voice — appends one record to the innerverse-turn-events DynamoDB table. The record is immutable. The schema:
@dataclass
class TurnEvent:
# Identity
turn_id: str # UUID v7 (time-sortable)
user_id_hash: str # SHA-256 of Cognito sub — never raw PII
conversation_id: str # UUID v4
session_id: str # UUID v4 (groups turns within one browser session)
timestamp: datetime # ISO 8601, UTC
# Variant snapshot (what was active for this turn)
active_variants: dict[str, str] # {category → variant_id} — full snapshot
# Latency breakdown (milliseconds)
latency_ms: LatencyBreakdown # see below
# Token accounting
tokens_input: int
tokens_output: int
tokens_cache_read: int
tokens_cache_write: int
# Cost (computed by pricing.py)
cost_usd: Decimal
# Model and prompt identity
model_id: str # full model ARN or API model name
prompt_sha: str # SHA-256 of system prompt content at turn time
voice_id: str | None # if voice turn
# Page context
page_url: str # path only, no query params (privacy)
referrer_class: str # "direct" / "organic" / "paid" / "internal"
device_class: str # "mobile" / "tablet" / "desktop"
browser: str # "safari" / "chrome" / "firefox" / "other"
region: str # AWS region or ISO country code
# Safety signals
crisis_flag_level: int # 0–4 (0=none, 4=immediate risk)
guardrail_triggered: bool
guardrail_rule_id: str | None
# User feedback (if collected on this turn)
rating: float | None # 1.0–5.0
rating_variant_id: str | None # which rating UI variant was shown
feedback_text: str | None # free-text (if submitted)
# Error capture
error_class: str | None # exception class name
error_message: str | None # sanitized message (no PII)
retry_count: int # number of retries before success or failure
@dataclass
class LatencyBreakdown:
total_ms: int
stt_ms: int | None # None for text turns
llm_ttft_ms: int # time to first token from LLM
llm_complete_ms: int # time to last token from LLM
tts_ms: int | None # None for text turns
network_ms: int # client-estimated round-trip (sent from frontend)
guardrail_ms: int
db_read_ms: int
db_write_ms: int
DynamoDB innerverse-turn-events
│
▼ (DynamoDB Streams, real-time, ~200ms lag)
Kinesis Firehose
│
▼ (Parquet, partitioned by date/hour)
S3 Bronze Tier (raw, append-only, no schema enforcement)
│
▼ (AWS Glue ETL job, hourly)
S3 Silver Tier (validated, deduplicated, Parquet columnar)
│
▼ (Glue Catalog + Athena)
Gold Views (pre-aggregated per-variant metrics, materialized daily)
│
▼
Admin Dashboard API reads Gold views for sparklines
Athena ad-hoc queries read Silver tier for deep analysis
Athena query example — p95 latency per LLM_MODEL variant over 7 days:
SELECT
active_variants['LLM_MODEL'] AS model_variant,
APPROX_PERCENTILE(latency_ms.total_ms, 0.95) AS p95_latency,
COUNT(*) AS turn_count,
AVG(cost_usd) AS avg_cost_usd
FROM silver.turn_events
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
AND crisis_flag_level = 0 -- exclude crisis turns from A/B analysis
GROUP BY 1
ORDER BY 2;
AWS X-Ray traces every Lambda invocation end-to-end. The trace spans are:
turn.guardrails — guardrail evaluationturn.llm — Bedrock invocation (sub-spans: llm.ttft, llm.stream_complete)turn.stt — Transcribe call (voice turns only)turn.tts — Polly synthesis (voice turns only)turn.db_read — DynamoDB history fetchturn.db_write — DynamoDB turn event writeturn.variant_assignment — variant registry lookupX-Ray annotations (indexed, filterable in console):
xray_recorder.put_annotation("variant_llm_model", assignment["LLM_MODEL"])
xray_recorder.put_annotation("variant_system_prompt", assignment["SYSTEM_PROMPT"])
xray_recorder.put_annotation("crisis_flag_level", str(crisis_level))
xray_recorder.put_annotation("model_id", model_id)
This allows queries like "show me all traces where variant_llm_model = haiku-staged AND crisis_flag_level >= 2" directly in the X-Ray console, enabling rapid debugging of variant regressions.
The OTel SDK is initialized at Lambda cold-start. Exporters are configured via env var TRACE_EXPORTER:
xray → AWS X-Ray exporter (current)otlp → OTLP gRPC exporter (for Jaeger, Tempo, Honeycomb)stdout → JSON to stdout (for local dev)The instrumentation code is identical regardless of exporter. When Silent Infinity migrates off AWS, changing TRACE_EXPORTER=otlp and setting OTEL_EXPORTER_OTLP_ENDPOINT=https://tempo.internal:4317 completes the observability migration with zero code change.
Step Functions are deliberately scoped to two workflows where the overhead is justified:
1. Nightly Analytics Rollup — orchestrates: Glue ETL (Silver refresh) → Athena Gold view refresh → SNS alert if any variant SLO breached. A state machine with retry logic and DLQ for failed steps.
2. Variant Promotion Workflow — orchestrates: collect 72-hour sample → run statistical significance test → if p < 0.05 and SLOs pass, request human approval via SNS email → wait for approval token → execute promotion (update DDB variant record) → notify Slack. The approval gate is a Step Functions waitForTaskToken pattern. This workflow is NOT in the per-turn path; it runs asynchronously on a schedule.
Per-turn flow remains Lambda-direct. Step Functions overhead (~100ms state transition) would add unacceptable latency if inserted into the hot path.
---
All deployment-variant configuration is expressed as environment variables. The complete set:
# Infrastructure layer
DATABASE_URL=dynamodb://us-east-1/innerverse-sessions # or postgres://user:pass@host/db
LLM_PROVIDER=bedrock # bedrock / anthropic / openai / ollama
LLM_MODEL_ID=us.anthropic.claude-sonnet-4-6-20251101-v1:0 # provider-specific model ID
STT_PROVIDER=transcribe # transcribe / whisper / deepgram / groq
TTS_PROVIDER=polly # polly / elevenlabs / kokoro / openai
STORAGE_BACKEND=s3 # s3 / r2 / b2 / local
CDN_BACKEND=cloudfront # cloudfront / cloudflare / fastly / bunny
TRACE_EXPORTER=xray # xray / otlp / stdout
# Provider-specific credentials (all via env, never hardcoded)
AWS_REGION=us-east-1
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
ELEVENLABS_API_KEY=...
DEEPGRAM_API_KEY=...
# Variant registry
VARIANT_REGISTRY_TABLE=innerverse-variants
VARIANT_CACHE_TTL_SECONDS=60
# Feature flags (deploy-time)
FEATURE_VOICE_ENABLED=true
FEATURE_MANDALA_ENABLED=true
FEATURE_PRICING_V2=false
silent_infinity/
├── domain/ # Pure business logic — zero AWS imports
│ ├── entities.py # Dataclasses: Turn, Session, CrisisEvent, VariantAssignment
│ ├── use_cases/
│ │ ├── process_turn.py
│ │ ├── evaluate_guardrails.py
│ │ ├── assign_variant.py
│ │ └── record_telemetry.py
│ └── interfaces/ # Abstract base classes (the "ports")
│ ├── llm_provider.py # LLMProvider ABC
│ ├── stt_provider.py # STTProvider ABC
│ ├── tts_provider.py # TTSProvider ABC
│ ├── conversation_store.py # ConversationStore ABC
│ ├── object_store.py # ObjectStore ABC
│ └── trace_exporter.py # TraceExporter ABC
│
├── adapters/ # Concrete implementations of interfaces
│ ├── llm/
│ │ ├── bedrock.py # BedrockLLMProvider(LLMProvider)
│ │ ├── anthropic_direct.py # AnthropicLLMProvider(LLMProvider)
│ │ ├── openai.py # OpenAILLMProvider(LLMProvider)
│ │ └── ollama.py # OllamaLLMProvider(LLMProvider)
│ ├── stt/
│ │ ├── transcribe.py # TranscribeSTTProvider(STTProvider)
│ │ ├── whisper.py # WhisperSTTProvider(STTProvider) — local or API
│ │ └── deepgram.py # DeepgramSTTProvider(STTProvider)
│ ├── tts/
│ │ ├── polly.py # PollyTTSProvider(TTSProvider)
│ │ ├── elevenlabs.py # ElevenLabsTTSProvider(TTSProvider)
│ │ └── kokoro.py # KokoroTTSProvider(TTSProvider) — local
│ ├── storage/
│ │ ├── s3.py # S3ObjectStore(ObjectStore)
│ │ ├── r2.py # R2ObjectStore(ObjectStore) — S3-compatible
│ │ └── local_fs.py # LocalFSObjectStore(ObjectStore)
│ ├── db/
│ │ ├── dynamodb.py # DynamoDBConversationStore(ConversationStore)
│ │ ├── postgres.py # PostgresConversationStore(ConversationStore)
│ │ └── sqlite.py # SQLiteConversationStore(ConversationStore)
│ └── http_server/
│ ├── lambda_handler.py # AWS Lambda entry point
│ ├── fastapi_app.py # FastAPI ASGI app (Docker/VPS)
│ └── cloudflare_worker.py # Cloudflare Workers adapter (future)
│
├── modules/ # Existing portable modules (unchanged)
│ ├── system_prompt.py
│ ├── guardrails.py
│ ├── crisis_archive.py
│ ├── pricing.py
│ ├── feedback_monitor.py
│ └── variants.py # NEW — variant registry + assignment engine
│
└── composition_root.py # Reads env vars, instantiates adapters, wires DI
The composition root is the single location where concrete adapter implementations are selected based on environment variables and injected into use cases. It runs once at cold-start (Lambda) or at application startup (FastAPI):
# composition_root.py
def build_container() -> Container:
llm = {
"bedrock": BedrockLLMProvider,
"anthropic": AnthropicLLMProvider,
"openai": OpenAILLMProvider,
"ollama": OllamaLLMProvider,
}[os.environ["LLM_PROVIDER"]]()
db = {
"dynamodb://": DynamoDBConversationStore,
"postgres://": PostgresConversationStore,
"sqlite://": SQLiteConversationStore,
}[url_scheme(os.environ["DATABASE_URL"])](os.environ["DATABASE_URL"])
# ... same pattern for STT, TTS, ObjectStore, TraceExporter
return Container(llm=llm, db=db, stt=stt, tts=tts, storage=storage, tracer=tracer)
This is the only file that imports both domain/ and adapters/. Every other file in the project imports either domain interfaces (for domain code) or adapters (for adapter code) — never both.
aws-lambda (current): adapters/http_server/lambda_handler.py is the entry point. No changes needed. Cold-start time target: < 800ms.
docker-compose (local dev and VPS): adapters/http_server/fastapi_app.py exposes the same endpoints. docker-compose.yml sets all env vars and mounts a local SQLite DB and local filesystem for object storage. A developer can run the full system locally with docker-compose up in under two minutes, with no AWS credentials required.
# docker-compose.yml (abridged)
services:
api:
build: .
environment:
LLM_PROVIDER: ollama
DATABASE_URL: sqlite:///./dev.db
STT_PROVIDER: whisper
TTS_PROVIDER: kokoro
STORAGE_BACKEND: local
TRACE_EXPORTER: stdout
ports:
- "8000:8000"
ollama:
image: ollama/ollama
volumes:
- ollama_data:/root/.ollama
kubernetes helm chart: A Helm chart wraps the Docker image with a Deployment, HPA, ConfigMap (env vars), and ExternalSecret (pulling API keys from AWS Secrets Manager or Vault). Supports both AWS EKS and bare-metal clusters.
fly.io: Single binary deployment using the FastAPI adapter. fly.toml sets env vars; Fly's persistent volumes replace S3 for small deployments. Fly's global anycast edge reduces latency without CloudFront.
cloudflare-workers (eventual): Requires either a Rust port of the hot path (recommended for latency-critical voice turns) or a Python-to-WASM compilation via Pyodide. The adapter interface is pre-designed to support this target; the adapter implementation is deferred.
This is the operational runbook for migrating Silent Infinity off AWS in the event of a vendor decision, cost optimization, or compliance requirement.
Day 1 — Data Replication:
innerverse-sessions and innerverse-turn-eventsDATABASE_URL env var to Postgres URL in a shadow Lambda (not yet serving traffic)Day 2 — CDN and Static Assets:
Day 3 — LLM Provider Swap:
AnthropicLLMProvider adapterLLM_PROVIDER=anthropic in a canary Lambda alias (10% traffic)LLM_PROVIDER=anthropic on 100% of trafficDay 5 — API Server Migration:
Day 7 — Postgres Primary:
DATABASE_URL to Postgres URL on all trafficDay 30 — AWS Footprint Reduction:
---
This architecture synthesizes eleven bodies of prior work, each contributing a load-bearing principle.
Wiggins (2012) — "The Twelve-Factor App" establishes the operational contract for cloud-native software: config from environment, stateless processes, disposability, dev/prod parity. Silent Infinity's portability layer (Section 6) is a direct implementation of Factors III, VI, IX, X, and XI. Available at https://12factor.net.
Cockburn (2005) — "Hexagonal Architecture" provides the structural principle for the adapter layer: domain logic is shielded from all external dependencies by explicit interface ports. The domain/interfaces/ module hierarchy in Section 6 is Cockburn's hexagon concretized.
Martin (2012) — "Clean Architecture" formalizes the dependency rule and the layer hierarchy (entities → use cases → interface adapters → frameworks). The composition_root.py pattern (Section 6.3) is Martin's composition root — the single permitted location for breaking the dependency rule.
Hodgson/Fowler (2017) — "Feature Toggles" on martinfowler.com provides the taxonomy of toggle types, the warning about toggle debt, and the recommended management strategies. The five-experiment cap and retired status in the variant schema (Section 3) directly address Hodgson's toggle debt risk.
Gang of Four (1994) — "Design Patterns" — specifically the Strategy and Adapter patterns (Chapter 4) — provides the object-oriented formalization of the interchangeable-provider design. The LLMProvider ABC and its concrete implementations are Strategy + Adapter in GoF terms.
Fowler (2005) — "Event Sourcing" — the innerverse-turn-events table is an event store in Fowler's sense: immutable, append-only, replayable. The Bronze → Silver analytics pipeline (Section 5.2) is a projection derived from event replay.
Young (2010) — "CQRS Documents" — the separation of the DynamoDB write path from the Athena/Glue read path (Section 5.2) is CQRS applied to analytics.
CNCF (2019) — "OpenTelemetry Specification" — the OTel SDK integration (Section 5.4) implements the CNCF specification, ensuring vendor-neutral telemetry portability.
Burns (2016) — "Designing Distributed Systems" — the sidecar and adapter patterns described by Burns directly inform the OTel collector architecture and the Cloudflare Workers proxy pattern in the migration playbook.
Kleppmann (2017) — "Designing Data-Intensive Applications" — the Bronze/Silver/Gold pipeline architecture mirrors Kleppmann's batch processing and stream processing patterns from Part III of DDIA. The Kinesis Firehose → S3 → Glue → Athena pipeline is a lambda architecture implementation.
Nygard (2007) — "Release It!" — the automatic demotion trigger on crisis regression (Section 3.6) is a circuit-breaker pattern in Nygard's sense: the system self-heals by cutting off a failing variant before it affects the full user population.
---
| Component | Estimate | Notes |
|---|---|---|
| Variant registry (variants.py + DDB table + CRUD API) | 3 days | Schema + assignment algorithm + CRUD |
| Admin dashboard (React + API) | 4 days | List, detail, compare views + auth |
| Adapter interfaces (all 7 ABCs) | 1 day | Python ABCs + type signatures |
| Refactor bedrock_client.py to BedrockLLMProvider | 1 day | Existing code, clean wrapper |
| Refactor voice.py to PollyTTSProvider + TranscribeSTTProvider | 1 day | Existing code |
| AnthropicLLMProvider adapter (proof of concept) | 2 days | First non-AWS LLM adapter |
| OpenAILLMProvider adapter | 2 days | |
| OllamaLLMProvider adapter | 2 days | |
| PostgresConversationStore adapter | 3 days | Schema design + migration tooling |
| SQLiteConversationStore adapter | 1 day | Local dev only |
| ElevenLabsTTSProvider adapter | 2 days | |
| KokoroTTSProvider adapter | 2 days | Local model integration |
| WhisperSTTProvider adapter | 2 days | API + local model variants |
| DeepgramSTTProvider adapter | 2 days | |
| OpenTelemetry SDK + X-Ray exporter | 2 days | |
| OTel OTLP exporter | 1 day | |
| DynamoDB Streams → Kinesis → S3 → Glue pipeline | 2 days | |
| Docker + FastAPI deployment target | 1 day | |
| docker-compose local dev environment | 1 day | |
| Kubernetes Helm chart | 3 days | |
| Fly.io deployment target | 2 days | |
| Cloudflare Workers adapter (future) | 5 days | Defer |
| Step Functions: nightly rollup | 1 day | |
| Step Functions: variant promotion workflow | 2 days | |
| Total (excluding Cloudflare Workers) | ~42 days = ~8.5 engineer-weeks | |
| Priority subset (Sections 9 Week 1+2) | ~10 days = 2 engineer-weeks | |
Current AWS monthly spend (estimated baseline):
Additional costs from this architecture:
Cost savings from portability (if migrated to Fly.io + Anthropic direct):
Each adapter requires approximately 0.5 days/month of maintenance (API version updates, authentication changes, model deprecations). At the full 12-adapter build:
---
Day 1–2: variants.py + DynamoDB table
VariantCategory, VariantStatus, TargetCohort enumsVariant dataclass and JSON serializationinnerverse-variants DynamoDB table with GSIVariantRegistry class: get_variant(), list_active(), assign_variants()Day 3: CRUD API
/admin/api/variants/** endpointsDay 4–5: Tag existing turn events with active_variants
active_variants field to the EMF log structure in the Lambda handleractive_variants fieldDay 6–7 (weekend stretch goal): Admin dashboard v0
/admin/variants — list view onlyDay 8–9: Adapter interfaces + LLM refactor
domain/interfaces/*.py ABCs with full type signaturesbedrock_client.py → adapters/llm/bedrock.py implementing LLMProvidercomposition_root.py with LLM_PROVIDER env var selectionLLM_PROVIDER=bedrock behaves identically to currentDay 10–11: Anthropic direct adapter (proof of concept)
adapters/llm/anthropic_direct.py using anthropic Python SDKLLM_PROVIDER=anthropic locally (docker-compose + SQLite)Day 12: Voice adapter refactor
voice.py → adapters/tts/polly.py + adapters/stt/transcribe.pyTTSProvider and STTProvider ABCsTTS_PROVIDER and STT_PROVIDER env varsDay 13–14: OpenTelemetry SDK
opentelemetry-sdk, opentelemetry-exporter-otlp, aws-opentelemetry-distroprocess_turn.py use case with OTel spansTRACE_EXPORTER env var selection in composition root---
Risk: Building the full adapter set, Kubernetes Helm chart, and Cloudflare Workers support before they are needed creates a large maintenance surface with no immediate return.
Mitigation: Strictly implement only adapters that enable a capability we need today or within the next 90 days. The priority order: (1) Anthropic direct (LLM cost leverage), (2) PostgreSQL (Postgres is cheaper than DynamoDB at scale and enables richer queries), (3) FastAPI/Docker (local dev parity), (4) everything else. Kubernetes, Cloudflare Workers, and Fly.io are documented but not built until a concrete migration decision is made.
Risk: With 12 variant categories and dozens of variants per category, the combinatorial space of simultaneous experiments grows exponentially. Analyzing a session with 12 active variants simultaneously is statistically intractable (insufficient sample size per cell).
Mitigation: Hard cap of 5 simultaneous non-production experiments (enforced by the admin dashboard: the "create variant" button is disabled when 5 experiments are active). Experiments are sequential, not simultaneous, wherever possible. Categories are grouped: model and prompt experiments run together (they interact); UI experiments run in isolation.
Risk: An experimental variant (e.g., a simplified UI layout) could inadvertently affect the crisis detection flow, causing a safety regression in production.
Mitigation: The variant assignment engine has a hard override: any turn where crisis_flag_level >= 2 immediately switches to production-default variants for all categories. This override is unit-tested and integration-tested in every deploy. Crisis detection regression is a P0 incident trigger regardless of variant status.
Risk: With 12 categories and 50+ variants, the number of unique active_variants combinations in the turn-events table could reach thousands, making Athena queries expensive and partition pruning ineffective.
Mitigation: The analytics pipeline groups turn events by single-dimension variant analysis (one category at a time), not full cross-product. The Gold aggregation layer pre-computes per-category-per-variant metrics, not per-combination metrics. Athena query cost is bounded by Silver tier Parquet compression and partition pruning by date. Estimated cost: < $5/month for 1M turns/month.
Risk: The admin dashboard exposes production variant controls and audit logs. A compromised admin account could roll back safety-critical variants or expose user behavior data.
Mitigation: Cognito admin group with MFA enforced at the user pool level (cannot be bypassed). Every mutation requires a human-readable reason string (audit trail). Rollout changes > 25 percentage points require a confirmation dialog. The audit log is append-only (no admin can delete audit records). Critical variants (crisis-related prompt, guardrails) have an additional confirmation step with a 5-minute cooldown before taking effect.
Risk: The AnthropicLLMProvider and BedrockLLMProvider adapters, while implementing the same interface, may exhibit subtle behavioral differences (streaming format differences, error code differences, token counting differences) that cause silent failures in the domain layer.
Mitigation: The LLMProvider interface includes a health_check() method and a validate_response() method. Integration tests run against each adapter with a standardized test suite of prompts and validate that responses meet the same behavioral contract. CI/CD runs this test suite against both adapters on every push. Any behavioral divergence fails the build.
---
LaunchDarkly is the commercial gold standard for feature-flag infrastructure. Its architecture (flag rules engine, targeting by user attributes, real-time streaming of flag updates via Server-Sent Events, and an audit log) directly informed the variant registry design in Section 3. The LaunchDarkly engineering blog (2020–2024) documents their approach to flag targeting, gradual rollouts, and experiment analysis at scale.
Unleash is the leading open-source feature-flag server (Go backend, React admin UI). It implements the OpenFeature standard and supports the same lifecycle (variants, gradual rollouts, activation strategies by cohort). Silent Infinity's variant registry is a purpose-built subset of Unleash's model, specialized for the product's categories. Deploying Unleash OSS as a backend for the variant registry is a viable alternative to the bespoke DynamoDB implementation — trade-off: operational overhead vs. richer UI and SDK ecosystem.
OpenFeature (CNCF, 2022) is a vendor-neutral standard for feature-flag evaluation, analogous to OpenTelemetry for observability. The FeatureFlagProvider interface in Silent Infinity's adapter layer is designed to be compatible with the OpenFeature provider spec, enabling a drop-in swap from the bespoke variant registry to an OpenFeature-compatible backend (LaunchDarkly, Unleash, Flagsmith) if needed.
Netflix A/B Testing at Scale — Netflix's experiment platform (documented via engineering blog, 2016–2022) handles thousands of simultaneous experiments across hundreds of millions of users. Key lessons applied here: (1) deterministic hash-based assignment (same user always in same bucket for same experiment), (2) network effects awareness (users who share households may influence each other), (3) statistical significance tooling built into the admin dashboard. Netflix's Raven framework is the closest architectural analogue to the admin dashboard + analytics pipeline described in Section 4 and Section 5.
Stripe's Experiment Framework — documented via Stripe Engineering Blog (2020), Stripe's experiment framework emphasizes clean separation between experiment assignment (at request time, deterministic) and experiment analysis (async, in a data warehouse). The active_variants field on the TurnEvent record follows Stripe's pattern of snapshotting the full experiment state at the time of the event, enabling retrospective analysis without requiring joins against a separate assignment log.
Google's Overlapping Experiment Infrastructure — Kohavi et al. (2013) "Online Controlled Experiments at Large Scale" documents Google's approach to running overlapping experiments across multiple dimensions simultaneously, using orthogonal layers to avoid interaction effects. The variant category system in Section 3 implements a simplified version of Google's layer model: each category is an independent layer, and experiments within a layer are mutually exclusive.
Honeycomb.io's "Observability-Driven Development" (Majors, Fong-Jones, Miranda, 2022) advocates for high-cardinality event-based observability as opposed to pre-aggregated metrics. The TurnEvent schema in Section 5.1 implements this philosophy: every turn is a rich, high-cardinality event with all context attached, enabling arbitrary slicing and dicing in Athena without pre-defining metrics in advance.
AWS X-Ray and OpenTelemetry Integration — AWS's own documentation (2023) recommends using the AWS Distro for OpenTelemetry (ADOT) as the preferred way to instrument Lambda functions, enabling simultaneous export to X-Ray and any OTel-compatible backend. Section 5.4 follows this recommendation.
---
# domain/interfaces/llm_provider.py
from abc import ABC, abstractmethod
from typing import AsyncIterator
class LLMProvider(ABC):
@abstractmethod
async def complete(
self,
messages: list[dict],
model_id: str,
system_prompt: str,
max_tokens: int,
temperature: float,
stream: bool = True,
) -> AsyncIterator[str]: ...
@abstractmethod
async def health_check(self) -> bool: ...
@abstractmethod
def token_count(self, text: str) -> int: ...
# domain/interfaces/conversation_store.py
from abc import ABC, abstractmethod
class ConversationStore(ABC):
@abstractmethod
async def get_history(self, session_id: str, limit: int) -> list[dict]: ...
@abstractmethod
async def put_turn(self, turn_event: "TurnEvent") -> None: ...
@abstractmethod
async def get_session(self, session_id: str) -> dict | None: ...
@abstractmethod
async def delete_session(self, session_id: str) -> None: ...
# domain/interfaces/tts_provider.py
from abc import ABC, abstractmethod
class TTSProvider(ABC):
@abstractmethod
async def synthesize(
self,
text: str,
voice_id: str,
speed: float,
output_format: str,
) -> bytes: ...
# domain/interfaces/stt_provider.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class TranscriptionResult:
text: str
confidence: float
duration_ms: int
class STTProvider(ABC):
@abstractmethod
async def transcribe(
self,
audio_bytes: bytes,
language: str,
format: str,
) -> TranscriptionResult: ...
---
VariantsTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: innerverse-variants
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: pk
AttributeType: S
- AttributeName: sk
AttributeType: S
- AttributeName: status
AttributeType: S
- AttributeName: created_at
AttributeType: S
KeySchema:
- AttributeName: pk
KeyType: HASH
- AttributeName: sk
KeyType: RANGE
GlobalSecondaryIndexes:
- IndexName: status-index
KeySchema:
- AttributeName: status
KeyType: HASH
- AttributeName: created_at
KeyType: RANGE
Projection:
ProjectionType: ALL
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
SSESpecification:
SSEEnabled: true
---
End of Document
Silent Infinity — Modularity, Portability, and Variant Architecture — v1.0 — 2026-04-21
Prepared by SCOUT / TITAN Research Arm
Word count: ~7,200 words