Silent Infinity — Modularity, Portability, and Variant Architecture

PhD-Level Architecture Memo

Author: TITAN / SCOUT Research Arm

Date: 2026-04-21

Status: Draft v1.0 — Print-Ready

Classification: Internal Engineering — Advisor-Grade

---

> "The purpose of software architecture is to minimize the human resources required to build and maintain the required system."

> — Robert C. Martin, Clean Architecture (2017)

---

Abstract

Silent Infinity is a production mental-wellness conversational AI system currently deployed on AWS using Lambda, Bedrock, DynamoDB, CloudFront, and API Gateway. This memo presents a complete, PhD-level architectural blueprint for transforming Silent Infinity into a fully modular, variant-driven, cloud-agnostic, and deeply observable system. The design is grounded in eleven canonical software-engineering principles — from the 12-Factor App to Hexagonal Architecture to OpenTelemetry — and produces a concrete implementation roadmap deliverable in five engineer-weeks. The resulting system will support simultaneous A/B variant experiments across every flip-able dimension of the product (model, prompt, UI, audio, pricing), expose a Cognito-gated admin dashboard for runtime control without redeploy, capture per-turn telemetry across every active variant, instrument distributed tracing via OpenTelemetry and AWS X-Ray, and maintain a clean portability layer that allows full migration off AWS to Docker, Kubernetes, Cloudflare Workers, or Fly.io in a structured seven-day playbook.

---

1. Architecture Principles

1.1 The Twelve-Factor App (Wiggins, Heroku, 2012)

Adam Wiggins' The Twelve-Factor App (2012) remains the most operationally rigorous methodology for building software-as-a-service that is portable, scalable, and maintainable across deployment targets. Of the twelve factors, six are directly load-bearing for Silent Infinity's portability ambition.

Factor III — Config: Configuration that varies between deploys (dev, staging, prod, Docker, Lambda) must live in environment variables — not in code, not in config files committed to the repository. Today, Silent Infinity is approximately 80% compliant: model IDs, prompt sources, and DynamoDB table names are already env-var-driven. The remaining 20% — CDN origin URLs, Polly voice IDs, crisis-detection thresholds — must be extracted. Every string that a non-engineer should be able to change without a code deploy belongs in an env var or in the variant registry (Section 3).

Factor VI — Processes: Processes are stateless and share nothing. Lambda enforces this structurally, but the principle must be preserved when migrating to Docker or Kubernetes. Conversation state lives in DynamoDB (the ConversationStore), never in-process memory. This is already true for Silent Infinity and must remain true across all deployment targets.

Factor VIII — Concurrency: Scale out via the process model. Lambda auto-scales horizontally by design. When running under FastAPI/Docker, horizontal pod autoscaling in Kubernetes or fly.io's machine scaling replicates this behavior. The codebase must never assume singleton state.

Factor IX — Disposability: Fast startup, graceful shutdown. Lambda cold starts are already minimized. The FastAPI adapter (Section 6) must implement SIGTERM handling, draining in-flight requests before shutdown, ensuring the same disposability contract.

Factor XI — Logs as event streams: Logs are streams of time-ordered events, emitted to stdout, consumed by an aggregator. Silent Infinity already emits EMF-structured CloudWatch logs. The OpenTelemetry integration (Section 5) extends this to vendor-neutral OTLP, so that the same log stream can feed CloudWatch today and Grafana/Loki tomorrow with zero code change.

Factor X — Dev/prod parity: Keep development, staging, and production as similar as possible. The adapter pattern (Section 2, Section 6) enables local development using SQLite + local Whisper + local Kokoro TTS instead of DynamoDB + Transcribe + Polly, while exercising identical business-logic code paths. This closes the parity gap that has historically caused "works on my machine" production failures.

1.2 Hexagonal Architecture / Ports and Adapters (Cockburn, 2005)

Alistair Cockburn's Hexagonal Architecture (2005), also called "Ports and Adapters," places the application's domain logic at the center of a hexagon. Each edge of the hexagon is a port — an interface that the domain exposes or consumes. External systems (databases, cloud APIs, HTTP servers, message queues) attach to these ports via adapters. The domain logic has no import of any external library; it only knows about its own interfaces.

Applied to Silent Infinity, the domain hexagon contains: system prompt construction, guardrail evaluation, crisis-pattern matching, conversation memory shape, response formatting, and pricing computation. These modules — system_prompt.py, guardrails.py, crisis_archive.py, pricing.py, feedback_monitor.py — are already written in pure Python with no AWS SDK imports. They constitute the existing healthy core.

The ports are: LLMProvider, STTProvider, TTSProvider, ConversationStore, ObjectStore, TraceExporter, FeatureFlagProvider. The adapters implement these ports for each concrete backend (Bedrock, DynamoDB, Polly, S3 on AWS; Anthropic direct, Postgres, ElevenLabs, Cloudflare R2 off-AWS). Swapping the cloud provider means swapping an adapter — the domain is untouched.

1.3 Clean Architecture (Martin, 2012)

Robert C. Martin's dependency rule states: source code dependencies must point only inward. Outer layers (frameworks, databases, UI) depend on inner layers (use cases, domain entities). Inner layers must never import outer layers. This is the formalization of Cockburn's intuition.

For Silent Infinity, the dependency hierarchy is:

1. Entities — conversation turn, crisis event, variant assignment, cost record (pure dataclasses, zero dependencies)

2. Use Cases — process_turn(), evaluate_guardrails(), assign_variant(), record_telemetry() (imports only entities and port interfaces)

3. Interface Adapters — adapter implementations, API controllers, DynamoDB repositories (imports use cases and entities)

4. Frameworks/Drivers — Lambda handler, FastAPI app, Cognito middleware, DynamoDB SDK (imports adapters)

Violations of this rule are the primary source of portability friction. Every import boto3 in a use-case layer is a dependency-rule violation that must be refactored to an adapter call.

1.4 Feature Flags (Hodgson / Fowler, 2017)

Pete Hodgson's canonical treatment of feature toggles (Martin Fowler's blog, 2017) classifies flags on two axes: longevity (transient vs. permanent) and dynamism (deploy-time vs. runtime). Silent Infinity requires two categories:

Release toggles (transient, deploy-time): hide an in-progress feature from users until it is ready. These live in env vars and are removed when the feature ships.
Experiment toggles (transient, runtime): A/B test a hypothesis about model, prompt, or UI behavior. These live in the variant registry (Section 3) and are flipped via the admin dashboard without redeploy.

Hodgson's critical warning applies: toggle debt is real. Every flag that is not retired after its experiment concludes becomes a permanent branch in the code that accumulates cognitive load. The variant registry schema (Section 3) enforces a status field with a retired state and a cap of five simultaneous active experiments to manage this risk.

1.5 Strategy Pattern (Gang of Four, 1994)

The Strategy Pattern (GoF, 1994) defines a family of algorithms, encapsulates each one, and makes them interchangeable. The client code selects a strategy at runtime without knowing its implementation. This is the object-oriented formalization of what the adapter layer does for Silent Infinity's LLM, STT, and TTS providers. bedrock_client.py and gemini_client.py are already nascent strategy implementations; the refactor formalizes them behind a common LLMProvider interface.

1.6 Adapter Pattern (GoF, 1994)

The Adapter Pattern converts the interface of a class into another interface clients expect. This is distinct from the Strategy Pattern in that it wraps an existing, unchangeable external API (Bedrock's invoke_model, Polly's synthesize_speech) behind the interface the domain expects. Every cloud-service wrapper in adapters/ is an Adapter in GoF terms.

1.7 Event Sourcing (Fowler, 2005)

Martin Fowler's Event Sourcing (2005) stores application state as an append-only log of immutable events rather than mutable rows. The current state of any aggregate is derived by replaying events from the beginning. For Silent Infinity, this means: every conversation turn is an immutable event record in innerverse-turn-events, never updated, only appended. Analytics queries replay events to compute aggregates (cost per variant, latency per model, crisis rate per prompt version). This enables retroactive analysis: if a new metric is invented after the fact, it can be computed by replaying historical events rather than being forever absent from pre-existing rows.

1.8 CQRS (Young, 2010)

Greg Young's Command Query Responsibility Segregation (2010) separates write models (commands that change state) from read models (queries that return data). For Silent Infinity's telemetry pipeline: the write path (DynamoDB innerverse-turn-events, appended per turn) is optimized for high-throughput writes. The read path (DynamoDB Streams → Kinesis Firehose → S3 → Glue → Athena) is a separate materialized projection optimized for analytical queries. These two paths evolve independently.

1.9 OpenTelemetry (CNCF, 2019)

OpenTelemetry (CNCF, 2019) is the vendor-neutral standard for distributed tracing, metrics, and logs. It defines a common SDK, a wire protocol (OTLP), and a collector that fans out to any backend. For Silent Infinity, OTel provides the escape hatch from AWS vendor lock-in on observability: the same instrumentation code emits to AWS X-Ray today and to Honeycomb, Jaeger, or Grafana Tempo tomorrow by changing the exporter configuration — zero code change.

---

2. Domain Model vs. Infrastructure: The Clean Separation

2.1 What Is Purely Portable Today

The following modules in Silent Infinity are already vendor-agnostic. They contain business logic that belongs at the center of Cockburn's hexagon and must be preserved intact across all deployment targets:

system_prompt.py — loads prompt from a Markdown file, injects session context, constructs the full system message. Zero AWS dependencies. Portable as-is.
guardrails.py — topic filtering, safe-messaging protocol compliance, output sanitization. Pure Python string and regex logic. Portable as-is.
crisis_archive.py — crisis-pattern matching, severity-level computation (0–4), safe-exit phrase detection. Pure Python. Portable as-is.
pricing.py — per-token cost computation, session cost accumulation, model-rate lookup. Pure Python. Portable as-is.
feedback_monitor.py — rating collection, sentiment heuristics, longitudinal engagement scoring. Pure Python. Portable as-is.
Response formatting — message shape, markdown-to-speech stripping, ghost-character injection. Pure Python. Portable as-is.
Conversation memory shape — the in-memory list of {role, content} dicts passed to the LLM. Pure Python. Portable as-is.

2.2 What Is AWS-Bound Today (Requires Adapter Pattern)

The following are concrete AWS service calls that violate the hexagonal architecture boundary and must be wrapped behind interfaces:

DynamoDB → ConversationStore interface

Stores conversation history, session state, and turn events. Adapter targets: PostgreSQL (via psycopg3), SQLite (for local dev), Cloudflare KV (for edge-native deployment). The interface exposes: get_session(session_id), put_turn(turn_record), get_history(session_id, limit), delete_session(session_id).

Bedrock → LLMProvider interface

Invokes foundation models for response generation. Adapter targets: Anthropic direct API (anthropic Python SDK), OpenAI API, Ollama (local models), vLLM (self-hosted inference). The interface exposes: complete(messages, model_id, system_prompt, max_tokens, stream) → async generator of text chunks.

Polly → TTSProvider interface

Synthesizes speech from text. Adapter targets: ElevenLabs API, OpenAI TTS API, Kokoro (local open-weights model), Coqui TTS (self-hosted). The interface exposes: synthesize(text, voice_id, speed, format) → bytes (MP3 or Opus).

Transcribe → STTProvider interface

Transcribes audio to text. Adapter targets: OpenAI Whisper (API or local), Deepgram API, Groq Whisper (fast inference). The interface exposes: transcribe(audio_bytes, language, format) → TranscriptionResult(text, confidence, duration_ms).

Lambda → HTTPServer interface

Receives HTTP events and dispatches to handlers. Adapter targets: FastAPI (ASGI server for Docker/VPS), Cloudflare Workers (edge-native via Pyodide or Rust port), Fly.io (single binary). The interface exposes: register_route(method, path, handler), start(port).

CloudFront → CDN interface

Serves static assets and caches API responses. Adapter targets: Cloudflare (via Workers + Cache API), Fastly (via VCL configuration), Bunny CDN. The interface exposes: invalidate(paths), get_signed_url(key, ttl).

S3 → ObjectStore interface

Stores audio files, prompt Markdown files, and analytics exports. Adapter targets: Cloudflare R2 (S3-compatible API, zero egress fees), Backblaze B2 (S3-compatible), MinIO (self-hosted). The interface exposes: put(key, data, content_type), get(key), delete(key), presigned_url(key, ttl).

2.3 The Dependency Inversion In Practice

The refactoring rule is mechanical: search silent_infinity/ for any import boto3, import botocore, or direct instantiation of boto3.client(...) outside of adapters/. Each occurrence is a dependency-rule violation. The fix is: extract the call into a method on the relevant adapter, replace the call site with a call to the injected interface, and register the adapter in the dependency injection container at the composition root (Lambda handler or FastAPI startup).

---

3. Variant Registry — The Complete Design

3.1 Motivation

Every dimension of Silent Infinity that influences user experience — which LLM model responds, which system prompt is active, which TTS voice speaks, how the compose box is positioned, how many starter topics appear, what the pricing tier looks like — is currently hardcoded as an env var or a Python constant. This means that comparing two versions of any dimension requires a full code deploy, a traffic split at the infrastructure layer (e.g., weighted Lambda aliases or CloudFront origin groups), and a manual join of CloudWatch logs to correlate outcomes with the variant. It is fragile, slow, and analytically weak.

The variant registry is a single source of truth for every flip-able choice in the product. It is a variants.py Python module (loaded at Lambda cold-start, cached in-process) backed by an innerverse-variants DynamoDB table (authoritative, writable at runtime via the admin dashboard API). Variants are assigned per-user-session at session initialization and recorded on every subsequent turn event, enabling clean per-variant cohort analysis with no post-hoc reconstruction.

3.2 Variant Schema


@dataclass
class Variant:
    id: str                          # "A", "B", "C", ... or "prompt-v3-sha-abc123"
    category: VariantCategory        # Enum — see 3.3
    description: str                 # Human-readable: "Claude Haiku 4.5 — cost reduction test"
    status: VariantStatus            # Enum: experimental / staged / canary / production / retired
    rollout_percent: int             # 0–100; traffic fraction assigned this variant
    target_cohort: TargetCohort      # Enum: all / new_users / returning / paid / free
    config: dict                     # Variant-specific config blob — see 3.4
    created_at: datetime
    created_by: str                  # "admin:harnoor" or "system:rollout-automation"
    last_modified: datetime
    parent_variant_id: str | None    # For branching experiments off a prior variant
    pass_criteria: dict              # SLO thresholds that must hold for promotion

3.3 Variant Categories


class VariantCategory(str, Enum):
    LLM_MODEL             = "LLM_MODEL"
    SYSTEM_PROMPT         = "SYSTEM_PROMPT"
    CHAT_UI_LAYOUT        = "CHAT_UI_LAYOUT"
    COMPOSE_POSITION      = "COMPOSE_POSITION"
    GHOST_CHAR_STYLE      = "GHOST_CHAR_STYLE"
    RATING_VARIANT        = "RATING_VARIANT"
    VOICE_PROVIDER        = "VOICE_PROVIDER"
    VOICE_LLM             = "VOICE_LLM"
    STARTER_POOL_SIZE     = "STARTER_POOL_SIZE"
    TOPIC_HIERARCHY       = "TOPIC_HIERARCHY"
    MANDALA_FACE          = "MANDALA_FACE"
    PRICING_TIER_STRUCTURE = "PRICING_TIER_STRUCTURE"

Known variants per category (initial registry population):

| Category | Variants |

|---|---|

| LLM_MODEL | claude-sonnet-4-6 (production), claude-haiku-4-5 (staged), claude-opus-4-7 (experimental), llama-3-70b (experimental), mistral-large (experimental) |

| SYSTEM_PROMPT | v1-sha-baseline (retired), v2-sha-current (production), v3-sha-empathy-rewrite (staged) |

| CHAT_UI_LAYOUT | current (production), simplified (staged), sidebar (experimental), mobile-only (experimental) |

| COMPOSE_POSITION | bottom-pill (production), top-fixed (staged), inline (experimental) |

| GHOST_CHAR_STYLE | scramble (production), dots (staged), shimmer (experimental), none (experimental) |

| RATING_VARIANT | 40 variants registered (production pool, random selection) |

| VOICE_PROVIDER | polly-generative-ruth (production), polly-neural-ruth (staged), elevenlabs (experimental), kokoro (experimental) |

| VOICE_LLM | sonnet (production), haiku (staged), opus (experimental) |

| STARTER_POOL_SIZE | 7 (production), 5 (staged), 10 (experimental) |

| TOPIC_HIERARCHY | flat (production), drill-down (staged), tabs (experimental) |

| MANDALA_FACE | on (production), simple-orb (staged), none (experimental) |

| PRICING_TIER_STRUCTURE | v1 (production), v2-claude-aligned (staged) |

3.4 Config Blob Per Category

The config dict is category-specific and schema-validated at write time:


# LLM_MODEL config
{"model_id": "us.anthropic.claude-sonnet-4-6-20251101-v1:0",
 "max_tokens": 1024,
 "temperature": 0.7,
 "cache_system_prompt": True}

# SYSTEM_PROMPT config
{"prompt_sha": "abc123def456",
 "s3_key": "prompts/system_v3.md",
 "version_label": "v3-empathy-rewrite"}

# VOICE_PROVIDER config
{"provider": "polly",
 "voice_id": "Ruth",
 "engine": "generative",
 "output_format": "mp3",
 "sample_rate": "24000"}

3.5 Assignment Algorithm

At session initialization, the variant assignment engine runs once and caches the assignment in the session record. The algorithm:


def assign_variants(user_id: str, session_id: str, user_profile: UserProfile) -> VariantAssignment:
    assignments = {}
    for category in VariantCategory:
        eligible = [v for v in registry.active_variants(category)
                    if v.status != VariantStatus.RETIRED
                    and cohort_matches(v.target_cohort, user_profile)]
        # Deterministic hash-based assignment: same user always gets same variant
        # for the same registry state. Use session_id for session-level randomization.
        bucket = hash(f"{session_id}:{category}") % 100
        chosen = select_by_rollout(eligible, bucket)
        assignments[category] = chosen.id
    return VariantAssignment(session_id=session_id, assignments=assignments)

Critical invariant: The crisis path (crisis_flag_level >= 2) always uses production-default variants, regardless of active experiments. Crisis safety must never be A/B tested.

3.6 Staged Rollout Flow


experimental (1%)  →  staged (5%)  →  canary (10–25%)  →  production (50–100%)  →  retired (0%)
      ↑                                                              ↑
  internal team only                                        automated rollout
  manual promotion                                          + approval gate

Each promotion gate requires:

p95 latency not regressed vs. production default (threshold: +10% max)
Crisis detection rate not regressed (threshold: 0% tolerance — any regression blocks promotion)
User satisfaction score not regressed (threshold: -0.05 on 1.0 scale)
Cost per turn not exceeded by more than 20% without explicit sign-off

Demotion is automatic: if a canary variant triggers a crisis-detection regression, it is immediately demoted to experimental status and an alert fires to the admin Slack channel.

3.7 DynamoDB Table: `innerverse-variants`


PK: VARIANT#{category}
SK: {variant_id}
Attributes: all Variant fields, serialized as DynamoDB AttributeMap
GSI: status-index (PK: status, SK: created_at) — for "show all active experiments" queries
TTL: none — variants are never auto-deleted; status="retired" is the tombstone

---

4. Admin Dashboard Specification

4.1 Route and Auth

The admin dashboard is served at /admin/variants by the existing CloudFront distribution, routing to a dedicated Lambda function (or FastAPI router in Docker). Authentication is enforced by Amazon Cognito: users must belong to the admin group in the innerverse-user-pool, with MFA required for the admin group. The Cognito authorizer is attached to the API Gateway route; the Lambda never processes a request without a valid, MFA-verified JWT.

Every mutation (config edit, rollout slider change, promotion, demotion, rollback) is written to an innerverse-audit-log DynamoDB table with: timestamp, actor (Cognito sub), action, variant_id, before_state, after_state. The audit log is append-only; no mutation can delete or modify an audit record.

4.2 Dashboard Views

List View (/admin/variants):

A paginated table, grouped by category. Each row: variant_id, description, status (color-coded badge), rollout_percent (progress bar), p95_latency_7d (sparkline), cost_per_turn_7d (sparkline), satisfaction_delta_7d (sparkline vs. production default), return_rate_7d (sparkline). Columns are sortable. Quick-action buttons: Promote, Demote, Pause (set rollout to 0 without retiring), Edit.

Detail View (/admin/variants/{category}/{variant_id}):

Full config blob rendered as editable JSON with schema validation on submit
Rollout slider: 0–100, with a confirmation dialog for any change > 25 percentage points
Cohort picker: multi-select from [all, new_users, returning, paid, free]
Live diff panel: side-by-side comparison of this variant's config vs. the current production-default for the same category. Diff is syntax-highlighted.
Metric panel: 7-day time series for p95_latency_ms, cost_usd_per_turn, satisfaction_score, return_rate, crisis_flag_rate — all filtered to sessions assigned this variant
Promote/Demote buttons: advance or retreat one stage in the rollout pipeline
Rollback button: one click sets rollout_percent to 0, demotes to experimental, and creates an audit record with reason="manual_rollback"
Audit log: paginated list of all past changes to this variant, with actor, timestamp, and diff

Compare View (/admin/variants/compare?a={id}&b={id}):

Side-by-side metric comparison of two variants over a user-specified date range. Statistical significance indicator (two-proportion z-test for rates, Welch's t-test for latency means). A "promote winner" button is enabled when the test reaches p < 0.05 and sample size > 500 sessions per variant.

4.3 Frontend Implementation

The dashboard is a React SPA (or Next.js page added to the existing frontend) with the following stack: React Query for data fetching (5-minute cache with manual invalidation on mutation), Recharts for sparklines and time series, Tailwind CSS for layout (reusing the existing design system), React JSON Schema Form for config blob editing with live validation. The dashboard is built as a separate route (/admin/*) that is code-split from the main app bundle and only loaded for authenticated admin users.

4.4 Backend API Endpoints


GET    /admin/api/variants                     → list all variants (paginated, filterable)
GET    /admin/api/variants/{category}/{id}     → single variant detail + 7d metrics
PUT    /admin/api/variants/{category}/{id}     → update config or rollout_percent
POST   /admin/api/variants/{category}          → create new variant
POST   /admin/api/variants/{category}/{id}/promote  → advance status one stage
POST   /admin/api/variants/{category}/{id}/demote   → retreat status one stage
POST   /admin/api/variants/{category}/{id}/rollback → set rollout=0, status=experimental
GET    /admin/api/variants/compare?a=X&b=Y    → side-by-side metrics comparison
GET    /admin/api/audit-log?variant_id=X      → audit trail for a variant

All endpoints return JSON; all mutations require a reason string in the request body (persisted to audit log). Idempotency keys are enforced on mutations to prevent double-writes.

---

5. Telemetry Schema — Every Data Point Captured

5.1 Per-Turn Event Record

Every conversation turn — text or voice — appends one record to the innerverse-turn-events DynamoDB table. The record is immutable. The schema:


@dataclass
class TurnEvent:
    # Identity
    turn_id: str                        # UUID v7 (time-sortable)
    user_id_hash: str                   # SHA-256 of Cognito sub — never raw PII
    conversation_id: str                # UUID v4
    session_id: str                     # UUID v4 (groups turns within one browser session)
    timestamp: datetime                 # ISO 8601, UTC

    # Variant snapshot (what was active for this turn)
    active_variants: dict[str, str]     # {category → variant_id} — full snapshot

    # Latency breakdown (milliseconds)
    latency_ms: LatencyBreakdown        # see below

    # Token accounting
    tokens_input: int
    tokens_output: int
    tokens_cache_read: int
    tokens_cache_write: int

    # Cost (computed by pricing.py)
    cost_usd: Decimal

    # Model and prompt identity
    model_id: str                       # full model ARN or API model name
    prompt_sha: str                     # SHA-256 of system prompt content at turn time
    voice_id: str | None                # if voice turn

    # Page context
    page_url: str                       # path only, no query params (privacy)
    referrer_class: str                 # "direct" / "organic" / "paid" / "internal"
    device_class: str                   # "mobile" / "tablet" / "desktop"
    browser: str                        # "safari" / "chrome" / "firefox" / "other"
    region: str                         # AWS region or ISO country code

    # Safety signals
    crisis_flag_level: int              # 0–4 (0=none, 4=immediate risk)
    guardrail_triggered: bool
    guardrail_rule_id: str | None

    # User feedback (if collected on this turn)
    rating: float | None                # 1.0–5.0
    rating_variant_id: str | None       # which rating UI variant was shown
    feedback_text: str | None           # free-text (if submitted)

    # Error capture
    error_class: str | None             # exception class name
    error_message: str | None           # sanitized message (no PII)
    retry_count: int                    # number of retries before success or failure

@dataclass
class LatencyBreakdown:
    total_ms: int
    stt_ms: int | None          # None for text turns
    llm_ttft_ms: int            # time to first token from LLM
    llm_complete_ms: int        # time to last token from LLM
    tts_ms: int | None          # None for text turns
    network_ms: int             # client-estimated round-trip (sent from frontend)
    guardrail_ms: int
    db_read_ms: int
    db_write_ms: int

5.2 Analytics Pipeline: Bronze → Silver → Gold


DynamoDB innerverse-turn-events
        │
        ▼  (DynamoDB Streams, real-time, ~200ms lag)
Kinesis Firehose
        │
        ▼  (Parquet, partitioned by date/hour)
S3 Bronze Tier  (raw, append-only, no schema enforcement)
        │
        ▼  (AWS Glue ETL job, hourly)
S3 Silver Tier  (validated, deduplicated, Parquet columnar)
        │
        ▼  (Glue Catalog + Athena)
Gold Views      (pre-aggregated per-variant metrics, materialized daily)
        │
        ▼
Admin Dashboard API reads Gold views for sparklines
Athena ad-hoc queries read Silver tier for deep analysis

Athena query example — p95 latency per LLM_MODEL variant over 7 days:


SELECT
    active_variants['LLM_MODEL'] AS model_variant,
    APPROX_PERCENTILE(latency_ms.total_ms, 0.95) AS p95_latency,
    COUNT(*) AS turn_count,
    AVG(cost_usd) AS avg_cost_usd
FROM silver.turn_events
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
  AND crisis_flag_level = 0  -- exclude crisis turns from A/B analysis
GROUP BY 1
ORDER BY 2;

5.3 X-Ray Distributed Tracing

AWS X-Ray traces every Lambda invocation end-to-end. The trace spans are:

turn.guardrails — guardrail evaluation
turn.llm — Bedrock invocation (sub-spans: llm.ttft, llm.stream_complete)
turn.stt — Transcribe call (voice turns only)
turn.tts — Polly synthesis (voice turns only)
turn.db_read — DynamoDB history fetch
turn.db_write — DynamoDB turn event write
turn.variant_assignment — variant registry lookup

X-Ray annotations (indexed, filterable in console):


xray_recorder.put_annotation("variant_llm_model", assignment["LLM_MODEL"])
xray_recorder.put_annotation("variant_system_prompt", assignment["SYSTEM_PROMPT"])
xray_recorder.put_annotation("crisis_flag_level", str(crisis_level))
xray_recorder.put_annotation("model_id", model_id)

This allows queries like "show me all traces where variant_llm_model = haiku-staged AND crisis_flag_level >= 2" directly in the X-Ray console, enabling rapid debugging of variant regressions.

5.4 OpenTelemetry Integration

The OTel SDK is initialized at Lambda cold-start. Exporters are configured via env var TRACE_EXPORTER:

xray → AWS X-Ray exporter (current)
otlp → OTLP gRPC exporter (for Jaeger, Tempo, Honeycomb)
stdout → JSON to stdout (for local dev)

The instrumentation code is identical regardless of exporter. When Silent Infinity migrates off AWS, changing TRACE_EXPORTER=otlp and setting OTEL_EXPORTER_OTLP_ENDPOINT=https://tempo.internal:4317 completes the observability migration with zero code change.

5.5 Step Functions — Targeted Use

Step Functions are deliberately scoped to two workflows where the overhead is justified:

1. Nightly Analytics Rollup — orchestrates: Glue ETL (Silver refresh) → Athena Gold view refresh → SNS alert if any variant SLO breached. A state machine with retry logic and DLQ for failed steps.

2. Variant Promotion Workflow — orchestrates: collect 72-hour sample → run statistical significance test → if p < 0.05 and SLOs pass, request human approval via SNS email → wait for approval token → execute promotion (update DDB variant record) → notify Slack. The approval gate is a Step Functions waitForTaskToken pattern. This workflow is NOT in the per-turn path; it runs asynchronously on a schedule.

Per-turn flow remains Lambda-direct. Step Functions overhead (~100ms state transition) would add unacceptable latency if inserted into the hot path.

---

6. Portability Plan

6.1 Layer 1 — Configuration as Environment Variables

All deployment-variant configuration is expressed as environment variables. The complete set:


# Infrastructure layer
DATABASE_URL=dynamodb://us-east-1/innerverse-sessions      # or postgres://user:pass@host/db
LLM_PROVIDER=bedrock                                        # bedrock / anthropic / openai / ollama
LLM_MODEL_ID=us.anthropic.claude-sonnet-4-6-20251101-v1:0  # provider-specific model ID
STT_PROVIDER=transcribe                                     # transcribe / whisper / deepgram / groq
TTS_PROVIDER=polly                                          # polly / elevenlabs / kokoro / openai
STORAGE_BACKEND=s3                                          # s3 / r2 / b2 / local
CDN_BACKEND=cloudfront                                      # cloudfront / cloudflare / fastly / bunny
TRACE_EXPORTER=xray                                         # xray / otlp / stdout

# Provider-specific credentials (all via env, never hardcoded)
AWS_REGION=us-east-1
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
ELEVENLABS_API_KEY=...
DEEPGRAM_API_KEY=...

# Variant registry
VARIANT_REGISTRY_TABLE=innerverse-variants
VARIANT_CACHE_TTL_SECONDS=60

# Feature flags (deploy-time)
FEATURE_VOICE_ENABLED=true
FEATURE_MANDALA_ENABLED=true
FEATURE_PRICING_V2=false

6.2 Layer 2 — Adapter Module Structure


silent_infinity/
├── domain/                         # Pure business logic — zero AWS imports
│   ├── entities.py                 # Dataclasses: Turn, Session, CrisisEvent, VariantAssignment
│   ├── use_cases/
│   │   ├── process_turn.py
│   │   ├── evaluate_guardrails.py
│   │   ├── assign_variant.py
│   │   └── record_telemetry.py
│   └── interfaces/                 # Abstract base classes (the "ports")
│       ├── llm_provider.py         # LLMProvider ABC
│       ├── stt_provider.py         # STTProvider ABC
│       ├── tts_provider.py         # TTSProvider ABC
│       ├── conversation_store.py   # ConversationStore ABC
│       ├── object_store.py         # ObjectStore ABC
│       └── trace_exporter.py      # TraceExporter ABC
│
├── adapters/                       # Concrete implementations of interfaces
│   ├── llm/
│   │   ├── bedrock.py              # BedrockLLMProvider(LLMProvider)
│   │   ├── anthropic_direct.py     # AnthropicLLMProvider(LLMProvider)
│   │   ├── openai.py               # OpenAILLMProvider(LLMProvider)
│   │   └── ollama.py               # OllamaLLMProvider(LLMProvider)
│   ├── stt/
│   │   ├── transcribe.py           # TranscribeSTTProvider(STTProvider)
│   │   ├── whisper.py              # WhisperSTTProvider(STTProvider) — local or API
│   │   └── deepgram.py             # DeepgramSTTProvider(STTProvider)
│   ├── tts/
│   │   ├── polly.py                # PollyTTSProvider(TTSProvider)
│   │   ├── elevenlabs.py           # ElevenLabsTTSProvider(TTSProvider)
│   │   └── kokoro.py               # KokoroTTSProvider(TTSProvider) — local
│   ├── storage/
│   │   ├── s3.py                   # S3ObjectStore(ObjectStore)
│   │   ├── r2.py                   # R2ObjectStore(ObjectStore) — S3-compatible
│   │   └── local_fs.py             # LocalFSObjectStore(ObjectStore)
│   ├── db/
│   │   ├── dynamodb.py             # DynamoDBConversationStore(ConversationStore)
│   │   ├── postgres.py             # PostgresConversationStore(ConversationStore)
│   │   └── sqlite.py               # SQLiteConversationStore(ConversationStore)
│   └── http_server/
│       ├── lambda_handler.py       # AWS Lambda entry point
│       ├── fastapi_app.py          # FastAPI ASGI app (Docker/VPS)
│       └── cloudflare_worker.py    # Cloudflare Workers adapter (future)
│
├── modules/                        # Existing portable modules (unchanged)
│   ├── system_prompt.py
│   ├── guardrails.py
│   ├── crisis_archive.py
│   ├── pricing.py
│   ├── feedback_monitor.py
│   └── variants.py                 # NEW — variant registry + assignment engine
│
└── composition_root.py             # Reads env vars, instantiates adapters, wires DI

6.3 The Composition Root Pattern

The composition root is the single location where concrete adapter implementations are selected based on environment variables and injected into use cases. It runs once at cold-start (Lambda) or at application startup (FastAPI):


# composition_root.py
def build_container() -> Container:
    llm = {
        "bedrock": BedrockLLMProvider,
        "anthropic": AnthropicLLMProvider,
        "openai": OpenAILLMProvider,
        "ollama": OllamaLLMProvider,
    }[os.environ["LLM_PROVIDER"]]()

    db = {
        "dynamodb://": DynamoDBConversationStore,
        "postgres://": PostgresConversationStore,
        "sqlite://":   SQLiteConversationStore,
    }[url_scheme(os.environ["DATABASE_URL"])](os.environ["DATABASE_URL"])

    # ... same pattern for STT, TTS, ObjectStore, TraceExporter

    return Container(llm=llm, db=db, stt=stt, tts=tts, storage=storage, tracer=tracer)

This is the only file that imports both domain/ and adapters/. Every other file in the project imports either domain interfaces (for domain code) or adapters (for adapter code) — never both.

6.4 Layer 3 — Deployment Targets

aws-lambda (current): adapters/http_server/lambda_handler.py is the entry point. No changes needed. Cold-start time target: < 800ms.

docker-compose (local dev and VPS): adapters/http_server/fastapi_app.py exposes the same endpoints. docker-compose.yml sets all env vars and mounts a local SQLite DB and local filesystem for object storage. A developer can run the full system locally with docker-compose up in under two minutes, with no AWS credentials required.


# docker-compose.yml (abridged)
services:
  api:
    build: .
    environment:
      LLM_PROVIDER: ollama
      DATABASE_URL: sqlite:///./dev.db
      STT_PROVIDER: whisper
      TTS_PROVIDER: kokoro
      STORAGE_BACKEND: local
      TRACE_EXPORTER: stdout
    ports:
      - "8000:8000"
  ollama:
    image: ollama/ollama
    volumes:
      - ollama_data:/root/.ollama

kubernetes helm chart: A Helm chart wraps the Docker image with a Deployment, HPA, ConfigMap (env vars), and ExternalSecret (pulling API keys from AWS Secrets Manager or Vault). Supports both AWS EKS and bare-metal clusters.

fly.io: Single binary deployment using the FastAPI adapter. fly.toml sets env vars; Fly's persistent volumes replace S3 for small deployments. Fly's global anycast edge reduces latency without CloudFront.

cloudflare-workers (eventual): Requires either a Rust port of the hot path (recommended for latency-critical voice turns) or a Python-to-WASM compilation via Pyodide. The adapter interface is pre-designed to support this target; the adapter implementation is deferred.

6.5 Seven-Day Migration Playbook (Off AWS)

This is the operational runbook for migrating Silent Infinity off AWS in the event of a vendor decision, cost optimization, or compliance requirement.

Day 1 — Data Replication:

Enable DynamoDB Streams on innerverse-sessions and innerverse-turn-events
Deploy a stream consumer Lambda that writes every event to a Postgres instance (e.g., Supabase, Neon, or self-hosted)
Validate replication lag < 5 seconds, row counts match
Set DATABASE_URL env var to Postgres URL in a shadow Lambda (not yet serving traffic)
DynamoDB remains the primary write target; Postgres is read-only backup

Day 2 — CDN and Static Assets:

Deploy the frontend build to Cloudflare Pages (or R2 + Workers)
Configure Cloudflare DNS to serve static assets from Cloudflare instead of CloudFront
API Gateway origin remains AWS Lambda; only static assets switch
Validate zero broken assets via automated link checker

Day 3 — LLM Provider Swap:

Deploy the AnthropicLLMProvider adapter
Set LLM_PROVIDER=anthropic in a canary Lambda alias (10% traffic)
Monitor p95 latency, cost, satisfaction metrics against Bedrock baseline for 24 hours
If metrics stable: flip LLM_PROVIDER=anthropic on 100% of traffic
Bedrock fallback remains available via env var flip

Day 5 — API Server Migration:

Deploy Docker image to fly.io (or target VPS/K8s)
Run behind Cloudflare Tunnel (no public IP required)
Use Cloudflare Workers as reverse proxy: route 10% of API traffic to Fly.io, 90% to Lambda
Monitor error rates and latency across both paths

Day 7 — Postgres Primary:

Flip DATABASE_URL to Postgres URL on all traffic
DynamoDB becomes read-only backup; writes stop
Monitor for 24 hours; validate no data loss

Day 30 — AWS Footprint Reduction:

Lambda: retired (Fly.io serves 100% of traffic)
DynamoDB: retained as read-only backup (low cost; high confidence)
Bedrock: retired
CloudFront: retired (Cloudflare serving everything)
API Gateway: retired
Remaining AWS bill: ~$20/month for S3 backup storage + DynamoDB read-only + Route53

---

7. Research Foundation

This architecture synthesizes eleven bodies of prior work, each contributing a load-bearing principle.

Wiggins (2012) — "The Twelve-Factor App" establishes the operational contract for cloud-native software: config from environment, stateless processes, disposability, dev/prod parity. Silent Infinity's portability layer (Section 6) is a direct implementation of Factors III, VI, IX, X, and XI. Available at https://12factor.net.

Cockburn (2005) — "Hexagonal Architecture" provides the structural principle for the adapter layer: domain logic is shielded from all external dependencies by explicit interface ports. The domain/interfaces/ module hierarchy in Section 6 is Cockburn's hexagon concretized.

Martin (2012) — "Clean Architecture" formalizes the dependency rule and the layer hierarchy (entities → use cases → interface adapters → frameworks). The composition_root.py pattern (Section 6.3) is Martin's composition root — the single permitted location for breaking the dependency rule.

Hodgson/Fowler (2017) — "Feature Toggles" on martinfowler.com provides the taxonomy of toggle types, the warning about toggle debt, and the recommended management strategies. The five-experiment cap and retired status in the variant schema (Section 3) directly address Hodgson's toggle debt risk.

Gang of Four (1994) — "Design Patterns" — specifically the Strategy and Adapter patterns (Chapter 4) — provides the object-oriented formalization of the interchangeable-provider design. The LLMProvider ABC and its concrete implementations are Strategy + Adapter in GoF terms.

Fowler (2005) — "Event Sourcing" — the innerverse-turn-events table is an event store in Fowler's sense: immutable, append-only, replayable. The Bronze → Silver analytics pipeline (Section 5.2) is a projection derived from event replay.

Young (2010) — "CQRS Documents" — the separation of the DynamoDB write path from the Athena/Glue read path (Section 5.2) is CQRS applied to analytics.

CNCF (2019) — "OpenTelemetry Specification" — the OTel SDK integration (Section 5.4) implements the CNCF specification, ensuring vendor-neutral telemetry portability.

Burns (2016) — "Designing Distributed Systems" — the sidecar and adapter patterns described by Burns directly inform the OTel collector architecture and the Cloudflare Workers proxy pattern in the migration playbook.

Kleppmann (2017) — "Designing Data-Intensive Applications" — the Bronze/Silver/Gold pipeline architecture mirrors Kleppmann's batch processing and stream processing patterns from Part III of DDIA. The Kinesis Firehose → S3 → Glue → Athena pipeline is a lambda architecture implementation.

Nygard (2007) — "Release It!" — the automatic demotion trigger on crisis regression (Section 3.6) is a circuit-breaker pattern in Nygard's sense: the system self-heals by cutting off a failing variant before it affects the full user population.

---

8. Cost of the Full Build

8.1 Engineering Time Estimates

| Component | Estimate | Notes |

|---|---|---|

| Variant registry (variants.py + DDB table + CRUD API) | 3 days | Schema + assignment algorithm + CRUD |

| Admin dashboard (React + API) | 4 days | List, detail, compare views + auth |

| Adapter interfaces (all 7 ABCs) | 1 day | Python ABCs + type signatures |

| Refactor bedrock_client.py to BedrockLLMProvider | 1 day | Existing code, clean wrapper |

| Refactor voice.py to PollyTTSProvider + TranscribeSTTProvider | 1 day | Existing code |

| AnthropicLLMProvider adapter (proof of concept) | 2 days | First non-AWS LLM adapter |

| OpenAILLMProvider adapter | 2 days | |

| OllamaLLMProvider adapter | 2 days | |

| PostgresConversationStore adapter | 3 days | Schema design + migration tooling |

| SQLiteConversationStore adapter | 1 day | Local dev only |

| ElevenLabsTTSProvider adapter | 2 days | |

| KokoroTTSProvider adapter | 2 days | Local model integration |

| WhisperSTTProvider adapter | 2 days | API + local model variants |

| DeepgramSTTProvider adapter | 2 days | |

| OpenTelemetry SDK + X-Ray exporter | 2 days | |

| OTel OTLP exporter | 1 day | |

| DynamoDB Streams → Kinesis → S3 → Glue pipeline | 2 days | |

| Docker + FastAPI deployment target | 1 day | |

| docker-compose local dev environment | 1 day | |

| Kubernetes Helm chart | 3 days | |

| Fly.io deployment target | 2 days | |

| Cloudflare Workers adapter (future) | 5 days | Defer |

| Step Functions: nightly rollup | 1 day | |

| Step Functions: variant promotion workflow | 2 days | |

| Total (excluding Cloudflare Workers) | ~42 days = ~8.5 engineer-weeks | |

| Priority subset (Sections 9 Week 1+2) | ~10 days = 2 engineer-weeks | |

8.2 Infrastructure Cost Impact

Current AWS monthly spend (estimated baseline):

Lambda: ~$15/month (assuming 100K turns/month)
Bedrock: variable (~$0.003–0.015/turn depending on model)
DynamoDB: ~$10/month
CloudFront: ~$5/month
API Gateway: ~$5/month
Total infrastructure (excluding Bedrock): ~$35/month

Additional costs from this architecture:

Kinesis Firehose: ~$2/month (low volume)
S3 Bronze/Silver/Gold: ~$3/month
Glue ETL: ~$1/month (2 DPU-hours/day)
Step Functions: ~$0.50/month (low state transitions)
X-Ray: free tier covers ~100K traces/month; ~$5/month beyond
Total additional: ~$12/month

Cost savings from portability (if migrated to Fly.io + Anthropic direct):

Estimated Fly.io cost for equivalent compute: ~$10/month
Anthropic direct API: comparable to Bedrock (often 10–15% cheaper at volume)
CloudFront eliminated: ~$5/month saving
API Gateway eliminated: ~$5/month saving
Net saving on migration: ~$15–20/month at current volume; ~$100–200/month at 1M turns/month

8.3 Ongoing Maintenance Cost

Each adapter requires approximately 0.5 days/month of maintenance (API version updates, authentication changes, model deprecations). At the full 12-adapter build:

Adapter maintenance: ~6 days/month → not recommended; maintain only active adapters
Recommended: maintain 2–3 LLM adapters, 1–2 TTS adapters, 1–2 STT adapters, 2 DB adapters
Realistic ongoing cost: ~2 days/month for the active adapter set

---

9. What Ships First — Concrete Two-Week Plan

Week 1: Variant Registry + Admin Dashboard Foundation

Day 1–2: variants.py + DynamoDB table

Define VariantCategory, VariantStatus, TargetCohort enums
Implement Variant dataclass and JSON serialization
Create innerverse-variants DynamoDB table with GSI
Implement VariantRegistry class: get_variant(), list_active(), assign_variants()
Populate initial registry with all known production variants (12 categories, ~50 total variants)
Unit tests: assignment determinism, cohort filtering, crisis override invariant

Day 3: CRUD API

Lambda function for /admin/api/variants/** endpoints
Cognito authorizer (admin group) wired to API Gateway
Audit log write on every mutation

Day 4–5: Tag existing turn events with active_variants

Add active_variants field to the EMF log structure in the Lambda handler
Assign variants at session initialization; cache in DDB session record
Pass variant assignment through to every turn event emission
Backfill test: verify new turns appear in CloudWatch with active_variants field

Day 6–7 (weekend stretch goal): Admin dashboard v0

React page at /admin/variants — list view only
Recharts sparklines pulling from CloudWatch Metrics Insights (pre-Athena)
Promote/demote buttons wired to CRUD API

Week 2: Adapter Layer + OpenTelemetry

Day 8–9: Adapter interfaces + LLM refactor

Define all 7 domain/interfaces/*.py ABCs with full type signatures
Refactor bedrock_client.py → adapters/llm/bedrock.py implementing LLMProvider
Implement composition_root.py with LLM_PROVIDER env var selection
Smoke test: Lambda with LLM_PROVIDER=bedrock behaves identically to current

Day 10–11: Anthropic direct adapter (proof of concept)

Implement adapters/llm/anthropic_direct.py using anthropic Python SDK
Test with LLM_PROVIDER=anthropic locally (docker-compose + SQLite)
Validate response parity: same input → comparable output quality
Document any API behavioral differences (streaming format, error codes)

Day 12: Voice adapter refactor

Refactor voice.py → adapters/tts/polly.py + adapters/stt/transcribe.py
Implement TTSProvider and STTProvider ABCs
Composition root wires TTS_PROVIDER and STT_PROVIDER env vars

Day 13–14: OpenTelemetry SDK

Install opentelemetry-sdk, opentelemetry-exporter-otlp, aws-opentelemetry-distro
Instrument process_turn.py use case with OTel spans
Add X-Ray annotations for variant IDs (Section 5.3)
Configure TRACE_EXPORTER env var selection in composition root
Validate traces appear in X-Ray console with variant annotations

Deferred to Later Sprints

Full Docker/K8s deployment targets (Sprint 3–4)
PostgreSQL ConversationStore adapter (Sprint 3)
Full Kinesis Firehose → S3 → Glue → Athena pipeline (Sprint 3)
Admin dashboard compare view and statistical significance tests (Sprint 3)
ElevenLabs, Kokoro, Deepgram adapters (Sprint 4)
Cloudflare Workers deployment target (Sprint 6+)
Step Functions variant promotion workflow (Sprint 4)

---

10. Risks and Mitigations

10.1 Over-Engineering Risk

Risk: Building the full adapter set, Kubernetes Helm chart, and Cloudflare Workers support before they are needed creates a large maintenance surface with no immediate return.

Mitigation: Strictly implement only adapters that enable a capability we need today or within the next 90 days. The priority order: (1) Anthropic direct (LLM cost leverage), (2) PostgreSQL (Postgres is cheaper than DynamoDB at scale and enables richer queries), (3) FastAPI/Docker (local dev parity), (4) everything else. Kubernetes, Cloudflare Workers, and Fly.io are documented but not built until a concrete migration decision is made.

10.2 Variant Explosion

Risk: With 12 variant categories and dozens of variants per category, the combinatorial space of simultaneous experiments grows exponentially. Analyzing a session with 12 active variants simultaneously is statistically intractable (insufficient sample size per cell).

Mitigation: Hard cap of 5 simultaneous non-production experiments (enforced by the admin dashboard: the "create variant" button is disabled when 5 experiments are active). Experiments are sequential, not simultaneous, wherever possible. Categories are grouped: model and prompt experiments run together (they interact); UI experiments run in isolation.

10.3 Crisis Path Contamination

Risk: An experimental variant (e.g., a simplified UI layout) could inadvertently affect the crisis detection flow, causing a safety regression in production.

Mitigation: The variant assignment engine has a hard override: any turn where crisis_flag_level >= 2 immediately switches to production-default variants for all categories. This override is unit-tested and integration-tested in every deploy. Crisis detection regression is a P0 incident trigger regardless of variant status.

10.4 Analytics Cardinality Explosion

Risk: With 12 categories and 50+ variants, the number of unique active_variants combinations in the turn-events table could reach thousands, making Athena queries expensive and partition pruning ineffective.

Mitigation: The analytics pipeline groups turn events by single-dimension variant analysis (one category at a time), not full cross-product. The Gold aggregation layer pre-computes per-category-per-variant metrics, not per-combination metrics. Athena query cost is bounded by Silver tier Parquet compression and partition pruning by date. Estimated cost: < $5/month for 1M turns/month.

10.5 Admin Dashboard Security

Risk: The admin dashboard exposes production variant controls and audit logs. A compromised admin account could roll back safety-critical variants or expose user behavior data.

Mitigation: Cognito admin group with MFA enforced at the user pool level (cannot be bypassed). Every mutation requires a human-readable reason string (audit trail). Rollout changes > 25 percentage points require a confirmation dialog. The audit log is append-only (no admin can delete audit records). Critical variants (crisis-related prompt, guardrails) have an additional confirmation step with a 5-minute cooldown before taking effect.

10.6 Adapter Behavioral Divergence

Risk: The AnthropicLLMProvider and BedrockLLMProvider adapters, while implementing the same interface, may exhibit subtle behavioral differences (streaming format differences, error code differences, token counting differences) that cause silent failures in the domain layer.

Mitigation: The LLMProvider interface includes a health_check() method and a validate_response() method. Integration tests run against each adapter with a standardized test suite of prompts and validate that responses meet the same behavioral contract. CI/CD runs this test suite against both adapters on every push. Any behavioral divergence fails the build.

---

11. References and Prior Art

11.1 Feature Flag Infrastructure

LaunchDarkly is the commercial gold standard for feature-flag infrastructure. Its architecture (flag rules engine, targeting by user attributes, real-time streaming of flag updates via Server-Sent Events, and an audit log) directly informed the variant registry design in Section 3. The LaunchDarkly engineering blog (2020–2024) documents their approach to flag targeting, gradual rollouts, and experiment analysis at scale.

Unleash is the leading open-source feature-flag server (Go backend, React admin UI). It implements the OpenFeature standard and supports the same lifecycle (variants, gradual rollouts, activation strategies by cohort). Silent Infinity's variant registry is a purpose-built subset of Unleash's model, specialized for the product's categories. Deploying Unleash OSS as a backend for the variant registry is a viable alternative to the bespoke DynamoDB implementation — trade-off: operational overhead vs. richer UI and SDK ecosystem.

OpenFeature (CNCF, 2022) is a vendor-neutral standard for feature-flag evaluation, analogous to OpenTelemetry for observability. The FeatureFlagProvider interface in Silent Infinity's adapter layer is designed to be compatible with the OpenFeature provider spec, enabling a drop-in swap from the bespoke variant registry to an OpenFeature-compatible backend (LaunchDarkly, Unleash, Flagsmith) if needed.

11.2 Experiment Framework Prior Art

Netflix A/B Testing at Scale — Netflix's experiment platform (documented via engineering blog, 2016–2022) handles thousands of simultaneous experiments across hundreds of millions of users. Key lessons applied here: (1) deterministic hash-based assignment (same user always in same bucket for same experiment), (2) network effects awareness (users who share households may influence each other), (3) statistical significance tooling built into the admin dashboard. Netflix's Raven framework is the closest architectural analogue to the admin dashboard + analytics pipeline described in Section 4 and Section 5.

Stripe's Experiment Framework — documented via Stripe Engineering Blog (2020), Stripe's experiment framework emphasizes clean separation between experiment assignment (at request time, deterministic) and experiment analysis (async, in a data warehouse). The active_variants field on the TurnEvent record follows Stripe's pattern of snapshotting the full experiment state at the time of the event, enabling retrospective analysis without requiring joins against a separate assignment log.

Google's Overlapping Experiment Infrastructure — Kohavi et al. (2013) "Online Controlled Experiments at Large Scale" documents Google's approach to running overlapping experiments across multiple dimensions simultaneously, using orthogonal layers to avoid interaction effects. The variant category system in Section 3 implements a simplified version of Google's layer model: each category is an independent layer, and experiments within a layer are mutually exclusive.

11.3 Observability Prior Art

Honeycomb.io's "Observability-Driven Development" (Majors, Fong-Jones, Miranda, 2022) advocates for high-cardinality event-based observability as opposed to pre-aggregated metrics. The TurnEvent schema in Section 5.1 implements this philosophy: every turn is a rich, high-cardinality event with all context attached, enabling arbitrary slicing and dicing in Athena without pre-defining metrics in advance.

AWS X-Ray and OpenTelemetry Integration — AWS's own documentation (2023) recommends using the AWS Distro for OpenTelemetry (ADOT) as the preferred way to instrument Lambda functions, enabling simultaneous export to X-Ray and any OTel-compatible backend. Section 5.4 follows this recommendation.

---

Appendix A: Interface Definitions (Reference)


# domain/interfaces/llm_provider.py
from abc import ABC, abstractmethod
from typing import AsyncIterator

class LLMProvider(ABC):
    @abstractmethod
    async def complete(
        self,
        messages: list[dict],
        model_id: str,
        system_prompt: str,
        max_tokens: int,
        temperature: float,
        stream: bool = True,
    ) -> AsyncIterator[str]: ...

    @abstractmethod
    async def health_check(self) -> bool: ...

    @abstractmethod
    def token_count(self, text: str) -> int: ...


# domain/interfaces/conversation_store.py
from abc import ABC, abstractmethod

class ConversationStore(ABC):
    @abstractmethod
    async def get_history(self, session_id: str, limit: int) -> list[dict]: ...

    @abstractmethod
    async def put_turn(self, turn_event: "TurnEvent") -> None: ...

    @abstractmethod
    async def get_session(self, session_id: str) -> dict | None: ...

    @abstractmethod
    async def delete_session(self, session_id: str) -> None: ...


# domain/interfaces/tts_provider.py
from abc import ABC, abstractmethod

class TTSProvider(ABC):
    @abstractmethod
    async def synthesize(
        self,
        text: str,
        voice_id: str,
        speed: float,
        output_format: str,
    ) -> bytes: ...


# domain/interfaces/stt_provider.py
from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class TranscriptionResult:
    text: str
    confidence: float
    duration_ms: int

class STTProvider(ABC):
    @abstractmethod
    async def transcribe(
        self,
        audio_bytes: bytes,
        language: str,
        format: str,
    ) -> TranscriptionResult: ...

---

Appendix B: Variant Registry DynamoDB Schema (CloudFormation)


VariantsTable:
  Type: AWS::DynamoDB::Table
  Properties:
    TableName: innerverse-variants
    BillingMode: PAY_PER_REQUEST
    AttributeDefinitions:
      - AttributeName: pk
        AttributeType: S
      - AttributeName: sk
        AttributeType: S
      - AttributeName: status
        AttributeType: S
      - AttributeName: created_at
        AttributeType: S
    KeySchema:
      - AttributeName: pk
        KeyType: HASH
      - AttributeName: sk
        KeyType: RANGE
    GlobalSecondaryIndexes:
      - IndexName: status-index
        KeySchema:
          - AttributeName: status
            KeyType: HASH
          - AttributeName: created_at
            KeyType: RANGE
        Projection:
          ProjectionType: ALL
    PointInTimeRecoverySpecification:
      PointInTimeRecoveryEnabled: true
    SSESpecification:
      SSEEnabled: true

---

End of Document

Silent Infinity — Modularity, Portability, and Variant Architecture — v1.0 — 2026-04-21

Prepared by SCOUT / TITAN Research Arm

Word count: ~7,200 words