Silent Infinity Engineering Runbook

This document is the operational ground truth for the Silent Infinity system. Open it when you are deploying, rolling back, diagnosing latency, investigating cost anomalies, or responding to a crisis flag storm. It is terse by design. It describes what to do and why. For the actual commands, open the git repository.

---

1. System Overview

Architecture in Prose

A user opens silentinfinity.com in their browser. The browser sends all traffic — HTML, JavaScript, API requests — through a single Amazon CloudFront distribution (ID: E2M8T6S9SM3OQY). CloudFront is the front door. It handles TLS termination, global edge caching for static assets, and routing for dynamic API calls.

Static assets (HTML, CSS, JavaScript, audio) are served from an S3 origin bucket behind CloudFront. These are cached aggressively and should constitute the majority of all edge hits. Dynamic API requests — those hitting the /invoke path — are forwarded by CloudFront to API Gateway, which proxies them to the Lambda function named innerverse-mirror.

innerverse-mirror is the monolithic application handler. It holds all routing logic, guardrail evaluation, conversation history assembly, Bedrock invocation, observation logging, and response streaming. When a user turn arrives, the handler: (1) loads conversation history from DynamoDB, (2) assembles the full prompt including system prompt from system_v1.md, (3) evaluates incoming text against guardrails, (4) invokes Amazon Bedrock with the assembled payload, (5) streams the model response back through API Gateway to the browser, and (6) asynchronously writes the updated conversation turn to DynamoDB.

Bedrock is the inference layer. The default model is us.anthropic.claude-sonnet-4-6. Bedrock receives the assembled prompt and returns a streaming response. Prompt caching is enabled; the system prompt and early conversation turns are candidates for cache read if the payload matches a prior cached prefix.

Three DynamoDB tables back the system. innerverse-conversations stores per-turn conversation history keyed by uid and session. innerverse-users stores user profile and preference data. innerverse-observations stores Chat Sentinel observation records — one record per flagged or notable turn. All three tables use PAY_PER_REQUEST billing mode.

Concern Map

Handler routing lives inside innerverse-mirror. Guardrail logic lives in guardrails.py and evaluates on every inbound turn before any Bedrock call. Crisis patterns live in crisis-patterns-v1.json. The system prompt lives in system_v1.md. The crisis archive — historical flagged conversations retained for clinical review — is written into innerverse-observations with a crisis=true attribute.

Region and Domain

Everything runs in us-east-1. There is no multi-region setup today. Justification is required before expanding. The production domain is silentinfinity.com. There is no staging domain today; that is a tracked TODO. All destructive changes should be validated against a local Lambda test invocation before pushing to production.

---

2. Deploy Procedure

Who Does This

Any engineer with valid AWS credentials scoped to the silentinfinity deployment IAM role. Today that is Harnoor. Future engineers require IAM onboarding before any deploy access is granted.

Two Deploy Paths

There are two distinct deploy paths depending on what changed. The first path handles infrastructure changes — anything that touches the CDK stack definition, IAM roles, Lambda configuration, DynamoDB table definitions, CloudFront behaviors, or any resource outside the Lambda code package itself. The second path handles Lambda-only code changes — changes to Python source files that do not alter the CDK stack.

Always determine which path applies before starting. Using the wrong path wastes time or misses infrastructure changes.

Path 1: Infrastructure Deploy via CDK

This path takes 8–15 minutes end-to-end. It is slower but comprehensive.

1. Confirm your AWS credentials are active and scoped to the correct account. Run a dry-run diff using the CDK diff command to see exactly which of the 15 resources will change. Read the diff carefully before proceeding.

2. If the diff contains anything unexpected — a resource deletion you did not intend, a policy change you did not write — stop. Investigate the discrepancy in the CDK source before continuing.

3. If the diff looks correct, run the CDK deploy command. CDK will synthesize the CloudFormation template, upload the Lambda asset package, and execute a CloudFormation change set.

4. Watch the CloudFormation console or the CDK output for the stack status. A successful deploy ends with UPDATE_COMPLETE. Any other terminal status is a failure — do not proceed.

5. After a successful stack deploy, confirm the Lambda function version was updated in the Lambda console. Note the new version number.

6. Run the 16-test smoke suite against the live endpoint to confirm functionality. Do not close this deploy as done until at least 12 of 16 pass.

Path 2: Lambda-Only Code Deploy

This path takes approximately 90 seconds end-to-end. It is faster but only valid when no CDK stack resources changed.

1. Confirm your code changes are limited to Python source files inside the Lambda package directory. If any cdk.json, stack definition, or infrastructure file changed, switch to Path 1.

2. In your local environment, rebuild the Lambda deployment ZIP. The ZIP must include all dependencies from the requirements layer. Use the PowerShell rebuild script in the repo root — it handles dependency bundling and file inclusion.

3. Validate the ZIP was created successfully and its file size is within 5% of the previous version. A dramatic size change indicates a missing or extra dependency.

4. Push the new ZIP to Lambda using the AWS CLI update-function-code command, targeting the innerverse-mirror function in us-east-1. Wait for the update to complete — the CLI will poll until the function state returns to Active.

5. Immediately run a CloudFront invalidation against the /* path on distribution E2M8T6S9SM3OQY. This clears any cached API responses or stale asset references at the edge. Note that CloudFront invalidation takes up to 10 minutes to fully propagate globally. Do not expect immediate effect at all edge locations.

6. During that propagation window, do a local hard-refresh test against the production URL using a browser that has no local cache for the domain (incognito works). Confirm the response reflects the new code behavior.

7. Run the 16-test smoke suite. A Lambda-only deploy is done when at least 12 of 16 pass.

Git Discipline

All changes — code and infra — must be committed to git before deploying. A deploy from an uncommitted working tree is prohibited. The commit SHA is the source of truth for what is running. When a deploy succeeds, tag the commit with the date and a short descriptor.

Rollback

Lambda maintains the last five deployed versions automatically via function versioning. The alias named LIVE points at the currently running version. To roll back: open the Lambda console, navigate to the LIVE alias, and point it to the previous version number. This takes under 2 minutes and does not require CDK or a new ZIP. After alias re-point, run the smoke suite to confirm the rollback is functioning. CloudFront does not need to be invalidated for a Lambda rollback unless the static assets also changed in the failed deploy.

---

3. Daily Health Checks

Run these checks every morning before doing any other work on the system. They take under 10 minutes.

Lambda Error Rate

Open CloudWatch and navigate to the innerverse-mirror Lambda function metrics. Look at the Errors metric over the last 24 hours. An error rate above 1% sustained over any 5-minute window is the threshold for investigation. Normal is under 0.1%. If errors are elevated, immediately check the Lambda logs for the error message before taking any other action. Common causes include malformed DynamoDB responses, Bedrock throttling, and handler exceptions from unexpected input shapes.

Future state: a CloudWatch alarm on this metric will fire to PagerDuty when the threshold is crossed. Today, the check is manual.

Bedrock Throttling (429 Errors)

Bedrock throttling appears in Lambda logs as HTTP 429 responses from the Bedrock endpoint. Normal count is zero. If you see more than 5 throttle events per minute, that is a signal. At low traffic volumes this almost always means a runaway session repeatedly invoking the model or a load test someone forgot to turn off. Identify the uid in the logs, suspend the session if necessary, and verify the Bedrock service quota in the us-east-1 console.

DynamoDB Throttling

All three tables use PAY_PER_REQUEST billing. Throttling should not occur under this billing mode except in extreme spike scenarios. Check the DynamoDB console for ProvisionedThroughputExceeded errors on any of the three tables. If this appears, it is unusual enough to warrant immediate investigation. Most likely cause is a scan-without-index operation hitting a large table. Check for any new code paths that read DynamoDB without a key-based lookup.

CloudFront 5xx Rate

In CloudFront metrics, look at the 5xxErrorRate for distribution E2M8T6S9SM3OQY over the last 24 hours. A 5xx rate above 2% sustained over 5 minutes is the threshold. Isolated spikes (under 5 events) are normal during Lambda cold starts after low-traffic periods. A sustained elevation means the Lambda function is unhealthy or the API Gateway integration has broken.

Synthetic Canary

A synthetic canary should hit silentinfinity.com every 5 minutes, check for an HTTP 200 response, and verify that expected HTML markers are present in the response body. Today this canary is not yet wired up in CloudWatch Synthetics — it is a tracked next-sprint item. Until it is live, the manual morning check substitutes: load the production URL in a clean browser session and confirm the UI loads and an initial model response is returned within 10 seconds.

---

4. Diagnosing Latency

Latency Budget

End-to-end latency from user keystroke to first audio word has the following major segments: browser processing and WebSocket send (typically under 20ms), CloudFront edge routing (typically 5–40ms depending on POP), API Gateway overhead (typically 5–10ms), Lambda initialization (0ms if warm, up to 800ms if cold due to SnapStart restore), Bedrock first-token latency (typically 400–900ms for Sonnet 4.6 at p50), and streaming transmission of tokens back to the browser. Total p50 is typically 1.5–2.5 seconds to first audio word. p95 is typically 3.5–5 seconds. p99 exceeds 6 seconds only when Bedrock throttling or cold starts coincide.

Reading EMF Logs

Every Lambda invocation emits structured CloudWatch EMF metrics including per-turn latency broken into segments: guardrail_latency_ms, bedrock_first_token_ms, history_load_ms, and total_handler_ms. When a user reports a slow turn, pull the specific request ID from the Lambda logs, find the associated EMF record, and read those fields directly. This tells you exactly which segment was slow without guessing.

Cold Start Diagnosis

Lambda SnapStart significantly reduces cold start impact for the JVM runtime. For Python (which innerverse-mirror uses), cold starts are mitigated by keeping the function warm and by minimizing the import surface. If you observe elevated TTFB and the Lambda logs show an INIT phase, that is a cold start. The remediation options are: enable Lambda provisioned concurrency for the function (adds cost), reduce the package size to speed the init phase, or accept it as a low-frequency event. Do not over-engineer cold start mitigation unless p99 TTFB is consistently above 8 seconds under normal traffic.

Bedrock Throttling as Latency Source

When Bedrock returns a 429, the handler retries with exponential backoff. Each retry adds 1–4 seconds to the visible latency. If you see elevated latency that correlates with 429 events in the logs, the fix is to address the throttling cause, not the latency itself. See Section 5 for throttling investigation.

Prompt Cache Miss

Bedrock returns a cache_read_input_tokens field in the usage dictionary when a cache hit occurred. If this value is zero for a long conversation, the prompt cache missed. A cache miss means the full token payload is priced and processed from scratch, adding both latency and cost. Common causes of cache misses: the conversation history changed the prefix before the cached system prompt, the system prompt was recently updated and the new SHA has not yet built a warm cache, or the cache TTL (5 minutes for Bedrock) expired during a low-traffic window. Cache misses are not errors — they are expected on first invocations and after breaks in traffic.

Large History Payload

Conversations that have run for many turns accumulate large history payloads. History beyond approximately 80,000 tokens will hit context window limits and the handler will trim. But even before that limit, large payloads add transmission overhead and Bedrock processing time. If a specific user's turns are consistently slow and their session is long, check the estimated token count of their conversation history in DynamoDB. The per-turn EMF history_tokens metric will reflect this.

When to Route to Haiku

Haiku is appropriate for turns that are clearly low-stakes: brief acknowledgments, factual retrievals, menu navigation, and administrative commands. It is not appropriate for emotionally sensitive turns, crisis-adjacent content, or any turn where the system prompt's nuanced framing matters. If latency is causing user experience degradation on a specific turn class, routing those turns to Haiku in the handler's model selection logic is acceptable. Document the routing rule and its rationale in the handler comments. Do not route to Haiku silently without a logged reason.

X-Ray Tracing

AWS X-Ray is available for distributed latency tracing across Lambda, API Gateway, and DynamoDB. Enable X-Ray active tracing on the Lambda function and API Gateway stage when investigating a latency regression. The X-Ray service map will show you the relative time each downstream call consumed. Disable active tracing after investigation — it adds overhead and cost at scale.

---

5. Diagnosing Cost Anomalies

Weekly SAGE Cost Report

Every week, run the SAGE cost analysis: pull the AWS Cost Explorer breakdown for the silentinfinity cost allocation tag, segment by service (Bedrock, Lambda, DynamoDB, CloudFront, API Gateway, S3). Compare to the prior week. A week-over-week increase under 15% is expected during growth. An increase over 30% warrants investigation. The per-turn EMF cost estimate fields (estimated_input_cost_usd, estimated_output_cost_usd) allow you to correlate Cost Explorer totals with specific turn volumes.

Runaway Session or Abuse

If a single uid accounts for more than 10x the average per-uid cost in a given week, investigate that session immediately. Pull the turn history from DynamoDB for that uid. Common causes: a user running extremely long sessions repeatedly in a single day, an automated script hitting the invoke endpoint, or a bug in the client that loops the same request. If it looks like abuse, the uid can be suspended by adding it to the block list in the handler configuration. If it looks like a bug, reproduce it locally and fix it.

Accidental Opus Routing

If Bedrock costs spike sharply and the turn volume has not increased proportionately, check for accidental routing to a more expensive model. The most likely cause is a stale or overridden BEDROCK_MODEL_ID environment variable pointing at Opus instead of Sonnet, or a code change that hard-coded a model ID. Pull the Lambda environment variable in the console. Pull the last five deploys' commit diffs and search for any model-ID string changes. If Opus is running where Sonnet should be, update the environment variable immediately — this does not require a redeploy.

DynamoDB Cost Spike

DynamoDB cost under PAY_PER_REQUEST should be nearly linear with turn volume. A spike that outpaces turn volume suggests one of three causes: a table scan that is reading every item (usually caused by a missing index on a new query pattern), a hot partition key (for example, a uid of test that receives synthetic traffic and accumulates millions of items), or an accidental recursive write loop. Check the DynamoDB CloudWatch metrics for ConsumedWriteCapacityUnits and ConsumedReadCapacityUnits broken down by table. Identify which table and which operation is causing the spike.

CloudFront Cost Spike

CloudFront is billed on data transfer and request count. A spike in CloudFront costs is usually caused by one of: a misconfigured cache TTL that caused dynamic API responses to be cached and re-served (not valid here because API responses are pass-through), a sudden increase in traffic from a crawler or bot, or a cache TTL set too low on static assets causing excessive origin fetches. Check the CloudFront access logs bucket (note: logging is created but not yet enabled on the distribution — see Section 10). Until logging is enabled, use CloudFront metrics in the console to identify which path patterns are generating the most requests.

---

6. Diagnosing Crisis Flag Storms

What a Flag Storm Looks Like

A crisis flag storm is when the guard_in.crisis guardrail is firing on a high proportion of turns that do not represent genuine user crisis. Normal false-positive rate should be under 1% of all turns. If you see crisis flags on more than 5% of turns in a given hour, and the flagged turns do not contain crisis-relevant language upon manual inspection, that is a storm.

First Diagnosis Step

Before assuming a code problem, manually read at least 10 of the flagged turns from innerverse-observations. If the language genuinely could be interpreted as crisis-adjacent by a reasonable clinical reviewer, the guardrail may be working correctly and user communication patterns may have shifted. Do not declare a false-positive storm based on volume alone — read the content.

Tracing the Pattern Match

If the flagged turns do not contain crisis language, open crisis-patterns-v1.json and guardrails.py. Check for any recent changes to either file in git log. Compare the current patterns against the flagged turns to identify which specific pattern is matching incorrectly. A regex that is too broad — for example, one that matches any sentence containing the word end — will produce exactly this symptom.

Chat Sentinel Correlation

Cross-reference the flag timestamps with Chat Sentinel observation records. If Sentinel is also producing observations on these turns with a rationale that does not align with crisis content, the entire guardrail pipeline may have a shared upstream signal error. If Sentinel observations are not firing or are flagging different turns than the crisis guardrail, the issue is isolated to the crisis pattern regex.

Remediation

If the false-positive cause is confirmed, the correct remediation is to roll back the last change to crisis-patterns-v1.json or guardrails.py. Do this immediately — do not attempt to patch the specific bad pattern while leaving other recent changes in place, because you cannot be certain the other recent changes are clean. Roll back the entire change set, restore the baseline, confirm the storm stops, then re-introduce the intended pattern changes one at a time with individual testing.

Never permanently silence a pattern — meaning, never delete a pattern from crisis-patterns-v1.json — without clinical review. The patterns exist because they were validated against real crisis presentations. A pattern that is producing false positives needs to be refined, not removed. Any pattern deletion must be approved by a clinical reviewer before the change is committed.

---

7. Rolling the System Prompt

Versioning Convention

The system prompt lives in system_v1.md. Version is tracked by SHA-256 hash of the file contents, not by a human-assigned version number. The Lambda handler reads the prompt at cold-start initialization and logs the SHA in the startup EMF record. If you need to know what system prompt is currently running in production, pull the SHA from the most recent Lambda startup log and compare it against the git history.

SHA-Bump Procedure

When you have a new version of the system prompt ready for review:

1. Compute the SHA-256 hash of the new system_v1.md file and record it.

2. Update any reference to the prior SHA in the handler configuration or constants file.

3. Commit both the updated prompt file and the SHA reference update in a single atomic commit. The commit message must include the phrase system-prompt-update and a one-line description of what changed and why.

4. Run the offline regression suite against the new prompt locally before deploying. Do not skip this step.

Feature Readiness Tiers

If the system is at General Availability tier or above, the following requirements apply before any system prompt change reaches production:

First, run the offline regression against the golden conversation set (50 conversations, spanning normal, emotional, and crisis-adjacent content). The regression pass rate must be 100% on safety-critical behaviors and above 90% on character/tone behaviors before proceeding.

Second, run the ECHO 22-turn crisis simulation pass. ECHO exercises a scripted set of 22 turns that probe crisis-adjacent content, identity disclosure requests, boundary-testing, and emotional escalation. All 22 turns must produce behavior within the expected range.

Third, obtain HERALD signoff. HERALD reviews the prompt change for any wording that could affect user safety, therapeutic frame integrity, or legal exposure. HERALD signoff must be a written approval in the change ticket.

Fourth, deploy the new prompt to a 10% shadow environment for 24 hours minimum. Observe the observation log for any unexpected flag patterns.

Fifth, after shadow observation is clean, proceed to full production deploy.

Before the system reaches GA tier, prompt changes require only the offline regression and a manual review by the developer making the change. Even so, treat safety-critical wording with the same rigor as if GA requirements were in force.

---

8. Rolling the Model

Current State

The default inference model is us.anthropic.claude-sonnet-4-6. This model ID is stored in the BEDROCK_MODEL_ID Lambda environment variable. The handler reads this variable at invocation time, meaning you can change the active model without redeploying code by updating the environment variable in the Lambda console.

Candidate: Opus 4.7 for Crisis Path

The model designated for potential crisis-path specialization is anthropic.claude-opus-4-7. Opus 4.7 is approximately 20% more expensive than Sonnet 4.6 per token. It is not the default model. It is a candidate for routing on crisis-flagged turns only, where the additional cost may be justified by improved response quality. Do not activate Opus 4.7 routing without completing the Switch Gate process.

The Switch Gate

Any model change — whether to a new version of Sonnet or a switch to Opus — must pass through the four-phase Switch Gate process before full production traffic is served by the new model.

Phase 1 is offline evaluation. Run both the current model and the candidate model against the golden conversation set and the ECHO crisis pass. Score each response across the defined evaluation rubrics. The candidate must not score worse than the current model on any safety dimension, and must not score worse than 5% on overall quality dimensions.

Phase 2 is shadow mode for 72 hours minimum. Run the candidate model in parallel with the current model on real production traffic. The candidate receives the same inputs but its outputs are not returned to the user — they are logged to a separate observation stream. Review the shadow log for any response quality regressions, unexpected refusals, or safety deviations.

Phase 3 is graduated A/B rollout. Move 10% of production traffic to the candidate model. Hold for 24 hours and monitor all health metrics. If metrics are stable, move to 50% and hold for another 24 hours. If metrics are stable, move to 100%.

At each phase gate, if any metric degrades beyond threshold, halt the rollout and return to the current model immediately.

Variant Registry

A planned variants.py module will formalize per-request model routing based on turn characteristics. Until that module is implemented, model routing logic lives inline in the handler. The /invoke endpoint will eventually support a ?v=X query parameter to request a specific registered variant. Do not build ad-hoc routing outside the variant system once it is live.

---

9. Incident Bridge

When the Incident Response Playbook Activates

This runbook covers diagnosis, standard operations, and known procedures. When an incident reaches P0 (total production outage) or P1 (significant degradation affecting more than 20% of users), the Incident Response Playbook takes over from this runbook.

The handoff point is: an Incident Commander has been identified, the communication channel has been opened, and roles (IC, scribe, technical lead, communications) have been assigned. At that point, this runbook moves to reference-only status. The IC drives from the Incident Response Playbook.

What This Runbook Contributes During an Incident

The diagnostic procedures in Sections 4, 5, and 6 of this runbook are valid reference material during an incident. The IC or technical lead may use them to identify the failure source faster. Sections 2 and 8 contain rollback procedures that may be executed under IC direction.

Post-Incident Runbook Update

After every P0 or P1 incident, conduct a post-incident review within 48 hours. If the incident surfaced a diagnostic pattern or failure mode that is not covered in this runbook, add it. Specifically: if you invented a diagnostic step during the incident that turned out to be effective, write it into the appropriate section here before closing the incident ticket. The runbook grows from incidents, not from anticipation alone.

---

10. Known Issues and Workarounds

iOS Safari WebSocket Drops on Screen Lock

When an iOS Safari user locks their device mid-session, the WebSocket connection drops. Upon unlock, the client does not automatically re-establish the connection, and the user sees a stale UI. A partial workaround is in place: the visibilitychange event handler detects when the document returns to the visible state and triggers an audio re-initialization sequence. This does not fully restore a dropped WebSocket in all cases. The correct fix is to implement connection liveness checks with exponential backoff reconnect on the client. This is tracked as a known bug. Until fixed, users can recover by refreshing the page.

Browser Auto-Play Policy

Modern browsers block audio auto-play until the user has made a gesture in the page context. The kickAudio handler addresses this: it attaches to the first user interaction event (click or keydown) and initializes the audio context within that event handler, satisfying the browser's user-gesture requirement. If a user reports that they hear no audio on first load, the most common cause is that the kickAudio handler did not fire before the first model response arrived. The workaround is to instruct the user to click anywhere in the UI before speaking.

DynamoDB Streams Not Yet Enabled

The Bronze tier data pipeline depends on DynamoDB Streams being enabled on innerverse-conversations. Streams are not currently enabled. The Bronze tier data flow is blocked until this is wired up. No data is being lost — conversations are being written to DynamoDB normally — but the downstream analytics and memory consolidation pipeline that reads from Streams is not running. This is a tracked next-sprint item.

CloudFront Access Logs Bucket Exists But Logging Not Enabled

The S3 bucket designated for CloudFront access logs was created and is correctly configured for log delivery. However, the CloudFront distribution E2M8T6S9SM3OQY does not yet have logging enabled in its distribution settings. As a result, no access logs are flowing to S3. This means historical traffic analysis must use CloudFront metrics (which are aggregated and less detailed) rather than raw log files. Enable logging on the distribution in the next infrastructure sprint.

Pack Archive Size

The most recent titan-pack.zip is 9.6 GB, primarily because the ~/.claude/projects directory was included in the archive. The exclusion list in the pack script needs to be tightened to omit that directory and any other large non-essential paths. Until tightened, restore drills using the archive will be slow on limited-bandwidth connections.

---

11. Ops Drills to Run Quarterly

Run these four drills once per quarter, at minimum. Document the results each time. If a drill reveals a gap, fix it and re-run the drill before marking it complete.

Drill 1: Full System Restore from Pack Archive

Take a clean machine — one that has no existing Silent Infinity configuration or credentials on it. Restore the most recent titan-pack.zip to that machine. Reinstall dependencies from the requirements file. Configure AWS credentials from scratch. Run the full 16-test smoke suite against the live production endpoint. The target: all 16 tests pass within two hours of starting from a completely clean state. If you cannot restore and pass the smoke suite within two hours, identify what blocked you and fix the runbook, the pack contents, or the smoke suite setup procedure.

Drill 2: Lambda Cold Start Measurement

Scale the Lambda function's reserved concurrency to zero using the Lambda console. Wait 5 minutes for all existing warm instances to expire. Then immediately scale concurrency back to its normal setting and send a test request to the production endpoint. Measure the time from request send to first byte of response. This is the worst-case TTFB under cold start conditions. Record the measurement. If it exceeds 5 seconds, investigate the package size and import chain for optimization opportunities.

Drill 3: Bedrock 429 Graceful Degradation

In the handler's routing layer, inject a mock error that returns a 429 response for all Bedrock calls. Do this in a local Lambda test environment, not production. Send a test conversation turn through the local handler. Verify that the handler: (1) retries with backoff rather than immediately returning an error, (2) returns a graceful fallback response to the caller rather than an unhandled exception, (3) logs the 429 event with the correct EMF metric, and (4) does not corrupt the conversation history in DynamoDB. Remove the mock error injection after the drill is complete.

Drill 4: CloudFront Configuration Validation

Simulate a CloudFront behavioral misconfiguration — for example, temporarily modify a cache behavior TTL to an extreme value — and verify that the system detection path (manual check or future synthetic canary) catches it before real users are affected. Then restore the correct configuration and confirm the invalidation clears the misconfigured state. This drill validates your ability to detect and recover from infrastructure configuration drift.

---

12. On-Call Rotation (Future)

Current State

Harnoor is the sole on-call responder today. There is no rotation. All alerts, whether from CloudWatch alarms (when configured), synthetic canary failures (when wired), or user-reported issues, route to Harnoor directly.

Future Rotation Design

When the engineering team reaches three or more engineers, establish a standard three-person rotation. Rotation length should be one week per engineer. No engineer should be on-call for more than two consecutive weeks.

Severity Response Times

These are the target response times by severity classification, effective immediately and binding for any future on-call engineer:

P0, total production outage: acknowledge within 15 minutes, incident declared within 30 minutes. P1, significant degradation: acknowledge within 2 hours. P2, minor degradation or non-critical feature failure: acknowledge within one business day. P3, cosmetic issues or low-priority bugs: acknowledge within one week.

Pager Integration (Future)

When the team is large enough to warrant automated paging, the recommended path is to wire CloudWatch alarms to SNS topics, which trigger either OpsGenie or PagerDuty. The choice between OpsGenie and PagerDuty should be made based on existing team tooling. Do not set up a paging system before there are at least two engineers who can receive and respond to pages. A pager that rings to a single person who is already aware of the incident adds no value and creates alert fatigue.

---

13. References

The following references informed the design of this runbook and should be consulted when making architectural decisions about reliability, operational posture, or incident management:

Google SRE Book — The foundational text for site reliability engineering practice. Particularly relevant: the chapters on Service Level Objectives, error budgets, eliminating toil, and the on-call chapter. Available at sre.google/sre-book.

AWS Well-Architected Framework, Reliability and Operational Excellence Pillars — AWS's own documentation for building systems that are resilient and operationally sound. The Reliability pillar covers failure management, backup, and recovery. The Operational Excellence pillar covers runbook and playbook design, change management, and monitoring. Available at docs.aws.amazon.com/wellarchitected.

OWASP LLM Top 10 — The ten highest-risk vulnerability categories specific to large language model applications. Relevant to Silent Infinity because the system processes unconstrained user input and routes it to a commercial LLM. The top concerns include prompt injection, insecure output handling, and training data poisoning. Guardrail and observability design should be reviewed against this list annually. Available at owasp.org/www-project-top-10-for-large-language-model-applications.

---

End of runbook. Last updated: 2026-04-21. Next scheduled review: 2026-07-21. Owner: Harnoor Singh.