ALL MEMOS Download .docx

Silent Infinity — Incident Response Playbook

Version: 1.0

Effective Date: 2026-04-21

Owner: Harnoor Singh (harnoors@gmail.com)

Status: Active

Next Review: 2026-07-21 (or after first P0/P1 incident, whichever comes first)

---

> How to use this document. If you are reading this in the middle of an incident: go directly to the relevant Playbook section (A–F). The Scope and Severity Matrix sections exist for orientation and training — not for reading during a live crisis.

---

1. Scope + Philosophy

What Counts as an Incident

An incident is any unplanned event that causes or credibly risks:

Incidents do NOT include: anticipated maintenance windows, minor cosmetic defects with no user-facing impact, or speculative "what if" concerns with no concrete trigger.

The Clinical Weight of This Product

Silent Infinity is a wellness and mental-health-support application. This changes the stakes of every incident class listed above. An outage that would be a minor inconvenience on a productivity app can interrupt a user's only available emotional support at 2 a.m. A harmful AI output that would be embarrassing on a content platform can, in the context of a vulnerable user, contribute to genuine psychological harm.

Every engineer, every decision-maker, and every contractor working under this playbook is expected to hold that clinical weight consciously — especially when under pressure to move fast.

Blameless Post-Mortem Culture

This organization follows a blameless post-mortem standard. The premise, articulated clearly in Kim Etherington's 2004 work on restorative practice and operationalized by John Allspaw and Paul Hammond at Etsy in their landmark 2012 "Blameless PostMortems and a Just Culture" post, is as follows:

People who cause incidents are not bad actors. They are people operating under conditions — time pressure, incomplete information, unclear ownership, inherited technical debt — that made the wrong outcome more likely. Blame extinguishes the information that would have prevented the next incident. Blamelessness does not mean absence of accountability; it means the focus is on system conditions, not individual fault.

In practice this means:

Violation of blameless culture — e.g., using a post-mortem to publicly shame a contributor — is itself an organizational incident.

Severity Levels

| Level | Label | Plain Summary |

|-------|-------|---------------|

| P0 | Life threat | Imminent risk to a user's physical safety or life |

| P1 | Serious harm risk | Credible harm risk to user wellbeing or significant data exposure |

| P2 | Degraded service | Product works, but meaningfully impaired for many users |

| P3 | Minor | Single-user issue, cosmetic, or non-blocking |

---

2. Severity Matrix

The following matrix governs classification. When in doubt, escalate to the higher severity level. Downgrading is always possible once you have more information; under-classifying a P0 as a P1 wastes the critical first minutes.

| Severity | Trigger Examples | Response Window | Who Is Paged |

|----------|-----------------|-----------------|--------------|

| P0 | User in active suicidal crisis; crisis-detection system silent; active PII data breach with confirmed exfiltration; site completely unresponsive during peak mental-health hours | < 15 minutes | IC (Harnoor) + Clinical Advisor + Legal (if breach) + Technical Lead |

| P1 | User reports distress caused by AI output; crisis-pattern false negative (detected in retrospect); payment data potentially exposed (unconfirmed); latency > 8 seconds sustained > 30 min; Lambda error rate > 15% | < 1 hour | IC + Technical Lead + Clinical Advisor (clinical dimension) |

| P2 | Non-safety bug affecting > 10% of users; session memory corruption for a cohort; onboarding flow broken; sustained > 3 second latency; pattern of negative feedback across > 5 users in 24 hours | < 4 hours | Technical Lead notifies IC |

| P3 | Single-user display bug; cosmetic regression; non-blocking error message; minor copy error | Next business day | Filed in bug tracker; no page |

Classification Notes

Crisis-detection false negatives are always at least P1. If the system failed to trigger a crisis response when one was warranted — even if the user is now safe — the system failed in its core safety function. Treat this with the same urgency as a harm report.

Ambiguous user reports escalate to P1 by default. If a user sends a message suggesting they were harmed or distressed by an AI output and the cause is unclear, classify P1 until the clinical review determines otherwise. Do not assume a misunderstanding and self-downgrade without evidence.

Data incidents are P0 until scope is confirmed. If you believe there has been unauthorized data access but do not yet know scope, treat it as P0 while you investigate. Once you confirm it affected no live user data, you may downgrade.

Compound incidents multiply. A P2 outage simultaneously with a P2 harmful-output report should be treated as P1 minimum — multiple failure modes operating at once indicate a deeper system issue.

---

3. Roles

Incident Commander (IC)

Default: Harnoor Singh

Backup: To be designated as team grows

The IC owns the incident from detection to closure. The IC does not need to personally fix anything — the IC coordinates, decides, and communicates. Responsibilities:

The IC should not be the one writing code during a live P0/P1 incident. If Harnoor is the only technical person available, bias toward communication and containment over deep diagnosis — a partially broken app that users know is broken is better than a silent failure.

Technical Lead

Default: Harnoor Singh (until dedicated engineering headcount)

Responsibilities:

Communications Lead

Default: Harnoor Singh (until dedicated team)

Responsibilities:

Clinical Advisor

Default: Placeholder — engage via American Foundation for Suicide Prevention (AFSP) partnership or licensed clinical psychologist on retainer. Deadline: [engage by 2026-06-01].

When engaged: Any P0 or P1 with a clinical dimension (crisis-detection failure, user harm report, harmful AI output in mental health context)

Responsibilities:

Legal Counsel

Default: Placeholder — engage by [2026-06-01]

When engaged: Data breach (any tier), legal threat (subpoena, C&D, plaintiff), P0 user-harm incident

Responsibilities:

Future Role: On-Call Operator

As Silent Infinity grows, a rotating on-call schedule will be established. The on-call operator assumes first-response duties during off-hours: incident detection, initial severity classification, paging of IC if P0/P1. This role does not yet exist; until it does, IC and Technical Lead monitoring responsibilities fall to Harnoor.

---

4. Playbook A — Crisis-User Harm

Trigger: A user reports (via /feedback form, direct email to harnoors@gmail.com, social media, or third-party press) that a Silent Infinity AI output caused distress or contributed to harm — including self-harm, suicidal crisis, or psychological deterioration.

Immediate Response (< 15 minutes from detection)

Short-Term Response (1 – 24 hours)

Medium-Term Response (24 – 72 hours)

What Never to Do

---

5. Playbook B — Data Breach

Trigger: Confirmed or credibly suspected unauthorized access to any tier of user data: conversation history, PII (name, email, phone), payment data, session tokens, or infrastructure credentials.

Immediate Response (< 1 hour)

Containment

Legal Obligations

| Jurisdiction | Obligation | Deadline |

|---|---|---|

| EU (GDPR Art. 33) | Notify supervisory authority if breach risks rights and freedoms of individuals | 72 hours from awareness |

| EU (GDPR Art. 34) | Notify affected data subjects if high risk | Without undue delay |

| California (CCPA) | Notify California AG if > 500 CA residents affected | Most expedient time possible |

| State laws | 50-state review required — most have breach notification laws | 30–60 days typical; confirm with Legal |

Legal Counsel owns regulatory notification. IC owns user communication. The 72-hour GDPR clock starts from when you became aware of a probable breach — not from when you confirmed scope.

User Communication

Root-Cause Analysis (72-hour target)

---

6. Playbook C — Outage

Trigger: silentinfinity.com returning 5xx errors at > 5% rate for > 2 minutes, or completely unresponsive to synthetic canary checks.

Auto-Detection

Immediate Response (< 15 minutes)

Diagnosis Path

Work the chain front-to-back. Stop at the first layer that shows errors.


CloudFront → API Gateway → Lambda → Bedrock (Claude) → DynamoDB

1. CloudFront: Check distribution error rates in CloudWatch. Is the origin returning errors, or is CloudFront itself the issue (e.g., SSL cert expiry)?

2. API Gateway: Check 5xx rate by route. Is it all routes or one? Throttling events?

3. Lambda: Check invocation errors, duration (timeout?), concurrency limits hit, code exception rate. CloudWatch Logs Insights: filter @message like "ERROR" across the function log group.

4. Bedrock (Claude API): Check for AWS service health events at health.aws.amazon.com. Check Bedrock-specific throttle errors in Lambda logs.

5. DynamoDB: Check consumed capacity, throttle events, table-level error metrics.

6. X-Ray: If the above doesn't isolate the issue, pull a service map trace from X-Ray for a representative failed request.

Rollback Procedure

Silent Infinity uses blue/green Lambda deployments via weighted aliases.

User Communication

Post-Outage

---

7. Playbook D — Harmful AI Output

Trigger: Claude generates content that violates Silent Infinity's published safety commitments. Categories: (a) clinical-level medical or psychiatric advice; (b) harmful instructions (self-harm methods, dangerous substance use guidance); (c) explicit sexual or violent content; (d) contemptuous, demeaning, or discriminatory language toward a user.

Detection surfaces:

Immediate Response (< 1 hour)

Short-Term: Root-Cause Classification (1 – 24 hours)

Determine which of three failure modes applies:

Failure Mode 1: Prompt engineering bug

The system prompt failed to establish a constraint, or established it ambiguously. Example: system prompt says "avoid detailed discussion of medication overdose" but does not define "detailed," and model produced a technically compliant but clinically dangerous response.

Failure Mode 2: Model behavior outside prompt

The system prompt was clear and correct; the model violated it anyway. This happens occasionally with large language models under adversarial or edge-case prompts.

Failure Mode 3: User induction (adversarial prompting)

User deliberately crafted input to circumvent safety guardrails (jailbreak, role-play framing, multi-step manipulation).

User Communication

---

8. Playbook E — Legal / Regulatory Threat

Trigger: Receipt of a subpoena, civil investigative demand, cease-and-desist letter, regulatory inquiry letter (FTC, state AG, HIPAA enforcement), or a communication from a plaintiff's attorney.

Immediate Response (< 4 hours)

Legal Hold

Regulatory Notification Intersections

If the legal threat arises from a data incident or user-harm incident, the notification obligations from Playbooks A and B apply concurrently. Do not allow the legal process to delay required user notifications — consult Legal Counsel, but understand that GDPR 72-hour timelines do not pause for legal strategy.

Evaluating Liability Defense

Silent Infinity maintains the following in service of product-liability defense:

These records are the foundation of any good-faith defense. Legal Counsel should be briefed on this architecture at engagement.

Valid Subpoena: Compliance and User Notification

---

9. Playbook F — Press Crisis

Trigger: A negative press story about Silent Infinity is published or clearly in progress (reporter inquiry); a viral complaint thread on social media (> 500 engagements or picked up by a journalist); a coordinated inauthentic attack campaign (flagged by pattern of similar accounts).

Immediate Response (< 1 hour)

Substantive Complaint (real harm alleged)

A substantive complaint means the negative coverage alleges something that could be true — a real user who experienced real harm, a genuine safety failure, a factual claim about our data practices.

Adversarial Campaign (misrepresentation or bad-faith attack)

An adversarial campaign means the claims being made are false, distorted, or being amplified in a coordinated inauthentic manner.

If a Reporter Contacts Us

---

10. Post-Mortem Template

Copy this template for each incident post-mortem. File location: F:/TITAN/post-mortems/[DATE]-[SEVERITY]-[SLUG].md

---


# Post-Mortem: [Incident Slug]

**Date of Incident:** [YYYY-MM-DD]
**Severity:** [P0 / P1 / P2 / P3]
**Incident Commander:** [Name]
**Technical Lead:** [Name]
**Total Duration:** [HH:MM from detection to resolution]
**Users Affected:** [number or "unknown"]
**Publication Status:** [Internal Only / Published at /safety/transparency]

---

## Summary

[2–4 sentences. What happened, who was affected, how it was resolved.]

---

## Timeline

All times in [timezone]. Times are approximate.

| Time | Event |
|------|-------|
| HH:MM | First detection / alert fired |
| HH:MM | IC declared incident, severity assigned |
| HH:MM | [action taken] |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed / rollback executed |
| HH:MM | Incident resolved; monitoring continues |
| HH:MM | Post-mortem convened |

---

## Root Cause

### 5 Whys

1. **Why did the incident occur?**
   [First-order cause]

2. **Why did [first-order cause] happen?**
   [Second-order cause]

3. **Why did [second-order cause] happen?**
   [Third-order cause]

4. **Why did [third-order cause] happen?**
   [Fourth-order cause]

5. **Why did [fourth-order cause] happen?**
   [Root cause — this is what goes in the summary]

---

## Contributing Factors

- [System condition, process gap, or tooling limitation that made this outcome more likely]
- [...]

---

## What Went Well

- [Things that worked: detection speed, containment, communication, team coordination]
- [...]

---

## What Didn't Go Well

- [Things that slowed response, increased impact, or created confusion]
- [...]

---

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, testable action] | [Name] | [YYYY-MM-DD] | Open |
| [...] | [...] | [...] | [...] |

---

## Appendix

[Full logs, screenshots, or other evidence — linked, not pasted inline]

---

Publication Policy

| Severity | Publication Obligation |

|---|---|

| P0 | Always published at /safety/transparency within 14 days of incident close |

| P1 | Always published within 14 days |

| P2 | Published if incident lasted > 4 hours or affected > 100 users |

| P3 | Internal only |

Publication means the full post-mortem, edited only to protect user privacy (no PII, no uid/cid). Clinical or legal details that are subject to privilege may be redacted with a notation explaining the redaction.

---

11. Contact Sheet

Internal

| Role | Name | Contact | Backup Contact |

|---|---|---|---|

| Incident Commander / Founder | Harnoor Singh | harnoors@gmail.com | Phone: [to be added] |

| Technical Lead | Harnoor Singh (current) | harnoors@gmail.com | — |

| Clinical Advisor | [To be engaged by 2026-06-01] | Via AFSP: afsp.org | — |

| Legal Counsel | [To be engaged by 2026-06-01] | [TBD] | — |

AWS Support

Crisis Lines (for direct reference in user communications)

| Service | Contact |

|---|---|

| 988 Suicide & Crisis Lifeline (US) | Call or text 988 |

| Emergency Services (US) | 911 |

| International Crisis Lines | findahelpline.com |

| SAMHSA Treatment Locator | findtreatment.gov |

| Crisis Text Line (US) | Text HOME to 741741 |

These lines should be included verbatim in any Playbook A user communication. Do not paraphrase. Do not omit.

---

12. Testing the Playbook

Quarterly Tabletop Exercise

Once per quarter (suggested schedule: Jan, Apr, Jul, Oct), the team conducts a structured tabletop walkthrough of one playbook scenario.

Format (90 minutes):

1. IC selects a playbook scenario and assigns roles (30 minutes prep)

2. IC reads a scenario prompt aloud: "It is 2:00 a.m. on a Tuesday. You receive the following message: [scenario]"

3. Each role-holder talks through what they would do, in order, referencing the playbook

4. IC pauses the walkthrough when a gap is identified: "We don't have a step for this. What should we do?"

5. Gaps are documented and converted to pull requests against this document

Rotation: Cover Playbook A (harm) and Playbook B (breach) at least annually. The others rotate.

Output: A brief gap log, filed at F:/TITAN/post-mortems/tabletop-[DATE].md

Annual Production Drill

Once per year, schedule a staged degraded-mode rehearsal on a staging deployment slot:

This drill should be scheduled at least 2 weeks in advance, announced internally, and should never touch the production environment without explicit IC authorization.

---

13. References

1. Allspaw, J. & Hammond, P. (2012). "Blameless PostMortems and a Just Culture." Etsy Engineering Blog. Foundation of our blameless culture standard.

2. Allspaw, J. & Robbins, J. (Eds.). Web Operations: Keeping the Data on Time. O'Reilly Media. Practical incident response patterns from which the role structure in this playbook is adapted.

3. Beyer, B., Jones, C., Petoff, J., & Murphy, N.R. (Eds.). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. Canonical reference for on-call practice, post-mortem culture, and error budget frameworks. Available free at sre.google/sre-book.

4. Cichonski, P., Millar, T., Grance, T., & Scarfone, K. Computer Security Incident Handling Guide (NIST SP 800-61 Revision 2). NIST, 2012. Authoritative federal guidance on forensic preservation, chain of custody, and incident classification. Informs our breach containment sequencing.

5. Etherington, K. (2004). Trauma Counselling and Narrative Therapy. Jessica Kingsley Publishers. Foundational work on restorative and blameless frameworks in human-centered contexts.

6. GDPR Article 33–34. Regulation (EU) 2016/679. Binding legal standard for breach notification obligations for EU data subjects.

7. CCPA / CPRA. California Consumer Privacy Act and California Privacy Rights Act. Governs breach notification obligations for California residents.

---

This playbook is a living document. It must be reviewed following every P0 or P1 incident, and on a scheduled quarterly basis regardless of incident activity. The most dangerous version of this document is one that is out of date and trusted anyway.

Last Updated: 2026-04-21

Updated By: Harnoor Singh

Next Scheduled Review: 2026-07-21