Silent Infinity — Incident Response Playbook

Version: 1.0

Effective Date: 2026-04-21

Owner: Harnoor Singh (harnoors@gmail.com)

Status: Active

Next Review: 2026-07-21 (or after first P0/P1 incident, whichever comes first)

---

> How to use this document. If you are reading this in the middle of an incident: go directly to the relevant Playbook section (A–F). The Scope and Severity Matrix sections exist for orientation and training — not for reading during a live crisis.

---

1. Scope + Philosophy

What Counts as an Incident

An incident is any unplanned event that causes or credibly risks:

User harm — physical, psychological, or financial damage to a Silent Infinity user, including distress triggered by AI output
Data breach — unauthorized access to, exfiltration of, or destruction of user data (conversations, PII, payment records, session metadata)
Service outage — the product is unavailable or functionally degraded to a degree that disrupts active users
Harmful AI output — the model generates content that violates our published safety commitments (crisis-response failures, medical advice, explicit content, contempt)
Legal threat — subpoena, cease-and-desist, regulatory inquiry, or a credible plaintiff's legal action
Press crisis — negative media coverage, viral complaint thread, or a coordinated reputational attack

Incidents do NOT include: anticipated maintenance windows, minor cosmetic defects with no user-facing impact, or speculative "what if" concerns with no concrete trigger.

The Clinical Weight of This Product

Silent Infinity is a wellness and mental-health-support application. This changes the stakes of every incident class listed above. An outage that would be a minor inconvenience on a productivity app can interrupt a user's only available emotional support at 2 a.m. A harmful AI output that would be embarrassing on a content platform can, in the context of a vulnerable user, contribute to genuine psychological harm.

Every engineer, every decision-maker, and every contractor working under this playbook is expected to hold that clinical weight consciously — especially when under pressure to move fast.

Blameless Post-Mortem Culture

This organization follows a blameless post-mortem standard. The premise, articulated clearly in Kim Etherington's 2004 work on restorative practice and operationalized by John Allspaw and Paul Hammond at Etsy in their landmark 2012 "Blameless PostMortems and a Just Culture" post, is as follows:

People who cause incidents are not bad actors. They are people operating under conditions — time pressure, incomplete information, unclear ownership, inherited technical debt — that made the wrong outcome more likely. Blame extinguishes the information that would have prevented the next incident. Blamelessness does not mean absence of accountability; it means the focus is on system conditions, not individual fault.

In practice this means:

Post-mortem reports do not name individuals as causes
Engineers are encouraged to share full information, including embarrassing mistakes, without fear of punishment
Action items target systems, documentation, tooling, and processes — not people
Harnoor, as IC and founder, is equally subject to this standard

Violation of blameless culture — e.g., using a post-mortem to publicly shame a contributor — is itself an organizational incident.

Severity Levels

| Level | Label | Plain Summary |

|-------|-------|---------------|

| P0 | Life threat | Imminent risk to a user's physical safety or life |

| P1 | Serious harm risk | Credible harm risk to user wellbeing or significant data exposure |

| P2 | Degraded service | Product works, but meaningfully impaired for many users |

| P3 | Minor | Single-user issue, cosmetic, or non-blocking |

---

2. Severity Matrix

The following matrix governs classification. When in doubt, escalate to the higher severity level. Downgrading is always possible once you have more information; under-classifying a P0 as a P1 wastes the critical first minutes.

|----------|-----------------|-----------------|--------------|

| P0 | User in active suicidal crisis; crisis-detection system silent; active PII data breach with confirmed exfiltration; site completely unresponsive during peak mental-health hours | < 15 minutes | IC (Harnoor) + Clinical Advisor + Legal (if breach) + Technical Lead |

| P1 | User reports distress caused by AI output; crisis-pattern false negative (detected in retrospect); payment data potentially exposed (unconfirmed); latency > 8 seconds sustained > 30 min; Lambda error rate > 15% | < 1 hour | IC + Technical Lead + Clinical Advisor (clinical dimension) |

| P2 | Non-safety bug affecting > 10% of users; session memory corruption for a cohort; onboarding flow broken; sustained > 3 second latency; pattern of negative feedback across > 5 users in 24 hours | < 4 hours | Technical Lead notifies IC |

Classification Notes

Crisis-detection false negatives are always at least P1. If the system failed to trigger a crisis response when one was warranted — even if the user is now safe — the system failed in its core safety function. Treat this with the same urgency as a harm report.

Ambiguous user reports escalate to P1 by default. If a user sends a message suggesting they were harmed or distressed by an AI output and the cause is unclear, classify P1 until the clinical review determines otherwise. Do not assume a misunderstanding and self-downgrade without evidence.

Data incidents are P0 until scope is confirmed. If you believe there has been unauthorized data access but do not yet know scope, treat it as P0 while you investigate. Once you confirm it affected no live user data, you may downgrade.

Compound incidents multiply. A P2 outage simultaneously with a P2 harmful-output report should be treated as P1 minimum — multiple failure modes operating at once indicate a deeper system issue.

---

3. Roles

Incident Commander (IC)

Default: Harnoor Singh

Backup: To be designated as team grows

The IC owns the incident from detection to closure. The IC does not need to personally fix anything — the IC coordinates, decides, and communicates. Responsibilities:

Declare the incident and assign its severity level
Page the appropriate roles based on the severity matrix
Make the final call on user-facing communications (timing, tone, content)
Authorize any change to production (rollbacks, emergency patches, service disablement)
Convene and chair the post-mortem
Approve the final post-mortem for publication

The IC should not be the one writing code during a live P0/P1 incident. If Harnoor is the only technical person available, bias toward communication and containment over deep diagnosis — a partially broken app that users know is broken is better than a silent failure.

Technical Lead

Default: Harnoor Singh (until dedicated engineering headcount)

Responsibilities:

Diagnose root cause using CloudWatch, X-Ray, CloudTrail, and application logs
Execute containment actions (rollback, traffic shift, credential rotation, feature flag disable)
Communicate technical status to IC in plain English every 15 minutes during P0/P1
Lead root-cause analysis for post-mortem
Own technical action items in post-mortem

Communications Lead

Default: Harnoor Singh (until dedicated team)

Responsibilities:

Draft all user-facing messages (email, in-app notification, status page)
Draft any press-facing statements — must be reviewed by IC before publication
Monitor social channels for escalating user sentiment during an incident
Update status page (silentinfinity.com/status — to be built) at agreed intervals
Never publish communications without IC sign-off

Clinical Advisor

Default: Placeholder — engage via American Foundation for Suicide Prevention (AFSP) partnership or licensed clinical psychologist on retainer. Deadline: [engage by 2026-06-01].

When engaged: Any P0 or P1 with a clinical dimension (crisis-detection failure, user harm report, harmful AI output in mental health context)

Responsibilities:

Review the specific conversation or output that triggered the incident
Advise on adequacy of our immediate user response (tone, resources offered)
Recommend clinical protocol changes if the gap exposed is pattern-level
Provide expert opinion for any regulatory or legal response involving clinical claims

Legal Counsel

Default: Placeholder — engage by [2026-06-01]

When engaged: Data breach (any tier), legal threat (subpoena, C&D, plaintiff), P0 user-harm incident

Responsibilities:

Advise on regulatory notification obligations (GDPR 72-hour rule, CCPA, HIPAA-adjacent)
Review any public statement before publication in a legal-threat context
Coordinate legal hold issuance
Interface with plaintiff's counsel or regulatory authority

Future Role: On-Call Operator

As Silent Infinity grows, a rotating on-call schedule will be established. The on-call operator assumes first-response duties during off-hours: incident detection, initial severity classification, paging of IC if P0/P1. This role does not yet exist; until it does, IC and Technical Lead monitoring responsibilities fall to Harnoor.

---

4. Playbook A — Crisis-User Harm

Trigger: A user reports (via /feedback form, direct email to harnoors@gmail.com, social media, or third-party press) that a Silent Infinity AI output caused distress or contributed to harm — including self-harm, suicidal crisis, or psychological deterioration.

Immediate Response (< 15 minutes from detection)

[ ] IC declares incident, assigns severity (P0 if user is in active crisis; P1 if harm is reported retrospectively)
[ ] Send acknowledgment to user within 15 minutes: "We've received your message. A real person is reviewing this now. If you need immediate support, please contact 988 (call or text), 911, or findahelpline.com."
[ ] Do NOT send an automated or templated-sounding response. Write it as a human.
[ ] Page Clinical Advisor. Do not wait for business hours.
[ ] Retrieve the full conversation log: DynamoDB sessions table, uid + cid. Do not alter any record.
[ ] Assign Technical Lead to pull guardrails.py activation log for the relevant session

Short-Term Response (1 – 24 hours)

[ ] Technical Lead reviews full conversation turn-by-turn against guardrails.py and crisis-patterns-v1.json
[ ] Determine the failure mode: (a) known pattern gap — the harm case was not in our patterns; (b) guardrail present but incorrectly triggered; (c) guardrail present, triggered, but insufficient; (d) user circumvented guardrail through adversarial prompting
[ ] Clinical Advisor assesses: given the conversation, was our response clinically adequate? What would a licensed counselor have done differently?
[ ] IC drafts a personal follow-up to the affected user. Honest, specific, no legalese. Offer: "If you'd like to speak with me directly, I will make time for a call." — and mean it.
[ ] If user is at ongoing risk: provide direct warm-transfer language to human crisis support. Do not leave the user with only a list of phone numbers.

Medium-Term Response (24 – 72 hours)

[ ] If a pattern gap was identified: update crisis-patterns-v1.json with the new pattern, test against the 50-case regression suite, deploy with SHA-versioned commit
[ ] If a prompt engineering gap was identified: update system prompt, deploy, document in changelog
[ ] Schedule a clinical review with Clinical Advisor: is there a category of harm we have systematically underweighted?
[ ] Document the full incident in the quarterly transparency report. Do not omit it. Transparency is a commitment, not a marketing decision.

What Never to Do

Never deny or minimize. "Our AI did not cause harm" is not a statement you can make without a full clinical review. If you say it prematurely and it turns out to be wrong, you have created a much worse secondary incident.
Never gag. Do not ask users to sign NDAs or stay quiet in exchange for compensation. The legal and reputational cost of this approach was demonstrated clearly when Character.AI faced regulatory scrutiny in January 2026 following reports of harm to minor users; their handling of user communications became as much of the story as the underlying failure.
Never route a distressed user through a ticket queue. Every person who reaches out after a harm incident is a person first.

---

5. Playbook B — Data Breach

Trigger: Confirmed or credibly suspected unauthorized access to any tier of user data: conversation history, PII (name, email, phone), payment data, session tokens, or infrastructure credentials.

Immediate Response (< 1 hour)

[ ] IC declares incident P0
[ ] Do NOT delete, overwrite, or rotate anything until scope is confirmed — you may destroy forensic evidence
[ ] Technical Lead pulls CloudTrail: identify anomalous API calls, access patterns, source IPs, and IAM principals in the last 72 hours
[ ] Confirm scope: what data store was accessed? DynamoDB table(s)? S3 bucket(s)? Lambda environment variables? Secrets Manager?
[ ] Confirm access method: compromised IAM credential? Misconfigured bucket policy? Injected code? Third-party dependency?
[ ] Once scope is confirmed: rotate ALL credentials that could have been exposed. In order of priority: (1) AWS root account if applicable, (2) IAM access keys for affected roles, (3) Secrets Manager values, (4) third-party API keys stored in environment variables
[ ] Page Legal Counsel immediately — do not wait for scope confirmation

Containment

[ ] If a Lambda execution role is confirmed compromised: revoke it. Accept that Lambda functions will error until re-credentialed. This is the correct tradeoff.
[ ] Rotate DynamoDB KMS customer-managed keys if table-level access is suspected
[ ] Enable S3 Object Lock on any bucket not already protected
[ ] Revoke any active sessions that could have been minted by compromised credentials

Legal Obligations

| Jurisdiction | Obligation | Deadline |

|---|---|---|

| EU (GDPR Art. 33) | Notify supervisory authority if breach risks rights and freedoms of individuals | 72 hours from awareness |

| EU (GDPR Art. 34) | Notify affected data subjects if high risk | Without undue delay |

| California (CCPA) | Notify California AG if > 500 CA residents affected | Most expedient time possible |

| State laws | 50-state review required — most have breach notification laws | 30–60 days typical; confirm with Legal |

Legal Counsel owns regulatory notification. IC owns user communication. The 72-hour GDPR clock starts from when you became aware of a probable breach — not from when you confirmed scope.

User Communication

[ ] Email all affected users within 72 hours. Required content: what happened, what data was accessed, what we have done, what users should do, who to contact with questions
[ ] Post public disclosure on /safety/transparency page
[ ] If financial data (payment card or bank information) was accessed: offer free credit monitoring service (12 months minimum) to all affected users
[ ] Tone: factual, direct, non-defensive. Do not use the phrase "we take security seriously" — show it, don't say it.

Root-Cause Analysis (72-hour target)

[ ] How was the breach achieved (attack vector)?
[ ] What data was accessed, for how long, by whom?
[ ] How many users are affected, and what tier of data?
[ ] What controls failed, and why?
[ ] What controls prevented further damage?
[ ] Full post-mortem published within 14 days (P0/P1 obligation)

---

6. Playbook C — Outage

Trigger: silentinfinity.com returning 5xx errors at > 5% rate for > 2 minutes, or completely unresponsive to synthetic canary checks.

Auto-Detection

CloudWatch alarm on API Gateway 5xx rate > 5% for 2 consecutive minutes → SNS → email + SMS to IC
CloudWatch synthetic canary: GET /api/health every 60 seconds — alarm on 2 consecutive failures
CloudFront error rate alarm: > 3% 5xx across all distributions

Immediate Response (< 15 minutes)

[ ] IC acknowledges alarm and declares incident, assigns severity
[ ] Technical Lead begins diagnosis (see Diagnosis Path below)
[ ] Communications Lead updates status page to "Investigating" within 5 minutes of IC declaration
[ ] If diagnosis is not complete within 15 minutes and users are clearly affected: post initial public update acknowledging the issue

Diagnosis Path

Work the chain front-to-back. Stop at the first layer that shows errors.


CloudFront → API Gateway → Lambda → Bedrock (Claude) → DynamoDB

1. CloudFront: Check distribution error rates in CloudWatch. Is the origin returning errors, or is CloudFront itself the issue (e.g., SSL cert expiry)?

2. API Gateway: Check 5xx rate by route. Is it all routes or one? Throttling events?

3. Lambda: Check invocation errors, duration (timeout?), concurrency limits hit, code exception rate. CloudWatch Logs Insights: filter @message like "ERROR" across the function log group.

4. Bedrock (Claude API): Check for AWS service health events at health.aws.amazon.com. Check Bedrock-specific throttle errors in Lambda logs.

5. DynamoDB: Check consumed capacity, throttle events, table-level error metrics.

6. X-Ray: If the above doesn't isolate the issue, pull a service map trace from X-Ray for a representative failed request.

Rollback Procedure

Silent Infinity uses blue/green Lambda deployments via weighted aliases.

[ ] Identify last-known-good Lambda version from deployment history
[ ] Update Lambda alias traffic weight: 100% to previous version
[ ] Target: < 5 minutes from rollback decision to full traffic on previous version
[ ] Verify rollback effectiveness: watch 5xx rate in CloudWatch for 3 minutes post-shift
[ ] If rollback does not resolve: issue is infrastructure (API Gateway, DDB, Bedrock, CloudFront), not application code — escalate to AWS support

User Communication

[ ] Status page updated every 30 minutes from IC declaration until resolution
[ ] Post on primary social channel (Twitter/X: @silentinfinity — to be created) at incident open, midpoint if > 1 hour, and resolution
[ ] Format: "[Time] We're investigating reports of [issue]. We'll update in 30 minutes." — then follow through
[ ] At resolution: post clear "resolved" update with approximate duration and a commitment to post-mortem

Post-Outage

[ ] Post-mortem within 7 days for any outage > 5 minutes affecting users
[ ] Published post-mortem for P0/P1 outages within 14 days

---

7. Playbook D — Harmful AI Output

Trigger: Claude generates content that violates Silent Infinity's published safety commitments. Categories: (a) clinical-level medical or psychiatric advice; (b) harmful instructions (self-harm methods, dangerous substance use guidance); (c) explicit sexual or violent content; (d) contemptuous, demeaning, or discriminatory language toward a user.

Detection surfaces:

User report via /feedback form or reaction emoji (😶 "missed the mark")
Chat Sentinel automated flag (when implemented) on output post-processing
Internal review during routine audit

Immediate Response (< 1 hour)

[ ] Log the exact turn: uid, cid, timestamp, full prompt, full completion
[ ] Do NOT alter the log. Legal and clinical review depends on the exact text.
[ ] Classify: does this output meet the threshold for P0/P1 (user at risk) or P2 (policy violation without acute harm)?
[ ] If user is identified and reachable: send a direct, human response acknowledging what happened. Apologize without deflecting to the model.
[ ] Check whether guardrails.py output filters were active for this session (could have been disabled by a configuration error)

Short-Term: Root-Cause Classification (1 – 24 hours)

Determine which of three failure modes applies:

Failure Mode 1: Prompt engineering bug

The system prompt failed to establish a constraint, or established it ambiguously. Example: system prompt says "avoid detailed discussion of medication overdose" but does not define "detailed," and model produced a technically compliant but clinically dangerous response.

Immediate: SHA-bump system prompt with corrected language; deploy to production
Document: what was the ambiguity? How was it resolved?

Failure Mode 2: Model behavior outside prompt

The system prompt was clear and correct; the model violated it anyway. This happens occasionally with large language models under adversarial or edge-case prompts.

Immediate: document the exact input that caused the violation
Medium: submit to Anthropic safety team with full context. This is a contractual and ethical obligation under our Anthropic usage agreement.
Add the input pattern to input-side guardrail regex if reproducible

Failure Mode 3: User induction (adversarial prompting)

User deliberately crafted input to circumvent safety guardrails (jailbreak, role-play framing, multi-step manipulation).

Immediate: document the pattern
Medium: add to input-guardrail regex and test against the regression suite
Decide: does this pattern indicate a vulnerability that needs architectural treatment (e.g., separate safety classifier before prompt reaches Claude)?

User Communication

[ ] Direct apology to affected user — from Harnoor, not an automated system
[ ] Offer: session memory rollback if the harmful output was stored and the user wants it removed from their history
[ ] If the user is distressed: follow Playbook A protocols for clinical response

---

8. Playbook E — Legal / Regulatory Threat

Trigger: Receipt of a subpoena, civil investigative demand, cease-and-desist letter, regulatory inquiry letter (FTC, state AG, HIPAA enforcement), or a communication from a plaintiff's attorney.

Immediate Response (< 4 hours)

[ ] Do NOT respond substantively to any legal communication before speaking with Legal Counsel. Even a polite acknowledgment of receipt can constitute a legal position in some jurisdictions.
[ ] Forward the complete, unaltered document to Legal Counsel within 4 hours of receipt
[ ] Note the exact date and time of receipt — statutes of limitations and response deadlines run from receipt date
[ ] IC notified immediately (if IC is not the person who received the document)

Legal Hold

[ ] Issue a legal hold immediately upon receipt of any subpoena or credible legal threat
[ ] Scope: all CloudWatch logs, DynamoDB records, S3 objects, Lambda deployment history, and email records relevant to the matter
[ ] Suspend any automated log retention deletion policies that would destroy relevant data
[ ] Document in writing what was held, when, and by whom — Legal Counsel directs this

Regulatory Notification Intersections

If the legal threat arises from a data incident or user-harm incident, the notification obligations from Playbooks A and B apply concurrently. Do not allow the legal process to delay required user notifications — consult Legal Counsel, but understand that GDPR 72-hour timelines do not pause for legal strategy.

Evaluating Liability Defense

Silent Infinity maintains the following in service of product-liability defense:

Feature Readiness Standard tier labels on all features, documenting clinical review status at launch
OpenTimestamps-anchored archive of system prompt versions, safety configuration changes, and deployment records — providing tamper-evident proof of what the system was doing at any given time

These records are the foundation of any good-faith defense. Legal Counsel should be briefed on this architecture at engagement.

Valid Subpoena: Compliance and User Notification

[ ] Comply with valid legal process as directed by Legal Counsel
[ ] Notify the affected user(s) of the disclosure unless the subpoena includes a lawful gag order
[ ] If gag order is present: follow Legal Counsel guidance; document that notification was legally prevented
[ ] Public disclosure: if legally permitted, include summary in the quarterly transparency report (anonymized as required)

---

9. Playbook F — Press Crisis

Trigger: A negative press story about Silent Infinity is published or clearly in progress (reporter inquiry); a viral complaint thread on social media (> 500 engagements or picked up by a journalist); a coordinated inauthentic attack campaign (flagged by pattern of similar accounts).

Immediate Response (< 1 hour)

[ ] IC + Communications Lead convene immediately
[ ] Do NOT engage individually on social media. No employee or contractor responds to negative posts on personal accounts about Silent Infinity. One voice, one time.
[ ] Compile: full text of complaint or article, number of people amplifying, any named individuals, any claimed facts
[ ] Determine: is this complaint substantive or adversarial?

Substantive Complaint (real harm alleged)

A substantive complaint means the negative coverage alleges something that could be true — a real user who experienced real harm, a genuine safety failure, a factual claim about our data practices.

[ ] Do not treat this as a PR problem. Treat it as an incident of the underlying type (Playbook A, B, C, or D) with a communications component.
[ ] Acknowledge publicly, early, and without defensiveness. Example: "We've seen this account of [event]. We are investigating urgently and will update here within [time]."
[ ] Link to concrete action taken once the underlying incident is resolved
[ ] Maya Shankar's research on negative feedback cycles in digital health contexts is instructive: attempts to suppress or deflect valid criticism amplify the harm signal. Transparent acknowledgment interrupts the cycle.

Adversarial Campaign (misrepresentation or bad-faith attack)

An adversarial campaign means the claims being made are false, distorted, or being amplified in a coordinated inauthentic manner.

[ ] One measured public response on our own /safety/transparency page. Factual, non-emotional. State what is accurate.
[ ] Do NOT engage on Twitter/X in a back-and-forth. As Charlie Warzel has written extensively on tech-press dynamics, platform fights create content that favors the aggressor regardless of who is right. Our credibility lives on our own owned channels.
[ ] If defamatory: Legal Counsel evaluates, does not automatically demand retraction (which amplifies), focuses on factual correction first
[ ] Monitor resolution. Most adversarial campaigns have a 48–72 hour peak. Consistent inaction (posting nothing) is sometimes the correct choice after the initial single response.

If a Reporter Contacts Us

[ ] All media inquiries to IC (Harnoor) only. No other team member speaks to press without IC approval.
[ ] Respond within the reporter's stated deadline if possible — "no comment" decisions should be active, not by default
[ ] On-record statements reviewed by Legal Counsel before transmission if the topic intersects with active legal matters
[ ] Background conversations are permissible with IC approval; always be clear about on/off-record status

---

10. Post-Mortem Template

Copy this template for each incident post-mortem. File location: F:/TITAN/post-mortems/[DATE]-[SEVERITY]-[SLUG].md

---


# Post-Mortem: [Incident Slug]

**Date of Incident:** [YYYY-MM-DD]
**Severity:** [P0 / P1 / P2 / P3]
**Incident Commander:** [Name]
**Technical Lead:** [Name]
**Total Duration:** [HH:MM from detection to resolution]
**Users Affected:** [number or "unknown"]
**Publication Status:** [Internal Only / Published at /safety/transparency]

---

## Summary

[2–4 sentences. What happened, who was affected, how it was resolved.]

---

## Timeline

All times in [timezone]. Times are approximate.

| Time | Event |
|------|-------|
| HH:MM | First detection / alert fired |
| HH:MM | IC declared incident, severity assigned |
| HH:MM | [action taken] |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed / rollback executed |
| HH:MM | Incident resolved; monitoring continues |
| HH:MM | Post-mortem convened |

---

## Root Cause

### 5 Whys

1. **Why did the incident occur?**
   [First-order cause]

2. **Why did [first-order cause] happen?**
   [Second-order cause]

3. **Why did [second-order cause] happen?**
   [Third-order cause]

4. **Why did [third-order cause] happen?**
   [Fourth-order cause]

5. **Why did [fourth-order cause] happen?**
   [Root cause — this is what goes in the summary]

---

## Contributing Factors

- [System condition, process gap, or tooling limitation that made this outcome more likely]
- [...]

---

## What Went Well

- [Things that worked: detection speed, containment, communication, team coordination]
- [...]

---

## What Didn't Go Well

- [Things that slowed response, increased impact, or created confusion]
- [...]

---

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific, testable action] | [Name] | [YYYY-MM-DD] | Open |
| [...] | [...] | [...] | [...] |

---

## Appendix

[Full logs, screenshots, or other evidence — linked, not pasted inline]

---

Publication Policy

| Severity | Publication Obligation |

|---|---|

| P0 | Always published at /safety/transparency within 14 days of incident close |

| P1 | Always published within 14 days |

| P2 | Published if incident lasted > 4 hours or affected > 100 users |

| P3 | Internal only |

Publication means the full post-mortem, edited only to protect user privacy (no PII, no uid/cid). Clinical or legal details that are subject to privilege may be redacted with a notation explaining the redaction.

---

11. Contact Sheet

Internal

|---|---|---|---|

| Legal Counsel | [To be engaged by 2026-06-01] | [TBD] | — |

AWS Support

AWS Support Case Creation: console.aws.amazon.com → Support → Create Case
For P0 incidents: Upgrade to Business or Enterprise Support plan. Enterprise Support provides < 15-minute response SLA for production-down cases and a dedicated Technical Account Manager.
AWS Security Response: Report compromised credentials at aws.amazon.com/security/vulnerability-reporting or via the Security Bulletin contact
AWS Service Health Dashboard: health.aws.amazon.com — check this before assuming application-layer root cause

Crisis Lines (for direct reference in user communications)

| Service | Contact |

|---|---|

| 988 Suicide & Crisis Lifeline (US) | Call or text 988 |

| Emergency Services (US) | 911 |

| International Crisis Lines | findahelpline.com |

| SAMHSA Treatment Locator | findtreatment.gov |

| Crisis Text Line (US) | Text HOME to 741741 |

These lines should be included verbatim in any Playbook A user communication. Do not paraphrase. Do not omit.

---

12. Testing the Playbook

Quarterly Tabletop Exercise

Once per quarter (suggested schedule: Jan, Apr, Jul, Oct), the team conducts a structured tabletop walkthrough of one playbook scenario.

Format (90 minutes):

1. IC selects a playbook scenario and assigns roles (30 minutes prep)

2. IC reads a scenario prompt aloud: "It is 2:00 a.m. on a Tuesday. You receive the following message: [scenario]"

3. Each role-holder talks through what they would do, in order, referencing the playbook

4. IC pauses the walkthrough when a gap is identified: "We don't have a step for this. What should we do?"

5. Gaps are documented and converted to pull requests against this document

Rotation: Cover Playbook A (harm) and Playbook B (breach) at least annually. The others rotate.

Output: A brief gap log, filed at F:/TITAN/post-mortems/tabletop-[DATE].md

Annual Production Drill

Once per year, schedule a staged degraded-mode rehearsal on a staging deployment slot:

Simulate a Lambda error rate spike (controlled traffic injection)
Verify CloudWatch alarms fire within the expected window
Verify status page update workflow functions
Verify rollback procedure completes within the 5-minute target
Document results and any gaps discovered

This drill should be scheduled at least 2 weeks in advance, announced internally, and should never touch the production environment without explicit IC authorization.

---

13. References

1. Allspaw, J. & Hammond, P. (2012). "Blameless PostMortems and a Just Culture." Etsy Engineering Blog. Foundation of our blameless culture standard.

2. Allspaw, J. & Robbins, J. (Eds.). Web Operations: Keeping the Data on Time. O'Reilly Media. Practical incident response patterns from which the role structure in this playbook is adapted.

3. Beyer, B., Jones, C., Petoff, J., & Murphy, N.R. (Eds.). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. Canonical reference for on-call practice, post-mortem culture, and error budget frameworks. Available free at sre.google/sre-book.

4. Cichonski, P., Millar, T., Grance, T., & Scarfone, K. Computer Security Incident Handling Guide (NIST SP 800-61 Revision 2). NIST, 2012. Authoritative federal guidance on forensic preservation, chain of custody, and incident classification. Informs our breach containment sequencing.

5. Etherington, K. (2004). Trauma Counselling and Narrative Therapy. Jessica Kingsley Publishers. Foundational work on restorative and blameless frameworks in human-centered contexts.

6. GDPR Article 33–34. Regulation (EU) 2016/679. Binding legal standard for breach notification obligations for EU data subjects.

7. CCPA / CPRA. California Consumer Privacy Act and California Privacy Rights Act. Governs breach notification obligations for California residents.

---

This playbook is a living document. It must be reviewed following every P0 or P1 incident, and on a scheduled quarterly basis regardless of incident activity. The most dangerous version of this document is one that is out of date and trusted anyway.

Last Updated: 2026-04-21

Updated By: Harnoor Singh

Next Scheduled Review: 2026-07-21