Silent Infinity analytics & data-collection architecture

Version: v1 · 2026-04-21 · HERALD

Authority: design doc, build M1 on approval

Rough-Ask: R0112

Companion: USER-FEEDBACK-SYSTEM-2026-04-21.md

> Harnoor: "dashboard from the data analyst — per user cost, average user cost, maximum cost, days/times, how many users, demographics. All the data should be collected. CloudFront may be a good place. Comprehensive plan. Nice format. Three tiers. PhD-level."

---

1. Why this matters now

Silent Infinity has six users and 189 conversations. We have just enough data to start instrumenting — and we are at the exact inflection point where instrumenting now prevents every founder's nightmare twelve months from now: "we scaled to 10k users and don't know what any of them are doing."

Three PhD-stacked disciplines govern what we build:

1. Information architecture (Ralph Kimball 1996, The Data Warehouse Toolkit) — the discipline of modeling events, facts, and dimensions into a queryable shape.

2. Behavioral analytics (Amy Heineike's work on product instrumentation; Hofmann 2014 on implicit feedback) — what events to capture, what to ignore.

3. Privacy-by-design (Cavoukian 2010, Privacy by Design; GDPR Art. 25) — data minimization, purpose limitation, storage limitation. A wellness app cannot treat user data casually.

These three must balance. Over-collect and we inherit liability + breach surface. Under-collect and we fly blind. The architecture below is the middle path.

---

2. Three-tier data lake (what you asked for as "three packets")

The standard warehouse pattern is Bronze → Silver → Gold (Databricks / Lakehouse pattern; Ralph Kimball's variant is Staging → Integration → Presentation). Same idea.

Tier 1 — BRONZE (raw ingest, 90-day retention)

What's in it:

Every CloudFront request log (JSON lines, S3 innerverse-logs-bronze/cf/)
Every Lambda invocation log (CloudWatch → Kinesis Firehose → S3 innerverse-logs-bronze/lambda/)
Every Bedrock model invocation (token counts, model ID, latency, error)
Every DynamoDB write event (streams → S3)
Every /feedback form submission (full payload)
Every Chat Sentinel observation (per §4 of feedback memo)

Format: newline-delimited JSON, one line per event, partitioned by dt=YYYY-MM-DD/hour=HH. Gzip-compressed. Parquet-converted on daily rollup.

Retention: 90 days raw, then aggregated to Silver and deleted. This is GDPR-compliant "data minimization."

Owner: automatic (Kinesis Firehose writes from CloudFront + Lambda + DDB Streams).

Tier 2 — SILVER (cleaned + joined, 2-year retention)

What's in it:

sessions fact table — one row per conversation session: uid, cid, start_ts, end_ts, turn_count, region, device_type, referrer, total_tokens_in/out, cost_usd, crisis_flag
turns fact table — one row per turn: uid, cid, turn_index, ts, user_char_count, assistant_char_count, model_id, latency_ms, tokens_in/out, cost_usd, sentiment, emotion (from Sentinel), frustration_flags, feature_wishes
users dim table — uid, first_seen, last_seen, total_sessions, total_turns, total_cost_usd, region, device_type, consent_state, opt_out_flags, cohort_week
feedback fact table — ts, uid, type (form|reaction|rating|sentinel_observation), love, bad, wish, mood, email_hash, related_turn_id
costs fact table — ts, service (bedrock|transcribe|polly|cloudfront|lambda|dynamodb), component, usd, uid, cid

Format: Parquet in S3 innerverse-logs-silver/, partitioned by table name + dt=YYYY-MM-DD. AWS Glue catalog for schema. Queryable via Athena.

PII handling:

UIDs are hashes of the original (SHA-256 salted). Raw uids never enter Silver.
Raw user message text is NEVER in Silver — only char counts + sentiment + emotion tags.
Emails are hashed (SHA-256) for join, raw email only stays in Bronze during its 90-day window.

Retention: 2 years. Then aggregated to Gold and deleted.

Owner: nightly Glue ETL job that reads Bronze → transforms → writes Silver.

Tier 3 — GOLD (analytics-ready, forever)

What's in it:

Cohort retention matrix (aggregated — no user row)
Daily/weekly/monthly usage metrics (DAU, WAU, MAU, session-length distribution)
Cost projections (rolling 30/90/365-day forecast)
Feature-wish leaderboard (ranked by frequency × Kano tier × confidence)
Emotion mix by day (% sad, % joyful, % frustrated across all sessions)
Retention curves by cohort week
Crisis-path metrics (anonymous counts, disposition, escalation path)
Notable moments reel (quotable user lines, anonymized + consented)

Format: Parquet + materialized views in S3 innerverse-logs-gold/. Also exposed to QuickSight dashboard + the internal /sage dashboard (we build that).

Retention: forever. These are aggregates; no single user can be re-identified.

Owner: weekly SAGE rollup job. Publishes to the SAGE dashboard and to the quarterly transparency report.

---

3. What we're collecting — the event schema

3.1 CloudFront access logs (standard format, enabled today)

Already captured by default. Fields we care about:

timestamp, client_ip (truncated to /24 for privacy), cs(Referer), cs(User-Agent), cs-uri-stem, sc-status, time-taken, x-edge-location (city-level)

This gives us: traffic volume, geographic distribution, device types, referrers, error rates — at zero incremental cost.

Action: enable if not already: aws cloudfront get-distribution-config --id E2M8T6S9SM3OQY → ensure Logging.Enabled = true + S3 bucket for logs. We need this TODAY.

3.2 Lambda structured logs (EMF format)

Every Lambda invocation emits an Embedded Metric Format (EMF) JSON blob to CloudWatch:


{
  "_aws": {"Timestamp": 1745222400000, "CloudWatchMetrics": [{"Namespace": "Innerverse", "Dimensions": [["model_id", "route"]], "Metrics": [{"Name": "latency_ms"}, {"Name": "tokens_in"}, {"Name": "tokens_out"}, {"Name": "cost_usd"}]}]},
  "model_id": "us.anthropic.claude-sonnet-4-6",
  "route": "/invoke",
  "uid": "sha256(raw_uid)",
  "cid": "sha256(raw_cid)",
  "turn_index": 7,
  "latency_ms": 1280,
  "tokens_in": 4532,
  "tokens_out": 287,
  "cost_usd": 0.0179,
  "region": "us-west",
  "device": "mobile",
  "crisis_flag": null
}

Partially there already (EMF in handler.py). Need to add: cost_usd computation per turn, device detection, region from CloudFront header.

3.3 DynamoDB Streams

Enable streams on innerverse-users, innerverse-conversations, innerverse-feedback. Kinesis Firehose delivers to S3 Bronze. This captures every write event for audit + analytics without requiring Lambda to double-write.

3.4 Chat Sentinel observations (new, per feedback memo §4)

Every user turn → Haiku observation → DynamoDB innerverse-observations → stream → S3.

3.5 User-agent + region derivation

From CloudFront CloudFront-Viewer-Country header + User-Agent string parsing (ua-parser library). Gives us device class, browser, country (not city — privacy). No fingerprinting.

3.6 What we do NOT collect

Raw user message content in any tier beyond Bronze (90-day)
IP address below /24 truncation
Keystroke-level telemetry
Mouse-movement heatmaps (Hotjar/FullStory style — invasive)
Third-party analytics trackers (Google Analytics, Mixpanel, etc. — we self-host)
Any data from users who have opted out of observation (per Privacy Policy §X)

---

4. The SAGE dashboard — what Harnoor sees

Two URLs, one private, one public.

4.1 `/sage` (private — admin only, IP-restricted or Cognito group)

Landing page with these panels:

Cost panel

Total spend today / this week / this month (bedrock + transcribe + polly + cloudfront + lambda + ddb, line chart)
Cost per user table: top 10 most expensive users, average, median, max, p99
Cost per session distribution histogram
Projection: 30/90/365-day at current growth rate

Usage panel

DAU / WAU / MAU counts
Session-length distribution (median turn count per session, p50/p95/p99)
Days/times heatmap — 7x24 grid showing when conversations happen (useful for staffing when voice ships)
Retention curves by signup cohort (1-day, 7-day, 30-day retention)
Churn signals — users with declining engagement trending toward 0

Demographic panel

Geographic map (country-level, from CloudFront)
Device class (mobile / tablet / desktop %)
Browser mix (Safari / Chrome / Firefox / etc.)
Peak usage hours by timezone cluster

Sentiment panel (from Chat Sentinel)

Emotion mix today (pie: joy / sadness / anger / fear / surprise / trust / etc.)
Frustration heatmap — which user-inputs correlate with frustration signals
Feature-wish cloud — top 20 requested features ranked

Crisis panel (anonymous)

Flagged turn count by severity today / week / month
Disposition (resolved in-session, directed to 988, escalated)
Time-to-escalation distribution
Zero raw content — only aggregate counts

4.2 `/safety/transparency` (public quarterly)

Per Innovation 5. Shows users we're honest:

Aggregate user count, session count, countries served
Aggregate sentiment mix
Feature-wishes we're building vs not
Crisis-flag counts + disposition (no PII)
Incidents (any P0/P1 that affected users)
Changes to /safety policy

---

5. The pricing / cost calculations (per-turn)

Each turn's cost is computed synchronously after the Bedrock response:


def compute_turn_cost(model_id: str, tokens_in: int, tokens_out: int) -> float:
    PRICING = {
        "us.anthropic.claude-sonnet-4-6":            {"in": 3.00,  "out": 15.00},
        "anthropic.claude-opus-4-7":                 {"in": 18.00, "out": 90.00},
        "anthropic.claude-opus-4-6-v1":              {"in": 15.00, "out": 75.00},
        "anthropic.claude-haiku-4-5-20251001-v1:0":  {"in": 0.80,  "out": 4.00},
    }
    p = PRICING.get(model_id, {"in": 3.00, "out": 15.00})
    return (tokens_in / 1_000_000) * p["in"] + (tokens_out / 1_000_000) * p["out"]

Pricing table stored in F:/projects/innerverse/backend/src/pricing.py + version-controlled so historical data is accurate when Anthropic changes rates.

Per-user cost = sum(turn.cost) across all their turns.

Average user cost = sum(turn.cost) / count(distinct uid) over window.

Max user cost = max(user.total_cost) over window.

Cost percentiles = p50/p90/p99 distribution of per-user cost (lets us spot outlier "power users" who consume disproportionate resources).

---

6. Implementation — build order

|---|---|---|---|

Total incremental infra cost at 1,000 DAU: ~$50-80/month. Trivial vs the value of actually knowing what's happening.

---

7. Privacy + compliance controls

| Concern | Control |

|---|---|

| GDPR Art. 6 (lawful basis) | Consent at clickwrap + Privacy Policy §4 (already live) |

| GDPR Art. 17 (right to erasure) | DELETE route removes from Silver + Gold within 30 days; Bronze auto-expires at 90 days |

| GDPR Art. 25 (privacy by design) | Bronze → Silver transform drops raw message content + IP below /24; never enters Silver |

| CCPA §1798.105 (right to delete) | Same as GDPR above; honored for all CA users automatically |

| COPPA | 13+ attestation in clickwrap gate; no identifiers collected for users we can't verify |

| CA SB 243 | /safety page already discloses + /safety/transparency will show aggregates |

| SOC 2 (future) | Audit trail in Bronze (immutable S3 object-lock) is the basis |

---

8. The hard things I'd watch for

1. Goodhart's law. The moment a metric becomes a target, it ceases to be a good metric. We must NEVER target NPS, MAU, or any single number. Use the triangulation rules from the feedback memo.

2. Sentinel over-reach. The AI monitor will want to classify everything. Constrain its schema tightly + audit 5% of outputs weekly.

3. Creepy personalization. Just because we can tag every emotion doesn't mean we should remember them. Memory should serve the user, not show them off.

4. False precision. A dashboard that says "sentiment trending -3.2%" invites false confidence. Always show error bars or confidence intervals.

5. Cohort collapse. At 6 users, single interesting users dominate averages. Don't publish per-cohort metrics until we have n ≥ 50 per cohort.

---

9. References

Kimball, R. (1996, 2013). The Data Warehouse Toolkit.
Cavoukian, A. (2010). "Privacy by Design: The 7 Foundational Principles."
Linstedt, D., & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0.
Hofmann, K., et al. (2014). "Implicit feedback for interactive information retrieval." FnTIR.
Snowflake / Databricks Bronze-Silver-Gold lakehouse pattern (2020).
European Parliament. (2016). General Data Protection Regulation (GDPR).
NIST. (2023). AI Risk Management Framework (AI RMF 1.0).

— HERALD

2026-04-21

Silent Infinity analytics & data-collection architecture

1. Why this matters now

2. Three-tier data lake (what you asked for as "three packets")

Tier 1 — BRONZE (raw ingest, 90-day retention)

Tier 2 — SILVER (cleaned + joined, 2-year retention)

Tier 3 — GOLD (analytics-ready, forever)

3. What we're collecting — the event schema

3.1 CloudFront access logs (standard format, enabled today)

3.2 Lambda structured logs (EMF format)

3.3 DynamoDB Streams

3.4 Chat Sentinel observations (new, per feedback memo §4)

3.5 User-agent + region derivation

3.6 What we do NOT collect

4. The SAGE dashboard — what Harnoor sees

4.1 /sage (private — admin only, IP-restricted or Cognito group)

Cost panel

Usage panel

Demographic panel

Sentiment panel (from Chat Sentinel)

Crisis panel (anonymous)

4.2 /safety/transparency (public quarterly)

5. The pricing / cost calculations (per-turn)

6. Implementation — build order

7. Privacy + compliance controls

8. The hard things I'd watch for

9. References

4.1 `/sage` (private — admin only, IP-restricted or Cognito group)

4.2 `/safety/transparency` (public quarterly)