ALL MEMOS Download .docx

Silent Infinity analytics & data-collection architecture

Version: v1 · 2026-04-21 · HERALD

Authority: design doc, build M1 on approval

Rough-Ask: R0112

Companion: USER-FEEDBACK-SYSTEM-2026-04-21.md

> Harnoor: "dashboard from the data analyst — per user cost, average user cost, maximum cost, days/times, how many users, demographics. All the data should be collected. CloudFront may be a good place. Comprehensive plan. Nice format. Three tiers. PhD-level."

---

1. Why this matters now

Silent Infinity has six users and 189 conversations. We have just enough data to start instrumenting — and we are at the exact inflection point where instrumenting now prevents every founder's nightmare twelve months from now: "we scaled to 10k users and don't know what any of them are doing."

Three PhD-stacked disciplines govern what we build:

1. Information architecture (Ralph Kimball 1996, The Data Warehouse Toolkit) — the discipline of modeling events, facts, and dimensions into a queryable shape.

2. Behavioral analytics (Amy Heineike's work on product instrumentation; Hofmann 2014 on implicit feedback) — what events to capture, what to ignore.

3. Privacy-by-design (Cavoukian 2010, Privacy by Design; GDPR Art. 25) — data minimization, purpose limitation, storage limitation. A wellness app cannot treat user data casually.

These three must balance. Over-collect and we inherit liability + breach surface. Under-collect and we fly blind. The architecture below is the middle path.

---

2. Three-tier data lake (what you asked for as "three packets")

The standard warehouse pattern is Bronze → Silver → Gold (Databricks / Lakehouse pattern; Ralph Kimball's variant is Staging → Integration → Presentation). Same idea.

Tier 1 — BRONZE (raw ingest, 90-day retention)

What's in it:

Format: newline-delimited JSON, one line per event, partitioned by dt=YYYY-MM-DD/hour=HH. Gzip-compressed. Parquet-converted on daily rollup.

Retention: 90 days raw, then aggregated to Silver and deleted. This is GDPR-compliant "data minimization."

Owner: automatic (Kinesis Firehose writes from CloudFront + Lambda + DDB Streams).

Tier 2 — SILVER (cleaned + joined, 2-year retention)

What's in it:

Format: Parquet in S3 innerverse-logs-silver/, partitioned by table name + dt=YYYY-MM-DD. AWS Glue catalog for schema. Queryable via Athena.

PII handling:

Retention: 2 years. Then aggregated to Gold and deleted.

Owner: nightly Glue ETL job that reads Bronze → transforms → writes Silver.

Tier 3 — GOLD (analytics-ready, forever)

What's in it:

Format: Parquet + materialized views in S3 innerverse-logs-gold/. Also exposed to QuickSight dashboard + the internal /sage dashboard (we build that).

Retention: forever. These are aggregates; no single user can be re-identified.

Owner: weekly SAGE rollup job. Publishes to the SAGE dashboard and to the quarterly transparency report.

---

3. What we're collecting — the event schema

3.1 CloudFront access logs (standard format, enabled today)

Already captured by default. Fields we care about:

This gives us: traffic volume, geographic distribution, device types, referrers, error rates — at zero incremental cost.

Action: enable if not already: aws cloudfront get-distribution-config --id E2M8T6S9SM3OQY → ensure Logging.Enabled = true + S3 bucket for logs. We need this TODAY.

3.2 Lambda structured logs (EMF format)

Every Lambda invocation emits an Embedded Metric Format (EMF) JSON blob to CloudWatch:


{
  "_aws": {"Timestamp": 1745222400000, "CloudWatchMetrics": [{"Namespace": "Innerverse", "Dimensions": [["model_id", "route"]], "Metrics": [{"Name": "latency_ms"}, {"Name": "tokens_in"}, {"Name": "tokens_out"}, {"Name": "cost_usd"}]}]},
  "model_id": "us.anthropic.claude-sonnet-4-6",
  "route": "/invoke",
  "uid": "sha256(raw_uid)",
  "cid": "sha256(raw_cid)",
  "turn_index": 7,
  "latency_ms": 1280,
  "tokens_in": 4532,
  "tokens_out": 287,
  "cost_usd": 0.0179,
  "region": "us-west",
  "device": "mobile",
  "crisis_flag": null
}

Partially there already (EMF in handler.py). Need to add: cost_usd computation per turn, device detection, region from CloudFront header.

3.3 DynamoDB Streams

Enable streams on innerverse-users, innerverse-conversations, innerverse-feedback. Kinesis Firehose delivers to S3 Bronze. This captures every write event for audit + analytics without requiring Lambda to double-write.

3.4 Chat Sentinel observations (new, per feedback memo §4)

Every user turn → Haiku observation → DynamoDB innerverse-observations → stream → S3.

3.5 User-agent + region derivation

From CloudFront CloudFront-Viewer-Country header + User-Agent string parsing (ua-parser library). Gives us device class, browser, country (not city — privacy). No fingerprinting.

3.6 What we do NOT collect

---

4. The SAGE dashboard — what Harnoor sees

Two URLs, one private, one public.

4.1 /sage (private — admin only, IP-restricted or Cognito group)

Landing page with these panels:

Cost panel

Usage panel

Demographic panel

Sentiment panel (from Chat Sentinel)

Crisis panel (anonymous)

4.2 /safety/transparency (public quarterly)

Per Innovation 5. Shows users we're honest:

---

5. The pricing / cost calculations (per-turn)

Each turn's cost is computed synchronously after the Bedrock response:


def compute_turn_cost(model_id: str, tokens_in: int, tokens_out: int) -> float:
    PRICING = {
        "us.anthropic.claude-sonnet-4-6":            {"in": 3.00,  "out": 15.00},
        "anthropic.claude-opus-4-7":                 {"in": 18.00, "out": 90.00},
        "anthropic.claude-opus-4-6-v1":              {"in": 15.00, "out": 75.00},
        "anthropic.claude-haiku-4-5-20251001-v1:0":  {"in": 0.80,  "out": 4.00},
    }
    p = PRICING.get(model_id, {"in": 3.00, "out": 15.00})
    return (tokens_in / 1_000_000) * p["in"] + (tokens_out / 1_000_000) * p["out"]

Pricing table stored in F:/projects/innerverse/backend/src/pricing.py + version-controlled so historical data is accurate when Anthropic changes rates.

Per-user cost = sum(turn.cost) across all their turns.

Average user cost = sum(turn.cost) / count(distinct uid) over window.

Max user cost = max(user.total_cost) over window.

Cost percentiles = p50/p90/p99 distribution of per-user cost (lets us spot outlier "power users" who consume disproportionate resources).

---

6. Implementation — build order

| Step | Ship in | Cost | Owner |

|---|---|---|---|

| 1. Enable CloudFront access logging to S3 | today | ~$0.50/mo S3 | FORGE |

| 2. Per-turn cost computation in handler.py → EMF | this week | 0 | FORGE |

| 3. DynamoDB Streams → Kinesis Firehose → S3 Bronze | this week | ~$2/mo | FORGE |

| 4. Glue ETL: Bronze → Silver nightly | next week | ~$5/mo | FORGE |

| 5. Athena + QuickSight dashboard (SAGE MVP) | next week | ~$12/mo QuickSight reader seat | SAGE |

| 6. Chat Sentinel (per feedback memo) | 2 weeks out | ~$30/mo Haiku | SCOUT |

| 7. /sage private dashboard page | 3 weeks | 0 (reuses QuickSight embed) | FORGE |

| 8. Public /safety/transparency page | 4 weeks | 0 | HERALD |

| 9. Weekly SAGE rollup job → Gold | month out | 0 | SAGE |

Total incremental infra cost at 1,000 DAU: ~$50-80/month. Trivial vs the value of actually knowing what's happening.

---

7. Privacy + compliance controls

| Concern | Control |

|---|---|

| GDPR Art. 6 (lawful basis) | Consent at clickwrap + Privacy Policy §4 (already live) |

| GDPR Art. 17 (right to erasure) | DELETE route removes from Silver + Gold within 30 days; Bronze auto-expires at 90 days |

| GDPR Art. 25 (privacy by design) | Bronze → Silver transform drops raw message content + IP below /24; never enters Silver |

| CCPA §1798.105 (right to delete) | Same as GDPR above; honored for all CA users automatically |

| COPPA | 13+ attestation in clickwrap gate; no identifiers collected for users we can't verify |

| CA SB 243 | /safety page already discloses + /safety/transparency will show aggregates |

| SOC 2 (future) | Audit trail in Bronze (immutable S3 object-lock) is the basis |

---

8. The hard things I'd watch for

1. Goodhart's law. The moment a metric becomes a target, it ceases to be a good metric. We must NEVER target NPS, MAU, or any single number. Use the triangulation rules from the feedback memo.

2. Sentinel over-reach. The AI monitor will want to classify everything. Constrain its schema tightly + audit 5% of outputs weekly.

3. Creepy personalization. Just because we can tag every emotion doesn't mean we should remember them. Memory should serve the user, not show them off.

4. False precision. A dashboard that says "sentiment trending -3.2%" invites false confidence. Always show error bars or confidence intervals.

5. Cohort collapse. At 6 users, single interesting users dominate averages. Don't publish per-cohort metrics until we have n ≥ 50 per cohort.

---

9. References

— HERALD

2026-04-21