Date: 2026-05-13 · Status: drafting (scout research running in parallel)
> Problem identified by Harnoor: newsletters repeat the same news across days. No indexed research. No timestamps. No multimedia variety. Same titles + summaries appear repeatedly. The shell looks nice but the engine is broken.
> Goal: every newsletter must feel fresh every single day. Same story across days OK — same title/angle/summary across days NEVER.
---
Current architecture:
openclaw_newsletter.py etc.) starts fresh each run — no memory of what it sent yesterday---
File: F:/TITAN/state/newsletter-stories.sqlite
Schema:
CREATE TABLE stories (
story_id TEXT PRIMARY KEY, -- sha256(entity_normalized + headline_keywords)
first_seen DATETIME NOT NULL,
last_sent DATETIME,
send_count INTEGER DEFAULT 0,
sent_in TEXT, -- JSON array of newsletter slugs that have used it
entity TEXT, -- "Anthropic", "OpenAI", "Linear" etc
headline TEXT,
summary_hash TEXT, -- sha256 of summary so we never re-use exact text
source_url TEXT,
story_kind TEXT, -- release, partnership, hire, funding, leak, opinion
freshness REAL, -- 0.0–1.0 decay score
tags TEXT -- json array
);
CREATE INDEX idx_freshness ON stories(freshness DESC, last_sent);
CREATE INDEX idx_entity_date ON stories(entity, first_seen);
Dedup rule on send: before any item lands in a newsletter draft:
1. Compute story_id from entity + headline-keywords
2. Look up in ledger
3. If sent in last 3 days and no new angle → SKIP
4. If sent in last 7 days but has a new development → ALLOWED, but Gemini must rewrite headline + summary completely (compare summary_hash to ensure they differ)
5. Insert/update row after send
Script: F:/TITAN/scripts/newsletter_research_indexer.py (new)
Runs at 05:00 UTC daily, BEFORE any newsletter is generated.
Tasks:
1. Perplexity Pro queries (via pplx.py) — 6 queries per newsletter topic:
- "what shipped in {topic} in last 24 hours, with source URLs and timestamps"
- "what's trending on /r/{subreddit} in {topic} this morning"
- "GitHub trending repos in {topic} last 24h"
- "latest YouTube uploads from {channel-list} in {topic}"
- "latest blog posts from {author-list}"
- "biggest opinion / hot take in {topic} from past day"
2. Gemini Pro dedupes results across the 6 queries, scores freshness, generates a fingerprint per story
3. YouTube API (or simple oEmbed) → pull thumbnail + duration for any embedded video
4. Screenshot service — for product launches, capture og:image or a Playwright screenshot
5. Output: F:/TITAN/state/research-index-{date}.jsonl — every story with id/entity/headline/summary/source-url/timestamp/media
Each newsletter script reads the indexed research, filters via ledger, picks N stories with diversity rules:
After SES send, the script:
send_countnewsletter-archive/<slug>/<date>.html---
| Type | Approach | Copyright |
|---|---|---|
| YouTube video | Embed <iframe> via oEmbed; thumbnail as fallback | Fair use — single embedded clip, attribution required, no re-upload |
| Product screenshot | Playwright/og:image scrape, attribute source, max 600×400 | Fair-use commentary; always credit + link |
| Quote | <15 words verbatim only; otherwise paraphrase | Already in copyright rules |
| Chart | Generate our own via Chart.js or Imagen 4 if data quoted | Avoid republishing source charts |
| Tweet/X post | Native blockquote with attribution | Standard embed rules |
---
┌────────────────────────────────┐
│ newsletter_research_indexer.py│
│ (cron 05:00 UTC daily) │
└────────────┬───────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌───────▼──────┐ ┌────────▼─────┐ ┌─────────▼────────┐
│ Perplexity │ │ Gemini Pro │ │ YouTube oEmbed │
│ (6 queries) │ │ (dedupe + │ │ + Playwright │
│ │ │ fingerprint)│ │ screenshots │
└───────┬──────┘ └────────┬─────┘ └─────────┬────────┘
│ │ │
└──────────────────┼──────────────────┘
│
┌────────────────▼─────────────────┐
│ research-index-{date}.jsonl │
│ story_id · entity · headline · │
│ summary · source · media · ts │
└────────────────┬─────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
┌─────────▼────────┐ ┌─────────▼─────────┐ ┌─────────▼─────────┐
│ openclaw_news.py │ │ agentic_ai_news.py│ │ claude_news.py │
│ 08:00 UTC │ │ 08:15 UTC │ │ 08:30 UTC │
└─────────┬────────┘ └─────────┬─────────┘ └─────────┬─────────┘
│ │ │
└───── filter via STORY LEDGER (sqlite) ───────────────┘
│
┌──────────────▼──────────────┐
│ SES send → ledger update │
│ archive HTML → S3 │
└─────────────────────────────┘
---
| Item | Cost |
|---|---|
| Perplexity Pro (Harnoor's plan, free for him) | $0 (Pro plan) |
| Gemini Pro dedup + fingerprint (~50k tokens) | $0.15 |
| Gemini Flash subject-line + headline rewrites | $0.05 |
| YouTube oEmbed + Playwright screenshots | $0 (free + local) |
| SES sends (4 emails × 1 subscriber Harnoor for now) | $0.0004 |
| Total daily | ~$0.20 |
When scale hits 1,000 subscribers: ~$0.50/day all-in. Cheap.
---
1. newsletter_stories.sqlite with schema above — empty DB ready
2. newsletter_research_indexer.py — Perplexity + Gemini + writes JSONL daily
3. newsletter_ledger.py — shared module for dedup lookup + update by every newsletter script
4. Retrofit openclaw_newsletter.py to use ledger
5. Retrofit agentic-ai + claude + agent-stack newsletter scripts
6. Add YouTube oEmbed support to newsletter templates (we already have 4 picked: Rolling Stone / Pop-Sci / Anthropic-brand / Comic-panel)
7. Add Playwright screenshot capture for product launches
8. Smart subject-line generator (Gemini Flash, dynamic per top story)
---
F:/TITAN/plans/audits/NEWSLETTER-DEDUP-RESEARCH-2026-05-13.md (memo + emailed)When SCOUT delivers, I'll merge its findings into this plan + then ship the P0 components.
---
| Metric | Target |
|---|---|
| Same headline appearing twice in same week across any 2 newsletters | 0 |
| Same summary text (>50% similarity) | 0 |
| Issues with embedded video | ≥1 per newsletter per week |
| Issues with screenshot | ≥1 per newsletter per week |
| Subject-line click-through (when we have analytics) | +30% vs current "Issue #N" |
| Reader survey "felt fresh" rating | ≥4/5 |