HERALD Tuesday Red-Team — CloudWatch Insights Queries

Purpose: Automated production audit of response quality + anti-pattern detection. Run against LogGroup /aws/lambda/innerverse-mirror every Tuesday 09:00 ET as part of the HERALD rhythm. Results emailed to harnoors@gmail.com.

Metrics source: _emit_emf_metrics() emission path in handler.py (R0156). Every chat turn now includes ResponseLenChars, ResponseParaCount, FrameworkMentionCount.

---

Query 1 — Short-response rate (anti-pattern: replies under 200 chars)


filter @message like /ResponseLenChars/
| parse @message '"ResponseLenChars": *,' as responseLen
| stats count() as total,
        count(responseLen < 200) as short_count,
        (count(responseLen < 200) * 100.0 / count()) as short_pct
  by bin(1h)
| sort @timestamp desc

Healthy target: short_pct < 10% per hour. Spike = sage prompt compliance slipping.

---

Query 2 — No-framework-cited rate (anti-pattern: sage without attribution)


filter @message like /FrameworkMentionCount/
| parse @message '"FrameworkMentionCount": *,' as frameworks
| parse @message '"ResponseLenChars": *,' as responseLen
| filter responseLen > 200
| stats count() as total,
        count(frameworks = 0) as unsourced,
        (count(frameworks = 0) * 100.0 / count()) as unsourced_pct
  by bin(1h)
| sort @timestamp desc

Healthy target: unsourced_pct < 20% per hour. If higher, add 2-3 example few-shots to the prompt.

---

Query 3 — Paragraph count distribution


filter @message like /ResponseParaCount/
| parse @message '"ResponseParaCount": *,' as paras
| stats count() as total,
        count(paras = 1) as one_para,
        count(paras = 2) as two_para,
        count(paras >= 3) as three_plus
  by bin(6h)
| sort @timestamp desc

Healthy target: three_plus >= 80%. Enforcement rule (C) of the core_behavior_rule mandates 3+ paragraphs in reflective mode.

---

Query 4 — Voice turn p50/p95 latency


filter @message like /VoiceTurnTotalMs/
| parse @message '"VoiceTurnTotalMs": *,' as totalMs
| parse @message '"SttMs": *,' as sttMs
| parse @message '"LlmFirstTokenMs": *,' as llmMs
| parse @message '"TtsFirstAudioMs": *,' as ttsMs
| stats
    avg(totalMs) as avg_total_ms,
    pct(totalMs, 50) as p50_total_ms,
    pct(totalMs, 95) as p95_total_ms,
    pct(sttMs, 50) as p50_stt_ms,
    pct(ttsMs, 50) as p50_tts_ms
  by bin(1h)
| sort @timestamp desc

Healthy target: p50_total_ms < 4000, p50_stt_ms < 2000, p50_tts_ms < 3500. R0148 PCM16 path should hold these.

---

Query 5 — Cache-hit rate (prompt caching working?)


filter @message like /CacheHit/
| parse @message '"CacheHit": *,' as hit
| stats avg(hit) * 100 as cache_hit_pct,
        count() as total
  by bin(1h)
| sort @timestamp desc

Healthy target: cache_hit_pct > 70% after warmup. Low hit rate = system prompt cache checkpoint not firing.

---

Query 6 — Error codes over time


filter ErrorCode exists
| stats count() as n by ErrorCode, bin(6h)
| sort @timestamp desc

Healthy target: no single ErrorCode > 5% of all invokes. stt_empty should drop to ~0% now that PCM16 landed (R0148).

---

Automation hook

These queries are registered in herald-cron-registration.json under the tuesday-redteam cron entry. HERALD's wrapper:

1. Runs each query via aws logs start-query

2. Waits for Complete status

3. Aggregates results into a single weekly email

4. Compares each metric to the "healthy target" threshold

5. Flags any breach in the email subject

Version history

2026-04-21 v1 — Initial query set. Metrics stream only started emitting R0156 depth metrics at this date; prior turns will show zero for the new fields (not a bug).