Author: SCOUT (TITAN Research Arm)
Date: 2026-04-23
Status: FINAL
Commissioned by: Harnoor (verbatim trigger: "audit why asks have to be mentioned again and again this is so frustrating")
Target length: ~4,000 words
---
The repeated-ask problem is not a bug. It is an architectural deficit. AI systems that fail to prevent re-asking share one root property: they treat state as implicit — stored in the context window and thus subject to compaction, session boundary erasure, and attention dilution — rather than explicit, structured, and persisted. Best-in-class systems (Linear, Devin, Mem0, HZL, LangGraph) solve this with one consistent pattern: externalize state into append-only stores that survive context loss. TITAN already has the scaffolding (hot memory, task registry, MASTER-TODO) but lacks the enforcement layer — the "ask ledger" — that closes the loop between what was asked and what was verified as seen. This memo describes the full architecture for that layer and ranks the ten most impactful interventions TITAN can make.
---
Six distinct failure modes cause repeated asks in AI assistants. They compound — any single one can drop an ask, but all six often co-occur.
When a Claude Code session approaches its context limit, the system compacts: it summarizes prior conversation into a compressed block and discards the originals. The problem is asymmetric — compaction optimizes for factual density, not for open commitments. An ask embedded in 12 messages of discussion ("make the word-rotation animation visible") compresses to a general summary about animations, losing the specific unresolved state. The Chroma 2025 benchmark tested 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5) and found that every one exhibits performance degradation at every input length increment tested — meaning even without hard compaction, recall degrades as context grows. Anthropic's own engineering blog notes that "overly aggressive compaction can result in the loss of subtle but critical context whose importance only becomes apparent later." The TITAN workspace has experienced this directly: asks made in long exploratory sessions early in the day are gone by the session's third compaction.
Evidence from TITAN workspace: The word-rotation streaming animation was requested across 10+ sessions. Each new session started without access to prior-session state, treating each request as if it were the first.
The AI declares "done" when code is written or deployed. The user experiences "done" only when they can see and interact with the change. This gap — between deployment success and perceptual verification — is the source of the single most frustrating TITAN failure pattern. Code writing is not shipping. Deploying to a branch is not shipping. The system has no mechanism to distinguish "I wrote the function" from "Harnoor can see the animation playing in his browser right now."
Replit's agent architecture illustrates the right model: it tests functionality automatically and shows a video recap of the app being tested, making Replit iterations "slightly longer, but the tradeoff of getting code that actually works on the first try is a no-brainer." TITAN has no equivalent. There is no post-deploy screenshot step, no DOM verification, no user-confirmation checkpoint before an ask is marked closed.
Every new Claude Code session begins with only what is loaded from memory files. If an ask was captured only in the prior session's TodoWrite (ephemeral by design), it is gone. The user re-enters the ask. The AI processes it as new. This is not a model failure — it is a storage failure. Ephemeral task state has no path to durable storage unless TITAN explicitly writes it out.
Industry data: by early 2025, "virtually all" agent frameworks share this flaw — they "forget everything when the session ends," losing "the user's name, preferences, what was discussed yesterday, or what it committed to do." ChatGPT's memory feature (2024), Claude's memory system (September 2025), and Mem0 (raised $24M October 2025) all emerged specifically to solve this category.
New asks are urgent and emotionally salient. Old unfinished asks are silent and invisible. In any context window that contains both, the new ask crowds out the old. This is not a bug in the model — it is a rational response to recency bias and user energy. The model interprets a new question as what the user currently wants, discarding the stale background intent that was never explicitly closed.
The AI agent anti-patterns research calls this "invisible state": "LLMs do not maintain structured state. They approximate it. And approximation creates drift." The fix requires that every open ask be as visible to the model as the new incoming ask — something that is only possible with an external state store that is loaded fresh at each turn, not inferred from history.
When a user reverses an earlier direction ("actually, skip the bubbles cap" → three sessions later: "wait why isn't the bubbles cap working?"), both the original ask and its reversal live in history. After compaction, the model may retain either one, depending on which got emphasized in the summary. This is structurally unresolvable through context management alone — only a ledger with explicit "superseded by ask_id" relationships can track the actual current state of a requirement.
Voice-to-text transcription errors and casual phrasing introduce parse ambiguity. "Bubbles-cap-at-3" is a precise technical requirement; "make the things not go too high" is not. When the AI parses an ambiguous ask as a general aesthetic comment rather than a functional requirement, it may acknowledge the ask ("I'll keep that in mind") without ever treating it as a deliverable. The ask is registered as conversation, not as a task, and thus never enters any tracking system.
---
Linear's core innovation is that every piece of work is a state machine node, not a conversation artifact. Issues move through: Backlog → Todo → In Progress → In Review → Done → Cancelled. Critically, a state transition requires an explicit actor (human or automation). An AI cannot mark an issue Done without a PR being merged. A webhook from GitHub fires on pull_request.merged and Linear moves the issue; without that webhook, the issue stays In Progress.
This creates structural accountability: "done" is a state that requires evidence. Linear's PR-linked issue system means a re-open is detectable (GitHub shows the PR was reverted; Linear sees it). The "no silent closure" design is the direct antidote to TITAN's "shipped vs. visible" problem.
Linear as an AI task hub is now a documented pattern: "Upon merge, a webhook triggers Linear to move the issue to Done." The state machine prevents any claim of completion without the underlying event.
Asana's task completion model is similar to Linear's, with one additional relevant feature: task dependencies and "waiting on" states. An ask that is "waiting on user confirmation" cannot be auto-closed. The user must explicitly mark it done. For TITAN's use case, the equivalent would be: any ask that touches user-visible UI enters a "waiting on visibility confirmation" state that cannot be closed by the AI — only by the user saying "yes, I see it."
The GitHub-Asana integration enforces this: "Only closing tasks when PRs are merged to the main branch" — not when a feature branch exists, not when the code is written.
Devin's 2025 annual performance review reveals one key architectural insight: Devin handles "clear upfront scoping well, but not mid-task requirement changes" and "usually performs worse when you keep telling it more after it starts." This is an honest acknowledgment that iterative correction is structurally hard for long-running agents.
More importantly, Devin 2.0's multi-agent architecture uses a task ledger with lease-based assignment: when an agent claims a task, it takes a time-limited lease. If the agent dies, crashes, or gets context-compacted, the lease expires and another agent picks it up. This is the formal solution to session-boundary loss: tasks are not owned by a session, they are leased by whoever is currently running, and the ledger persists independent of any session.
For long-running migrations, "the agent can keep a running to-do list of subtasks and chip away at them over hours or days" — the list persists in external storage, not in context.
HZL is the most directly applicable prior art. It is a backend-first, CLI-native, model-agnostic task ledger built specifically for AI agent workflows. Its key properties:
HZL's design explicitly acknowledges "that agents lack persistent memory and face context limitations" and engineers around it rather than hoping the model will remember. The task ledger is the primary state, not the context window.
Mem0 (production-ready AI memory layer, $24M raised October 2025, 91% faster retrieval vs. full-context, 90% lower token usage) implements a three-tier memory model that is directly applicable to TITAN:
1. User-level memory — persists across all sessions (preferences, recurring needs, technical background)
2. Session-level memory — task-specific, cleared on session close
3. Agent-level memory — behavioral patterns, successful approaches
The critical design: user-level memory is never cleared. Session-level is ephemeral by design. Asks belong in user-level memory (or higher) — not session-level — because they represent durable commitments, not conversation state.
On the LOCOMO benchmark, Mem0 scored 26% higher than OpenAI's built-in memory, specifically because of selective retrieval — it surfaces the most relevant prior context rather than dumping everything into the window.
Anthropic's own Claude Code implements a three-layer memory architecture: MEMORY.md (always-loaded index), topic files (on-demand markdown), and session transcripts (grep-only JSON). The CLAUDE.md system provides persistent project-level instructions. This is precisely what TITAN uses.
The gap: Anthropic's context engineering guidance recommends "structured note-taking" where agents "regularly write notes persisted to memory outside of the context window" — but this is advisory, not enforced. Claude Code has no hook that fires before a "done" declaration and checks whether there are unverified open asks. That enforcement layer is missing and must be built.
The broader industry convergence (LangGraph, Semantic Kernel, Bee Agent Framework) around explicit state objects reflects the same insight: don't let the model infer what has happened — tell it explicitly. These frameworks pass structured state objects between agent steps. Every action reads and writes defined fields. "Agents never have to infer what has happened — they know."
For TITAN, this translates to: the ask ledger file (F:/TITAN/state/harnoor-asks.jsonl) is the state object. Every agent action that relates to a known ask must read and update that file, not rely on context.
---
Research on chatbot UX (Clutch.co, 2024) documents the behavioral loop: users who must re-ask enter a negative feedback spiral. The first re-ask is interpreted as "the AI missed it once, that's acceptable." The second re-ask generates irritation ("I already told it this"). By the third or fourth re-ask, users develop compensatory behaviors: writing longer, more emphatic initial requests ("IMPORTANT: please make sure the animation is ACTUALLY visible, not just in the code"); front-loading repeat asks before any new asks; and eventually, testing the AI before trusting any "done" claim.
45% of consumers report receiving irrelevant answers when communicating with AI support systems. The dominant response is not to clarify — it is to disengage or repeat with higher emphasis. Neither helps.
Multi-session studies on human-AI collaboration (ACM CHI 2024, arxiv 2603.13717) show that "trust in AI is known to accumulate gradually, influenced by past experiences and evolving expectations." Re-asking inverts this: each failure to retain an ask makes the user less willing to treat new "done" claims as reliable. The AI's confidence in its own outputs begins to feel inversely correlated with actual delivery.
For TITAN specifically, the five documented re-ask patterns (word-rotation, growing tree, Instagram share, bubbles cap, emotion-as-state) represent not just five dropped tasks — they represent the compounding cost of five trust-damaging incidents that make Harnoor less likely to believe the next "done" claim.
The CHI 2025 paper "AI on My Shoulder" (Swain et al.) documents that when AI systems fail repeatedly, users absorb the cognitive and emotional labor of tracking state themselves — essentially becoming the AI's external memory. The user writes their own MASTER-TODO, maintains their own task registry, and then has to re-enter that state into the AI on each session. This is a direct inversion of the intended dynamic: the AI should be the user's external memory, not the other way around.
---
File: F:/TITAN/state/harnoor-asks.jsonl (append-only, one JSON object per line)
{
"ask_id": "ASK-0001",
"ts": "2026-04-23T13:55:00Z",
"verbatim": "make the word rotation animation actually visible",
"paraphrase": "Streaming text rotation animation must be visually perceptible in the live Trillionaire app",
"feature_area": "animation/streaming",
"priority": "P1",
"status": "UNVERIFIED",
"deploy_sha": null,
"verified_at": null,
"verified_by": "user",
"re_ask_count": 4,
"re_ask_history": ["2026-04-10T...", "2026-04-14T...", "2026-04-19T...", "2026-04-23T..."],
"superseded_by": null,
"supersedes": null,
"notes": "Previously reported as done in sessions 4, 7, 9. Each time not visible in prod."
}
Status values: OPEN | IN_PROGRESS | DEPLOYED_UNVERIFIED | VERIFIED | SUPERSEDED | CANCELLED
The ledger is append-only. Status changes are new entries with the same ask_id and a new timestamp — the full history is always recoverable.
Every TITAN agent, before declaring any ask complete, must:
1. Load harnoor-asks.jsonl and filter for status IN (OPEN, IN_PROGRESS, DEPLOYED_UNVERIFIED)
2. Fuzzy-match the current task against open asks (semantic similarity on paraphrase field)
3. If a match is found with re_ask_count >= 2, escalate: "This ask has been reported done before without verification — proceeding to visibility check before closing."
4. Write a new ledger entry with status: DEPLOYED_UNVERIFIED, deploy_sha: <current git SHA>
The pre-work hook fires in the Claude Code PreToolUse hook on any TodoWrite or final-response generation.
After any deploy that touches a user-visible feature:
1. TITAN runs a Playwright screenshot of the live URL
2. Screenshot is diff'd against the last verified golden master stored at F:/TITAN/state/golden-masters/<feature_id>.png
3. If diff exceeds 1% pixel change in the target region, the change is considered "visible"
4. TITAN surfaces: "I can see [X change] in the live app at [URL]. Can you confirm you see this too?"
5. Only on explicit user confirmation ("yes", "looks good", "I see it") does the ledger entry transition to status: VERIFIED, verified_at: <ts>
Without that confirmation, the ask stays DEPLOYED_UNVERIFIED and is surfaced again at the next session start.
Every new Claude Code session, in the UserPromptSubmit pre-hook:
1. Load harnoor-asks.jsonl, filter for status IN (OPEN, IN_PROGRESS, DEPLOYED_UNVERIFIED)
2. Sort by re_ask_count DESC (most-repeated asks surface first)
3. Prepend to the system context: "OPEN ASKS REQUIRING ATTENTION: [list with re_ask counts and last action]"
4. Cap at 5 items to avoid context bloat — if more than 5, surface the top 5 by re_ask_count * priority_weight
This ensures every session begins with full awareness of unfinished business, regardless of what was in the prior session.
When a user submits a new message, TITAN checks it against the open asks. If semantic similarity > 0.85:
re_ask_count{...previous ask..., re_ask_count: N+1, ts: <now>}---
Code deployment and user perception are different events. TITAN has consistently conflated them. The word-rotation animation was "deployed" multiple times — the code existed in the codebase — but the animation was not perceptible due to CSS conflicts, build failures, wrong branch, or production-vs-development environment mismatch. None of these failures were detectable by code review alone.
Playwright's visual comparison API (native since Playwright 1.20) supports:
await expect(page).toHaveScreenshot('word-rotation-animation.png', {
threshold: 0.02,
animations: 'allow'
});
The first run creates the golden master. Subsequent runs compare pixel-by-pixel. Changes greater than the threshold fail the test. The output includes expected, actual, and diff images.
For TITAN:
F:/TITAN/state/golden-masters/<feature_id>_<date>.pngF:/TITAN/verify/visual-tests/Bash tool calls containing npm run build or git pushVISIBLE_REGRESSION, ask stays DEPLOYED_UNVERIFIEDFor any ask with re_ask_count >= 2, the visual test alone is insufficient. TITAN must request explicit user confirmation:
Template: "The Playwright test shows [feature] is rendering. Here is a screenshot of the live app: [path/URL]. Please confirm you can see [specific element] so I can close this ask."
This confirmation step is non-optional for high-re-ask items. It transfers the verification burden from inference to explicit acknowledgment.
For interactive features (animations, share flows, emotion tracking), a Playwright bot can:
1. Navigate to the live URL
2. Post a test message
3. Capture a video of the response animation firing
4. Compare the animation frame count against expected minimum (e.g., >= 30 frames for a 1s animation)
5. Log result to ledger
This is the highest-confidence verification — it proves the feature works end-to-end in the actual user environment, not just "the code is there."
---
Claude Code's context compaction runs at ~80% capacity (configurable). When it fires, the prior conversation is summarized by a secondary model call. That summary is optimized for factual content — past decisions, code changes, architecture choices. It is not optimized for open commitments with no delivery evidence. An ask buried in turn 47 of a 200-turn session, briefly mentioned and never followed up, will be summarized into nothing.
The Claude Code PreToolUse hook can intercept any tool use that typically precedes compaction (large Bash commands, multi-file edits). Before proceeding, the hook should:
1. Check if context usage > 70%
2. If yes, read harnoor-asks.jsonl for all status != VERIFIED
3. Write a compact summary to ~/.claude/knowledge/memory/hot/feedback/OPEN-ASKS-SNAPSHOT-<date>.md in hot memory
4. Format: one line per open ask, sorted by re_ask_count, max 10 items
This ensures that post-compaction, the next context load will immediately surface the open asks through the hot memory tier, which is always loaded first.
After any compaction event, TITAN should automatically run a synthesis check:
1. Compare the new summarized context against the hot-memory OPEN-ASKS snapshot
2. For any ask not mentioned in the new summary: prepend it to the working context as a system note
3. This prevents the compacted-context from being the only source of truth about open work
The synthesis check is essentially a merge conflict resolver for implicit state vs. explicit ledger state.
---
When re_ask_count >= 3 for any ask, TITAN must shift its behavior:
1. Acknowledge explicitly, without defensiveness. Not "I thought I handled that" — instead: "This has been asked 3 times and I have not successfully delivered it. I am treating this as a top-priority failure mode."
2. Surface a concrete fix path. Not "I'll try again" — instead: "I am going to [specific action] and will not close this ask until you confirm you can see [specific outcome] in the live app."
3. Request a verification checkpoint. Make the user a partner in closure, not a passive recipient of "done" claims.
When a re-ask counter hits 3+, TITAN uses this pattern:
> "I have logged this as ASK-[ID], re-ask #[N]. The prior [N-1] attempts were marked done without visual verification. I will not close this ask until: (a) a Playwright screenshot confirms [feature] is visible, and (b) you explicitly confirm you see it. Expected completion: [time estimate]."
This template was seeded by the "VERIFY-VISIBILITY" memo created 2026-04-23. It should be generalized into the ledger system so it fires automatically, not only when Harnoor explicitly surfaces a pattern.
Each verified closure of a high-re-ask item is an opportunity to rebuild trust. When re_ask_count >= 3 and the ask reaches VERIFIED:
> "Ask [ID] — [paraphrase] — is now verified closed after [N] attempts. Previous attempts failed due to [logged root cause]. I've added a [specific fix] to prevent this class of failure going forward."
The explicit acknowledgment of the failure pattern, combined with the concrete prevention step, signals that the system is learning — not just completing tasks.
---
Scoring: Impact (1-5) × Buildability (1-5 inverse of effort) = Priority Score
---
Description: Create F:/TITAN/state/harnoor-asks.jsonl as an append-only JSON Lines file with the schema in Section D. Backfill the five known re-ask items immediately.
Effort: S (2-4 hours). The schema is defined. The file can be written by hand in 30 minutes and maintained by TITAN going forward.
Blast radius: All future asks — every task TITAN handles will be trackable and re-ask detectable.
Prereqs: None. Can be done in the current session.
Concrete action:
F:/TITAN/state/harnoor-asks.jsonl — create now, backfill ASK-0001 through ASK-0005
---
Description: At every session start, TITAN reads the ask ledger and prepends unverified asks to context. Implemented as a Claude Code UserPromptSubmit hook in ~/.claude/hooks/session-start-replay.py.
Effort: S (1-2 hours). One Python script, reads JSONL, formats output, injects into pre-tool context.
Blast radius: Every session. Immediate.
Prereqs: H1 (ask ledger must exist).
Concrete action:
# ~/.claude/hooks/session-start-replay.py
# Reads F:/TITAN/state/harnoor-asks.jsonl
# Filters status != VERIFIED
# Sorts by re_ask_count DESC
# Prints top 5 to stdout for context injection
---
Description: Add a DEPLOYED_UNVERIFIED ledger status. Any time TITAN deploys a user-visible feature, it must transition the ask to this status and explicitly ask the user to confirm visibility before moving to VERIFIED.
Effort: S (under 2 hours). Protocol change + ledger write. No tooling required beyond the ledger.
Blast radius: All visual feature asks. Eliminates the "shipped vs. visible" conflation.
Prereqs: H1.
Concrete action: TITAN adds to its response protocol: after any deploy, always write ledger entry status: DEPLOYED_UNVERIFIED and ask "Can you confirm you see [X] in the live app?"
---
Description: Implement a Claude Code PreToolUse hook that fires when context usage crosses 70%, reads open asks from the ledger, and writes a snapshot to ~/.claude/knowledge/memory/hot/feedback/OPEN-ASKS-SNAPSHOT-<date>.md.
Effort: M (2-4 hours). Requires context-usage detection, file read/write, and hot-memory format compliance.
Blast radius: All sessions that run long. Prevents compaction-driven ask loss.
Prereqs: H1, H2.
Concrete action:
Hook file: ~/.claude/hooks/pre-compaction-hoist.py
Output: ~/.claude/knowledge/memory/hot/feedback/OPEN-ASKS-SNAPSHOT-2026-04-23.md
---
Description: When a new user message semantically matches an open ask, increment re_ask_count in the ledger and trigger the failure acknowledgment template (Section G2) for any ask where re_ask_count >= 3.
Effort: M (3-5 hours). Requires semantic similarity matching — can use simple embedding or keyword overlap as a first approximation.
Blast radius: All repeat asks. Prevents silent "I'll try again" behavior on high-re-ask items.
Prereqs: H1.
Concrete action: Add re-ask detection to TITAN's response generation pre-hook. Match against open asks; if similarity > 0.8, increment counter and fire escalation template.
---
Description: Create F:/TITAN/verify/visual-tests/ with Playwright tests for each user-visible feature. Golden masters stored at F:/TITAN/state/golden-masters/. Run on every deploy.
Effort: L (4-8 hours to bootstrap, then ongoing per feature).
Blast radius: All visual features. Eliminates "code exists but not visible" failures.
Prereqs: Node.js, Playwright installed. H3 for status integration.
Concrete action:
F:/TITAN/verify/visual-tests/animation-word-rotation.spec.ts
F:/TITAN/verify/visual-tests/bubbles-cap.spec.ts
F:/TITAN/verify/visual-tests/growing-tree.spec.ts
---
Description: When a user reverses an earlier direction, the ledger records superseded_by: ASK-XXXX on the original ask and creates a new entry. This prevents both the old and new direction from being "active" simultaneously.
Effort: S (1-2 hours). Schema change + write procedure.
Blast radius: Any ask that conflicts with a prior ask.
Prereqs: H1.
Concrete action: Add supersedes and superseded_by fields to ask schema. When TITAN detects a reversal ("actually, don't do X, do Y"), it creates ASK-N+1 with supersedes: ASK-N and sets ASK-N status: SUPERSEDED.
---
Description: A daily scheduled task reads the ask ledger and generates a one-line summary of open asks, delivered at session start. Stores output in ~/.claude/scheduled-tasks/ask-digest-<date>.md.
Effort: S (1-2 hours). Extend existing scheduled-task infrastructure.
Blast radius: Every day. Provides passive awareness without requiring active query.
Prereqs: H1, existing scheduled-task infrastructure.
Concrete action:
~/.claude/scheduled-tasks/ask-digest.py
Runs: daily at 09:00 local
Output: ~/.claude/scheduled-tasks/ask-digest-2026-04-23.md
---
Description: For asks that were elevated to "prime directive" status (like emotion-as-persistent-state), create a separate ledger category: type: PRIME_DIRECTIVE. These asks can never be auto-closed — they require explicit review at each major feature change.
Effort: M (add ledger type, add review check to all feature-deploy hooks).
Blast radius: Any prime directive ask.
Prereqs: H1, H3.
---
Description: A Playwright bot that navigates to the live Trillionaire app, posts a test message, and verifies interactive features (animations fire, bubbles cap at 3, growing tree renders) by inspecting the DOM and capturing video. Highest-confidence verification.
Effort: L (8-16 hours to build a robust bot). High value but non-trivial.
Blast radius: All interactive features in the live app.
Prereqs: H6, stable production URL.
---
| Rank | Fix | Time | Key Action |
|------|-----|------|------------|
| 1 | Ask Ledger (H1) | 30 min | Create F:/TITAN/state/harnoor-asks.jsonl, backfill 5 known re-asks |
| 2 | Session-Start Replay (H2) | 60-90 min | Write ~/.claude/hooks/session-start-replay.py |
| 3 | Deployed-Unverified Status (H3) | 30 min | Protocol change: never say "done" without asking "can you see it?" |
---
1. Why Your AI Assistant Keeps Forgetting (And How to Fix It) — ForgeWorkflows, accessed 2026-04-23
2. Cognition | Devin's 2025 Performance Review — Cognition AI, 2025
3. HZL: A Task Ledger for AI Agents — Trevin Chow, accessed 2026-04-23
4. Effective context engineering for AI agents — Anthropic Engineering, accessed 2026-04-23
5. AI Agent Anti-Patterns (Part 1) — Allen Chan, Medium, March 2026
6. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — arXiv, April 2025
7. Mem0 — The Memory Layer for your AI Apps — accessed 2026-04-23
8. Linear as an AI Task Hub for Agent Workflows — ClawList, accessed 2026-04-23
9. Context Rot: Why LLMs Degrade as Context Grows — Morph, 2025
10. Building an internal agent: Context window compaction — Will Larson (Irrational Exuberance), accessed 2026-04-23
11. Playwright Visual Regression Testing — Playwright official docs, accessed 2026-04-23
12. Claude Memory: A Deep Dive — Skywork AI, accessed 2026-04-23
13. AI on My Shoulder: Supporting Emotional Labor in Front-Office — CHI 2025
14. Building Durable AI Agents with Restate + Vercel AI SDK — Restate, accessed 2026-04-23
15. Solving Context Window Overflow in AI Agents — arXiv, 2024
16. Why Your Chatbot UX Is Annoying Users (and How to Fix It) — Clutch.co, accessed 2026-04-23
17. UX Research on Conversational Human-AI Interaction — ACM DL, accessed 2026-04-23
18. Multi-Session Study of UX Evaluators Collaborating with Conversational AI — arXiv, 2026