Wait at least 30 minutes before retrying ANY boto3-using script (titan_email.send, aws_cost_daily.py, aws sts get-caller-identity, anything that imports boto3) when output is silent — credential resolution can take 15-20+ minutes on cold start on this Windows host.
Why: On 2026-05-03, the titan-daily-newsletter scheduled task hung silently during SES send for ~18 minutes on a cold-start boto3 invocation. The orchestrator (me) declared failure at the 10-15 minute mark, retried the script, and sent Issue #015 to Harnoor's inbox twice (MessageIds 0100019dee3c5630-... and 0100019dee3e555d-...). Both AWS CLI and boto3 client creation exhibited the same hang — likely IMDS lookup or SSO refresh holding the credential chain. The retry was premature because the first invocation was still in-flight.
Diagnostic update (2026-05-03): Root cause investigation found:
get_send_quota() call took 1258 seconds (20+ min)boto3.client('ses') instantiation took 1219 secondsAWS_EC2_METADATA_DISABLED=true took 1323 seconds/ship for deep-dive)How to apply: Any caller of titan_email.send (especially scheduled tasks like /newsletter, /briefing, agent report senders) should log the start time, monitor output, and only retry after 30+ minutes of confirmed no-output silence. Prefer a single foreground invocation with an extended timeout (30+ min) over aggressive retry logic on cold-start SES calls. Do NOT attempt the AWS_EC2_METADATA_DISABLED workaround — it does not resolve the hang.
Repeat incident 2026-05-03 (aws-cost-daily-150-cap scheduled task): I killed aws_cost_daily.py at 60s and 120s thinking it was hung. It wasn't — it completed successfully ~10+ minutes later (tier=GREEN, MTD=$11.78). Then I sent a "fallback failure email" via titan_email.send while the original was still in flight, then a third retraction email. Result: three emails to Harnoor's inbox where one would have sufficed. The lesson is the same as the original incident — silence is not failure on this host. Default behavior for ANY boto3 script: launch in background, set the timeout to 30 minutes, do NOT kill early, do NOT start a fallback while the first is still running.