Testing & Iteration Log

External testing, stress tests, and the version-by-version fixes that shaped the system

Testing Approach

Email Brain was tested through two channels: internal daily use (running scheduled tasks against a real inbox with 18 active contacts) and external first-user testing (a colleague installing from scratch and running the system independently, including intentional stress tests).

The external testing was more valuable. Setup gaps were invisible to me because I already knew how the system worked — first-user testing revealed the actual experience.

Version Timeline

v0.1.0 — March 3, 2026 (Initial release)

Five modes operational: Draft, Inbox Scan, Daily Briefing, Decision Extraction, Resource Scanner. Notion context system connected. Gmail integration working. Pre-draft email filtering and basic context retrieval in place.

v0.2.0 — March 6, 2026 (Critical bug fix)

Discovered that Notion's semantic search API was silently returning partial results — 10 of 18 contacts, with no error or warning. Eight active clients were simply excluded from every scan. Root cause: platform limitation in Notion's search, not a logic error.

Fix: Complete Contact Retrieval Protocol — mandatory direct database fetch, local filtering, fallback searches by client code, count verification, and deduplication. After the fix, all 18 contacts consistently retrieved.

This was the most important bug found during the entire project. Silent partial retrieval is a critical failure mode for any AI system that depends on retrieval — it doesn't break loudly, it just quietly misses things.

v0.3.0 — March 6, 2026 (First external feedback round)

Brett installed the system from scratch and ran his first morning scan. Findings:

Gmail draft threading — Drafts weren't landing in the correct email thread. The skill was creating drafts without passing the thread identifier. Fixed by adding threadId to the gmail_create_draft call. Verified in the smoke test checklist.
Permission prompts — The "always run" permission prompts weren't clearly flagged as something the user needed to approve during setup. Added a guided test run step so users see and approve these prompts before enabling scheduled tasks.
Task monitor sidebar — Collapsed by default in the app. Users couldn't see what was running. Added an explicit step to open the sidebar during setup.
[VERIFY] tag visibility — These flags for human review weren't prominent enough. Users need to actually see them before hitting send.
Token usage documentation — Added expectations (~25-30% of Pro plan on first run, lighter daily) so users aren't surprised by consumption.