You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Email Triage Agent (Phase 2, v0.23.0) needs a synthetic .mbox dataset to validate the triage engine, MIME parsing, email threading, and categorization logic without requiring real email accounts or OAuth setup. This unblocks offline development and CI testing of the email pipeline described in the email & calendar integration plan.
Why
No real inboxes in CI — tests cannot depend on live IMAP/OAuth credentials
Reproducible triage evaluation — ground-truth labels let us measure categorization accuracy against the 80% target
MIME edge cases — multipart, attachments, inline images, non-UTF-8 encodings, and nested forwarding need coverage upfront
Threading correctness — In-Reply-To / References chains must parse correctly for conversation grouping
Proposed Dataset
Generate a synthetic .mbox file (Python mailbox + email stdlib) with ~200-250 messages covering a realistic inbox mix.
Triage categories (ground truth labels in metadata)
The plan (section 4.2) defines 4 triage categories. This dataset adds a pre-triage spam/phishing class — the triage engine should filter these before categorization, not treat them as a 5th category. Ground truth marks them with is_spam: true / is_phishing: true separately from category.
Category
Count
Examples
Urgent
~20
Boss escalation, client contract deadline, prod incident alert, security advisory
Actionable
~45
PR review requests, meeting invites, direct questions, expense approvals
Informational
~55
Newsletters, order receipts, shipping notifications, internal announcements
Low priority
~30
Marketing promos, social media notifications, cold outreach
Messages within each category are drawn from the corporate, personal, and spam flavors described below — the flavors are content types within categories, not separate pools.
Recurring sender personas
Define 8-10 recurring senders so threading and sender-importance learning can be tested across multiple messages from the same person:
Persona
Domain
Role
Typical category
Sarah Chen (boss)
@acme-corp.example.com
VP Engineering
Urgent / Actionable
Alex Kumar (direct report)
@acme-corp.example.com
Senior Engineer
Actionable
Jordan Lee (cross-team)
@acme-corp.example.com
Product Manager
Actionable / Informational
IT Systems
noreply@acme-corp.example.com
Automated
Informational
HR Team
hr@acme-corp.example.com
Automated
Informational
Maria Santos (client)
@globaltech.example.net
External partner
Urgent / Actionable
DevOps Bot
alerts@acme-corp.example.com
CI/CD / PagerDuty
Urgent / Informational
Newsletter senders
various @*.example.com
Marketing / news
Informational / Low priority
Each recurring sender should appear in 3-8 messages (mix of thread roots and replies) to enable sender-frequency and response-pattern analysis.
Commercial spam — unsolicited product pitch, SEO services, fake invoice attachment, "limited time offer"
Social engineering — fake LinkedIn connection, spoofed coworker name with external domain, urgent wire transfer request from "CEO"
Delivery scams — fake UPS/FedEx tracking, "package held at customs", Amazon order you did not place
Spam messages should include realistic spam signals: mismatched From display name vs address, suspicious domains, urgency language, misspellings, suspicious attachment names, missing List-Unsubscribe, etc.
Personal / consumer email types (~30 messages)
Order confirmations and shipping updates (Amazon, retailer receipts)
Intentionally hard-to-classify emails that stress-test the triage engine decision boundaries. Mark these with ambiguous: true in ground truth, plus a rationale field explaining the intended classification:
Meeting invite from unknown external contact (Actionable or Low priority?)
Vendor invoice with no prior relationship (Urgent or Informational?)
Automated JIRA ticket from a project the user is not on (Actionable or Informational?)
Newsletter from a tool you actively use vs. one signed up for and forgotten (Informational or Low priority?)
"Quick question" from someone never emailed before (Actionable or Low priority?)
Internal compliance email that requires acknowledgment by EOD (Urgent or Actionable?)
Reply-all on a thread where the user was Cc'd but not addressed (Actionable or Informational?)
Malformed / edge-case messages (~5-10 messages)
Real inboxes contain broken email. Include parser-robustness tests:
Realistic Date headers spanning ~2 weeks, clustered around 9 AM and 4-5 PM on weekdays to simulate real arrival patterns, with a batch arriving overnight Saturday to Monday for "overnight triage" testing
Proper Message-ID, In-Reply-To, References for threading
Mix of To, Cc, Bcc patterns (including large Cc lists for corporate threads)
List-Unsubscribe headers on marketing/newsletter emails
X-Priority / Importance headers on some urgent emails
Corporate email disclaimers in footers ("This email is confidential...")
Reply-To mismatches on phishing emails
Received header chains simulating realistic relay paths
Deliverables
Generator script — tests/fixtures/email/generate_mbox.py that produces the .mbox deterministically (seeded RNG)
Pre-built fixture — tests/fixtures/email/synthetic_inbox.mbox checked into repo (must be under 1 MB)
Ground truth manifest — tests/fixtures/email/ground_truth.json mapping Message-ID to { category, priority, is_thread_root, thread_id, has_attachment, is_spam, is_phishing, ambiguous, rationale, sender_persona }
Pytest fixtures — in tests/fixtures/email/conftest.py:
synthetic_mbox — loads the .mbox file
email_ground_truth — loads the ground truth JSON
single_email(category) — returns one message of a given triage category
spam_emails — returns all spam/phishing messages for filter testing
ambiguous_emails — returns borderline messages for boundary testing
Acceptance criteria
python tests/fixtures/email/generate_mbox.py produces a valid .mbox readable by Python mailbox.mbox()
All 4 triage categories represented with ground truth labels, plus separate is_spam/is_phishing flags (spam is a pre-triage filter, not a 5th category)
Corporate email types cover IT, HR, exec comms, cross-team threads, and automated systems
Spam/phishing emails include realistic indicators (domain mismatches, urgency, suspicious attachments)
~15 ambiguous/borderline messages included with ambiguous: true and rationale in ground truth
Summary
The Email Triage Agent (Phase 2, v0.23.0) needs a synthetic
.mboxdataset to validate the triage engine, MIME parsing, email threading, and categorization logic without requiring real email accounts or OAuth setup. This unblocks offline development and CI testing of the email pipeline described in the email & calendar integration plan.Why
In-Reply-To/Referenceschains must parse correctly for conversation groupingProposed Dataset
Generate a synthetic
.mboxfile (Pythonmailbox+emailstdlib) with ~200-250 messages covering a realistic inbox mix.Triage categories (ground truth labels in metadata)
The plan (section 4.2) defines 4 triage categories. This dataset adds a pre-triage spam/phishing class — the triage engine should filter these before categorization, not treat them as a 5th category. Ground truth marks them with
is_spam: true/is_phishing: trueseparately fromcategory.Messages within each category are drawn from the corporate, personal, and spam flavors described below — the flavors are content types within categories, not separate pools.
Recurring sender personas
Define 8-10 recurring senders so threading and sender-importance learning can be tested across multiple messages from the same person:
@acme-corp.example.com@acme-corp.example.com@acme-corp.example.comnoreply@acme-corp.example.comhr@acme-corp.example.com@globaltech.example.netalerts@acme-corp.example.com@*.example.comEach recurring sender should appear in 3-8 messages (mix of thread roots and replies) to enable sender-frequency and response-pattern analysis.
Corporate / enterprise email types (~60-70 messages)
The primary target is enterprise users on Outlook/Exchange. These messages are distributed across the triage categories above:
Ccchains with 4-6 participants, "loop in [name]" forwards, top-posted reply style.ics), meeting cancellation, room booking confirmation, recurring meeting updateSpam / phishing (~20 messages, pre-triage filter)
Realistic spam that the triage engine should filter before categorization:
Spam messages should include realistic spam signals: mismatched
Fromdisplay name vs address, suspicious domains, urgency language, misspellings, suspicious attachment names, missingList-Unsubscribe, etc.Personal / consumer email types (~30 messages)
Ambiguous / borderline messages (~15 messages)
Intentionally hard-to-classify emails that stress-test the triage engine decision boundaries. Mark these with
ambiguous: truein ground truth, plus arationalefield explaining the intended classification:Malformed / edge-case messages (~5-10 messages)
Real inboxes contain broken email. Include parser-robustness tests:
SubjectheaderDateheader (unparseable format)=?UTF-8?B?...?=wrapping already-encoded text)Subject(>500 chars)FromheaderEmail structure variety
.pdf,.csv,.png,.docx— 1-5 KB each, total .mbox under 1 MB)Content-Disposition: inline)message/rfc822)In-Reply-To/Referencesheaders).icsattachments)>quoted original (Outlook style)Realistic metadata
Fromaddresses using the recurring sender personas above@acme-corp.example.com,@globaltech.example.net(RFC 2606 reserved)Dateheaders spanning ~2 weeks, clustered around 9 AM and 4-5 PM on weekdays to simulate real arrival patterns, with a batch arriving overnight Saturday to Monday for "overnight triage" testingMessage-ID,In-Reply-To,Referencesfor threadingTo,Cc,Bccpatterns (including largeCclists for corporate threads)List-Unsubscribeheaders on marketing/newsletter emailsX-Priority/Importanceheaders on some urgent emailsX-Mailerheaders (Outlook, Thunderbird, Gmail, automated systems)Reply-Tomismatches on phishing emailsReceivedheader chains simulating realistic relay pathsDeliverables
tests/fixtures/email/generate_mbox.pythat produces the.mboxdeterministically (seeded RNG)tests/fixtures/email/synthetic_inbox.mboxchecked into repo (must be under 1 MB)tests/fixtures/email/ground_truth.jsonmappingMessage-IDto{ category, priority, is_thread_root, thread_id, has_attachment, is_spam, is_phishing, ambiguous, rationale, sender_persona }tests/fixtures/email/conftest.py:synthetic_mbox— loads the.mboxfileemail_ground_truth— loads the ground truth JSONsingle_email(category)— returns one message of a given triage categoryspam_emails— returns all spam/phishing messages for filter testingambiguous_emails— returns borderline messages for boundary testingAcceptance criteria
python tests/fixtures/email/generate_mbox.pyproduces a valid.mboxreadable by Pythonmailbox.mbox()is_spam/is_phishingflags (spam is a pre-triage filter, not a 5th category)ambiguous: trueandrationalein ground truth.example.com/.example.net).mboxfile size under 1 MB; individual dummy attachments 1-5 KB eachpython tests/fixtures/email/generate_mbox.py --verify)tests/unit/test_synthetic_mbox.pypasses)References
docs/plans/email-calendar-integration.mdxsection 4 (Email Triage Agent)