Skip to content

feat(email): synthetic .mbox dataset for email triage agent testing #848

Description

@kovtcharov-amd

Summary

The Email Triage Agent (Phase 2, v0.23.0) needs a synthetic .mbox dataset to validate the triage engine, MIME parsing, email threading, and categorization logic without requiring real email accounts or OAuth setup. This unblocks offline development and CI testing of the email pipeline described in the email & calendar integration plan.

Why

  • No real inboxes in CI — tests cannot depend on live IMAP/OAuth credentials
  • Reproducible triage evaluation — ground-truth labels let us measure categorization accuracy against the 80% target
  • MIME edge cases — multipart, attachments, inline images, non-UTF-8 encodings, and nested forwarding need coverage upfront
  • Threading correctnessIn-Reply-To / References chains must parse correctly for conversation grouping

Proposed Dataset

Generate a synthetic .mbox file (Python mailbox + email stdlib) with ~200-250 messages covering a realistic inbox mix.

Triage categories (ground truth labels in metadata)

The plan (section 4.2) defines 4 triage categories. This dataset adds a pre-triage spam/phishing class — the triage engine should filter these before categorization, not treat them as a 5th category. Ground truth marks them with is_spam: true / is_phishing: true separately from category.

Category Count Examples
Urgent ~20 Boss escalation, client contract deadline, prod incident alert, security advisory
Actionable ~45 PR review requests, meeting invites, direct questions, expense approvals
Informational ~55 Newsletters, order receipts, shipping notifications, internal announcements
Low priority ~30 Marketing promos, social media notifications, cold outreach
Spam/phishing (pre-triage filter) ~20 Phishing attempts, fake invoices, scams, fake delivery notices
Ambiguous/borderline ~15 Hard-to-classify messages (see section below)

Messages within each category are drawn from the corporate, personal, and spam flavors described below — the flavors are content types within categories, not separate pools.

Recurring sender personas

Define 8-10 recurring senders so threading and sender-importance learning can be tested across multiple messages from the same person:

Persona Domain Role Typical category
Sarah Chen (boss) @acme-corp.example.com VP Engineering Urgent / Actionable
Alex Kumar (direct report) @acme-corp.example.com Senior Engineer Actionable
Jordan Lee (cross-team) @acme-corp.example.com Product Manager Actionable / Informational
IT Systems noreply@acme-corp.example.com Automated Informational
HR Team hr@acme-corp.example.com Automated Informational
Maria Santos (client) @globaltech.example.net External partner Urgent / Actionable
DevOps Bot alerts@acme-corp.example.com CI/CD / PagerDuty Urgent / Informational
Newsletter senders various @*.example.com Marketing / news Informational / Low priority

Each recurring sender should appear in 3-8 messages (mix of thread roots and replies) to enable sender-frequency and response-pattern analysis.

Corporate / enterprise email types (~60-70 messages)

The primary target is enterprise users on Outlook/Exchange. These messages are distributed across the triage categories above:

  • IT / Ops notifications — password expiry warnings, VPN maintenance windows, system outage alerts, new software rollout announcements
  • HR / People Ops — benefits enrollment reminders, PTO policy updates, org announcements, new hire intros, mandatory training deadlines
  • Executive comms — all-hands meeting recap, quarterly earnings summary, CEO update, reorg announcement
  • Cross-team threads — multi-person Cc chains with 4-6 participants, "loop in [name]" forwards, top-posted reply style
  • Calendar-related — meeting invite (.ics), meeting cancellation, room booking confirmation, recurring meeting update
  • Compliance / Legal — NDA reminder, export control notice, data retention policy update, audit request
  • Automated systems — JIRA ticket assignment, CI/CD build failure, Confluence page update, PagerDuty alert, Salesforce lead notification
  • Expense / Finance — expense report approval, purchase order confirmation, travel booking receipt, budget review request

Spam / phishing (~20 messages, pre-triage filter)

Realistic spam that the triage engine should filter before categorization:

  • Phishing — fake password reset, "verify your account", spoofed IT department, fake DocuSign
  • Scam — Nigerian prince, lottery winner, inheritance notification, crypto "opportunity"
  • Commercial spam — unsolicited product pitch, SEO services, fake invoice attachment, "limited time offer"
  • Social engineering — fake LinkedIn connection, spoofed coworker name with external domain, urgent wire transfer request from "CEO"
  • Delivery scams — fake UPS/FedEx tracking, "package held at customs", Amazon order you did not place

Spam messages should include realistic spam signals: mismatched From display name vs address, suspicious domains, urgency language, misspellings, suspicious attachment names, missing List-Unsubscribe, etc.

Personal / consumer email types (~30 messages)

  • Order confirmations and shipping updates (Amazon, retailer receipts)
  • Subscription newsletters (tech blogs, news digests)
  • Social media notifications (LinkedIn, GitHub stars/follows)
  • Bank/financial alerts (transaction notification, statement ready)
  • Travel confirmations (flight, hotel, car rental)
  • App notifications (Slack digest, calendar reminders)

Ambiguous / borderline messages (~15 messages)

Intentionally hard-to-classify emails that stress-test the triage engine decision boundaries. Mark these with ambiguous: true in ground truth, plus a rationale field explaining the intended classification:

  • Meeting invite from unknown external contact (Actionable or Low priority?)
  • Vendor invoice with no prior relationship (Urgent or Informational?)
  • Automated JIRA ticket from a project the user is not on (Actionable or Informational?)
  • Newsletter from a tool you actively use vs. one signed up for and forgotten (Informational or Low priority?)
  • "Quick question" from someone never emailed before (Actionable or Low priority?)
  • Internal compliance email that requires acknowledgment by EOD (Urgent or Actionable?)
  • Reply-all on a thread where the user was Cc'd but not addressed (Actionable or Informational?)

Malformed / edge-case messages (~5-10 messages)

Real inboxes contain broken email. Include parser-robustness tests:

  • Missing Subject header
  • Empty body (headers only)
  • Truncated multipart (missing closing boundary)
  • Invalid Date header (unparseable format)
  • Double-encoded UTF-8 subject (=?UTF-8?B?...?= wrapping already-encoded text)
  • Base64-encoded body with incorrect padding
  • Extremely long Subject (>500 chars)
  • Message with no From header

Email structure variety

  • Plain text only
  • HTML only (with realistic corporate email templates)
  • Multipart (text + HTML)
  • With attachments (small dummy .pdf, .csv, .png, .docx1-5 KB each, total .mbox under 1 MB)
  • Inline images (Content-Disposition: inline)
  • Forwarded messages (nested message/rfc822)
  • Reply chains (3-5 deep with proper In-Reply-To / References headers)
  • Non-ASCII subjects and bodies (UTF-8, ISO-8859-1)
  • Calendar invites (.ics attachments)
  • Top-posted replies with > quoted original (Outlook style)
  • HTML signature blocks with logos and legal disclaimers

Realistic metadata

  • Varied From addresses using the recurring sender personas above
  • Corporate domains: @acme-corp.example.com, @globaltech.example.net (RFC 2606 reserved)
  • Realistic Date headers spanning ~2 weeks, clustered around 9 AM and 4-5 PM on weekdays to simulate real arrival patterns, with a batch arriving overnight Saturday to Monday for "overnight triage" testing
  • Proper Message-ID, In-Reply-To, References for threading
  • Mix of To, Cc, Bcc patterns (including large Cc lists for corporate threads)
  • List-Unsubscribe headers on marketing/newsletter emails
  • X-Priority / Importance headers on some urgent emails
  • X-Mailer headers (Outlook, Thunderbird, Gmail, automated systems)
  • Corporate email disclaimers in footers ("This email is confidential...")
  • Reply-To mismatches on phishing emails
  • Received header chains simulating realistic relay paths

Deliverables

  1. Generator scripttests/fixtures/email/generate_mbox.py that produces the .mbox deterministically (seeded RNG)
  2. Pre-built fixturetests/fixtures/email/synthetic_inbox.mbox checked into repo (must be under 1 MB)
  3. Ground truth manifesttests/fixtures/email/ground_truth.json mapping Message-ID to { category, priority, is_thread_root, thread_id, has_attachment, is_spam, is_phishing, ambiguous, rationale, sender_persona }
  4. Pytest fixtures — in tests/fixtures/email/conftest.py:
    • synthetic_mbox — loads the .mbox file
    • email_ground_truth — loads the ground truth JSON
    • single_email(category) — returns one message of a given triage category
    • spam_emails — returns all spam/phishing messages for filter testing
    • ambiguous_emails — returns borderline messages for boundary testing

Acceptance criteria

  • python tests/fixtures/email/generate_mbox.py produces a valid .mbox readable by Python mailbox.mbox()
  • All 4 triage categories represented with ground truth labels, plus separate is_spam/is_phishing flags (spam is a pre-triage filter, not a 5th category)
  • Corporate email types cover IT, HR, exec comms, cross-team threads, and automated systems
  • Spam/phishing emails include realistic indicators (domain mismatches, urgency, suspicious attachments)
  • ~15 ambiguous/borderline messages included with ambiguous: true and rationale in ground truth
  • ~5-10 malformed messages test parser robustness (missing headers, truncated MIME, encoding errors)
  • Recurring sender personas appear in 3-8 messages each for sender-importance testing
  • Multipart, attachment, threading, and encoding edge cases covered
  • No real PII — all names, addresses, domains, and content are synthetic (use .example.com / .example.net)
  • Total .mbox file size under 1 MB; individual dummy attachments 1-5 KB each
  • Pre-built fixture matches generator output (hash check in CI — python tests/fixtures/email/generate_mbox.py --verify)
  • Pytest can load and iterate the dataset (tests/unit/test_synthetic_mbox.py passes)
  • Generator is deterministic (same seed produces identical output)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentdomain:automationScheduler, autonomy, RAG, web search, watchers, researchenhancementNew feature or requesttrack:consumer-appConsumer product track — mobile-first: voice + messaging + memory + skills

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions