feat(email): synthetic .mbox dataset for email triage agent testing

## Summary

The Email Triage Agent (Phase 2, v0.23.0) needs a synthetic `.mbox` dataset to validate the triage engine, MIME parsing, email threading, and categorization logic without requiring real email accounts or OAuth setup. This unblocks offline development and CI testing of the email pipeline described in the [email & calendar integration plan](https://github.com/amd/gaia/blob/main/docs/plans/email-calendar-integration.mdx).

## Why

- **No real inboxes in CI** — tests cannot depend on live IMAP/OAuth credentials
- **Reproducible triage evaluation** — ground-truth labels let us measure categorization accuracy against the [80% target](https://github.com/amd/gaia/blob/main/docs/plans/email-calendar-integration.mdx#12-success-metrics)
- **MIME edge cases** — multipart, attachments, inline images, non-UTF-8 encodings, and nested forwarding need coverage upfront
- **Threading correctness** — `In-Reply-To` / `References` chains must parse correctly for conversation grouping

## Proposed Dataset

Generate a synthetic `.mbox` file (Python `mailbox` + `email` stdlib) with **~200-250 messages** covering a realistic inbox mix.

### Triage categories (ground truth labels in metadata)

The plan (section 4.2) defines 4 triage categories. This dataset adds a **pre-triage spam/phishing class** — the triage engine should filter these *before* categorization, not treat them as a 5th category. Ground truth marks them with `is_spam: true` / `is_phishing: true` separately from `category`.

| Category | Count | Examples |
|----------|-------|---------|
| **Urgent** | ~20 | Boss escalation, client contract deadline, prod incident alert, security advisory |
| **Actionable** | ~45 | PR review requests, meeting invites, direct questions, expense approvals |
| **Informational** | ~55 | Newsletters, order receipts, shipping notifications, internal announcements |
| **Low priority** | ~30 | Marketing promos, social media notifications, cold outreach |
| **Spam/phishing** (pre-triage filter) | ~20 | Phishing attempts, fake invoices, scams, fake delivery notices |
| **Ambiguous/borderline** | ~15 | Hard-to-classify messages (see section below) |

Messages within each category are drawn from the corporate, personal, and spam flavors described below — the flavors are **content types within categories**, not separate pools.

### Recurring sender personas

Define 8-10 recurring senders so threading and sender-importance learning can be tested across multiple messages from the same person:

| Persona | Domain | Role | Typical category |
|---------|--------|------|-----------------|
| Sarah Chen (boss) | `@acme-corp.example.com` | VP Engineering | Urgent / Actionable |
| Alex Kumar (direct report) | `@acme-corp.example.com` | Senior Engineer | Actionable |
| Jordan Lee (cross-team) | `@acme-corp.example.com` | Product Manager | Actionable / Informational |
| IT Systems | `noreply@acme-corp.example.com` | Automated | Informational |
| HR Team | `hr@acme-corp.example.com` | Automated | Informational |
| Maria Santos (client) | `@globaltech.example.net` | External partner | Urgent / Actionable |
| DevOps Bot | `alerts@acme-corp.example.com` | CI/CD / PagerDuty | Urgent / Informational |
| Newsletter senders | various `@*.example.com` | Marketing / news | Informational / Low priority |

Each recurring sender should appear in **3-8 messages** (mix of thread roots and replies) to enable sender-frequency and response-pattern analysis.

### Corporate / enterprise email types (~60-70 messages)

The primary target is enterprise users on Outlook/Exchange. These messages are distributed across the triage categories above:

- **IT / Ops notifications** — password expiry warnings, VPN maintenance windows, system outage alerts, new software rollout announcements
- **HR / People Ops** — benefits enrollment reminders, PTO policy updates, org announcements, new hire intros, mandatory training deadlines
- **Executive comms** — all-hands meeting recap, quarterly earnings summary, CEO update, reorg announcement
- **Cross-team threads** — multi-person `Cc` chains with 4-6 participants, "loop in [name]" forwards, top-posted reply style
- **Calendar-related** — meeting invite (`.ics`), meeting cancellation, room booking confirmation, recurring meeting update
- **Compliance / Legal** — NDA reminder, export control notice, data retention policy update, audit request
- **Automated systems** — JIRA ticket assignment, CI/CD build failure, Confluence page update, PagerDuty alert, Salesforce lead notification
- **Expense / Finance** — expense report approval, purchase order confirmation, travel booking receipt, budget review request

### Spam / phishing (~20 messages, pre-triage filter)

Realistic spam that the triage engine should filter before categorization:

- **Phishing** — fake password reset, "verify your account", spoofed IT department, fake DocuSign
- **Scam** — Nigerian prince, lottery winner, inheritance notification, crypto "opportunity"
- **Commercial spam** — unsolicited product pitch, SEO services, fake invoice attachment, "limited time offer"
- **Social engineering** — fake LinkedIn connection, spoofed coworker name with external domain, urgent wire transfer request from "CEO"
- **Delivery scams** — fake UPS/FedEx tracking, "package held at customs", Amazon order you did not place

Spam messages should include realistic spam signals: mismatched `From` display name vs address, suspicious domains, urgency language, misspellings, suspicious attachment names, missing `List-Unsubscribe`, etc.

### Personal / consumer email types (~30 messages)

- Order confirmations and shipping updates (Amazon, retailer receipts)
- Subscription newsletters (tech blogs, news digests)
- Social media notifications (LinkedIn, GitHub stars/follows)
- Bank/financial alerts (transaction notification, statement ready)
- Travel confirmations (flight, hotel, car rental)
- App notifications (Slack digest, calendar reminders)

### Ambiguous / borderline messages (~15 messages)

Intentionally hard-to-classify emails that stress-test the triage engine decision boundaries. Mark these with `ambiguous: true` in ground truth, plus a `rationale` field explaining the intended classification:

- Meeting invite from unknown external contact (Actionable or Low priority?)
- Vendor invoice with no prior relationship (Urgent or Informational?)
- Automated JIRA ticket from a project the user is not on (Actionable or Informational?)
- Newsletter from a tool you actively use vs. one signed up for and forgotten (Informational or Low priority?)
- "Quick question" from someone never emailed before (Actionable or Low priority?)
- Internal compliance email that requires acknowledgment by EOD (Urgent or Actionable?)
- Reply-all on a thread where the user was Cc'd but not addressed (Actionable or Informational?)

### Malformed / edge-case messages (~5-10 messages)

Real inboxes contain broken email. Include parser-robustness tests:

- Missing `Subject` header
- Empty body (headers only)
- Truncated multipart (missing closing boundary)
- Invalid `Date` header (unparseable format)
- Double-encoded UTF-8 subject (`=?UTF-8?B?...?=` wrapping already-encoded text)
- Base64-encoded body with incorrect padding
- Extremely long `Subject` (>500 chars)
- Message with no `From` header

### Email structure variety

- Plain text only
- HTML only (with realistic corporate email templates)
- Multipart (text + HTML)
- With attachments (small dummy `.pdf`, `.csv`, `.png`, `.docx` — **1-5 KB each, total .mbox under 1 MB**)
- Inline images (`Content-Disposition: inline`)
- Forwarded messages (nested `message/rfc822`)
- Reply chains (3-5 deep with proper `In-Reply-To` / `References` headers)
- Non-ASCII subjects and bodies (UTF-8, ISO-8859-1)
- Calendar invites (`.ics` attachments)
- Top-posted replies with `>` quoted original (Outlook style)
- HTML signature blocks with logos and legal disclaimers

### Realistic metadata

- Varied `From` addresses using the recurring sender personas above
- Corporate domains: `@acme-corp.example.com`, `@globaltech.example.net` (RFC 2606 reserved)
- Realistic `Date` headers spanning ~2 weeks, **clustered around 9 AM and 4-5 PM** on weekdays to simulate real arrival patterns, with a batch arriving overnight Saturday to Monday for "overnight triage" testing
- Proper `Message-ID`, `In-Reply-To`, `References` for threading
- Mix of `To`, `Cc`, `Bcc` patterns (including large `Cc` lists for corporate threads)
- `List-Unsubscribe` headers on marketing/newsletter emails
- `X-Priority` / `Importance` headers on some urgent emails
- `X-Mailer` headers (Outlook, Thunderbird, Gmail, automated systems)
- Corporate email disclaimers in footers ("This email is confidential...")
- `Reply-To` mismatches on phishing emails
- `Received` header chains simulating realistic relay paths

## Deliverables

1. **Generator script** — `tests/fixtures/email/generate_mbox.py` that produces the `.mbox` deterministically (seeded RNG)
2. **Pre-built fixture** — `tests/fixtures/email/synthetic_inbox.mbox` checked into repo (must be under 1 MB)
3. **Ground truth manifest** — `tests/fixtures/email/ground_truth.json` mapping `Message-ID` to `{ category, priority, is_thread_root, thread_id, has_attachment, is_spam, is_phishing, ambiguous, rationale, sender_persona }`
4. **Pytest fixtures** — in `tests/fixtures/email/conftest.py`:
   - `synthetic_mbox` — loads the `.mbox` file
   - `email_ground_truth` — loads the ground truth JSON
   - `single_email(category)` — returns one message of a given triage category
   - `spam_emails` — returns all spam/phishing messages for filter testing
   - `ambiguous_emails` — returns borderline messages for boundary testing

## Acceptance criteria

- [ ] `python tests/fixtures/email/generate_mbox.py` produces a valid `.mbox` readable by Python `mailbox.mbox()`
- [ ] All 4 triage categories represented with ground truth labels, plus separate `is_spam`/`is_phishing` flags (spam is a pre-triage filter, not a 5th category)
- [ ] Corporate email types cover IT, HR, exec comms, cross-team threads, and automated systems
- [ ] Spam/phishing emails include realistic indicators (domain mismatches, urgency, suspicious attachments)
- [ ] ~15 ambiguous/borderline messages included with `ambiguous: true` and `rationale` in ground truth
- [ ] ~5-10 malformed messages test parser robustness (missing headers, truncated MIME, encoding errors)
- [ ] Recurring sender personas appear in 3-8 messages each for sender-importance testing
- [ ] Multipart, attachment, threading, and encoding edge cases covered
- [ ] No real PII — all names, addresses, domains, and content are synthetic (use `.example.com` / `.example.net`)
- [ ] Total `.mbox` file size under 1 MB; individual dummy attachments 1-5 KB each
- [ ] Pre-built fixture matches generator output (hash check in CI — `python tests/fixtures/email/generate_mbox.py --verify`)
- [ ] Pytest can load and iterate the dataset (`tests/unit/test_synthetic_mbox.py` passes)
- [ ] Generator is deterministic (same seed produces identical output)

## References

- Plan: [`docs/plans/email-calendar-integration.mdx`](https://github.com/amd/gaia/blob/main/docs/plans/email-calendar-integration.mdx) section 4 (Email Triage Agent)
- Related: #645 (Email Triage Agent), #663 (Daily briefs)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(email): synthetic .mbox dataset for email triage agent testing #848

Summary

Why

Proposed Dataset

Triage categories (ground truth labels in metadata)

Recurring sender personas

Corporate / enterprise email types (~60-70 messages)

Spam / phishing (~20 messages, pre-triage filter)

Personal / consumer email types (~30 messages)

Ambiguous / borderline messages (~15 messages)

Malformed / edge-case messages (~5-10 messages)

Email structure variety

Realistic metadata

Deliverables

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	Count	Examples
Urgent	~20	Boss escalation, client contract deadline, prod incident alert, security advisory
Actionable	~45	PR review requests, meeting invites, direct questions, expense approvals
Informational	~55	Newsletters, order receipts, shipping notifications, internal announcements
Low priority	~30	Marketing promos, social media notifications, cold outreach
Spam/phishing (pre-triage filter)	~20	Phishing attempts, fake invoices, scams, fake delivery notices
Ambiguous/borderline	~15	Hard-to-classify messages (see section below)

Persona	Domain	Role	Typical category
Sarah Chen (boss)	`@acme-corp.example.com`	VP Engineering	Urgent / Actionable
Alex Kumar (direct report)	`@acme-corp.example.com`	Senior Engineer	Actionable
Jordan Lee (cross-team)	`@acme-corp.example.com`	Product Manager	Actionable / Informational
IT Systems	`noreply@acme-corp.example.com`	Automated	Informational
HR Team	`hr@acme-corp.example.com`	Automated	Informational
Maria Santos (client)	`@globaltech.example.net`	External partner	Urgent / Actionable
DevOps Bot	`alerts@acme-corp.example.com`	CI/CD / PagerDuty	Urgent / Informational
Newsletter senders	various `@*.example.com`	Marketing / news	Informational / Low priority

feat(email): synthetic .mbox dataset for email triage agent testing #848

Description

Summary

Why

Proposed Dataset

Triage categories (ground truth labels in metadata)

Recurring sender personas

Corporate / enterprise email types (~60-70 messages)

Spam / phishing (~20 messages, pre-triage filter)

Personal / consumer email types (~30 messages)

Ambiguous / borderline messages (~15 messages)

Malformed / edge-case messages (~5-10 messages)

Email structure variety

Realistic metadata

Deliverables

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions