Skip to content

feat(eval): synthetic mbox fixture for email-agent eval (#848)#854

Open
theonlychant wants to merge 7 commits intoamd:mainfrom
theonlychant:patch-1
Open

feat(eval): synthetic mbox fixture for email-agent eval (#848)#854
theonlychant wants to merge 7 commits intoamd:mainfrom
theonlychant:patch-1

Conversation

@theonlychant
Copy link
Copy Markdown
Contributor

@theonlychant theonlychant commented Apr 23, 2026

Summary

Adds a synthetic .mbox dataset for email triage agent evaluation, with
a deterministic generator that writes fixtures to a temporary directory
at test time so the source tree is never mutated.

Why

The email triage agent had no test data to validate triage logic in CI.
This fixture provides realistic email threads covering all required
categories, edge cases, and threading headers without requiring a live
mailbox.

Linked issue

Closes #848

Changes

  • Updated test fixtures to use tests/fixtures/email/ground_truth.json
    if it already exists; otherwise the generator writes deterministic
    ground_truth.json and synthetic_inbox.mbox into a temporary
    directory at test time - source tree is never mutated
  • Fixed test_generator_determinism_verify_mode to call
    _ensure_generated() itself so it's no longer order-dependent
  • Tightened filesystem and browser keyword regexes to use word boundaries
    • prevents mis-routing on slashes in URLs, "and/or", "online meeting"
  • Added force_activate(bundle_name) public method to ToolLoader to
    replace direct _state access
  • Wired reset_session() into conversation-start path

Test plan

  • pytest tests/unit/test_synthetic_mbox.py - all passing
  • pytest tests/unit/test_synthetic_mbox.py::test_generator_determinism_verify_mode - passes in isolation
  • python util/lint.py --all - no failures

Checklist

  • I have linked a GitHub issue above (Closes #848).
  • I have described why this change is being made, not just what changed.
  • I have run linting and tests locally.
  • Documentation updated where applicable.

@theonlychant
Copy link
Copy Markdown
Contributor Author

theonlychant commented Apr 24, 2026

for some reason the pylint keeps running into problems on my tests saying

python3 util/lint.py --pylint
========================================
Running Code Quality Checks
========================================

[3/9] Running Pylint (errors only)...
----------------------------------------
[CMD] pylint src/gaia --rcfile .pylintrc --disable C0103,C0301,W0246,W0221,E1102,R0401,E0401,W0718,W0212

[!] Pylint found critical errors:
************* Module gaia.agents.base.console
src/gaia/agents/base/console.py:909:46: E1101: Module 'PIL.Image' has no 'LANCZOS' member (no-member)
************* Module gaia.mcp.servers.agent_ui_mcp
src/gaia/mcp/servers/agent_ui_mcp.py:441:43: E1101: Module 'PIL.Image' has no 'LANCZOS' member (no-member)



================================================================
                    LINT SUMMARY REPORT                        
================================================================

[STATS] Project Statistics:
   - Python Files: 355
   - Lines of Code: 160,254
   - Directories: src/gaia, tests

[RESULTS] Quality Check Results:

+--------------------------------+------------+-----------+
| Check                          | Status     | Issues    |
+--------------------------------+------------+-----------+
| Critical Errors (Pylint)       | [X] FAIL   | 2 errors  |
+--------------------------------+------------+-----------+

[SUMMARY] Statistics:
   - Total Checks Run: 1
   - Passed: 0 (0.0%)
   - Failed: 1 (100.0%)
   - Warnings: 0 (0.0%)

============================================================
[FAILED] QUALITY CHECKS FAILED
============================================================

[ERROR] Issues Found:
   - 1 critical error(s) - must fix before PR

[TIP] Review the error messages above and fix the issues.
[TIP] Use --fix flag to auto-fix formatting issues:
   python util/lint.py --black --fix
   python util/lint.py --isort --fix

I tried resolving the issue nothing seems to work right now I will try to fix it later

I made some changes in a different branch PR

@github-actions github-actions Bot added the agents label May 3, 2026
@theonlychant
Copy link
Copy Markdown
Contributor Author

I ran this again in my kernel the checks were good

================================================================
                    LINT SUMMARY REPORT                        
================================================================

[STATS] Project Statistics:
   - Python Files: 362
   - Lines of Code: 162,969
   - Directories: src/gaia, tests

[RESULTS] Quality Check Results:

+--------------------------------+------------+-----------+
| Check                          | Status     | Issues    |
+--------------------------------+------------+-----------+
| Code Formatting (Black)        | [OK] PASS  | -         |
| Import Sorting (isort)         | [OK] PASS  | -         |
| Critical Errors (Pylint)       | [OK] PASS  | -         |
| Style Compliance (Flake8)      | [OK] PASS  | -         |
| Type Checking (MyPy)           | [!] WARN   | 1109 warns |
| Import Validation              | [OK] PASS  | -         |
| Security Check (Bandit)        | [!] WARN   | 54 warns  |
| Agent Conventions              | [!] WARN   | 11 warns  |
| Doc Version Consistency        | [OK] PASS  | -         |
+--------------------------------+------------+-----------+

[SUMMARY] Statistics:
   - Total Checks Run: 9
   - Passed: 6 (66.7%)
   - Failed: 0 (0.0%)
   - Warnings: 3 (33.3%)

============================================================
[SUCCESS] ALL QUALITY CHECKS PASSED!
[WARNING] 3 warning(s) found (non-blocking)
============================================================

[OK] Your code meets quality standards!
[OK] Ready for PR submission

@itomek itomek self-assigned this May 4, 2026
Copy link
Copy Markdown
Collaborator

@itomek-amd itomek-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bundling this — there's real, useful work here (the ToolLoader design is clean, the bundle policies are well-thought-out, and the synthetic-mbox generator is impressively thorough). Requesting changes on a few items before merge.

Findings below are from a code read; I didn't run pytest or lint locally, so the tests may pass cleanly. Disregard the existing PR-check failure (LANCZOS pylint error in screenshot_tools.py) — it reproduces on main and is pre-existing tech debt, not from this PR.

Blocking

  1. Pre-built fixtures (synthetic_inbox.mbox, ground_truth.json) are missing from the repo. Issue #848 explicitly asks for them to be committed. Without them, tests/fixtures/email/conftest.py _ensure_generated() silently generates a ~1MB .mbox into the source tree on first test run — meaning every contributor and CI runner mutates their working copy on import. Either commit the generated artifacts, or move generation to tmp_path so the source tree stays clean.

  2. test_generator_determinism_verify_mode is order-dependent. It runs generate_mbox.py --verify, but if the fixture files aren't present yet the script returns 1 with "Missing pre-built fixtures; run without --verify first.". The test only passes today because earlier tests in the module call _ensure_generated() first. Run this test in isolation (pytest tests/unit/test_synthetic_mbox.py::test_generator_determinism_verify_mode) and it will fail. Fix: ensure the --verify test calls _ensure_generated() itself, or test against a tmp_path-generated copy.

  3. Two keyword regexes in _setup_tool_bundles are too greedy and will mis-route.

    • filesystem includes r"[/\\]" — matches any slash, so "https://example.com", "and/or", "24/7", code paths in error messages all activate filesystem tools.
    • browser includes r"search.*web|google|look\s*up|online" — matches "look it up in the doc", "online meeting", and any sentence containing the substring online.
      Tighten to word-boundary patterns (e.g. r"\bhttps?://", r"\bgoogle\s", r"\blook\s+up\b").

Non-blocking

  1. ToolLoader.reset_session() is defined but never called — confirm it's wired into the conversation-start path or drop it.
  2. ChatAgent._setup_tool_bundles writes directly to self.tool_loader._state["rag"].activated = True. Add a public force_activate(bundle_name) method on the loader so this isn't an encapsulation breach.
  3. Agent.rebuild_system_prompt() builds against the full _TOOL_REGISTRY rather than passing through the loader — verify this is intentional (it means the loader's filter only applies on the initial prompt build, not on rebuild).
  4. PR title "feat: PR for issue #848" doesn't follow the repo's conventional-commits convention. Consider feat(eval): synthetic mbox fixture for email-agent eval (#848).

Once these land I'm happy to re-review and approve.

Comment thread tests/fixtures/email/conftest.py Outdated
Comment thread tests/unit/test_synthetic_mbox.py Outdated
Comment thread src/gaia/agents/chat/agent.py Outdated
Comment thread src/gaia/agents/chat/agent.py Outdated
Comment thread src/gaia/agents/base/tool_loader.py
@theonlychant theonlychant changed the title added-resolve for the issue-feat(email): synthetic .mbox dataset for email triage agent testing #848 feat(eval): synthetic mbox fixture for email-agent eval (#848) May 4, 2026
…d encapsulation

Signed-off-by: theonlychant <sacehenry@gmail.com>
Signed-off-by: theonlychant <sacehenry@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(email): synthetic .mbox dataset for email triage agent testing

4 participants