feat(eval): synthetic mbox fixture for email-agent eval (#848) by theonlychant · Pull Request #854 · amd/gaia

theonlychant · 2026-04-23T21:52:51Z

Summary

Adds a synthetic .mbox dataset for email triage agent evaluation, with
a deterministic generator that writes fixtures to a temporary directory
at test time so the source tree is never mutated.

Why

The email triage agent had no test data to validate triage logic in CI.
This fixture provides realistic email threads covering all required
categories, edge cases, and threading headers without requiring a live
mailbox.

Linked issue

Closes #848

Changes

Updated test fixtures to use tests/fixtures/email/ground_truth.json
if it already exists; otherwise the generator writes deterministic
ground_truth.json and synthetic_inbox.mbox into a temporary
directory at test time - source tree is never mutated
Fixed test_generator_determinism_verify_mode to call
_ensure_generated() itself so it's no longer order-dependent
Tightened filesystem and browser keyword regexes to use word boundaries
- prevents mis-routing on slashes in URLs, "and/or", "online meeting"
Added force_activate(bundle_name) public method to ToolLoader to
replace direct _state access
Wired reset_session() into conversation-start path

Test plan

pytest tests/unit/test_synthetic_mbox.py - all passing
pytest tests/unit/test_synthetic_mbox.py::test_generator_determinism_verify_mode - passes in isolation
python util/lint.py --all - no failures

Checklist

I have linked a GitHub issue above (Closes #848).
I have described why this change is being made, not just what changed.
I have run linting and tests locally.
Documentation updated where applicable.

…hpad/memory collision via amd#688 dynamic loading amd#800

…r email triage agent testing amd#848

theonlychant · 2026-04-24T21:58:13Z

for some reason the pylint keeps running into problems on my tests saying

python3 util/lint.py --pylint
========================================
Running Code Quality Checks
========================================

[3/9] Running Pylint (errors only)...
----------------------------------------
[CMD] pylint src/gaia --rcfile .pylintrc --disable C0103,C0301,W0246,W0221,E1102,R0401,E0401,W0718,W0212

[!] Pylint found critical errors:
************* Module gaia.agents.base.console
src/gaia/agents/base/console.py:909:46: E1101: Module 'PIL.Image' has no 'LANCZOS' member (no-member)
************* Module gaia.mcp.servers.agent_ui_mcp
src/gaia/mcp/servers/agent_ui_mcp.py:441:43: E1101: Module 'PIL.Image' has no 'LANCZOS' member (no-member)



================================================================
                    LINT SUMMARY REPORT                        
================================================================

[STATS] Project Statistics:
   - Python Files: 355
   - Lines of Code: 160,254
   - Directories: src/gaia, tests

[RESULTS] Quality Check Results:

+--------------------------------+------------+-----------+
| Check                          | Status     | Issues    |
+--------------------------------+------------+-----------+
| Critical Errors (Pylint)       | [X] FAIL   | 2 errors  |
+--------------------------------+------------+-----------+

[SUMMARY] Statistics:
   - Total Checks Run: 1
   - Passed: 0 (0.0%)
   - Failed: 1 (100.0%)
   - Warnings: 0 (0.0%)

============================================================
[FAILED] QUALITY CHECKS FAILED
============================================================

[ERROR] Issues Found:
   - 1 critical error(s) - must fix before PR

[TIP] Review the error messages above and fix the issues.
[TIP] Use --fix flag to auto-fix formatting issues:
   python util/lint.py --black --fix
   python util/lint.py --isort --fix

I tried resolving the issue nothing seems to work right now I will try to fix it later

I made some changes in a different branch PR

theonlychant · 2026-05-03T00:45:06Z

I ran this again in my kernel the checks were good

================================================================
                    LINT SUMMARY REPORT                        
================================================================

[STATS] Project Statistics:
   - Python Files: 362
   - Lines of Code: 162,969
   - Directories: src/gaia, tests

[RESULTS] Quality Check Results:

+--------------------------------+------------+-----------+
| Check                          | Status     | Issues    |
+--------------------------------+------------+-----------+
| Code Formatting (Black)        | [OK] PASS  | -         |
| Import Sorting (isort)         | [OK] PASS  | -         |
| Critical Errors (Pylint)       | [OK] PASS  | -         |
| Style Compliance (Flake8)      | [OK] PASS  | -         |
| Type Checking (MyPy)           | [!] WARN   | 1109 warns |
| Import Validation              | [OK] PASS  | -         |
| Security Check (Bandit)        | [!] WARN   | 54 warns  |
| Agent Conventions              | [!] WARN   | 11 warns  |
| Doc Version Consistency        | [OK] PASS  | -         |
+--------------------------------+------------+-----------+

[SUMMARY] Statistics:
   - Total Checks Run: 9
   - Passed: 6 (66.7%)
   - Failed: 0 (0.0%)
   - Warnings: 3 (33.3%)

============================================================
[SUCCESS] ALL QUALITY CHECKS PASSED!
[WARNING] 3 warning(s) found (non-blocking)
============================================================

[OK] Your code meets quality standards!
[OK] Ready for PR submission

itomek-amd

Thanks for bundling this — there's real, useful work here (the ToolLoader design is clean, the bundle policies are well-thought-out, and the synthetic-mbox generator is impressively thorough). Requesting changes on a few items before merge.

Findings below are from a code read; I didn't run pytest or lint locally, so the tests may pass cleanly. Disregard the existing PR-check failure (LANCZOS pylint error in screenshot_tools.py) — it reproduces on main and is pre-existing tech debt, not from this PR.

Blocking

Pre-built fixtures (synthetic_inbox.mbox, ground_truth.json) are missing from the repo. Issue #848 explicitly asks for them to be committed. Without them, tests/fixtures/email/conftest.py _ensure_generated() silently generates a ~1MB .mbox into the source tree on first test run — meaning every contributor and CI runner mutates their working copy on import. Either commit the generated artifacts, or move generation to tmp_path so the source tree stays clean.
test_generator_determinism_verify_mode is order-dependent. It runs generate_mbox.py --verify, but if the fixture files aren't present yet the script returns 1 with "Missing pre-built fixtures; run without --verify first.". The test only passes today because earlier tests in the module call _ensure_generated() first. Run this test in isolation (pytest tests/unit/test_synthetic_mbox.py::test_generator_determinism_verify_mode) and it will fail. Fix: ensure the --verify test calls _ensure_generated() itself, or test against a tmp_path-generated copy.
Two keyword regexes in _setup_tool_bundles are too greedy and will mis-route.
- filesystem includes r"[/\\]" — matches any slash, so "https://example.com", "and/or", "24/7", code paths in error messages all activate filesystem tools.
- browser includes r"search.*web|google|look\s*up|online" — matches "look it up in the doc", "online meeting", and any sentence containing the substring online.
  Tighten to word-boundary patterns (e.g. r"\bhttps?://", r"\bgoogle\s", r"\blook\s+up\b").

Non-blocking

ToolLoader.reset_session() is defined but never called — confirm it's wired into the conversation-start path or drop it.
ChatAgent._setup_tool_bundles writes directly to self.tool_loader._state["rag"].activated = True. Add a public force_activate(bundle_name) method on the loader so this isn't an encapsulation breach.
Agent.rebuild_system_prompt() builds against the full _TOOL_REGISTRY rather than passing through the loader — verify this is intentional (it means the loader's filter only applies on the initial prompt build, not on rebuild).
PR title "feat: PR for issue #848" doesn't follow the repo's conventional-commits convention. Consider feat(eval): synthetic mbox fixture for email-agent eval (#848).

Once these land I'm happy to re-review and approve.

…d encapsulation Signed-off-by: theonlychant <sacehenry@gmail.com>

Signed-off-by: theonlychant <sacehenry@gmail.com>

theonlychant added 4 commits April 19, 2026 03:51

trying to address the issue in::Tool-registry scaling: resolve scratc…

2f79fa9

…hpad/memory collision via amd#688 dynamic loading amd#800

fixed::lint-errors

c416cd3

added-resolve for the issue:: feat(email): synthetic .mbox dataset fo…

2f2523b

…r email triage agent testing amd#848

Merge branch 'amd:main' into patch-1

40fcb70

theonlychant requested a review from kovtcharov-amd as a code owner April 23, 2026 21:52

github-actions Bot added agents tests Test changes labels Apr 23, 2026

kovtcharov added this to the v0.18.0 — Agent Eval Benchmark [OSS] milestone Apr 26, 2026

kovtcharov removed the agents label Apr 26, 2026

fix(lint): resolve linting errors

8bcebe6

github-actions Bot added the agents label May 3, 2026

itomek self-assigned this May 4, 2026

itomek-amd requested changes May 4, 2026

View reviewed changes

Comment thread tests/fixtures/email/conftest.py Outdated

Comment thread tests/unit/test_synthetic_mbox.py Outdated

Comment thread src/gaia/agents/chat/agent.py Outdated

Comment thread src/gaia/agents/chat/agent.py Outdated

Comment thread src/gaia/agents/base/tool_loader.py

theonlychant changed the title ~~added-resolve for the issue-feat(email): synthetic .mbox dataset for email triage agent testing #848~~ feat(eval): synthetic mbox fixture for email-agent eval (#848) May 4, 2026

theonlychant added 2 commits May 4, 2026 13:22

fix(eval): address review feedback — fix fixtures, regex patterns, an…

6c0fe11

…d encapsulation Signed-off-by: theonlychant <sacehenry@gmail.com>

fix(eval): resolve lint errors

1803d83

Signed-off-by: theonlychant <sacehenry@gmail.com>

itomek mentioned this pull request May 5, 2026

feat(agents): dynamic tool-loader with bundle-gated activation for #688 #958

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): synthetic mbox fixture for email-agent eval (#848)#854

feat(eval): synthetic mbox fixture for email-agent eval (#848)#854
theonlychant wants to merge 7 commits intoamd:mainfrom
theonlychant:patch-1

theonlychant commented Apr 23, 2026 •

edited

Loading

Uh oh!

theonlychant commented Apr 24, 2026 •

edited

Loading

Uh oh!

theonlychant commented May 3, 2026

Uh oh!

itomek-amd left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

theonlychant commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Linked issue

Changes

Test plan

Checklist

Uh oh!

theonlychant commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theonlychant commented May 3, 2026

Uh oh!

itomek-amd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Blocking

Non-blocking

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

theonlychant commented Apr 23, 2026 •

edited

Loading

theonlychant commented Apr 24, 2026 •

edited

Loading

itomek-amd left a comment •

edited

Loading