feat(eval): synthetic mbox fixture for email-agent eval (#848)#854
feat(eval): synthetic mbox fixture for email-agent eval (#848)#854theonlychant wants to merge 7 commits intoamd:mainfrom
Conversation
…r email triage agent testing amd#848
|
for some reason the pylint keeps running into problems on my tests saying python3 util/lint.py --pylint
========================================
Running Code Quality Checks
========================================
[3/9] Running Pylint (errors only)...
----------------------------------------
[CMD] pylint src/gaia --rcfile .pylintrc --disable C0103,C0301,W0246,W0221,E1102,R0401,E0401,W0718,W0212
[!] Pylint found critical errors:
************* Module gaia.agents.base.console
src/gaia/agents/base/console.py:909:46: E1101: Module 'PIL.Image' has no 'LANCZOS' member (no-member)
************* Module gaia.mcp.servers.agent_ui_mcp
src/gaia/mcp/servers/agent_ui_mcp.py:441:43: E1101: Module 'PIL.Image' has no 'LANCZOS' member (no-member)
================================================================
LINT SUMMARY REPORT
================================================================
[STATS] Project Statistics:
- Python Files: 355
- Lines of Code: 160,254
- Directories: src/gaia, tests
[RESULTS] Quality Check Results:
+--------------------------------+------------+-----------+
| Check | Status | Issues |
+--------------------------------+------------+-----------+
| Critical Errors (Pylint) | [X] FAIL | 2 errors |
+--------------------------------+------------+-----------+
[SUMMARY] Statistics:
- Total Checks Run: 1
- Passed: 0 (0.0%)
- Failed: 1 (100.0%)
- Warnings: 0 (0.0%)
============================================================
[FAILED] QUALITY CHECKS FAILED
============================================================
[ERROR] Issues Found:
- 1 critical error(s) - must fix before PR
[TIP] Review the error messages above and fix the issues.
[TIP] Use --fix flag to auto-fix formatting issues:
python util/lint.py --black --fix
python util/lint.py --isort --fixI tried resolving the issue nothing seems to work right now I will try to fix it later I made some changes in a different branch PR |
|
I ran this again in my kernel the checks were good ================================================================
LINT SUMMARY REPORT
================================================================
[STATS] Project Statistics:
- Python Files: 362
- Lines of Code: 162,969
- Directories: src/gaia, tests
[RESULTS] Quality Check Results:
+--------------------------------+------------+-----------+
| Check | Status | Issues |
+--------------------------------+------------+-----------+
| Code Formatting (Black) | [OK] PASS | - |
| Import Sorting (isort) | [OK] PASS | - |
| Critical Errors (Pylint) | [OK] PASS | - |
| Style Compliance (Flake8) | [OK] PASS | - |
| Type Checking (MyPy) | [!] WARN | 1109 warns |
| Import Validation | [OK] PASS | - |
| Security Check (Bandit) | [!] WARN | 54 warns |
| Agent Conventions | [!] WARN | 11 warns |
| Doc Version Consistency | [OK] PASS | - |
+--------------------------------+------------+-----------+
[SUMMARY] Statistics:
- Total Checks Run: 9
- Passed: 6 (66.7%)
- Failed: 0 (0.0%)
- Warnings: 3 (33.3%)
============================================================
[SUCCESS] ALL QUALITY CHECKS PASSED!
[WARNING] 3 warning(s) found (non-blocking)
============================================================
[OK] Your code meets quality standards!
[OK] Ready for PR submission |
There was a problem hiding this comment.
Thanks for bundling this — there's real, useful work here (the ToolLoader design is clean, the bundle policies are well-thought-out, and the synthetic-mbox generator is impressively thorough). Requesting changes on a few items before merge.
Findings below are from a code read; I didn't run pytest or lint locally, so the tests may pass cleanly. Disregard the existing PR-check failure (LANCZOS pylint error in screenshot_tools.py) — it reproduces on main and is pre-existing tech debt, not from this PR.
Blocking
-
Pre-built fixtures (
synthetic_inbox.mbox,ground_truth.json) are missing from the repo. Issue #848 explicitly asks for them to be committed. Without them,tests/fixtures/email/conftest.py_ensure_generated()silently generates a ~1MB.mboxinto the source tree on first test run — meaning every contributor and CI runner mutates their working copy on import. Either commit the generated artifacts, or move generation totmp_pathso the source tree stays clean. -
test_generator_determinism_verify_modeis order-dependent. It runsgenerate_mbox.py --verify, but if the fixture files aren't present yet the script returns1with"Missing pre-built fixtures; run without --verify first.". The test only passes today because earlier tests in the module call_ensure_generated()first. Run this test in isolation (pytest tests/unit/test_synthetic_mbox.py::test_generator_determinism_verify_mode) and it will fail. Fix: ensure the--verifytest calls_ensure_generated()itself, or test against atmp_path-generated copy. -
Two keyword regexes in
_setup_tool_bundlesare too greedy and will mis-route.filesystemincludesr"[/\\]"— matches any slash, so"https://example.com","and/or","24/7", code paths in error messages all activate filesystem tools.browserincludesr"search.*web|google|look\s*up|online"— matches"look it up in the doc","online meeting", and any sentence containing the substringonline.
Tighten to word-boundary patterns (e.g.r"\bhttps?://",r"\bgoogle\s",r"\blook\s+up\b").
Non-blocking
ToolLoader.reset_session()is defined but never called — confirm it's wired into the conversation-start path or drop it.ChatAgent._setup_tool_bundleswrites directly toself.tool_loader._state["rag"].activated = True. Add a publicforce_activate(bundle_name)method on the loader so this isn't an encapsulation breach.Agent.rebuild_system_prompt()builds against the full_TOOL_REGISTRYrather than passing through the loader — verify this is intentional (it means the loader's filter only applies on the initial prompt build, not on rebuild).- PR title
"feat: PR for issue #848"doesn't follow the repo's conventional-commits convention. Considerfeat(eval): synthetic mbox fixture for email-agent eval (#848).
Once these land I'm happy to re-review and approve.
…d encapsulation Signed-off-by: theonlychant <sacehenry@gmail.com>
Signed-off-by: theonlychant <sacehenry@gmail.com>
Summary
Adds a synthetic
.mboxdataset for email triage agent evaluation, witha deterministic generator that writes fixtures to a temporary directory
at test time so the source tree is never mutated.
Why
The email triage agent had no test data to validate triage logic in CI.
This fixture provides realistic email threads covering all required
categories, edge cases, and threading headers without requiring a live
mailbox.
Linked issue
Closes #848
Changes
tests/fixtures/email/ground_truth.jsonif it already exists; otherwise the generator writes deterministic
ground_truth.jsonandsynthetic_inbox.mboxinto a temporarydirectory at test time - source tree is never mutated
test_generator_determinism_verify_modeto call_ensure_generated()itself so it's no longer order-dependentforce_activate(bundle_name)public method toToolLoadertoreplace direct
_stateaccessreset_session()into conversation-start pathTest plan
pytest tests/unit/test_synthetic_mbox.py- all passingpytest tests/unit/test_synthetic_mbox.py::test_generator_determinism_verify_mode- passes in isolationpython util/lint.py --all- no failuresChecklist
Closes #848).