Tool-registry scaling: resolve scratchpad/memory collision via #688 dynamic loading

## Context

Two PRs currently in flight add a combined **8 new tools** to ChatAgent and bring the total tool count into "too many for reliable selection" territory:

- **PR #495** (chat-agent file navigation) — 14 new tools: 6 filesystem (`browse_directory`, `tree`, `file_info`, `find_files`, `read_file`, `bookmark`), 5 scratchpad (`create_table`, `insert_data`, `query_data`, `list_tables`, `drop_table`), 3 browser (`fetch_page`, `search_web`, `download_file`).
- **PR #606** (agent memory v2) — 5 new tools: `remember`, `recall`, `update_memory`, `forget`, `search_past_conversations`.

After both land, ChatAgent will have ~**27 tools** simultaneously registered. Per issue #688, ~12K tokens of the 32K context (~37%) will be consumed by tool descriptions alone.

## The concrete collision

`scratchpad.query_data(sql)` (from #495) and `memory.recall(query)` (from #606) answer different questions but are both terminal "query" steps. Without disambiguation the LLM will pick the wrong one when the query is ambiguous:

| User query | Right tool | Why |
|---|---|---|
| "What did I spend on groceries in March?" | `query_data` | Structured data from session pipeline |
| "What did I learn about FTS5 last week?" | `recall` | Semantic concept across history |
| "How many research papers mention transformer attention?" | Ambiguous | Depends on where the data is stored |

## Why this needs to be fixed as part of #688, not in #495 or #606

- Both source PRs are in review/merge-ready state; interrupting them to bolt on dynamic loading delays two shipments
- The fix is structural, not behavioral — it belongs in a dedicated refactor PR
- PR #606 already ships **exactly the substrate** #688 needs (`tool_history` table, `_execute_tool` wrapper, `get_memory_system_prompt` injection point)
- Implementing bundle-based loading isolated from feature work is easier to review

## Proposed resolution (implements #688 Phase 1)

Introduce a `ToolLoader` abstraction that gates tool visibility at prompt-generation time, not at registration time. The module-level `_TOOL_REGISTRY` remains the source of truth for *all* registered tools; the loader picks which subset appears in the LLM prompt each turn.

**Bundles for the current tool set (27 tools):**

| Bundle | Always on? | Activation trigger | Tools |
|---|---|---|---|
| `core` | ✅ | — | `read_file` (RAG-side), `list_files`, `run_shell_command` |
| `rag` | conditional | indexed docs exist | `query_documents`, `query_specific_file`, `index_document`, `list_indexed_documents`, `search_indexed_chunks`, `index_directory` |
| `filesystem` | conditional | file/folder keywords, or recent use | `browse_directory`, `tree`, `file_info`, `find_files`, `read_file` (FS-side), `bookmark` |
| `scratchpad` | conditional | after `create_table` in session, or keyword match | `create_table`, `insert_data`, `query_data`, `list_tables`, `drop_table` |
| `browser` | conditional | URL pattern or web-search keywords | `fetch_page`, `search_web`, `download_file` |
| `memory` | conditional | after any `remember` in session, or keyword match | `remember`, `recall`, `update_memory`, `forget`, `search_past_conversations` |
| `mcp_<server>` | per-server | user enabled the MCP server | dynamic per server |

**Tool routing decision order:**
1. Always-on bundles load (always `core`)
2. If memory tool_history shows user used a bundle in last 24h → keep warm
3. If current user message contains activation keywords → activate bundle
4. If mid-workflow (e.g. just created a scratchpad table) → keep bundle active for rest of session
5. LLM can explicitly request via meta-tool `activate_bundle(name)` (phase 2)

## Acceptance criteria

- [ ] ChatAgent tool prompt drops from ~12K → ~3-4K tokens in typical sessions (#688 target)
- [ ] \`scratchpad.query_data\` and \`memory.recall\` are not both in the prompt simultaneously unless the user's message / session history justifies it
- [ ] Core tools always available regardless of heuristic failure
- [ ] Bundle selection decisions are logged to memory's \`tool_history\` so we can eval and tune
- [ ] End-to-end regression test: a prompt that needs `scratchpad.query_data` picks it (not `recall`), and vice versa
- [ ] Eval suite passes with no regression in tool-selection accuracy
- [ ] Mid-conversation re-evaluation works (session that starts with file browsing and pivots to web research loads the right bundles at each turn)

## Dependencies

- **#495** (merge first — defines filesystem/scratchpad/browser tools)
- **#606** (merge second — ships `MemoryStore` + `tool_history` + `_execute_tool` substrate)
- **#688** (this issue operationalizes #688's plan against the concrete post-495/606 tool set)

## Out of scope

- The ToolLoader itself (that is #688 Phase 1 — this issue is the coordination tracker ensuring the collision is actually fixed, not left as a known gap)
- Migration of non-ChatAgent agents (CodeAgent, JiraAgent, etc.) — each owns its own tool set; handle when / if that becomes a problem

## References

- PR #495 final-state comment: https://github.com/amd/gaia/pull/495#issuecomment-4272297416
- PR #606 coordination comment: https://github.com/amd/gaia/pull/606#issuecomment-4272383985
- Parent design issue: #688

---

/cc @kovtcharov-amd @itomek-amd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool-registry scaling: resolve scratchpad/memory collision via #688 dynamic loading #800

Context

The concrete collision

Why this needs to be fixed as part of #688, not in #495 or #606

Proposed resolution (implements #688 Phase 1)

Acceptance criteria

Dependencies

Out of scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

User query	Right tool	Why
"What did I spend on groceries in March?"	`query_data`	Structured data from session pipeline
"What did I learn about FTS5 last week?"	`recall`	Semantic concept across history
"How many research papers mention transformer attention?"	Ambiguous	Depends on where the data is stored

Bundle	Always on?	Activation trigger	Tools
`core`	✅	—	`read_file` (RAG-side), `list_files`, `run_shell_command`
`rag`	conditional	indexed docs exist	`query_documents`, `query_specific_file`, `index_document`, `list_indexed_documents`, `search_indexed_chunks`, `index_directory`
`filesystem`	conditional	file/folder keywords, or recent use	`browse_directory`, `tree`, `file_info`, `find_files`, `read_file` (FS-side), `bookmark`
`scratchpad`	conditional	after `create_table` in session, or keyword match	`create_table`, `insert_data`, `query_data`, `list_tables`, `drop_table`
`browser`	conditional	URL pattern or web-search keywords	`fetch_page`, `search_web`, `download_file`
`memory`	conditional	after any `remember` in session, or keyword match	`remember`, `recall`, `update_memory`, `forget`, `search_past_conversations`
`mcp_<server>`	per-server	user enabled the MCP server	dynamic per server

Tool-registry scaling: resolve scratchpad/memory collision via #688 dynamic loading #800

Description

Context

The concrete collision

Why this needs to be fixed as part of #688, not in #495 or #606

Proposed resolution (implements #688 Phase 1)

Acceptance criteria

Dependencies

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions