Skip to content

Cross-model schema sanitizer + tool discovery layer (102-tool load fix, 9.3K→1.8K footprint)#59

Open
drewburchfield wants to merge 5 commits into
devfrom
feature/nas-1307-schema-sanitizer
Open

Cross-model schema sanitizer + tool discovery layer (102-tool load fix, 9.3K→1.8K footprint)#59
drewburchfield wants to merge 5 commits into
devfrom
feature/nas-1307-schema-sanitizer

Conversation

@drewburchfield

@drewburchfield drewburchfield commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Cross-model compatibility + tool-surface footprint, in two independent-but-stacked pieces, proven by an eval across model families. Design: docs/superpowers/specs/2026-06-17-tool-discovery-layer-design.md. Closes NAS-1307, NAS-1305, NAS-1308; absorbs NAS-1302.

Why (proven by the eval, not assumed)

The "Gemini can't handle many tools" framing was wrong. Gemini rejected the old 102-tool surface with a schema 400 on a single object-level anyOf (structuredConversationFilter) — one bad schema kills the whole tools/list. The 102→55 consolidation's apparent "Gemini win" was luck (it deleted the offending tool). The real cross-model issue is schema dialect, and tool-definition tokens (55 schemas = 9.3K/request).

A. Cross-model schema sanitizer (NAS-1307)

Pure sanitizeJsonSchema() over every emitted inputSchema: strips object-level anyOf/oneOf/allOf (Gemini), adds additionalProperties:false (OpenAI strict), numberinteger (matches Zod .int()), inline-derefs $defs. Plus JSON-string arg coercion at dispatch (weak models stringify arrays/objects → -32602). Absorbs NAS-1302.

B. Discovery layer (NAS-1305)

Default tools/list = ~10 tools (1.8K tokens, 0.9% of context): 7 core read tools + search_tools / get_tool_schema / call_tool. The other ~46 stay reachable:

  • Response-bootstrapped successor hints — hub tools append the logically-next tail tools' full sanitized schemas in _meta.suggestedTools (successor map grounded in the API entity graph; core tools filtered out), so the model's next call is correctly typed with no search step.
  • search_tools (keyword + 12-group Help Scout synonym map) + get_tool_schema as the escape hatch.

Discovery is the opinionated default (no mode flag); one escape hatch HELPSCOUT_EXPOSE_ALL_TOOLS=true returns the flat sanitized 55. The 55-case dispatch was refactored into dispatchTool() so call_tool re-enters cleanly.

Proof (evals/, NAS-1308)

Per-model LOAD check — does each surface load without a provider 400?

Surface gemini-3-flash gemini-3.1-pro-low glm-4.7 gpt
control-102 (unsanitized) FAIL (anyOf) FAIL LOAD n/a (quota)
control-102 (sanitized) LOAD LOAD LOAD n/a
discovery-10 LOAD LOAD LOAD n/a

→ the sanitizer fixes the load bug; discovery-10 loads on every reachable model.

Agentic tail-walk — can a model drive the 10-tool surface to a non-core tool? 24/24 (100%) across gemini-3-flash / gemini-3.1-pro-low / glm-4.7, avg ~3.4 turns (search → schema → call), with self-recovery from validation errors. gpt quota-blocked (untested; the additionalProperties fix targets its exact requirement).

Testing

type-check, lint, build, npm test (417 pass — only the pre-existing local-.env package-scripts test fails, passes in CI), MCPB validation (18). Eval harnesses live-verified against the proxy + Help Scout API.

Notes

Default surface is now 10 tools; full catalog reachable via meta-tools or HELPSCOUT_EXPOSE_ALL_TOOLS. Static mcp.json/manifest.json document the full 55 + 3 meta-tools. gpt cross-model load confirmation pending quota reset.


Open in Devin Review

Pure sanitizeJsonSchema() pass over every emitted inputSchema (applied in
listTools()): strips object-level anyOf/oneOf/allOf (Gemini 400), adds
additionalProperties:false (OpenAI strict), converts number->integer (API has
no floats; matches Zod .int()), inline-derefs $defs/$ref. Plus
coerceJsonStringArgs() at callTool() dispatch so weak/non-Claude models that
stringify array/object args don't 400 (-32602).

Verified runtime: listTools() emits 0 combinators, additionalProperties on all
55 object schemas, all-integer. Live: sanitized 55-tool surface loads+calls on
gemini-3-flash, gemini-3.1-pro-low, glm-4.7 (gpt quota-blocked this run).
Absorbs NAS-1302. 13 new sanitizer tests; 397 pass.

Proven earlier: stripping the anyOf made the full 102-tool surface load on
Gemini, confirming the root cause was schema dialect, not tool count.
…ools

Default tools/list = 7 core read tools + search_tools/get_tool_schema/call_tool
(~1.8K tokens vs 9.3K). The other 46 tools stay reachable: search_tools (keyword
scorer + 12-group Help Scout synonym map, camelCase-aware), get_tool_schema
(returns sanitized schemas), call_tool (dispatches into the existing switch with
arg coercion, rejects unknown names + meta-tool recursion). Refactored the 55-case
switch into private dispatchTool() so call_tool re-enters cleanly; buildToolDefs/
allToolDefs() is the internal registry. HELPSCOUT_EXPOSE_ALL_TOOLS=true returns the
flat sanitized 55 (single escape-hatch branch, no other dual-mode logic).

Verified: default=10, expose-all=55; search('happiness report')->getHappinessReport,
mailbox-synonym->getInbox, call_tool(getServerTime) returns live time. 13 new
discovery tests; 410 pass. Static mcp.json/manifest document full catalog + meta.
…dTools)

Hub tools append their logically-next TAIL tools' full sanitized schemas to
result._meta.suggestedTools, so the model can call_tool them with correct args
and no search step. SUCCESSOR_MAP grounded in the API entity graph; successorHintsFor()
filters out core + meta tools (only purely-additive tail surfaces), caps at 3.
Attached in dispatchTool so it rides along through call_tool re-entry; skipped for
meta-tool results and under HELPSCOUT_EXPOSE_ALL_TOOLS; total (try/catch, never
mutates payload).

Verified live: searchConversations -> [getOriginalSource, getAttachment] (not the
core getConversation/getThreads); getServerTime (terminal) -> none. 7 new tests; 417 pass.
…ity metric)

evals/load-check.mjs runs a surface x model matrix asking 'does this tool surface
load without a provider schema-400?' across Gemini/GLM/OpenAI. Proves end-to-end
with production code: control-102 unsanitized FAILS on both Gemini (the anyOf 400),
control-102 SANITIZED loads -> the sanitizer fixes the cross-model load bug; and
discovery-10 loads on every reachable model. gpt currently n/a (quota cooldown, not
a schema failure - detection distinguishes quota/config from real 400s).
…urface

evals/tail-walk.mjs runs a real multi-turn tool-use loop: model gets the 10-tool
discovery surface + a task needing a NON-core tool, and must reach it via
search_tools -> get_tool_schema -> call_tool (executed live against the handler).
Result: gemini-3-flash 8/8, gemini-3.1-pro-low 8/8, glm-4.7 8/8 = 24/24 (100%)
reached the right tail tool with correct args, avg ~3.4 turns. gpt quota-blocked
(n/a). Self-recovery observed on report date-range validation errors. Proves the
discovery surface works across model families.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review

Comment thread src/tools/index.ts
Comment on lines +1500 to +1503
arguments: {
type: 'object',
description: "The tool's input arguments object, matching its loaded input schema.",
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Sanitizer adds additionalProperties: false to call_tool.arguments, blocking all inner tool arguments for strict-schema models

The call_tool meta-tool's arguments property is defined as { type: 'object', description: '...' } with no properties key (src/tools/index.ts:1500-1502), because it's an open passthrough bag for arbitrary inner-tool arguments. However, the sanitizer at src/utils/schema-sanitizer.ts:127-129 unconditionally adds additionalProperties: false to every node where type === 'object', even those without a properties map. After sanitization the advertised schema becomes { type: 'object', additionalProperties: false } — which per JSON Schema means "accept zero properties." Models or clients that enforce strict schema validation (e.g. OpenAI structured outputs, some Gemini configurations) would either refuse to populate arguments or send an empty object, making the entire discovery layer's ~48 non-core tools unreachable. The existing test at src/__tests__/discovery.test.ts:100-101 only asserts props.arguments.type === 'object' and does not check additionalProperties, so it doesn't catch this.

Suggested change
arguments: {
type: 'object',
description: "The tool's input arguments object, matching its loaded input schema.",
},
arguments: {
type: 'object',
additionalProperties: true,
description: "The tool's input arguments object, matching its loaded input schema.",
},
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread src/tools/index.ts
Comment on lines +2029 to +2030
const coerced = coerceJsonStringArgs(innerArgs, tool.inputSchema ?? {}) as object;
return this.dispatchTool(name, coerced);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 call_tool bypasses constraint validation and call history tracking

When a tool is invoked through the call_tool meta-tool, it enters via callToolMetadispatchTooldispatchToolInner, which skips callTool()'s HelpScoutAPIConstraints.validateToolCall check and does not record the inner tool name in this.callHistory (src/tools/index.ts:1639). For example, searchConversations invoked via call_tool would skip inbox-ID format validation and date-format checks (src/utils/api-constraints.ts:82-97). The constraints are advisory (suggestions/guidance, not security gates), and the Help Scout API itself would reject malformed requests, so this is not a correctness bug. However, if the constraint system is extended with harder requirements in the future, this bypass path could become a real issue.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant