Skip to content

MCP catalog exposure: integration tests and benchmarks#926

Merged
andreasronge merged 2 commits into
mainfrom
claude/913-catalog-integration-tests-benchmarks
May 12, 2026
Merged

MCP catalog exposure: integration tests and benchmarks#926
andreasronge merged 2 commits into
mainfrom
claude/913-catalog-integration-tests-benchmarks

Conversation

@andreasronge
Copy link
Copy Markdown
Owner

@andreasronge andreasronge commented May 12, 2026

Summary

  • Add 6 integration tests (spec §12) validating the full catalog exposure pipeline end-to-end
  • Add benchmark script (spec §13) measuring description size and rendering latency at 10/30/50/100/200 tools

Integration tests (catalog_integration_test.exs)

  1. Inline rendering in tools/list — small fleet under both thresholds renders inline catalog with tool details
  2. Lazy rendering in tools/list — large fleet over tool threshold renders lazy instructions with discovery builtins
  3. Unloaded upstreams in tools/list — initially unloaded fleet renders lazy instructions
  4. PTC-Lisp discovery flowcatalog/search-toolscatalog/describe-tooltool/mcp-call pipeline
  5. Unloaded discovery flow — search server-level match → catalog/list-toolscatalog/describe-tooltool/mcp-call
  6. Direct call in inline modetool/mcp-call works without catalog builtins

Benchmark script (catalog_bench.exs)

Measures rendered description size (chars) and rendering latency (µs) at 10, 30, 50, 100, and 200 tools for inline vs lazy modes. Outputs a markdown table with threshold analysis to justify/revise the default values (catalog_inline_max_chars=12000, catalog_inline_max_tools=40).

Run: mix run mcp_server/bench/catalog_bench.exs [--runs=N] [--out=PATH]

Closes #913

Test plan

  • All 6 integration tests pass (mix test mcp_server/test/ptc_runner_mcp/catalog_integration_test.exs)
  • All existing mcp_server tests pass (1009 tests, 0 failures)
  • All root tests pass (5305 tests, 0 failures)
  • Benchmark script runs and produces correct output
  • Format, credo, and compile checks pass

Fix Automation State

Fix attempts: 2/3

Add 6 integration tests validating the full catalog exposure pipeline
(spec §12) and a benchmark script measuring description size and
rendering latency at various tool counts (spec §13).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

PR Review: MCP catalog exposure — integration tests and benchmarks

Summary

Well-structured PR adding 6 integration tests and a benchmark script for the MCP catalog exposure feature. All 6 integration tests map directly to spec §12 requirements, and the benchmark covers the core measurable aspects of §13. Clean implementation with good test isolation.

What's Good

  • Complete spec coverage: All 6 required integration tests from spec §12 are present and correctly scoped — inline/lazy/unloaded rendering + discovery flows + direct call bypass.
  • Clear test structure: Each test scenario has a descriptive section header referencing the spec section, making it easy to trace requirements.
  • Proper isolation: Tests correctly use async: false, start their own Registry, and clean up via on_exit — no leakage between tests.
  • Benchmark is practical: Outputs a markdown table with threshold analysis at all 5 required tool counts (10, 30, 50, 100, 200), includes CLI options (--runs, --out), and measures both median and p99 latency.
  • No TODOs, FIXMEs, or skipped tests — clean addition.

Issues (Must Fix)

None.

Suggestions (Optional)

  1. Benchmark scope vs spec §13 — Nice-to-have

    • Observation: The benchmark covers 2 of 5 spec §13 items (description size, rendering latency). The remaining 3 (approximate token cost, first-shot program correctness, failure modes in lazy mode) are not measured.
    • Assessment: This was explicitly deferred per the issue review (requires real model access / API keys). The deferral is reasonable, but the benchmark script could note which §13 items it covers and which are deferred.
    • Recommendation: Consider adding a brief comment in the script noting the deferred items, or create a tracking issue if one doesn't exist.
  2. Helper duplication across catalog test files — Nice-to-have

    • Observation: make_tools, put_fake, and snapshot helpers appear in slightly different forms across catalog_description_test.exs, catalog_builtins_test.exs, and the new catalog_integration_test.exs.
    • Assessment: Each variant is tailored to its test file's needs (different handler signatures, metadata, snapshot strategies), so this isn't simple duplication. However, if the catalog test suite continues to grow, extracting a shared CatalogTestHelpers module could reduce boilerplate.
    • Recommendation: Out of scope for this PR. Worth a GitHub issue if catalog tests expand further.
  3. Unused impl field in benchmark entries — Cosmetic

    • Observation: make_entries/2 in the benchmark sets impl: PtcRunnerMcp.Upstream.Stdio, but the snapshot_entry type defines impl as optional and render_for_entries/2 never reads it.
    • Impact: None — the field is harmless. But removing it would reduce noise.

Security

No concerns. Tests and benchmarks only — no production code changes, no external calls, no credential handling.

Documentation

No updates needed. The PR adds tests and a benchmark script, both of which are self-documenting. The benchmark script includes usage comments and a --help flag.

Verdict

Approve. All 6 spec §12 integration tests are correctly implemented, the benchmark covers the core measurable aspects of §13, and code quality is solid. The optional suggestions are minor improvements that can be deferred.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

@claude please fix:

  1. Unused impl field in benchmark entries: In make_entries/2 in the benchmark script, remove the impl: PtcRunnerMcp.Upstream.Stdio field from the map — it is never read by render_for_entries/2 and adds noise.
  2. Benchmark coverage note: Add a brief comment near the top of the benchmark script (after the module doc or before the run/1 function) noting which spec §13 items the benchmark covers (description size, rendering latency) and which are deferred (approximate token cost, first-shot program correctness, failure modes in lazy mode) with a note that those require real model/API access.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Auto-Triage Summary

Issue Decision Action
Unused impl field in benchmark make_entries/2 FIX_NOW In fix request above
Benchmark coverage note for deferred §13 items FIX_NOW In fix request above
Helper duplication across catalog test files DEFER_ISSUE Created #927

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andreasronge
Copy link
Copy Markdown
Owner Author

Fix Summary

Issue Fixed: Removed unused impl field from benchmark entries and documented §13 coverage scope.

  • make_entries/2: Removed impl: PtcRunnerMcp.Upstream.Stdio — the field was never read by render_for_entries/2 and added noise to the map.
  • Coverage comment: Added a §13 coverage block in the file header (after the spec reference) listing what this benchmark covers (description size, rendering latency) and what is deferred with model/API access (approximate token cost, first-shot program correctness, failure modes in lazy mode).

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

PR Review: MCP catalog exposure — integration tests and benchmarks

Summary

Solid PR adding 6 integration tests (spec §12) and a benchmark script (spec §13) for the MCP catalog exposure feature. All §12 integration test requirements are covered, the benchmark is well-structured with CLI options, and the code follows established codebase patterns. No issues found.

What's Good

  • Complete §12 coverage: All 6 required integration tests are present and correctly map to spec requirements: inline rendering (test 1), lazy rendering (test 2), unloaded upstreams (test 3), discovery flow (test 4), unloaded discovery flow (test 5), and direct call bypass (test 6).
  • Spec section headers: Each test block references its spec section (§ 12 integration test N), making requirement traceability trivial.
  • Correct test isolation: async: false (required since tests mutate global state via Registry/persistent_term), fresh Registry per test, comprehensive on_exit cleanup for Registry, Limits, AggregatorConfig, CatalogConfig, and Catalog.frozen.
  • Benchmark design: Measures description size and rendering latency at all 5 spec-required tool counts (10/30/50/100/200). CLI options (--runs, --out, --help) make it usable in CI and manual runs. Threshold analysis output ties results back to the default config values.
  • Explicit deferral documentation: The benchmark header (lines 10–12) clearly documents which §13 items are deferred (token cost, first-shot correctness, failure modes) and why (require real model/API access). This is the right call — trying to measure these without API keys would produce misleading results.
  • No TODOs, FIXMEs, or skipped tests — clean addition.

Issues (Must Fix)

None.

Suggestions (Optional)

  1. stop_existing_registry helper duplication — Out of scope

    • Observation: This identical helper is now copy-pasted across 15 test files in mcp_server/test/. This PR follows the established codebase pattern, so it's not a regression, but it's the most-duplicated test helper in the project.
    • Recommendation: Out of scope for this PR. Consider a future McpTestHelpers module extraction if the pattern continues to spread.
  2. Benchmark §13 tracking — Nice-to-have

    • Observation: The deferred §13 items (token cost, first-shot correctness, failure modes) don't have a tracking issue. If these are expected to be measured later, a lightweight issue would prevent them from being forgotten.
    • Recommendation: Create a tracking issue for the deferred §13 benchmark items, or note in the spec that they're deferred indefinitely.

Security

No concerns. Test and benchmark code only — no production changes, no external calls, no credential handling.

Documentation

No updates needed. The benchmark script is self-documenting with header comments and --help. Integration tests reference spec sections directly.

Verdict

Approve. All 6 spec §12 integration tests are correctly implemented, the benchmark covers the measurable §13 items with clear documentation of deferrals, and code quality matches established codebase patterns. The two suggestions are minor housekeeping items that can be deferred.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

Auto-Triage Summary

Issue Decision Action
stop_existing_registry helper duplicated across 15 test files DEFER_ISSUE Created #928
Deferred §13 benchmark items (token cost, first-shot correctness, failure modes) DEFER_ISSUE Created #929

No FIX_NOW items — added ready-to-merge label.

@andreasronge andreasronge enabled auto-merge (squash) May 12, 2026 21:34
@andreasronge andreasronge merged commit fa2ca45 into main May 12, 2026
8 checks passed
@andreasronge andreasronge deleted the claude/913-catalog-integration-tests-benchmarks branch May 12, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge PR is ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP catalog exposure: integration tests and benchmarks

1 participant