MCP catalog exposure: integration tests and benchmarks by andreasronge · Pull Request #926 · andreasronge/ptc_runner

andreasronge · 2026-05-12T21:17:43Z

Summary

Add 6 integration tests (spec §12) validating the full catalog exposure pipeline end-to-end
Add benchmark script (spec §13) measuring description size and rendering latency at 10/30/50/100/200 tools

Integration tests (`catalog_integration_test.exs`)

Inline rendering in tools/list — small fleet under both thresholds renders inline catalog with tool details
Lazy rendering in tools/list — large fleet over tool threshold renders lazy instructions with discovery builtins
Unloaded upstreams in tools/list — initially unloaded fleet renders lazy instructions
PTC-Lisp discovery flow — catalog/search-tools → catalog/describe-tool → tool/mcp-call pipeline
Unloaded discovery flow — search server-level match → catalog/list-tools → catalog/describe-tool → tool/mcp-call
Direct call in inline mode — tool/mcp-call works without catalog builtins

Benchmark script (`catalog_bench.exs`)

Measures rendered description size (chars) and rendering latency (µs) at 10, 30, 50, 100, and 200 tools for inline vs lazy modes. Outputs a markdown table with threshold analysis to justify/revise the default values (catalog_inline_max_chars=12000, catalog_inline_max_tools=40).

Run: mix run mcp_server/bench/catalog_bench.exs [--runs=N] [--out=PATH]

Closes #913

Test plan

All 6 integration tests pass (mix test mcp_server/test/ptc_runner_mcp/catalog_integration_test.exs)
All existing mcp_server tests pass (1009 tests, 0 failures)
All root tests pass (5305 tests, 0 failures)
Benchmark script runs and produces correct output
Format, credo, and compile checks pass

Fix Automation State

Fix attempts: 2/3

Add 6 integration tests validating the full catalog exposure pipeline (spec §12) and a benchmark script measuring description size and rendering latency at various tool counts (spec §13). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-05-12T21:23:45Z

PR Review: MCP catalog exposure — integration tests and benchmarks

Summary

Well-structured PR adding 6 integration tests and a benchmark script for the MCP catalog exposure feature. All 6 integration tests map directly to spec §12 requirements, and the benchmark covers the core measurable aspects of §13. Clean implementation with good test isolation.

What's Good

Complete spec coverage: All 6 required integration tests from spec §12 are present and correctly scoped — inline/lazy/unloaded rendering + discovery flows + direct call bypass.
Clear test structure: Each test scenario has a descriptive section header referencing the spec section, making it easy to trace requirements.
Proper isolation: Tests correctly use async: false, start their own Registry, and clean up via on_exit — no leakage between tests.
Benchmark is practical: Outputs a markdown table with threshold analysis at all 5 required tool counts (10, 30, 50, 100, 200), includes CLI options (--runs, --out), and measures both median and p99 latency.
No TODOs, FIXMEs, or skipped tests — clean addition.

Issues (Must Fix)

None.

Suggestions (Optional)

Benchmark scope vs spec §13 — Nice-to-have
- Observation: The benchmark covers 2 of 5 spec §13 items (description size, rendering latency). The remaining 3 (approximate token cost, first-shot program correctness, failure modes in lazy mode) are not measured.
- Assessment: This was explicitly deferred per the issue review (requires real model access / API keys). The deferral is reasonable, but the benchmark script could note which §13 items it covers and which are deferred.
- Recommendation: Consider adding a brief comment in the script noting the deferred items, or create a tracking issue if one doesn't exist.
Helper duplication across catalog test files — Nice-to-have
- Observation: make_tools, put_fake, and snapshot helpers appear in slightly different forms across catalog_description_test.exs, catalog_builtins_test.exs, and the new catalog_integration_test.exs.
- Assessment: Each variant is tailored to its test file's needs (different handler signatures, metadata, snapshot strategies), so this isn't simple duplication. However, if the catalog test suite continues to grow, extracting a shared CatalogTestHelpers module could reduce boilerplate.
- Recommendation: Out of scope for this PR. Worth a GitHub issue if catalog tests expand further.
Unused impl field in benchmark entries — Cosmetic
- Observation: make_entries/2 in the benchmark sets impl: PtcRunnerMcp.Upstream.Stdio, but the snapshot_entry type defines impl as optional and render_for_entries/2 never reads it.
- Impact: None — the field is harmless. But removing it would reduce noise.

Security

No concerns. Tests and benchmarks only — no production code changes, no external calls, no credential handling.

Documentation

No updates needed. The PR adds tests and a benchmark script, both of which are self-documenting. The benchmark script includes usage comments and a --help flag.

Verdict

Approve. All 6 spec §12 integration tests are correctly implemented, the benchmark covers the core measurable aspects of §13, and code quality is solid. The optional suggestions are minor improvements that can be deferred.

claude · 2026-05-12T21:24:40Z

@claude please fix:

Unused impl field in benchmark entries: In make_entries/2 in the benchmark script, remove the impl: PtcRunnerMcp.Upstream.Stdio field from the map — it is never read by render_for_entries/2 and adds noise.
Benchmark coverage note: Add a brief comment near the top of the benchmark script (after the module doc or before the run/1 function) noting which spec §13 items the benchmark covers (description size, rendering latency) and which are deferred (approximate token cost, first-shot program correctness, failure modes in lazy mode) with a note that those require real model/API access.

claude · 2026-05-12T21:24:51Z

Auto-Triage Summary

Issue	Decision	Action
Unused `impl` field in benchmark `make_entries/2`	FIX_NOW	In fix request above
Benchmark coverage note for deferred §13 items	FIX_NOW	In fix request above
Helper duplication across catalog test files	DEFER_ISSUE	Created #927

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andreasronge · 2026-05-12T21:28:32Z

Fix Summary

✅ Issue Fixed: Removed unused impl field from benchmark entries and documented §13 coverage scope.

make_entries/2: Removed impl: PtcRunnerMcp.Upstream.Stdio — the field was never read by render_for_entries/2 and added noise to the map.
Coverage comment: Added a §13 coverage block in the file header (after the spec reference) listing what this benchmark covers (description size, rendering latency) and what is deferred with model/API access (approximate token cost, first-shot program correctness, failure modes in lazy mode).

claude · 2026-05-12T21:32:57Z

PR Review: MCP catalog exposure — integration tests and benchmarks

Summary

Solid PR adding 6 integration tests (spec §12) and a benchmark script (spec §13) for the MCP catalog exposure feature. All §12 integration test requirements are covered, the benchmark is well-structured with CLI options, and the code follows established codebase patterns. No issues found.

What's Good

Complete §12 coverage: All 6 required integration tests are present and correctly map to spec requirements: inline rendering (test 1), lazy rendering (test 2), unloaded upstreams (test 3), discovery flow (test 4), unloaded discovery flow (test 5), and direct call bypass (test 6).
Spec section headers: Each test block references its spec section (§ 12 integration test N), making requirement traceability trivial.
Correct test isolation: async: false (required since tests mutate global state via Registry/persistent_term), fresh Registry per test, comprehensive on_exit cleanup for Registry, Limits, AggregatorConfig, CatalogConfig, and Catalog.frozen.
Benchmark design: Measures description size and rendering latency at all 5 spec-required tool counts (10/30/50/100/200). CLI options (--runs, --out, --help) make it usable in CI and manual runs. Threshold analysis output ties results back to the default config values.
Explicit deferral documentation: The benchmark header (lines 10–12) clearly documents which §13 items are deferred (token cost, first-shot correctness, failure modes) and why (require real model/API access). This is the right call — trying to measure these without API keys would produce misleading results.
No TODOs, FIXMEs, or skipped tests — clean addition.

Issues (Must Fix)

None.

Suggestions (Optional)

stop_existing_registry helper duplication — Out of scope
- Observation: This identical helper is now copy-pasted across 15 test files in mcp_server/test/. This PR follows the established codebase pattern, so it's not a regression, but it's the most-duplicated test helper in the project.
- Recommendation: Out of scope for this PR. Consider a future McpTestHelpers module extraction if the pattern continues to spread.
Benchmark §13 tracking — Nice-to-have
- Observation: The deferred §13 items (token cost, first-shot correctness, failure modes) don't have a tracking issue. If these are expected to be measured later, a lightweight issue would prevent them from being forgotten.
- Recommendation: Create a tracking issue for the deferred §13 benchmark items, or note in the spec that they're deferred indefinitely.

Security

No concerns. Test and benchmark code only — no production changes, no external calls, no credential handling.

Documentation

No updates needed. The benchmark script is self-documenting with header comments and --help. Integration tests reference spec sections directly.

Verdict

Approve. All 6 spec §12 integration tests are correctly implemented, the benchmark covers the measurable §13 items with clear documentation of deferrals, and code quality matches established codebase patterns. The two suggestions are minor housekeeping items that can be deferred.

claude · 2026-05-12T21:34:15Z

Auto-Triage Summary

Issue	Decision	Action
`stop_existing_registry` helper duplicated across 15 test files	DEFER_ISSUE	Created #928
Deferred §13 benchmark items (token cost, first-shot correctness, failure modes)	DEFER_ISSUE	Created #929

No FIX_NOW items — added ready-to-merge label.

andreasronge mentioned this pull request May 12, 2026

MCP catalog exposure: integration tests and benchmarks #913

Closed

claude Bot mentioned this pull request May 12, 2026

[From PR #926] Extract shared CatalogTestHelpers module as catalog test suite grows #927

Open

fix(mcp): remove unused impl field and document §13 benchmark coverage

45223e0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This was referenced May 12, 2026

[From PR #926] Extract stop_existing_registry into shared McpTestHelpers module #928

Closed

[From PR #926] Tracking issue: deferred §13 benchmark items (token cost, first-shot correctness, failure modes) #929

Open

claude Bot added the ready-to-merge PR is ready to be merged label May 12, 2026

andreasronge enabled auto-merge (squash) May 12, 2026 21:34

andreasronge merged commit fa2ca45 into main May 12, 2026
8 checks passed

andreasronge deleted the claude/913-catalog-integration-tests-benchmarks branch May 12, 2026 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP catalog exposure: integration tests and benchmarks#926

MCP catalog exposure: integration tests and benchmarks#926
andreasronge merged 2 commits into
mainfrom
claude/913-catalog-integration-tests-benchmarks

andreasronge commented May 12, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

andreasronge commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreasronge commented May 12, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Integration tests (catalog_integration_test.exs)

Benchmark script (catalog_bench.exs)

Test plan

Fix Automation State

Uh oh!

claude Bot commented May 12, 2026

PR Review: MCP catalog exposure — integration tests and benchmarks

Summary

What's Good

Issues (Must Fix)

Suggestions (Optional)

Security

Documentation

Verdict

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Auto-Triage Summary

Uh oh!

andreasronge commented May 12, 2026

Fix Summary

Uh oh!

claude Bot commented May 12, 2026

PR Review: MCP catalog exposure — integration tests and benchmarks

Summary

What's Good

Issues (Must Fix)

Suggestions (Optional)

Security

Documentation

Verdict

Uh oh!

claude Bot commented May 12, 2026

Auto-Triage Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andreasronge commented May 12, 2026 •

edited by github-actions Bot

Loading

Integration tests (`catalog_integration_test.exs`)

Benchmark script (`catalog_bench.exs`)