Skip to content

WIP: feat(mcp): expand tool surface, add feature flags, FastMCP transforms#66

Open
Jmevorach wants to merge 5 commits into
mainfrom
feat/mcp-expanded-tools-and-feature-flags
Open

WIP: feat(mcp): expand tool surface, add feature flags, FastMCP transforms#66
Jmevorach wants to merge 5 commits into
mainfrom
feat/mcp-expanded-tools-and-feature-flags

Conversation

@Jmevorach
Copy link
Copy Markdown
Contributor

The MCP server (mcp/run_mcp.py) gains a substantially larger tool and resource surface, a documented feature-flag taxonomy, FastMCP Tasks support for long-running operations, FastMCP Context migration with audit-log enrichment, examples and docs discovery, live-state resources, a container image registry, and FastMCP transforms (Resources As Tools, BM25 / Regex / Code Mode catalog replacement).

Tool surface (90 tools by default, up to 111 with feature flags):

  • New 'safe' read-only tools across queue, templates, webhooks, dag, nodepools, analytics, config, plus stacks inspection (stack_diff / stack_outputs / stack_synth / valkey_status / aurora_status) and storage (files_get / files_access_points).
  • New 'low-risk' mutating tools (queue_submit, templates_create / templates_run, webhooks_create, dag_run, FSx / Valkey / Aurora / Analytics enable/disable, analytics_user_add, nodepools_create_odcr).
  • New gated tools — every infrastructure / capacity / model-upload / destructive / image-publish operation sits behind a per-flag environment variable, plus the umbrella GCO_ENABLE_ALL_TOOLS.
  • Existing 'delete_job' and 'delete_inference' moved under GCO_ENABLE_DESTRUCTIVE_OPERATIONS (intentional backward-compat break; documented in mcp/README.md).

Feature flags (mcp/feature_flags.py):

  • GCO_ENABLE_ALL_TOOLS — umbrella; overrides per-tool flags
  • GCO_ENABLE_CAPACITY_PURCHASE — gates reserve_capacity (existing)
  • GCO_ENABLE_MODEL_UPLOAD — gates models_upload
  • GCO_ENABLE_IMAGE_PUBLISH — gates images_build / images_push
  • GCO_ENABLE_INFRASTRUCTURE_DEPLOY — gates deploy_stack / deploy_all / bootstrap_cdk
  • GCO_ENABLE_INFRASTRUCTURE_DESTROY — gates destroy_stack / destroy_all
  • GCO_ENABLE_DESTRUCTIVE_OPERATIONS — gates every destructive tool (delete_job / delete_inference / delete_template / delete_webhook / delete_model / delete_nodepool / analytics_user_remove / cancel_queue_job / images_cleanup / images_prune / images_delete_tag / images_delete_repo)

FastMCP integration:

  • Long-running tools (deploy_stack, deploy_all, bootstrap_cdk, destroy_stack, destroy_all, images_build, images_push) use FastMCP Tasks via task=TaskConfig(mode=...) plus a shared async subprocess runner (mcp/tools/_long_task.py) that streams progress, handles cancellation with SIGTERM -> 10s grace -> SIGKILL, and surfaces the partial-CFN-state disclaimer for stack ops.
  • Audit decorator (mcp/audit.py) now dispatches on inspect.iscoroutinefunction so async tools work, captures request_id / client_id / task_id from FastMCP Context, and pulls client_messages / elicitations from a new AuditCaptureMiddleware (mcp/audit_middleware.py).
  • Three FastMCP transforms wired in mcp/server.py: Resources As Tools (always on, exposes synthetic list_resources / read_resource for tool-only clients like Cursor), and one of BM25 / Regex / Code Mode selected via GCO_MCP_TOOL_SEARCH (default 'bm25', mutually exclusive among bm25 / regex / code_mode / off, unknown values fall back to bm25). Code Mode also exposes GCO_MCP_CODE_MODE_MAX_DURATION_SECS and GCO_MCP_CODE_MODE_MAX_MEMORY for the MontySandboxProvider.

Discovery and live state:

  • Examples discovery: 4 new EXAMPLE_METADATA fields (keywords, instance_types, use_cases, related), find_examples tool, docs://gco/examples/by-category/{category} and docs://gco/examples/by-use-case/{use_case} resource paths.
  • Docs discovery: symmetric DOC_METADATA, find_docs tool, docs://gco/docs/by-topic/{topic} and docs://gco/docs/by-related/{doc_name} resource paths.
  • Six live-state resources: gco://jobs/{job_name}, gco://inference/{endpoint_name}, gco://k8s/{namespace}/{kind}/{name}, gco://cluster/{region}/topology, costs://gco/summary/{days_window}, tasks://gco/{task_id}.

Container image registry:

  • New cli/_container_runtime.py exposing detect_container_runtime (docker > finch > podman) with CDK_DOCKER override.
  • New cli/images.py::ImageManager with full CRUD (build, push, list, tags, describe, uri, init, lifecycle get/set, replication get/sync/ status, orphans, delete_tag, delete_repo, cleanup, prune).
  • New 'gco images' CLI subcommand group.
  • GCOGlobalStack provisions the project ECR repo + lifecycle policy + account-wide replication rule for gco/* across every deployed region + a lookup-or-create custom resource Lambda (lambda/image-lookup/handler.py) that adopts existing repos and honours the gco:retain=true tag.
  • Inference deploy rewrites ECR image URIs to the local replica per region.
  • Pre-destroy inventory summary + helpful errors guard against destroying gco-global with non-empty repos.
  • MCP tool surface (mcp/tools/images.py): 11 unconditional tools (7 read-only safe + 4 administrative low-risk), 2 gated by GCO_ENABLE_IMAGE_PUBLISH (build, push), 4 gated by GCO_ENABLE_DESTRUCTIVE_OPERATIONS (cleanup, prune, delete_tag, delete_repo).
  • MCP resource surface (mcp/resources/images.py): images://gco/index, images://gco/{name}/tags, images://gco/{name}/{tag}, images://gco/replication/status.

Python version + deprecation cleanup:

  • requires-python bumped to >=3.14 in pyproject.toml; classifiers for 3.10 / 3.11 / 3.12 / 3.13 dropped.
  • README.md, mcp/README.md, CONTRIBUTING.md, QUICKSTART.md, .github/oidc_provider/README.md, docs/TROUBLESHOOTING.md, and demo/DEMO_WALKTHROUGH.md updated.
  • tests/test_integration.py's single re.match call migrated to re.search (3.15 soft-deprecates re.match; the explicit ^ anchor makes the two functions equivalent).
  • New guardrail tests/test_no_python_315_deprecation_surface.py walks the production tree and fails on any 3.15 deprecation surface (typing.ByteString, glob.glob0/1, NamedTuple keyword-arg syntax, TypedDict zero-field syntax, bare re.match, etc.).
  • New tests/test_mcp_python_version.py confirms the un-parenthesized except-tuple in mcp/resources/config.py compiles on 3.14+ and greps every doc for legacy Python version references.

Documentation:

  • mcp/README.md reorganized: top-of-file capacity-purchase blockquote replaced with a single-sentence pointer to the new ## Feature Flags section. Every Available Tools subsection table now has Risk Tier and Gated By columns. New subsections for Queue Management, Templates, Webhooks, DAG Pipelines, NodePools, Analytics, Config, Image Registry, Examples Discovery, Docs Discovery, Live State.
  • tests/README.md expanded with detailed entries for every new MCP test file, the image registry test layer (CLI + global stack), and the two codebase guardrails.
  • README.md project-structure tree, MCP-server one-liner, and tool count bumped to 90 (default) / up to 111 (with feature flags).

Tests + verification:

  • 4349 pytest tests pass, 3 skipped (FastMCP API mismatches on certain transform-tag-filter surfaces).
  • Full ruff format + check, mypy strict (mcp/, cli/, scripts/, app.py excluding stacks) all clean.
  • New guardrail tests/test_no_spec_references.py rejects requirements.md / design.md / tasks.md / bugfix.md filenames and prose patterns ('per the spec', 'see the design doc', etc.) in production code, examples, docs, and project READMEs.

Diagrams:

  • diagrams/code_diagrams/_targets.py extended with new entries for the branchy MCP modules (audit_logged, _run_long_task, detect_container_runtime, ImageManager.build / push / cleanup, image-lookup Lambda).
  • Both diagram catalogues (code_diagrams/, infra_diagrams/) regenerated; demo PDFs (DEMO_WALKTHROUGH, INFERENCE_WALKTHROUGH) regenerated.

Summary

Type of change

  • feat: New feature (non-breaking)
  • fix: Bug fix (non-breaking)
  • docs: Documentation only
  • refactor: Code refactor (no behavior change)
  • perf: Performance improvement
  • test: Test-only change
  • ci: CI / tooling change
  • chore: Maintenance (dep bumps, etc.)
  • breaking: Breaking change (major version bump)

Testing

  • pytest tests/ passes locally
  • cdk synth succeeds (if CDK code changed)
  • New tests added for new behavior
  • Ran the change against a real AWS account (describe below)

Checklist

  • Documentation updated (README, docs/, inline docstrings) as needed
  • No secrets, credentials, or customer data committed
  • requirements-lock.txt regenerated if pyproject.toml changed
  • Changes align with the architecture described in docs/ARCHITECTURE.md

Related issues

The MCP server (mcp/run_mcp.py) gains a substantially larger tool and
resource surface, a documented feature-flag taxonomy, FastMCP Tasks
support for long-running operations, FastMCP Context migration with
audit-log enrichment, examples and docs discovery, live-state
resources, a container image registry, and FastMCP transforms (Resources
As Tools, BM25 / Regex / Code Mode catalog replacement).

Tool surface (90 tools by default, up to 111 with feature flags):

* New 'safe' read-only tools across queue, templates, webhooks, dag,
  nodepools, analytics, config, plus stacks inspection (stack_diff /
  stack_outputs / stack_synth / valkey_status / aurora_status) and
  storage (files_get / files_access_points).
* New 'low-risk' mutating tools (queue_submit, templates_create /
  templates_run, webhooks_create, dag_run, FSx / Valkey / Aurora /
  Analytics enable/disable, analytics_user_add, nodepools_create_odcr).
* New gated tools — every infrastructure / capacity / model-upload /
  destructive / image-publish operation sits behind a per-flag
  environment variable, plus the umbrella GCO_ENABLE_ALL_TOOLS.
* Existing 'delete_job' and 'delete_inference' moved under
  GCO_ENABLE_DESTRUCTIVE_OPERATIONS (intentional backward-compat break;
  documented in mcp/README.md).

Feature flags (mcp/feature_flags.py):

* GCO_ENABLE_ALL_TOOLS — umbrella; overrides per-tool flags
* GCO_ENABLE_CAPACITY_PURCHASE — gates reserve_capacity (existing)
* GCO_ENABLE_MODEL_UPLOAD — gates models_upload
* GCO_ENABLE_IMAGE_PUBLISH — gates images_build / images_push
* GCO_ENABLE_INFRASTRUCTURE_DEPLOY — gates deploy_stack /
  deploy_all / bootstrap_cdk
* GCO_ENABLE_INFRASTRUCTURE_DESTROY — gates destroy_stack / destroy_all
* GCO_ENABLE_DESTRUCTIVE_OPERATIONS — gates every destructive tool
  (delete_job / delete_inference / delete_template / delete_webhook /
  delete_model / delete_nodepool / analytics_user_remove /
  cancel_queue_job / images_cleanup / images_prune /
  images_delete_tag / images_delete_repo)

FastMCP integration:

* Long-running tools (deploy_stack, deploy_all, bootstrap_cdk,
  destroy_stack, destroy_all, images_build, images_push) use FastMCP
  Tasks via task=TaskConfig(mode=...) plus a shared async subprocess
  runner (mcp/tools/_long_task.py) that streams progress, handles
  cancellation with SIGTERM -> 10s grace -> SIGKILL, and surfaces the
  partial-CFN-state disclaimer for stack ops.
* Audit decorator (mcp/audit.py) now dispatches on
  inspect.iscoroutinefunction so async tools work, captures
  request_id / client_id / task_id from FastMCP Context, and pulls
  client_messages / elicitations from a new AuditCaptureMiddleware
  (mcp/audit_middleware.py).
* Three FastMCP transforms wired in mcp/server.py: Resources As Tools
  (always on, exposes synthetic list_resources / read_resource for
  tool-only clients like Cursor), and one of BM25 / Regex / Code Mode
  selected via GCO_MCP_TOOL_SEARCH (default 'bm25', mutually exclusive
  among bm25 / regex / code_mode / off, unknown values fall back to
  bm25). Code Mode also exposes GCO_MCP_CODE_MODE_MAX_DURATION_SECS
  and GCO_MCP_CODE_MODE_MAX_MEMORY for the MontySandboxProvider.

Discovery and live state:

* Examples discovery: 4 new EXAMPLE_METADATA fields (keywords,
  instance_types, use_cases, related), find_examples tool,
  docs://gco/examples/by-category/{category} and
  docs://gco/examples/by-use-case/{use_case} resource paths.
* Docs discovery: symmetric DOC_METADATA, find_docs tool,
  docs://gco/docs/by-topic/{topic} and
  docs://gco/docs/by-related/{doc_name} resource paths.
* Six live-state resources: gco://jobs/{job_name},
  gco://inference/{endpoint_name},
  gco://k8s/{namespace}/{kind}/{name},
  gco://cluster/{region}/topology,
  costs://gco/summary/{days_window}, tasks://gco/{task_id}.

Container image registry:

* New cli/_container_runtime.py exposing detect_container_runtime
  (docker > finch > podman) with CDK_DOCKER override.
* New cli/images.py::ImageManager with full CRUD (build, push, list,
  tags, describe, uri, init, lifecycle get/set, replication get/sync/
  status, orphans, delete_tag, delete_repo, cleanup, prune).
* New 'gco images' CLI subcommand group.
* GCOGlobalStack provisions the project ECR repo + lifecycle policy +
  account-wide replication rule for gco/* across every deployed
  region + a lookup-or-create custom resource Lambda
  (lambda/image-lookup/handler.py) that adopts existing repos and
  honours the gco:retain=true tag.
* Inference deploy rewrites ECR image URIs to the local replica per
  region.
* Pre-destroy inventory summary + helpful errors guard against
  destroying gco-global with non-empty repos.
* MCP tool surface (mcp/tools/images.py): 11 unconditional tools (7
  read-only safe + 4 administrative low-risk), 2 gated by
  GCO_ENABLE_IMAGE_PUBLISH (build, push), 4 gated by
  GCO_ENABLE_DESTRUCTIVE_OPERATIONS (cleanup, prune, delete_tag,
  delete_repo).
* MCP resource surface (mcp/resources/images.py): images://gco/index,
  images://gco/{name}/tags, images://gco/{name}/{tag},
  images://gco/replication/status.

Python version + deprecation cleanup:

* requires-python bumped to >=3.14 in pyproject.toml; classifiers for
  3.10 / 3.11 / 3.12 / 3.13 dropped.
* README.md, mcp/README.md, CONTRIBUTING.md, QUICKSTART.md,
  .github/oidc_provider/README.md, docs/TROUBLESHOOTING.md, and
  demo/DEMO_WALKTHROUGH.md updated.
* tests/test_integration.py's single re.match call migrated to
  re.search (3.15 soft-deprecates re.match; the explicit ^ anchor
  makes the two functions equivalent).
* New guardrail tests/test_no_python_315_deprecation_surface.py walks
  the production tree and fails on any 3.15 deprecation surface
  (typing.ByteString, glob.glob0/1, NamedTuple keyword-arg syntax,
  TypedDict zero-field syntax, bare re.match, etc.).
* New tests/test_mcp_python_version.py confirms the un-parenthesized
  except-tuple in mcp/resources/config.py compiles on 3.14+ and
  greps every doc for legacy Python version references.

Documentation:

* mcp/README.md reorganized: top-of-file capacity-purchase blockquote
  replaced with a single-sentence pointer to the new ## Feature Flags
  section. Every Available Tools subsection table now has Risk Tier
  and Gated By columns. New subsections for Queue Management,
  Templates, Webhooks, DAG Pipelines, NodePools, Analytics, Config,
  Image Registry, Examples Discovery, Docs Discovery, Live State.
* tests/README.md expanded with detailed entries for every new MCP
  test file, the image registry test layer (CLI + global stack), and
  the two codebase guardrails.
* README.md project-structure tree, MCP-server one-liner, and tool
  count bumped to 90 (default) / up to 111 (with feature flags).

Tests + verification:

* 4349 pytest tests pass, 3 skipped (FastMCP API mismatches on
  certain transform-tag-filter surfaces).
* Full ruff format + check, mypy strict (mcp/, cli/, scripts/, app.py
  excluding stacks) all clean.
* New guardrail tests/test_no_spec_references.py rejects
  requirements.md / design.md / tasks.md / bugfix.md filenames and
  prose patterns ('per the spec', 'see the design doc', etc.) in
  production code, examples, docs, and project READMEs.

Diagrams:

* diagrams/code_diagrams/_targets.py extended with new entries
  for the branchy MCP modules (audit_logged, _run_long_task,
  detect_container_runtime, ImageManager.build / push / cleanup,
  image-lookup Lambda).
* Both diagram catalogues (code_diagrams/, infra_diagrams/)
  regenerated; demo PDFs (DEMO_WALKTHROUGH, INFERENCE_WALKTHROUGH)
  regenerated.
Comment thread cli/_container_runtime.py Fixed
Comment thread cli/images.py Fixed
Comment thread cli/images.py Fixed
Comment thread cli/inference.py Fixed
Comment thread mcp/audit_middleware.py Fixed
Comment thread mcp/run_mcp.py Fixed
Jmevorach added 2 commits May 18, 2026 23:01
- demo/GCO_PRESENTATION.{pdf,pptx}: pick up the updated tool counts
- mcp/README.md: collapse 4 double-blank lines flagged by markdownlint MD012
- requirements-lock.txt: regenerate via Dockerfile.dev to add pins for
  rank-bm25, pydocket and its transitive deps (burner-redis, cloudpickle,
  cronsim, numpy, pydantic-monty, python-json-logger, redis, shellingham,
  typer) introduced by fastmcp[tasks,code-mode]==3.2.4
CodeQL:
- cli/_container_runtime.py: drop redundant _container_runtime_checked
  bool, use a single sentinel-valued cache (_UNCHECKED) instead of a
  pair of globals.
- cli/_image_uri.py: NEW leaf module that hosts rewrite_image_uri_for_region
  and _ECR_HOST_RE. Both cli/images.py and cli/inference.py now
  depend on it instead of on each other, so the module-level dependency
  graph is a DAG (breaks the deferred-import cycle CodeQL flagged).
- cli/images.py: drop the local definition; re-export the helper for
  back-compat with the test suite. Replace the empty except body that
  swallowed RepositoryAlreadyExistsException with a debug-log line
  documenting the idempotent intent.
- mcp/audit_middleware.py: replace the unused _PATCHES_INSTALLED
  global with a marker attribute on the patched method so re-imports
  observe the patch state directly on Context.warning.
- mcp/run_mcp.py: hoist import importlib as _importlib to module
  scope and drop the four redundant in-block re-imports the gated
  reload pattern was emitting.

cdk-nag:
- gco/stacks/global_stack.py: add a scoped AwsSolutions-IAM5
  suppression on ImageLookupFunction's default policy. appliesTo
  uses the literal <AWS::AccountId> token cdk-nag emits; the reason
  documents that the arn:aws:ecr:*:<account>:repository/gco/*
  pattern is the documented IAM way to express the function's
  contract (manage every ECR repo under the project prefix).

Trivy:
- dockerfiles/{health,inference,manifest,queue}-monitor + Dockerfile.dev:
  add apt-get upgrade -y before the install step. Pulls the
  trixie-security patches for libcap2 (CVE-2026-4878),
  libsystemd0/libudev1 (CVE-2026-29111) that ship newer than the
  python:3.14.5-slim base image's snapshot. Verified the upgrade lands
  the fixed versions inside a fresh container.
- inference-monitor: had no apt-get block at all; added an
  upgrade-only RUN step so its base image gets the same patches.
- .trivyignore: bump the 6 dated CVE entries from 2026-06-13 to
  2026-07-18. Helm v4.1.4 and kubectl v1.35.5 are still the latest
  releases (both Go 1.25.9), and AWS hasn't rebuilt the Lambda Python
  base image yet.

Drive-by:
- mcp/resources/{cluster,costs}.py: ruff format pulled them onto the
  PEP 758 un-parenthesized except syntax now that the project pins
  Python >= 3.14.
- tests/test_container_runtime.py: docstring update to match the
  new single-sentinel cache shape.
Comment thread cli/_container_runtime.py Dismissed
Comment thread cli/images.py Dismissed
Jmevorach added 2 commits May 19, 2026 00:34
* pyproject.toml: extend coverage source to ['gco', 'cli', 'mcp'];
  raise fail_under from 85 to 90.
* CI configs: add --cov=mcp to every pytest invocation in
  .github/workflows/unit-tests.yml, .gitlab-ci.yml, .github/CI.md,
  CONTRIBUTING.md, README.md, and tests/README.md.

New test files:
  - tests/test_images_cli_extended.py — 64 cases covering the
    long-tail ImageManager surface (read-only, admin, destructive,
    cleanup/prune/orphans, ECR auth, immutable-tag collision check,
    push-only path, region resolution, factory).
  - tests/test_images_cmd.py — 50 cases driving every gco images
    Click subcommand through CliRunner against a mocked
    ImageManager (init, list, tags, describe, uri, build, push,
    delete-tag, delete-repo, cleanup, prune, orphans, lifecycle
    get/set, replication get/status/sync), with --yes confirmation
    gating, --build-arg parsing, and --no-dry-run on prune.
  - tests/test_mcp_extended_coverage.py — 58 cases for the long
    tail of mcp/ modules — iam.py role assumption, tasks.py
    accessor + coercion, cluster.py / k8s.py kubectl branches,
    iam_policies.py and ci.py read paths, find_docs filtering
    branches, every read-only and administrative tools/images.py
    body, audit middleware buffer reset on exit and on exception,
    and every cli_runner._run_cli error path.
  - tests/test_stacks_extended_coverage.py — 45 cases for the
    cli/stacks.py destroy flow — _read_images_config,
    _build_image_registry_inventory, _image_registry_destroy_preflight,
    CloudFormation describe/delete helpers, analytics toggle
    helpers, _api_gateway_imports_from_analytics, _cleanup_backup_vault,
    _cleanup_eks_security_groups, and _start_eks_sg_watchdog.

Drive-by:
  - tests/test_stacks.py: patch _detect_container_runtime in the
    two TestStackManagerDeployAnalyticsAutoApiGateway cases that
    were missing it (cli/_container_runtime.py runtime cache reset
    is now per-module).
  - tests/README.md: document each new file in the matching
    section. Drops 'requires' for 'enforces' so the floor and
    target wording aligns with the new gate.
  - demo/GCO_PRESENTATION.{pdf,pptx}: refreshed slides.

Coverage results:
  - 4566 passed, 3 skipped on the full suite (excluding heavy CDK
    matrix tests).
  - Overall coverage 92% across gco/, cli/, mcp/.
The CI build of Dockerfile.dev hit a mid-download IncompleteRead while
pulling pydantic_monty (3.1 MB of an expected 7.0 MB) and aborted with
an unrecoverable 'Connection broken' from urllib3. Add three layers of
mitigation around every pip install in the repo:

* --retries 10: raise pip's per-connection retry budget (default 5).
* --timeout 120: widen the socket timeout from the default 15s.
* --resume-retries 5 (pip 24.1+): resume partial downloads instead
  of restarting them from byte 0 — the exact failure mode CI just hit.
* Wrap each pip call in a 3-attempt bash loop as a final belt-and-
  suspenders guard for unrecoverable transport errors.

Applied identically across every Dockerfile in the repo so the dev
image, the four service images, and the helm-installer Lambda all
follow the same pattern.

Verified locally: finch build -f Dockerfile.dev -t gco-dev .
completes; pip install --help inside the resulting image lists
--retries, --timeout, and --resume-retries as recognised
flags.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants