WIP: feat(mcp): expand tool surface, add feature flags, FastMCP transforms#66
Open
Jmevorach wants to merge 5 commits into
Open
WIP: feat(mcp): expand tool surface, add feature flags, FastMCP transforms#66Jmevorach wants to merge 5 commits into
Jmevorach wants to merge 5 commits into
Conversation
The MCP server (mcp/run_mcp.py) gains a substantially larger tool and
resource surface, a documented feature-flag taxonomy, FastMCP Tasks
support for long-running operations, FastMCP Context migration with
audit-log enrichment, examples and docs discovery, live-state
resources, a container image registry, and FastMCP transforms (Resources
As Tools, BM25 / Regex / Code Mode catalog replacement).
Tool surface (90 tools by default, up to 111 with feature flags):
* New 'safe' read-only tools across queue, templates, webhooks, dag,
nodepools, analytics, config, plus stacks inspection (stack_diff /
stack_outputs / stack_synth / valkey_status / aurora_status) and
storage (files_get / files_access_points).
* New 'low-risk' mutating tools (queue_submit, templates_create /
templates_run, webhooks_create, dag_run, FSx / Valkey / Aurora /
Analytics enable/disable, analytics_user_add, nodepools_create_odcr).
* New gated tools — every infrastructure / capacity / model-upload /
destructive / image-publish operation sits behind a per-flag
environment variable, plus the umbrella GCO_ENABLE_ALL_TOOLS.
* Existing 'delete_job' and 'delete_inference' moved under
GCO_ENABLE_DESTRUCTIVE_OPERATIONS (intentional backward-compat break;
documented in mcp/README.md).
Feature flags (mcp/feature_flags.py):
* GCO_ENABLE_ALL_TOOLS — umbrella; overrides per-tool flags
* GCO_ENABLE_CAPACITY_PURCHASE — gates reserve_capacity (existing)
* GCO_ENABLE_MODEL_UPLOAD — gates models_upload
* GCO_ENABLE_IMAGE_PUBLISH — gates images_build / images_push
* GCO_ENABLE_INFRASTRUCTURE_DEPLOY — gates deploy_stack /
deploy_all / bootstrap_cdk
* GCO_ENABLE_INFRASTRUCTURE_DESTROY — gates destroy_stack / destroy_all
* GCO_ENABLE_DESTRUCTIVE_OPERATIONS — gates every destructive tool
(delete_job / delete_inference / delete_template / delete_webhook /
delete_model / delete_nodepool / analytics_user_remove /
cancel_queue_job / images_cleanup / images_prune /
images_delete_tag / images_delete_repo)
FastMCP integration:
* Long-running tools (deploy_stack, deploy_all, bootstrap_cdk,
destroy_stack, destroy_all, images_build, images_push) use FastMCP
Tasks via task=TaskConfig(mode=...) plus a shared async subprocess
runner (mcp/tools/_long_task.py) that streams progress, handles
cancellation with SIGTERM -> 10s grace -> SIGKILL, and surfaces the
partial-CFN-state disclaimer for stack ops.
* Audit decorator (mcp/audit.py) now dispatches on
inspect.iscoroutinefunction so async tools work, captures
request_id / client_id / task_id from FastMCP Context, and pulls
client_messages / elicitations from a new AuditCaptureMiddleware
(mcp/audit_middleware.py).
* Three FastMCP transforms wired in mcp/server.py: Resources As Tools
(always on, exposes synthetic list_resources / read_resource for
tool-only clients like Cursor), and one of BM25 / Regex / Code Mode
selected via GCO_MCP_TOOL_SEARCH (default 'bm25', mutually exclusive
among bm25 / regex / code_mode / off, unknown values fall back to
bm25). Code Mode also exposes GCO_MCP_CODE_MODE_MAX_DURATION_SECS
and GCO_MCP_CODE_MODE_MAX_MEMORY for the MontySandboxProvider.
Discovery and live state:
* Examples discovery: 4 new EXAMPLE_METADATA fields (keywords,
instance_types, use_cases, related), find_examples tool,
docs://gco/examples/by-category/{category} and
docs://gco/examples/by-use-case/{use_case} resource paths.
* Docs discovery: symmetric DOC_METADATA, find_docs tool,
docs://gco/docs/by-topic/{topic} and
docs://gco/docs/by-related/{doc_name} resource paths.
* Six live-state resources: gco://jobs/{job_name},
gco://inference/{endpoint_name},
gco://k8s/{namespace}/{kind}/{name},
gco://cluster/{region}/topology,
costs://gco/summary/{days_window}, tasks://gco/{task_id}.
Container image registry:
* New cli/_container_runtime.py exposing detect_container_runtime
(docker > finch > podman) with CDK_DOCKER override.
* New cli/images.py::ImageManager with full CRUD (build, push, list,
tags, describe, uri, init, lifecycle get/set, replication get/sync/
status, orphans, delete_tag, delete_repo, cleanup, prune).
* New 'gco images' CLI subcommand group.
* GCOGlobalStack provisions the project ECR repo + lifecycle policy +
account-wide replication rule for gco/* across every deployed
region + a lookup-or-create custom resource Lambda
(lambda/image-lookup/handler.py) that adopts existing repos and
honours the gco:retain=true tag.
* Inference deploy rewrites ECR image URIs to the local replica per
region.
* Pre-destroy inventory summary + helpful errors guard against
destroying gco-global with non-empty repos.
* MCP tool surface (mcp/tools/images.py): 11 unconditional tools (7
read-only safe + 4 administrative low-risk), 2 gated by
GCO_ENABLE_IMAGE_PUBLISH (build, push), 4 gated by
GCO_ENABLE_DESTRUCTIVE_OPERATIONS (cleanup, prune, delete_tag,
delete_repo).
* MCP resource surface (mcp/resources/images.py): images://gco/index,
images://gco/{name}/tags, images://gco/{name}/{tag},
images://gco/replication/status.
Python version + deprecation cleanup:
* requires-python bumped to >=3.14 in pyproject.toml; classifiers for
3.10 / 3.11 / 3.12 / 3.13 dropped.
* README.md, mcp/README.md, CONTRIBUTING.md, QUICKSTART.md,
.github/oidc_provider/README.md, docs/TROUBLESHOOTING.md, and
demo/DEMO_WALKTHROUGH.md updated.
* tests/test_integration.py's single re.match call migrated to
re.search (3.15 soft-deprecates re.match; the explicit ^ anchor
makes the two functions equivalent).
* New guardrail tests/test_no_python_315_deprecation_surface.py walks
the production tree and fails on any 3.15 deprecation surface
(typing.ByteString, glob.glob0/1, NamedTuple keyword-arg syntax,
TypedDict zero-field syntax, bare re.match, etc.).
* New tests/test_mcp_python_version.py confirms the un-parenthesized
except-tuple in mcp/resources/config.py compiles on 3.14+ and
greps every doc for legacy Python version references.
Documentation:
* mcp/README.md reorganized: top-of-file capacity-purchase blockquote
replaced with a single-sentence pointer to the new ## Feature Flags
section. Every Available Tools subsection table now has Risk Tier
and Gated By columns. New subsections for Queue Management,
Templates, Webhooks, DAG Pipelines, NodePools, Analytics, Config,
Image Registry, Examples Discovery, Docs Discovery, Live State.
* tests/README.md expanded with detailed entries for every new MCP
test file, the image registry test layer (CLI + global stack), and
the two codebase guardrails.
* README.md project-structure tree, MCP-server one-liner, and tool
count bumped to 90 (default) / up to 111 (with feature flags).
Tests + verification:
* 4349 pytest tests pass, 3 skipped (FastMCP API mismatches on
certain transform-tag-filter surfaces).
* Full ruff format + check, mypy strict (mcp/, cli/, scripts/, app.py
excluding stacks) all clean.
* New guardrail tests/test_no_spec_references.py rejects
requirements.md / design.md / tasks.md / bugfix.md filenames and
prose patterns ('per the spec', 'see the design doc', etc.) in
production code, examples, docs, and project READMEs.
Diagrams:
* diagrams/code_diagrams/_targets.py extended with new entries
for the branchy MCP modules (audit_logged, _run_long_task,
detect_container_runtime, ImageManager.build / push / cleanup,
image-lookup Lambda).
* Both diagram catalogues (code_diagrams/, infra_diagrams/)
regenerated; demo PDFs (DEMO_WALKTHROUGH, INFERENCE_WALKTHROUGH)
regenerated.
- demo/GCO_PRESENTATION.{pdf,pptx}: pick up the updated tool counts
- mcp/README.md: collapse 4 double-blank lines flagged by markdownlint MD012
- requirements-lock.txt: regenerate via Dockerfile.dev to add pins for
rank-bm25, pydocket and its transitive deps (burner-redis, cloudpickle,
cronsim, numpy, pydantic-monty, python-json-logger, redis, shellingham,
typer) introduced by fastmcp[tasks,code-mode]==3.2.4
CodeQL:
- cli/_container_runtime.py: drop redundant _container_runtime_checked
bool, use a single sentinel-valued cache (_UNCHECKED) instead of a
pair of globals.
- cli/_image_uri.py: NEW leaf module that hosts rewrite_image_uri_for_region
and _ECR_HOST_RE. Both cli/images.py and cli/inference.py now
depend on it instead of on each other, so the module-level dependency
graph is a DAG (breaks the deferred-import cycle CodeQL flagged).
- cli/images.py: drop the local definition; re-export the helper for
back-compat with the test suite. Replace the empty except body that
swallowed RepositoryAlreadyExistsException with a debug-log line
documenting the idempotent intent.
- mcp/audit_middleware.py: replace the unused _PATCHES_INSTALLED
global with a marker attribute on the patched method so re-imports
observe the patch state directly on Context.warning.
- mcp/run_mcp.py: hoist import importlib as _importlib to module
scope and drop the four redundant in-block re-imports the gated
reload pattern was emitting.
cdk-nag:
- gco/stacks/global_stack.py: add a scoped AwsSolutions-IAM5
suppression on ImageLookupFunction's default policy. appliesTo
uses the literal <AWS::AccountId> token cdk-nag emits; the reason
documents that the arn:aws:ecr:*:<account>:repository/gco/*
pattern is the documented IAM way to express the function's
contract (manage every ECR repo under the project prefix).
Trivy:
- dockerfiles/{health,inference,manifest,queue}-monitor + Dockerfile.dev:
add apt-get upgrade -y before the install step. Pulls the
trixie-security patches for libcap2 (CVE-2026-4878),
libsystemd0/libudev1 (CVE-2026-29111) that ship newer than the
python:3.14.5-slim base image's snapshot. Verified the upgrade lands
the fixed versions inside a fresh container.
- inference-monitor: had no apt-get block at all; added an
upgrade-only RUN step so its base image gets the same patches.
- .trivyignore: bump the 6 dated CVE entries from 2026-06-13 to
2026-07-18. Helm v4.1.4 and kubectl v1.35.5 are still the latest
releases (both Go 1.25.9), and AWS hasn't rebuilt the Lambda Python
base image yet.
Drive-by:
- mcp/resources/{cluster,costs}.py: ruff format pulled them onto the
PEP 758 un-parenthesized except syntax now that the project pins
Python >= 3.14.
- tests/test_container_runtime.py: docstring update to match the
new single-sentinel cache shape.
* pyproject.toml: extend coverage source to ['gco', 'cli', 'mcp'];
raise fail_under from 85 to 90.
* CI configs: add --cov=mcp to every pytest invocation in
.github/workflows/unit-tests.yml, .gitlab-ci.yml, .github/CI.md,
CONTRIBUTING.md, README.md, and tests/README.md.
New test files:
- tests/test_images_cli_extended.py — 64 cases covering the
long-tail ImageManager surface (read-only, admin, destructive,
cleanup/prune/orphans, ECR auth, immutable-tag collision check,
push-only path, region resolution, factory).
- tests/test_images_cmd.py — 50 cases driving every gco images
Click subcommand through CliRunner against a mocked
ImageManager (init, list, tags, describe, uri, build, push,
delete-tag, delete-repo, cleanup, prune, orphans, lifecycle
get/set, replication get/status/sync), with --yes confirmation
gating, --build-arg parsing, and --no-dry-run on prune.
- tests/test_mcp_extended_coverage.py — 58 cases for the long
tail of mcp/ modules — iam.py role assumption, tasks.py
accessor + coercion, cluster.py / k8s.py kubectl branches,
iam_policies.py and ci.py read paths, find_docs filtering
branches, every read-only and administrative tools/images.py
body, audit middleware buffer reset on exit and on exception,
and every cli_runner._run_cli error path.
- tests/test_stacks_extended_coverage.py — 45 cases for the
cli/stacks.py destroy flow — _read_images_config,
_build_image_registry_inventory, _image_registry_destroy_preflight,
CloudFormation describe/delete helpers, analytics toggle
helpers, _api_gateway_imports_from_analytics, _cleanup_backup_vault,
_cleanup_eks_security_groups, and _start_eks_sg_watchdog.
Drive-by:
- tests/test_stacks.py: patch _detect_container_runtime in the
two TestStackManagerDeployAnalyticsAutoApiGateway cases that
were missing it (cli/_container_runtime.py runtime cache reset
is now per-module).
- tests/README.md: document each new file in the matching
section. Drops 'requires' for 'enforces' so the floor and
target wording aligns with the new gate.
- demo/GCO_PRESENTATION.{pdf,pptx}: refreshed slides.
Coverage results:
- 4566 passed, 3 skipped on the full suite (excluding heavy CDK
matrix tests).
- Overall coverage 92% across gco/, cli/, mcp/.
The CI build of Dockerfile.dev hit a mid-download IncompleteRead while pulling pydantic_monty (3.1 MB of an expected 7.0 MB) and aborted with an unrecoverable 'Connection broken' from urllib3. Add three layers of mitigation around every pip install in the repo: * --retries 10: raise pip's per-connection retry budget (default 5). * --timeout 120: widen the socket timeout from the default 15s. * --resume-retries 5 (pip 24.1+): resume partial downloads instead of restarting them from byte 0 — the exact failure mode CI just hit. * Wrap each pip call in a 3-attempt bash loop as a final belt-and- suspenders guard for unrecoverable transport errors. Applied identically across every Dockerfile in the repo so the dev image, the four service images, and the helm-installer Lambda all follow the same pattern. Verified locally: finch build -f Dockerfile.dev -t gco-dev . completes; pip install --help inside the resulting image lists --retries, --timeout, and --resume-retries as recognised flags.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The MCP server (mcp/run_mcp.py) gains a substantially larger tool and resource surface, a documented feature-flag taxonomy, FastMCP Tasks support for long-running operations, FastMCP Context migration with audit-log enrichment, examples and docs discovery, live-state resources, a container image registry, and FastMCP transforms (Resources As Tools, BM25 / Regex / Code Mode catalog replacement).
Tool surface (90 tools by default, up to 111 with feature flags):
Feature flags (mcp/feature_flags.py):
FastMCP integration:
Discovery and live state:
Container image registry:
Python version + deprecation cleanup:
Documentation:
Tests + verification:
Diagrams:
Summary
Type of change
feat:New feature (non-breaking)fix:Bug fix (non-breaking)docs:Documentation onlyrefactor:Code refactor (no behavior change)perf:Performance improvementtest:Test-only changeci:CI / tooling changechore:Maintenance (dep bumps, etc.)breaking:Breaking change (major version bump)Testing
pytest tests/passes locallycdk synthsucceeds (if CDK code changed)Checklist
docs/, inline docstrings) as neededrequirements-lock.txtregenerated ifpyproject.tomlchangeddocs/ARCHITECTURE.mdRelated issues