Conversation
|
was working on same PR last month check if anything is useful and worth stealing from it #3250 so we can close it |
|
The implementation of the protocol adapter and the managed lifecycle looks very solid. That said, the idea of Python serving requests and processing them with Rust feels less efficient than the opposite approach. We should look at the Nginx Unit: a native listener handles the socket and routes traffic to Python as a service. In that configuration, you can easily specify how many Python workers to start for the endpoints. The Nginx Unit architecture usually provides the fastest RPS—outperforming multi-worker setups like Uvicorn, Gunicorn, or Granian. |
22672db to
ed7413c
Compare
|
This is a strong piece of work overall. The Rust A2A runtime direction, protocol normalization work, and sidecar boundary are all promising. I do think there are a few high-priority issues worth addressing before merge:
|
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
- Remove redundant local `from mcpgateway.config import settings` re-imports in a2a_service.py (W0404/W0621) - Add pylint disable for unused `interaction_type` parameter in build_a2a_jsonrpc_request (W0613) - part of public API - Add missing docstrings for interrogate coverage (8 functions) - Update test patches to target module-level settings binding at mcpgateway.services.a2a_service.settings - Redirect decode_auth/apply_query_param_auth test patches to mcpgateway.services.a2a_protocol where prepare_a2a_invocation now calls them - Update admin test assertion for simplified test_params format Signed-off-by: Jon Spriggs <github@sprig.gs> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
- Restore query-param credential redaction in error messages by exposing sensitive_query_param_names on PreparedA2AInvocation - Route Rust sidecar 504 (upstream timeout) through the timeout accounting path (counter, circuit-breaker, ToolTimeoutError) - Disable redirect following in the Rust reqwest client to match the Python path and prevent credential forwarding on redirects - Handle v/V-prefixed protocol version strings (e.g. "v1", "V1.0") so they correctly resolve to A2A v1 semantics - Add full docstrings to a2a_protocol.py and rust_a2a_runtime.py (resolves flake8 DAR101/DAR201/DAR401 warnings) - Add version.py A2A runtime diagnostic docstrings - Fix 6 test mock targets for module-level settings import - Add 100% coverage tests for a2a_protocol.py, rust_a2a_runtime.py, and version.py A2A diagnostics - Update .secrets.baseline for test fixture false positives Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
…, and caching Production-grade Rust A2A sidecar with full v1.0 method dispatch, SSE streaming, Redis-backed sessions, three-layer caching, and virtual server federation. Implements all 20 items from the A2A implementation plan across 5 phases. Core engine (17 Rust modules, 122 tests): - Circuit breaker with per-endpoint per-tenant isolation - Lock-free per-agent metrics with P95 adaptive timeouts - Bounded job queue with request coalescing - AES-GCM auth decryption matching Python services_auth contract - Trust validation via loopback + HMAC shared secret A2A v1.0 method dispatch: - SendMessage, SendStreamingMessage (SSE with event store + reconnect) - GetTask, ListTasks, CancelTask (proxied to Python RPC) - GetExtendedAgentCard (cached agent cards) - Push notification CRUD (4 methods) + Rust webhook dispatch - All methods accept both v1 (PascalCase) and legacy (slash) names Streaming and sessions: - SSE passthrough from agent through Rust to client - Redis ring buffer event store (Lua atomic scripts) + async PG flush - Client reconnect via Last-Event-ID with Redis/PG replay - Redis-backed session management with auth fingerprinting and TTL - Session fast-path skips Python authenticate on reuse Three-layer caching: - L1 in-process DashMap, L2 Redis, L3 PG via Python RPC - Redis pub/sub invalidation for cross-worker L1 eviction - Graceful degradation when Redis unavailable Infrastructure: - Nginx a2a_upstream with Rust-primary/Python-backup failover - 15 Python /_internal/a2a/* RPC endpoints - Virtual server federation (servers exposed as A2A agents) - Shadow mode dual-dispatch with response comparison logging - Progressive rollout via RUST_A2A_MODE=off|shadow|edge|full Domain models: - A2ATask, ServerTaskMapping, ServerInterface, A2AAgentAuth, A2APushNotificationConfig, A2ATaskEvent (6 models, 3 migrations) - ADR-048 (A2A v1 protocol migration), ADR-049 (multi-protocol servers) Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
Add unit tests for previously untested modules and integration tests for error handling edge cases. Unit tests: - push.rs: webhook config filtering (enabled, events match, case-insensitive) - config.rs: listen_target parsing, UDS preference, invalid address, defaults Integration tests: - Trust chain: authentication failure (401→403), empty method fallthrough - Streaming: agent error (500), agent timeout (502/504) - Malformed requests: invalid JSON body, empty endpoint URL - Method routing: unknown method falls through to agent invoke Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
… endpoints Add 75 new Python tests covering previously untested A2A code: - A2AServerService (12 tests): get_server_agent_card, resolve_server_agent, select_downstream_agent, create/resolve_task_mapping - a2a_service new methods (24 tests): cancel_task, push config CRUD, flush/replay events, shadow mode comparison, invalidation publishing - /_internal/a2a/* endpoints (39 tests): untrusted 403 gate for all 16 endpoints, happy-path delegation for tasks, push, events, resolve, card Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
Add title column to tools, resources, and prompts tables following MCP 2025-11-25 spec precedence (title → annotations.title → name). Implement _resolve_tool_title() helper with comprehensive doctests. SSO enhancements: - Add ADFS integration with tutorial documentation - Expand SSO service test coverage (775+ new test lines) - Add integration tests for ADFS flows Performance improvements: - Optimize gateway listings with eager-loaded tool counts - Add selectinload for tools relationship in admin queries Testing: - Add unit tests for demo_a2a_agent.py script - Expand gateway_service and sso_service test coverage - Add ADFS integration test suite Documentation: - Add comprehensive ADFS SSO tutorial - Update A2A agent documentation - Remove duplicate config.schema.json files Database: - Migration a7f3c9e1b2d4: add title column (idempotent) - Update schemas to include title field validation Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
- Scope agent cache keys by caller's teams to prevent cross-tenant cache poisoning in the Rust A2A runtime - Add visibility scoping to task get/list endpoints so callers only see tasks owned by agents they have access to - Check agent access before building agent card to avoid loading sensitive data for unauthorized callers - Truncate error detail in Rust runtime client logging to avoid leaking auth blobs - Tighten protocol matching in A2AServerService to exact match instead of substring contains - Validate task_id/agent_id/state parameter types in task proxy endpoints - Fix stale migration docstring (Revises: field) - Fix clippy warnings (single_match, too_many_arguments) - Run cargo fmt, isort, and black on PR-touched files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
…ints Task cancel, push config CRUD (create/get/list/delete), and events replay endpoints were missing scope-aware authorization checks. A caller could operate on tasks and configs belonging to agents outside their team visibility. - cancel_task now checks agent visibility before allowing cancellation - Push config create/get/list/delete endpoints verify the caller can access the owning agent via _check_agent_access_by_id - Events replay checks the task's owning agent visibility before returning event data - Events flush is left unscoped (trusted sidecar write path, not a read path exposing data to callers) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
…bility query - Add single-flight deduplication to resolve_agent: concurrent cache misses for the same scoped agent key now share one Python resolve call instead of each hitting the backend independently - Wrap ResolvedRequest in Arc in the job queue to avoid deep-cloning headers and JSON body on every request submission - Rewrite _visible_agent_ids to push visibility filtering into SQL instead of loading all enabled agents into Python and filtering in a loop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Springer <jps@s390x.com> Signed-off-by: lucarlig <luca.carlig@ibm.com>
…dation
Security fixes:
- Add authentication to the /invoke endpoint (HMAC trust header gate)
- Add authentication to the A2A proxy catch-all before forwarding
- Require non-empty A2A_RUST_AUTH_SECRET at startup to prevent
predictable trust headers
- Use server-derived owner_email/team_id in agent registration instead
of trusting client-provided values
- Add visibility scoping to the flush_events internal endpoint
- Close visibility bypass on events/replay when task row is absent
- Close visibility bypass on push config get/list when agent_id omitted
- Validate webhook URL through Pydantic schema (SSRF protection) in
internal push config creation endpoint
- Add webhook URL validation via validate_core_url on
A2APushNotificationConfigCreate schema
- Exclude auth_token from A2APushNotificationConfigRead (Field exclude)
- Change _check_agent_access_by_id to fail-closed for deleted agents
- Make encode_auth_context a hard error instead of silent empty fallback
- Sanitize all error responses to external callers — strip internal
backend URLs, Python response bodies, and reqwest error details
- Redact Redis URL credentials in logs
- Use constant-time comparison for session fingerprint validation
- Warn at startup if HTTP listener is non-loopback
Correctness fixes:
- Pass actual JSON-RPC method name through full_authenticate
- Add explicit authz match arms for cancel, delete, create, and stream
- Remove x-forwarded-for from default session fingerprint headers
- Fix cargo fmt conflicts with pragma allowlist comments
- Cap retry backoff at 60 seconds to prevent runaway delays
Error handling improvements:
- Replace expect() in queue worker with graceful error handling for
semaphore closure, panicked JoinSet tasks, and missing results
- Add logger.exception + db.invalidate fallback to all 11 internal
A2A endpoint exception handlers
- Upgrade query param decryption failure log from debug to warning
- Log warning on auth decoding failure during agent update
- Return None from event_store.store_event on serialization failure
- Replace .ok() with logged match on proxy response JSON parsing
- Replace mutex expect("poisoned") with unwrap_or_else recovery
- Upgrade shadow mode payload mismatch log to warning with traceback
- Upgrade Redis cache invalidation failure log to warning
- Bound resolve_inflight DashMap to prevent unbounded memory growth
Comment and code cleanup:
- Fix handle_a2a_proxy doc (does not inject trust headers)
- Fix coalesce_jobs doc (does not match on timeout)
- Fix proxy_task_method doc (also handles cancel, not just reads)
- Clarify trust header as keyed SHA-256, not formal HMAC
- Renumber handle_a2a_invoke steps sequentially (1-2-3-3a-4-5)
- Document _visible_agent_ids admin-bypass divergence
- Remove leftover logger.info debug statements from register_agent
- Remove commented-out dead code from register_agent
- Remove duplicate TOOLS_MANAGE_PLUGINS constant
- Remove constant-comparison test (test_version_endpoint_redis_conditions)
Test coverage:
- Add visibility tests for cancel_task (wrong team, admin, public-only)
- Add deny-path tests for events/replay (inaccessible agent, missing task)
- Add 32 tests for _check_agent_access_by_id, _visible_agent_ids,
get_task, list_tasks, and _check_server_access
- Add tests for auth_secret startup rejection (unit + binary)
- Update integration tests for trust-gated /invoke and authenticated
proxy endpoints
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
… type design Security-critical: - Resolve A2A task lookup ambiguity: Rust runtime pre-resolves agent from URL path and injects agent_id into task-method proxy params; Python service refuses to guess when (task_id, no agent_id) matches multiple rows. Prevents cross-agent task read/cancel via shared task_id. - Encrypt push webhook auth_token at rest via services_auth.encode_auth; new list_push_configs_for_dispatch returns decrypted tokens for the trusted Rust sidecar only. Fixes secret-at-rest leak. - Upsert semantics on push-config re-registration: mutable fields (auth_token, events, enabled) update in place instead of silently keeping stale config. Key rotations now take effect. - Fail-closed on query-param auth decrypt failure (matches header path). - Refuse empty/missing A2A_RUST_AUTH_SECRET at startup; add cross-field RuntimeConfig validation (retry budget, TTL ordering, session TTL). - Feature-flag gate on internal A2A endpoints: reject trusted requests when MCPGATEWAY_A2A_ENABLED=false (defense-in-depth). - events/flush rejects events for unknown task_ids (400) to prevent visibility bypass. Observability: - logger.exception on _authorize_internal_a2a_method errors. - logger.warning on agent-visibility denial across 7 internal endpoints. - warn! on JSON parse failures in 4 proxy sites (server.rs) and malformed Last-Event-ID headers. - warn! on session fingerprint mismatch (security-relevant signal). - MetricsCollector.webhook_retry_exhausted counter + record method. - Log Rust-cache invalidation scheduling failures instead of silent pass. - Rollback-after-failure in task persistence now logs and calls invalidate. Type design / schema: - New A2ATaskState enum with terminal-state validator on A2ATaskUpdate. - Unknown task-state protocol values now warn (passed through for forward-compat). - Alembic: tenant=String(255), icon_url=String(767) match the ORM (prevents Postgres VARCHAR-length drift between fresh and migrated DBs). - Rename two A2A migration files to hash-first convention matching the rest of the project. - trust.rs doc/comment clarifies the header is SHA256, not HMAC. - BTreeMap for query-param merge ensures deterministic ordering. - encode_auth_context uses .expect() — empty header would look anonymous. Test coverage: - Deny-path regressions for internal A2A endpoints: RBAC denied, wrong-team, feature-disabled (20 new cases). - Rust↔Python bridge: ConnectError/ConnectTimeout/ReadTimeout/PoolTimeout now wrap as RustA2ARuntimeError with correct is_timeout flag. - Push config: encryption round-trip, token rotation, upsert semantics, undecryptable-legacy heal path, dispatch-listing SQL visibility scope. - Rust concurrency: session concurrent lookup/invalidate leaves storage consistent; concurrent create produces unique IDs; queue overflow under concurrent producers rejects with Full. - RuntimeConfig cross-field validation (5 scenarios). - Rust push: webhook retry-exhaustion metric increments. - Binary startup refuses without auth_secret. - select_downstream_agent: ORDER BY name + enabled filter assertions. Deferred to follow-up issues: - Typestate wrappers for trust-boundary types (#4233) - SSE client-disconnect integration test (#4234) Signed-off-by: Jonathan Springer <jps@s390x.com>
… type design Security-critical: - Resolve A2A task lookup ambiguity: Rust runtime pre-resolves agent from URL path and injects agent_id into task-method proxy params; Python service refuses to guess when (task_id, no agent_id) matches multiple rows. Prevents cross-agent task read/cancel via shared task_id. - Encrypt push webhook auth_token at rest via services_auth.encode_auth; new list_push_configs_for_dispatch returns decrypted tokens for the trusted Rust sidecar only. Fixes secret-at-rest leak. - Upsert semantics on push-config re-registration: mutable fields (auth_token, events, enabled) update in place instead of silently keeping stale config. Key rotations now take effect. - Fail-closed on query-param auth decrypt failure (matches header path). - Refuse empty/missing A2A_RUST_AUTH_SECRET at startup; add cross-field RuntimeConfig validation (retry budget, TTL ordering, session TTL). - Feature-flag gate on internal A2A endpoints: reject trusted requests when MCPGATEWAY_A2A_ENABLED=false (defense-in-depth). - events/flush rejects events for unknown task_ids (400) to prevent visibility bypass. Observability: - logger.exception on _authorize_internal_a2a_method errors. - logger.warning on agent-visibility denial across 7 internal endpoints. - warn! on JSON parse failures in 4 proxy sites (server.rs) and malformed Last-Event-ID headers. - warn! on session fingerprint mismatch (security-relevant signal). - MetricsCollector.webhook_retry_exhausted counter + record method. - Log Rust-cache invalidation scheduling failures instead of silent pass. - Rollback-after-failure in task persistence now logs and calls invalidate. Type design / schema: - New A2ATaskState enum with terminal-state validator on A2ATaskUpdate. - Unknown task-state protocol values now warn (passed through for forward-compat). - Alembic: tenant=String(255), icon_url=String(767) match the ORM (prevents Postgres VARCHAR-length drift between fresh and migrated DBs). - Rename two A2A migration files to hash-first convention matching the rest of the project. - trust.rs doc/comment clarifies the header is SHA256, not HMAC. - BTreeMap for query-param merge ensures deterministic ordering. - encode_auth_context uses .expect() — empty header would look anonymous. Test coverage: - Deny-path regressions for internal A2A endpoints: RBAC denied, wrong-team, feature-disabled (20 new cases). - Rust↔Python bridge: ConnectError/ConnectTimeout/ReadTimeout/PoolTimeout now wrap as RustA2ARuntimeError with correct is_timeout flag. - Push config: encryption round-trip, token rotation, upsert semantics, undecryptable-legacy heal path, dispatch-listing SQL visibility scope. - Rust concurrency: session concurrent lookup/invalidate leaves storage consistent; concurrent create produces unique IDs; queue overflow under concurrent producers rejects with Full. - RuntimeConfig cross-field validation (5 scenarios). - Rust push: webhook retry-exhaustion metric increments. - Binary startup refuses without auth_secret. - select_downstream_agent: ORDER BY name + enabled filter assertions. Deferred to follow-up issues: - Typestate wrappers for trust-boundary types (#4233) - SSE client-disconnect integration test (#4234) Signed-off-by: Jonathan Springer <jps@s390x.com>
Summary
Details
tools_rust/a2a_runtimewith health and invoke endpoints, plus container and entrypoint wiring forRUST_A2A_MODE=off|shadow|edge|fullRefs #1624