Skip to content

feat: add Rust A2A 1.0 runtime support#3704

Open
crivetimihai wants to merge 19 commits intomainfrom
a2a-rust
Open

feat: add Rust A2A 1.0 runtime support#3704
crivetimihai wants to merge 19 commits intomainfrom
a2a-rust

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai commented Mar 17, 2026

Summary

  • add an experimental Rust A2A sidecar and gateway delegation path for registered A2A agents
  • normalize A2A v1 and legacy JSON-RPC payloads through a shared outbound protocol adapter
  • move the example and compose testing path to A2A 1.0.0 by default while retaining legacy interop

Details

  • add tools_rust/a2a_runtime with health and invoke endpoints, plus container and entrypoint wiring for RUST_A2A_MODE=off|shadow|edge|full
  • route A2A service and A2A tool invocations through shared request preparation and optional Rust-side delegation
  • rewrite the Go echo agent to serve a v1 card and task surface without the old Go SDK dependency, while still accepting legacy aliases
  • replace the integration fixture with direct v1 coverage and update load and compose flows to exercise the v1 default path through the managed Rust sidecar

Refs #1624

@crivetimihai crivetimihai added experimental Experimental features, test proposed MCP Specification changes rust Rust programming labels Mar 17, 2026
@crivetimihai crivetimihai self-assigned this Mar 17, 2026
@crivetimihai crivetimihai added the COULD P3: Nice-to-have features with minimal impact if left out; included if time permits label Mar 20, 2026
@crivetimihai crivetimihai added this to the Release 1.1.0 milestone Mar 20, 2026
@lucarlig
Copy link
Copy Markdown
Collaborator

lucarlig commented Mar 31, 2026

was working on same PR last month check if anything is useful and worth stealing from it #3250 so we can close it

@jonpspri jonpspri added the a2a Support for A2A protocol label Apr 1, 2026
@dima-zakharov
Copy link
Copy Markdown
Collaborator

The implementation of the protocol adapter and the managed lifecycle looks very solid.

That said, the idea of Python serving requests and processing them with Rust feels less efficient than the opposite approach. We should look at the Nginx Unit: a native listener handles the socket and routes traffic to Python as a service.

In that configuration, you can easily specify how many Python workers to start for the endpoints. The Nginx Unit architecture usually provides the fastest RPS—outperforming multi-worker setups like Uvicorn, Gunicorn, or Granian.
The current sidecar MCP runtime design follows similar approach like Nginx UNit, but we can't really run multiple Python workers effectively on UDS unless we use HTTP with something like Unit or other asgi servers anyway.

@lucarlig
Copy link
Copy Markdown
Collaborator

lucarlig commented Apr 8, 2026

This is a strong piece of work overall. The Rust A2A runtime direction, protocol normalization work, and sidecar boundary are all promising. I do think there are a few high-priority issues worth addressing before merge:

  1. The Rust A2A entrypoint appears to authorize every JSON-RPC method as invoke before branching into task reads, card lookups, and push-config operations. That means the new per-method A2A authz endpoints do not seem to be exercised, so RBAC behavior for read-oriented methods looks weaker than intended.

  2. The new trusted agent-resolve path appears to bypass agent visibility scoping. The existing Python invoke flow checks access before returning an agent, but the internal Rust resolve path looks like it can return any enabled agent by name without consulting forwarded auth context or token teams. That seems likely to expose private or team-scoped agents through the sidecar path.

  3. The task proxy path looks like it forwards the wrong parameter shape. The Rust side passes raw JSON-RPC params through to the internal Python task endpoints, but those endpoints expect gateway-style task identifiers, while the protocol normalization and integration coverage in this PR use spec-shaped task params. As written, valid task requests may fail on the Rust path.

  4. The new queue implementation does not appear to deliver the concurrency it advertises. The batching path acquires a semaphore permit but still awaits each invoke inline before moving to the next request group, so work stays effectively serial per worker thread. That looks like a throughput regression in the new hot path.

  5. The default queue mode appears to be unbounded. With no explicit queue limit configured, submissions can continue accumulating even after the bridge channel is saturated, which makes overload scenarios look vulnerable to unbounded memory growth.

crivetimihai and others added 18 commits April 15, 2026 18:20
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
- Remove redundant local `from mcpgateway.config import settings`
  re-imports in a2a_service.py (W0404/W0621)
- Add pylint disable for unused `interaction_type` parameter in
  build_a2a_jsonrpc_request (W0613) - part of public API
- Add missing docstrings for interrogate coverage (8 functions)
- Update test patches to target module-level settings binding at
  mcpgateway.services.a2a_service.settings
- Redirect decode_auth/apply_query_param_auth test patches to
  mcpgateway.services.a2a_protocol where prepare_a2a_invocation
  now calls them
- Update admin test assertion for simplified test_params format

Signed-off-by: Jon Spriggs <github@sprig.gs>
Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
- Restore query-param credential redaction in error messages by exposing
  sensitive_query_param_names on PreparedA2AInvocation
- Route Rust sidecar 504 (upstream timeout) through the timeout
  accounting path (counter, circuit-breaker, ToolTimeoutError)
- Disable redirect following in the Rust reqwest client to match the
  Python path and prevent credential forwarding on redirects
- Handle v/V-prefixed protocol version strings (e.g. "v1", "V1.0")
  so they correctly resolve to A2A v1 semantics
- Add full docstrings to a2a_protocol.py and rust_a2a_runtime.py
  (resolves flake8 DAR101/DAR201/DAR401 warnings)
- Add version.py A2A runtime diagnostic docstrings
- Fix 6 test mock targets for module-level settings import
- Add 100% coverage tests for a2a_protocol.py, rust_a2a_runtime.py,
  and version.py A2A diagnostics
- Update .secrets.baseline for test fixture false positives

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
…, and caching

Production-grade Rust A2A sidecar with full v1.0 method dispatch, SSE
streaming, Redis-backed sessions, three-layer caching, and virtual
server federation. Implements all 20 items from the A2A implementation
plan across 5 phases.

Core engine (17 Rust modules, 122 tests):
- Circuit breaker with per-endpoint per-tenant isolation
- Lock-free per-agent metrics with P95 adaptive timeouts
- Bounded job queue with request coalescing
- AES-GCM auth decryption matching Python services_auth contract
- Trust validation via loopback + HMAC shared secret

A2A v1.0 method dispatch:
- SendMessage, SendStreamingMessage (SSE with event store + reconnect)
- GetTask, ListTasks, CancelTask (proxied to Python RPC)
- GetExtendedAgentCard (cached agent cards)
- Push notification CRUD (4 methods) + Rust webhook dispatch
- All methods accept both v1 (PascalCase) and legacy (slash) names

Streaming and sessions:
- SSE passthrough from agent through Rust to client
- Redis ring buffer event store (Lua atomic scripts) + async PG flush
- Client reconnect via Last-Event-ID with Redis/PG replay
- Redis-backed session management with auth fingerprinting and TTL
- Session fast-path skips Python authenticate on reuse

Three-layer caching:
- L1 in-process DashMap, L2 Redis, L3 PG via Python RPC
- Redis pub/sub invalidation for cross-worker L1 eviction
- Graceful degradation when Redis unavailable

Infrastructure:
- Nginx a2a_upstream with Rust-primary/Python-backup failover
- 15 Python /_internal/a2a/* RPC endpoints
- Virtual server federation (servers exposed as A2A agents)
- Shadow mode dual-dispatch with response comparison logging
- Progressive rollout via RUST_A2A_MODE=off|shadow|edge|full

Domain models:
- A2ATask, ServerTaskMapping, ServerInterface, A2AAgentAuth,
  A2APushNotificationConfig, A2ATaskEvent (6 models, 3 migrations)
- ADR-048 (A2A v1 protocol migration), ADR-049 (multi-protocol servers)

Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
Add unit tests for previously untested modules and integration tests
for error handling edge cases.

Unit tests:
- push.rs: webhook config filtering (enabled, events match, case-insensitive)
- config.rs: listen_target parsing, UDS preference, invalid address, defaults

Integration tests:
- Trust chain: authentication failure (401→403), empty method fallthrough
- Streaming: agent error (500), agent timeout (502/504)
- Malformed requests: invalid JSON body, empty endpoint URL
- Method routing: unknown method falls through to agent invoke

Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
… endpoints

Add 75 new Python tests covering previously untested A2A code:

- A2AServerService (12 tests): get_server_agent_card, resolve_server_agent,
  select_downstream_agent, create/resolve_task_mapping
- a2a_service new methods (24 tests): cancel_task, push config CRUD,
  flush/replay events, shadow mode comparison, invalidation publishing
- /_internal/a2a/* endpoints (39 tests): untrusted 403 gate for all 16
  endpoints, happy-path delegation for tasks, push, events, resolve, card

Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
Add title column to tools, resources, and prompts tables following MCP
2025-11-25 spec precedence (title → annotations.title → name). Implement
_resolve_tool_title() helper with comprehensive doctests.

SSO enhancements:
- Add ADFS integration with tutorial documentation
- Expand SSO service test coverage (775+ new test lines)
- Add integration tests for ADFS flows

Performance improvements:
- Optimize gateway listings with eager-loaded tool counts
- Add selectinload for tools relationship in admin queries

Testing:
- Add unit tests for demo_a2a_agent.py script
- Expand gateway_service and sso_service test coverage
- Add ADFS integration test suite

Documentation:
- Add comprehensive ADFS SSO tutorial
- Update A2A agent documentation
- Remove duplicate config.schema.json files

Database:
- Migration a7f3c9e1b2d4: add title column (idempotent)
- Update schemas to include title field validation

Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
- Scope agent cache keys by caller's teams to prevent cross-tenant
  cache poisoning in the Rust A2A runtime
- Add visibility scoping to task get/list endpoints so callers only
  see tasks owned by agents they have access to
- Check agent access before building agent card to avoid loading
  sensitive data for unauthorized callers
- Truncate error detail in Rust runtime client logging to avoid
  leaking auth blobs
- Tighten protocol matching in A2AServerService to exact match
  instead of substring contains
- Validate task_id/agent_id/state parameter types in task proxy
  endpoints
- Fix stale migration docstring (Revises: field)
- Fix clippy warnings (single_match, too_many_arguments)
- Run cargo fmt, isort, and black on PR-touched files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
…ints

Task cancel, push config CRUD (create/get/list/delete), and events
replay endpoints were missing scope-aware authorization checks. A
caller could operate on tasks and configs belonging to agents outside
their team visibility.

- cancel_task now checks agent visibility before allowing cancellation
- Push config create/get/list/delete endpoints verify the caller can
  access the owning agent via _check_agent_access_by_id
- Events replay checks the task's owning agent visibility before
  returning event data
- Events flush is left unscoped (trusted sidecar write path, not a
  read path exposing data to callers)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
…bility query

- Add single-flight deduplication to resolve_agent: concurrent cache
  misses for the same scoped agent key now share one Python resolve
  call instead of each hitting the backend independently
- Wrap ResolvedRequest in Arc in the job queue to avoid deep-cloning
  headers and JSON body on every request submission
- Rewrite _visible_agent_ids to push visibility filtering into SQL
  instead of loading all enabled agents into Python and filtering
  in a loop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
…dation

Security fixes:
- Add authentication to the /invoke endpoint (HMAC trust header gate)
- Add authentication to the A2A proxy catch-all before forwarding
- Require non-empty A2A_RUST_AUTH_SECRET at startup to prevent
  predictable trust headers
- Use server-derived owner_email/team_id in agent registration instead
  of trusting client-provided values
- Add visibility scoping to the flush_events internal endpoint
- Close visibility bypass on events/replay when task row is absent
- Close visibility bypass on push config get/list when agent_id omitted
- Validate webhook URL through Pydantic schema (SSRF protection) in
  internal push config creation endpoint
- Add webhook URL validation via validate_core_url on
  A2APushNotificationConfigCreate schema
- Exclude auth_token from A2APushNotificationConfigRead (Field exclude)
- Change _check_agent_access_by_id to fail-closed for deleted agents
- Make encode_auth_context a hard error instead of silent empty fallback
- Sanitize all error responses to external callers — strip internal
  backend URLs, Python response bodies, and reqwest error details
- Redact Redis URL credentials in logs
- Use constant-time comparison for session fingerprint validation
- Warn at startup if HTTP listener is non-loopback

Correctness fixes:
- Pass actual JSON-RPC method name through full_authenticate
- Add explicit authz match arms for cancel, delete, create, and stream
- Remove x-forwarded-for from default session fingerprint headers
- Fix cargo fmt conflicts with pragma allowlist comments
- Cap retry backoff at 60 seconds to prevent runaway delays

Error handling improvements:
- Replace expect() in queue worker with graceful error handling for
  semaphore closure, panicked JoinSet tasks, and missing results
- Add logger.exception + db.invalidate fallback to all 11 internal
  A2A endpoint exception handlers
- Upgrade query param decryption failure log from debug to warning
- Log warning on auth decoding failure during agent update
- Return None from event_store.store_event on serialization failure
- Replace .ok() with logged match on proxy response JSON parsing
- Replace mutex expect("poisoned") with unwrap_or_else recovery
- Upgrade shadow mode payload mismatch log to warning with traceback
- Upgrade Redis cache invalidation failure log to warning
- Bound resolve_inflight DashMap to prevent unbounded memory growth

Comment and code cleanup:
- Fix handle_a2a_proxy doc (does not inject trust headers)
- Fix coalesce_jobs doc (does not match on timeout)
- Fix proxy_task_method doc (also handles cancel, not just reads)
- Clarify trust header as keyed SHA-256, not formal HMAC
- Renumber handle_a2a_invoke steps sequentially (1-2-3-3a-4-5)
- Document _visible_agent_ids admin-bypass divergence
- Remove leftover logger.info debug statements from register_agent
- Remove commented-out dead code from register_agent
- Remove duplicate TOOLS_MANAGE_PLUGINS constant
- Remove constant-comparison test (test_version_endpoint_redis_conditions)

Test coverage:
- Add visibility tests for cancel_task (wrong team, admin, public-only)
- Add deny-path tests for events/replay (inaccessible agent, missing task)
- Add 32 tests for _check_agent_access_by_id, _visible_agent_ids,
  get_task, list_tasks, and _check_server_access
- Add tests for auth_secret startup rejection (unit + binary)
- Update integration tests for trust-gated /invoke and authenticated
  proxy endpoints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Springer <jps@s390x.com>

Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
Signed-off-by: lucarlig <luca.carlig@ibm.com>
jonpspri added a commit that referenced this pull request Apr 15, 2026
… type design

Security-critical:
- Resolve A2A task lookup ambiguity: Rust runtime pre-resolves agent from
  URL path and injects agent_id into task-method proxy params; Python
  service refuses to guess when (task_id, no agent_id) matches multiple
  rows. Prevents cross-agent task read/cancel via shared task_id.
- Encrypt push webhook auth_token at rest via services_auth.encode_auth;
  new list_push_configs_for_dispatch returns decrypted tokens for the
  trusted Rust sidecar only. Fixes secret-at-rest leak.
- Upsert semantics on push-config re-registration: mutable fields
  (auth_token, events, enabled) update in place instead of silently
  keeping stale config. Key rotations now take effect.
- Fail-closed on query-param auth decrypt failure (matches header path).
- Refuse empty/missing A2A_RUST_AUTH_SECRET at startup; add cross-field
  RuntimeConfig validation (retry budget, TTL ordering, session TTL).
- Feature-flag gate on internal A2A endpoints: reject trusted requests
  when MCPGATEWAY_A2A_ENABLED=false (defense-in-depth).
- events/flush rejects events for unknown task_ids (400) to prevent
  visibility bypass.

Observability:
- logger.exception on _authorize_internal_a2a_method errors.
- logger.warning on agent-visibility denial across 7 internal endpoints.
- warn! on JSON parse failures in 4 proxy sites (server.rs) and
  malformed Last-Event-ID headers.
- warn! on session fingerprint mismatch (security-relevant signal).
- MetricsCollector.webhook_retry_exhausted counter + record method.
- Log Rust-cache invalidation scheduling failures instead of silent pass.
- Rollback-after-failure in task persistence now logs and calls invalidate.

Type design / schema:
- New A2ATaskState enum with terminal-state validator on A2ATaskUpdate.
- Unknown task-state protocol values now warn (passed through for
  forward-compat).
- Alembic: tenant=String(255), icon_url=String(767) match the ORM
  (prevents Postgres VARCHAR-length drift between fresh and migrated DBs).
- Rename two A2A migration files to hash-first convention matching the
  rest of the project.
- trust.rs doc/comment clarifies the header is SHA256, not HMAC.
- BTreeMap for query-param merge ensures deterministic ordering.
- encode_auth_context uses .expect() — empty header would look anonymous.

Test coverage:
- Deny-path regressions for internal A2A endpoints: RBAC denied,
  wrong-team, feature-disabled (20 new cases).
- Rust↔Python bridge: ConnectError/ConnectTimeout/ReadTimeout/PoolTimeout
  now wrap as RustA2ARuntimeError with correct is_timeout flag.
- Push config: encryption round-trip, token rotation, upsert semantics,
  undecryptable-legacy heal path, dispatch-listing SQL visibility scope.
- Rust concurrency: session concurrent lookup/invalidate leaves storage
  consistent; concurrent create produces unique IDs; queue overflow
  under concurrent producers rejects with Full.
- RuntimeConfig cross-field validation (5 scenarios).
- Rust push: webhook retry-exhaustion metric increments.
- Binary startup refuses without auth_secret.
- select_downstream_agent: ORDER BY name + enabled filter assertions.

Deferred to follow-up issues:
- Typestate wrappers for trust-boundary types (#4233)
- SSE client-disconnect integration test (#4234)

Signed-off-by: Jonathan Springer <jps@s390x.com>
… type design

Security-critical:
- Resolve A2A task lookup ambiguity: Rust runtime pre-resolves agent from
  URL path and injects agent_id into task-method proxy params; Python
  service refuses to guess when (task_id, no agent_id) matches multiple
  rows. Prevents cross-agent task read/cancel via shared task_id.
- Encrypt push webhook auth_token at rest via services_auth.encode_auth;
  new list_push_configs_for_dispatch returns decrypted tokens for the
  trusted Rust sidecar only. Fixes secret-at-rest leak.
- Upsert semantics on push-config re-registration: mutable fields
  (auth_token, events, enabled) update in place instead of silently
  keeping stale config. Key rotations now take effect.
- Fail-closed on query-param auth decrypt failure (matches header path).
- Refuse empty/missing A2A_RUST_AUTH_SECRET at startup; add cross-field
  RuntimeConfig validation (retry budget, TTL ordering, session TTL).
- Feature-flag gate on internal A2A endpoints: reject trusted requests
  when MCPGATEWAY_A2A_ENABLED=false (defense-in-depth).
- events/flush rejects events for unknown task_ids (400) to prevent
  visibility bypass.

Observability:
- logger.exception on _authorize_internal_a2a_method errors.
- logger.warning on agent-visibility denial across 7 internal endpoints.
- warn! on JSON parse failures in 4 proxy sites (server.rs) and
  malformed Last-Event-ID headers.
- warn! on session fingerprint mismatch (security-relevant signal).
- MetricsCollector.webhook_retry_exhausted counter + record method.
- Log Rust-cache invalidation scheduling failures instead of silent pass.
- Rollback-after-failure in task persistence now logs and calls invalidate.

Type design / schema:
- New A2ATaskState enum with terminal-state validator on A2ATaskUpdate.
- Unknown task-state protocol values now warn (passed through for
  forward-compat).
- Alembic: tenant=String(255), icon_url=String(767) match the ORM
  (prevents Postgres VARCHAR-length drift between fresh and migrated DBs).
- Rename two A2A migration files to hash-first convention matching the
  rest of the project.
- trust.rs doc/comment clarifies the header is SHA256, not HMAC.
- BTreeMap for query-param merge ensures deterministic ordering.
- encode_auth_context uses .expect() — empty header would look anonymous.

Test coverage:
- Deny-path regressions for internal A2A endpoints: RBAC denied,
  wrong-team, feature-disabled (20 new cases).
- Rust↔Python bridge: ConnectError/ConnectTimeout/ReadTimeout/PoolTimeout
  now wrap as RustA2ARuntimeError with correct is_timeout flag.
- Push config: encryption round-trip, token rotation, upsert semantics,
  undecryptable-legacy heal path, dispatch-listing SQL visibility scope.
- Rust concurrency: session concurrent lookup/invalidate leaves storage
  consistent; concurrent create produces unique IDs; queue overflow
  under concurrent producers rejects with Full.
- RuntimeConfig cross-field validation (5 scenarios).
- Rust push: webhook retry-exhaustion metric increments.
- Binary startup refuses without auth_secret.
- select_downstream_agent: ORDER BY name + enabled filter assertions.

Deferred to follow-up issues:
- Typestate wrappers for trust-boundary types (#4233)
- SSE client-disconnect integration test (#4234)

Signed-off-by: Jonathan Springer <jps@s390x.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a2a Support for A2A protocol COULD P3: Nice-to-have features with minimal impact if left out; included if time permits experimental Experimental features, test proposed MCP Specification changes rust Rust programming

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants