Skip to content

feat: Durable A2A (Agent2Agent) protocol — client, server, observability#1195

Open
v1r3n wants to merge 7 commits into
mainfrom
feat/durable-a2a-protocol
Open

feat: Durable A2A (Agent2Agent) protocol — client, server, observability#1195
v1r3n wants to merge 7 commits into
mainfrom
feat/durable-a2a-protocol

Conversation

@v1r3n

@v1r3n v1r3n commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Agent2Agent (A2A) protocol support to Conductor in both directions, with durability as the differentiator: a remote agent call survives a server crash/restart and resumes from persisted execution state.

Two directions

Client — call a remote agent from a workflow (conductor.integrations.ai.enabled=true)

  • AGENT (poll / streaming-SSE / push), GET_AGENT_CARD, CANCEL_AGENT.
  • Durable: deterministic messageId (stable across retries/restarts), in-flight state lives in the persisted task output (no held thread in poll mode), absolute-deadline + consecutive-poll-failure liveness guards, push backstop poll.

Server — expose any workflow as an A2A agent (conductor.a2a.server.enabled=true)

  • One agent per workflow at {basePath}/{workflow}: Agent Card + JSON-RPC message/send / tasks/get / tasks/cancel.
  • Idempotent start (messageIdidempotencyKey, RETURN_EXISTING) → server-side effectively-once.
  • Multi-turn resume: a follow-up message/send carrying the task id completes the paused HUMAN/WAIT task and resumes the same execution (no duplicate workflow).

Security

SSRF guard on outbound agentUrl (loopback / RFC-1918 / link-local / IPv6 ULA / cloud-metadata — metadata always blocked; redirects disabled); conductor.a2a.client.allow-private-network opt-in for trusted/dev. Optional server api-key (constant-time). Push callbacks use single-use bearer tokens with embedded 24h expiry.

Observability

Micrometer counters via the shared Monitors registry (a2a_client_calls, a2a_client_poll_failures, a2a_rpc_errors, a2a_ssrf_blocked, a2a_server_requests, a2a_server_resumes) + MDC correlation keys across A2A code paths.

Tests

  • Unit/wire (MockWebServer), embedded-agent e2e, push callback, server dispatch + multi-turn resume, mapper/worker, observability — 86 ai a2a tests.
  • A2ASdkInteropTest — drives the client against the official a2a-sdk reference agent (subprocess): discovery, send, poll, streaming, message-mode. Self-skips when a2a-sdk is unavailable.
  • A2ADurableEngineEndToEndTest (test-harness) — AGENT through the real decider + AsyncSystemTaskExecutor + Redis, proving crash/restart resume from persistence. Runs in the existing test-harness CI job (Docker/Redis).

Docs / UI / examples

  • docs/devguide/ai/a2a-integration.md (AI Cookbook nav) with worked examples: call/expose, multi-turn resume request/response, push end-to-end.
  • ai/examples/*a2a* (call / get-card / server / streaming / push / multi-turn / cancel) + indexed in the examples README; runnable interop and durable demos under ai/src/test/resources/a2a/.
  • ui-next: registered AGENT / GET_AGENT_CARD / CANCEL_AGENT (typecheck + build green).
  • design/a2a/: protocol study, durability proposal, server design.

⚠️ Follow-up needed (separate, needs workflow token scope)

To make A2ASdkInteropTest run in CI (it self-skips otherwise), add to the build job in .github/workflows/ci.yml after the JDK setup:

      - name: Set up Python (for the A2A interop test)
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - name: Install a2a-sdk (lets A2ASdkInteropTest run against the reference agent)
        run: pip install "a2a-sdk>=0.2,<0.3" uvicorn

(Couldn't be pushed here — the PAT lacks the workflow scope.)

Out of scope (documented follow-ups)

Server-side streaming (message/stream) + push-config endpoints; Agent Card JWS signing/verification (+ signatures/extensions passthrough); getAuthenticatedExtendedCard; per-skill OAuth scopes; gRPC / HTTP+JSON transports; full v1.0 ProtoJSON.

Verify locally

./gradlew :conductor-ai:test --tests '*a2a*'
A2A_PYTHON=<venv>/bin/python ./gradlew :conductor-ai:test --tests '*A2ASdkInteropTest'
./gradlew :conductor-test-harness:test --tests '*A2ADurableEngineEndToEndTest'

Note: inbound A2A server auth (api-key) was moved to the enterprise build; the OSS server is open by default (front with a gateway/firewall). Client→remote auth headers, SSRF guard, push-callback tokens, and cross-agent isolation remain in OSS.

v1r3n added 3 commits June 19, 2026 18:24
Add Agent2Agent (A2A) protocol support to the AI module, in both directions.

Client: CALL_AGENT / GET_AGENT_CARD / CANCEL_AGENT_TASK system tasks call remote
A2A agents over JSON-RPC with poll, streaming (SSE), and push modes. Durable by
design — deterministic messageId, state in the execution (not a thread), liveness
guards, and a push backstop. SSRF-guarded (IPv6 ULA + cloud-metadata; redirects off).

Server: expose any workflow as an A2A agent (one agent per workflow), idempotent
message/send -> startWorkflow (RETURN_EXISTING), tasks/get / tasks/cancel, and
multi-turn resume (a follow-up message/send completes the paused HUMAN/WAIT task
instead of starting a duplicate).

Observability: Micrometer counters via the shared Monitors registry and MDC
correlation keys across the A2A code paths.

Gated by conductor.integrations.ai.enabled (client) and conductor.a2a.server.enabled
(server).
- Unit/wire tests (MockWebServer), embedded-agent e2e, push callback, server
  JSON-RPC dispatch + multi-turn resume, mapper/worker, and observability.
- A2ASdkInteropTest: drives the client against the official a2a-sdk reference
  agent launched as a subprocess (discovery, send, poll, streaming, message-mode);
  self-skips when no Python with a2a-sdk is available.
- A2ADurableEngineEndToEndTest (test-harness): CALL_AGENT through the real decider
  + AsyncSystemTaskExecutor + Redis, proving crash/restart resume from persistence.
- docs/devguide/ai/a2a-integration.md (AI Cookbook nav) with worked examples:
  call/expose, multi-turn resume request/response, and push end-to-end.
- ai/examples: call / get-card / server + streaming / push / multi-turn / cancel,
  indexed in the examples README; runnable interop + durable demos under test
  resources.
- ui-next: register CALL_AGENT / GET_AGENT_CARD / CANCEL_AGENT_TASK task types.
- design/a2a: protocol study, durability proposal, and server design notes.

Note: a CI step to install a2a-sdk (so A2ASdkInteropTest runs in the build job)
is left as a follow-up — it needs a token with the `workflow` scope to land.
v1r3n added 4 commits June 19, 2026 23:00
…fault)

The A2A server's optional shared-secret api-key (conductor.a2a.server.api-key)
is removed from OSS: the server is now open by default, matching OSS Conductor
REST. Inbound authentication (API keys, OAuth/OIDC, mTLS, per-skill scopes,
signed Agent Cards) belongs to the enterprise build; front OSS with a
gateway/firewall.

Unchanged and kept in OSS as safe-by-default guards: client SSRF protection,
push-callback token auth, cross-agent execution isolation, and client→remote
per-call auth headers.

Removes authorized()/apiKey + the two api-key tests; docs and design notes
updated.
…-turn + LLM-pick examples

Docs: add an error-handling & retries section (FAILED vs FAILED_WITH_TERMINAL_ERROR
mapping for HTTP/JSON-RPC/SSRF/liveness) + a troubleshooting table; flesh out the
client multi-turn section with a worked SWITCH-on-input-required snippet; add an
"orchestrating multiple agents" subsection.

Examples (validated by ExampleWorkflowValidationTest):
- 27-a2a-multi-agent: FORK_JOIN calling agents in parallel → JOIN.
- 28-a2a-llm-pick-skill: GET_AGENT_CARD → LLM_CHAT_COMPLETE → CALL_AGENT.
- 29-a2a-client-multi-turn: SWITCH on input-required, re-call with same context/taskId.
…NT + add agentType

Generalize the agent task types ahead of multi-runtime support:
- CALL_AGENT → AGENT (class CallAgentTask → AgentTask), CANCEL_AGENT_TASK →
  CANCEL_AGENT. GET_AGENT_CARD kept (discovery/"Agent Card" is A2A-specific).
- New input field `agentType` (default "a2a") on all three tasks — the extension
  point for native runtimes (langgraph, openai, …). Unknown values are rejected
  with a clear error; only "a2a" is implemented today.

Updated across enum, handlers, request models, mapper, callback, tests,
test-harness, examples, docs, design notes, and the ui-next task registration.

BREAKING CHANGE: workflows using type "CALL_AGENT"/"CANCEL_AGENT_TASK" must switch
to "AGENT"/"CANCEL_AGENT".
…sing)

Fix 'A AGENT'→'An AGENT' grammar (A2AMetrics, design), reword example descriptions and a doc sentence left awkward by the mechanical rename, and update legacy example workflow names (a2a_call_agent_* → a2a_agent_*).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant