feat: Durable A2A (Agent2Agent) protocol — client, server, observability#1195
Open
v1r3n wants to merge 7 commits into
Open
feat: Durable A2A (Agent2Agent) protocol — client, server, observability#1195v1r3n wants to merge 7 commits into
v1r3n wants to merge 7 commits into
Conversation
Add Agent2Agent (A2A) protocol support to the AI module, in both directions. Client: CALL_AGENT / GET_AGENT_CARD / CANCEL_AGENT_TASK system tasks call remote A2A agents over JSON-RPC with poll, streaming (SSE), and push modes. Durable by design — deterministic messageId, state in the execution (not a thread), liveness guards, and a push backstop. SSRF-guarded (IPv6 ULA + cloud-metadata; redirects off). Server: expose any workflow as an A2A agent (one agent per workflow), idempotent message/send -> startWorkflow (RETURN_EXISTING), tasks/get / tasks/cancel, and multi-turn resume (a follow-up message/send completes the paused HUMAN/WAIT task instead of starting a duplicate). Observability: Micrometer counters via the shared Monitors registry and MDC correlation keys across the A2A code paths. Gated by conductor.integrations.ai.enabled (client) and conductor.a2a.server.enabled (server).
- Unit/wire tests (MockWebServer), embedded-agent e2e, push callback, server JSON-RPC dispatch + multi-turn resume, mapper/worker, and observability. - A2ASdkInteropTest: drives the client against the official a2a-sdk reference agent launched as a subprocess (discovery, send, poll, streaming, message-mode); self-skips when no Python with a2a-sdk is available. - A2ADurableEngineEndToEndTest (test-harness): CALL_AGENT through the real decider + AsyncSystemTaskExecutor + Redis, proving crash/restart resume from persistence.
- docs/devguide/ai/a2a-integration.md (AI Cookbook nav) with worked examples: call/expose, multi-turn resume request/response, and push end-to-end. - ai/examples: call / get-card / server + streaming / push / multi-turn / cancel, indexed in the examples README; runnable interop + durable demos under test resources. - ui-next: register CALL_AGENT / GET_AGENT_CARD / CANCEL_AGENT_TASK task types. - design/a2a: protocol study, durability proposal, and server design notes. Note: a CI step to install a2a-sdk (so A2ASdkInteropTest runs in the build job) is left as a follow-up — it needs a token with the `workflow` scope to land.
…fault) The A2A server's optional shared-secret api-key (conductor.a2a.server.api-key) is removed from OSS: the server is now open by default, matching OSS Conductor REST. Inbound authentication (API keys, OAuth/OIDC, mTLS, per-skill scopes, signed Agent Cards) belongs to the enterprise build; front OSS with a gateway/firewall. Unchanged and kept in OSS as safe-by-default guards: client SSRF protection, push-callback token auth, cross-agent execution isolation, and client→remote per-call auth headers. Removes authorized()/apiKey + the two api-key tests; docs and design notes updated.
…-turn + LLM-pick examples Docs: add an error-handling & retries section (FAILED vs FAILED_WITH_TERMINAL_ERROR mapping for HTTP/JSON-RPC/SSRF/liveness) + a troubleshooting table; flesh out the client multi-turn section with a worked SWITCH-on-input-required snippet; add an "orchestrating multiple agents" subsection. Examples (validated by ExampleWorkflowValidationTest): - 27-a2a-multi-agent: FORK_JOIN calling agents in parallel → JOIN. - 28-a2a-llm-pick-skill: GET_AGENT_CARD → LLM_CHAT_COMPLETE → CALL_AGENT. - 29-a2a-client-multi-turn: SWITCH on input-required, re-call with same context/taskId.
…NT + add agentType Generalize the agent task types ahead of multi-runtime support: - CALL_AGENT → AGENT (class CallAgentTask → AgentTask), CANCEL_AGENT_TASK → CANCEL_AGENT. GET_AGENT_CARD kept (discovery/"Agent Card" is A2A-specific). - New input field `agentType` (default "a2a") on all three tasks — the extension point for native runtimes (langgraph, openai, …). Unknown values are rejected with a clear error; only "a2a" is implemented today. Updated across enum, handlers, request models, mapper, callback, tests, test-harness, examples, docs, design notes, and the ui-next task registration. BREAKING CHANGE: workflows using type "CALL_AGENT"/"CANCEL_AGENT_TASK" must switch to "AGENT"/"CANCEL_AGENT".
…sing) Fix 'A AGENT'→'An AGENT' grammar (A2AMetrics, design), reword example descriptions and a doc sentence left awkward by the mechanical rename, and update legacy example workflow names (a2a_call_agent_* → a2a_agent_*).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Agent2Agent (A2A) protocol support to Conductor in both directions, with durability as the differentiator: a remote agent call survives a server crash/restart and resumes from persisted execution state.
Two directions
Client — call a remote agent from a workflow (
conductor.integrations.ai.enabled=true)AGENT(poll / streaming-SSE / push),GET_AGENT_CARD,CANCEL_AGENT.messageId(stable across retries/restarts), in-flight state lives in the persisted task output (no held thread in poll mode), absolute-deadline + consecutive-poll-failure liveness guards, push backstop poll.Server — expose any workflow as an A2A agent (
conductor.a2a.server.enabled=true){basePath}/{workflow}: Agent Card + JSON-RPCmessage/send/tasks/get/tasks/cancel.messageId→idempotencyKey,RETURN_EXISTING) → server-side effectively-once.message/sendcarrying the task id completes the pausedHUMAN/WAITtask and resumes the same execution (no duplicate workflow).Security
SSRF guard on outbound
agentUrl(loopback / RFC-1918 / link-local / IPv6 ULA / cloud-metadata — metadata always blocked; redirects disabled);conductor.a2a.client.allow-private-networkopt-in for trusted/dev. Optional serverapi-key(constant-time). Push callbacks use single-use bearer tokens with embedded 24h expiry.Observability
Micrometer counters via the shared
Monitorsregistry (a2a_client_calls,a2a_client_poll_failures,a2a_rpc_errors,a2a_ssrf_blocked,a2a_server_requests,a2a_server_resumes) + MDC correlation keys across A2A code paths.Tests
A2ASdkInteropTest— drives the client against the officiala2a-sdkreference agent (subprocess): discovery, send, poll, streaming, message-mode. Self-skips whena2a-sdkis unavailable.A2ADurableEngineEndToEndTest(test-harness) —AGENTthrough the real decider + AsyncSystemTaskExecutor + Redis, proving crash/restart resume from persistence. Runs in the existingtest-harnessCI job (Docker/Redis).Docs / UI / examples
docs/devguide/ai/a2a-integration.md(AI Cookbook nav) with worked examples: call/expose, multi-turn resume request/response, push end-to-end.ai/examples/*a2a*(call / get-card / server / streaming / push / multi-turn / cancel) + indexed in the examples README; runnable interop and durable demos underai/src/test/resources/a2a/.ui-next: registeredAGENT/GET_AGENT_CARD/CANCEL_AGENT(typecheck + build green).design/a2a/: protocol study, durability proposal, server design.workflowtoken scope)To make
A2ASdkInteropTestrun in CI (it self-skips otherwise), add to thebuildjob in.github/workflows/ci.ymlafter the JDK setup:(Couldn't be pushed here — the PAT lacks the
workflowscope.)Out of scope (documented follow-ups)
Server-side streaming (
message/stream) + push-config endpoints; Agent Card JWS signing/verification (+signatures/extensionspassthrough);getAuthenticatedExtendedCard; per-skill OAuth scopes; gRPC / HTTP+JSON transports; full v1.0 ProtoJSON.Verify locally