Replies: 6 comments
-
|
@alnzng feel free to add any additional thoughts on this topic and the use case. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, @weiqingy. Actually, we’ve recently received some requests for sub-agents and are currently considering how to address this within flink-agents. This work is being handled by other colleagues at the moment, but I’d like to share what I know about it. Recently, the Flink community initiated a discussion on FLIP-577, aiming to lay out a direction for evolving Flink into a data engine that natively supports AI workloads. It mentions the hope of introducing a new RpcOperator in Flink, which is independently deployed outside the stream graph and can communicate with operators within the stream graph via RPC. @xintongsong shared in community discussions his view that RpcOperator can help flink-agents implement sub-agent capabilities. I believe this discussion is highly beneficial for clarifying the requirements and design of multi-agent systems. Thank you for your insights, and I will also invite my colleagues to join this discussion. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @weiqingy, Thanks for putting this thread together — it does a good job laying out concrete use cases and the open questions to discuss. On the big picture: I think sub-agent support is worth pursuing. That "Not likely" line of mine in #516 is actually a February take. Since then we've had a lot of conversations with team members and some users we're talking to, and my view on sub-agents has evolved. As I mentioned in the email @wenjin272 linked, if Flink Agents can support sub-agents, I see three benefits:
So putting this on the 0.3 / 0.4 roadmap makes sense. Two quick questions before going further:
Now to share how we've been thinking about sub-agents. Before getting into the supervisor + sub-agent pattern itself, I think it's worth pinning down where sub-agents actually run. There are roughly three possibilities (the first one splits into two sub-cases): 1. Supervisor and sub-agents in the same Flink Agents job
2. Supervisor and sub-agents in different Flink Agents jobs I'm not fully sold on the use case for this shape yet. Multiple operators and agents inside a single Flink job share a lifecycle and deployment story, so if a sub-agent is required for the supervisor to run, putting them in the same job seems more natural. My guess is you're proposing Kafka between Flink Agents jobs to handle the case where a sub-agent isn't available when the supervisor calls it? But what's the advantage over just colocating them in the same job? That said, I can think of a legitimate scenario for multi-agent collaboration across jobs: each agent owns a dedicated responsibility along with the data / state it needs, and processes tasks coming from various requesters. Think of a company with separate procurement, sales, warehousing, logistics, and after-sales departments, where orders flow between departments without always going through a supervisor. This looks more like a system of independent services, each with its own mailbox / request queue: upstream drops tasks into Kafka, the current service picks them up, processes them, and forwards results to the downstream mailbox. Kafka fits naturally here, but this is a different architecture from supervisor-subagent. The other way around — connecting supervisor and sub-agents via Kafka — means each supervisor / sub-agent pair needs two queues (input and output). That feels complex and not very natural. So on your Tier 1 item 2 (async cross-job RPC pattern) and open question 4 (cross-job RPC over Kafka design), I'd suggest first clarifying the use case for cross-job, then discussing the concrete design. 3. Supervisor in a Flink Agents job, sub-agents served by an external RPC framework This is really just an async remote call inside a custom action — the server side could be an agent or any other RPC / HTTP service, doesn't matter. On our team's end, the main focus is 1.b (same job, different operators). This depends on RpcOperator, so it's unlikely to land in 0.3. In parallel, for cases where the sub-agent doesn't have heavy workload, I think there's a nice opportunity for the community to make 1.a (same job, same operator) more user-friendly: a built-in supervisor + sub-agent implementation along the lines of ReActAgent — possibly even by extending ReActAgent directly — to cut down on what users have to wire by hand. This looks doable within 0.3. Once RpcOperator is ready, the same built-in can be extended to offer a choice between "sub-agent in-operator" and "sub-agent in an independent resource pool". This built-in implementation incidentally also covers two of your Tier 2 items:
So compared to "start as recipe / example" in Tier 0 / Tier 2, going one step further and providing a built-in implementation feels better to me — friendlier for users, and the community can iterate on a single shared implementation. To wrap up, a few points on specific items in the proposal: On "flink-agents leans LLM-centric" (Motivation / Tier 0 item 5): I'd push back a bit here. Looking at the single-agent orchestration design today, Flink Agents is really workflow orchestration with action and event as the basic units — calling an LLM, calling a tool, searching a vector store are all just different action types, peers to each other. Once multi-agent support lands, it'll extend to orchestration between agents. So I don't see the current design itself as LLM-centric. That said, if docs or examples are giving that impression, that's a separate matter. Happy to look at specific descriptions you find misleading and figure out how to improve them. On sub-agent-specific primitives (open question 3): Agreed they're needed, but doing it well takes careful design, and 0.3 looks tight. Punting to the next release cycle feels safer. In the meantime, introducing the sub-agent concept on top of the ReActAgent built-in is a safer move — once Flink Agents API formally introduces sub-agent primitives, we just update the built-in, and users won't notice. On a unified callable resource type (Tier 1 item 1): I'd hold off on this for now, no rush to abstract. Tool, REST service, and sub-agent are already familiar standalone concepts to both users and models. A unified abstraction looks cleaner conceptually, but doesn't really add capability, and the help with lowering the learning curve seems limited too. If we later hear users actually complaining about switching between the three, we can revisit then. |
Beta Was this translation helpful? Give feedback.
-
|
For event-driven multi-agent systems, the primitive I would care about most is the durable run record between agents. A supervisor/sub-agent design becomes much easier to operate if every handoff carries:
That seems very aligned with Flink’s strengths: state, replay, event history, and recovery. The interesting part is avoiding “LLM call chains” as the abstraction and instead modeling agents as event-producing workers with inspectable state transitions. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, @weiqingy. Thanks a lot for raising this discussion. I'm currently investigating the multi-agent framework, including its integration with the RPC Operator proposed in FLIP-577. The supervisor+subagent pattern is a very typical application scenario. To facilitate a more productive discussion, I suggest we separate two concerns: (1) the definition and execution of the subagent itself, and (2) the orchestration and execution of the overall workflow.
First, I strongly agree with your idea of "callable resource". I believe the introduction of callable resources will bring significant and beneficial changes to the orchestration of subagents—and indeed to the overall orchestration approach and usability of flink-agent. Historically, flink-agent has adopted an event-driven execution model, including coordination among actions within an agent. User job orchestration has been built around this paradigm. This feels natural in a purely pipeline: each participant completes their task and hands it off to the next, without worrying about who picks it up next. However, in a subagent architecture, the main agent needs to perform further processing based on the subagent's execution results, and the current design becomes less user-friendly. Actually, LLM actions face a similar issue. flink-agent requires users to split LLM input preparation and output handling into separate steps, manually subscribing to and processing events. To implement a logical operation A that calls a model, users must implement two Actions: Action-A (to produce the chat request) and Action-A' (to handle the chat response). This is not only hard to work with, but also changes how users expect the system to behave. For example, in the diagram below: (1) represents the user's logical intent, (2) is how the user expects the execution to flow, (3) is how flink-agent actually executes it, and (4), describes how user may feel about the execution model-as if all user Actions are serving the LLM, rather than orchestrate their own business logic. I guess this is why you think flink-agent as being LLM-centric. Building on this, I've rethought about flink-agent's current APIs and execution model, and arrived at conclusions very similar to the "callable resource" concept. We should provide users with a new request-response style interaction paradigm for orchestration, rather than being limited to event-triggered flows. LLM calls and subagents naturally fit the former, while the latter still holds value for decoupled orchestration and flexible subscription. The two paradigms can complement each other, and the event-driven approach can still be used for orchestrating complex subagent workflows. Users would interact via a new
Building on the foundation above, how users define and use subagents becomes clear: a subagent can be as simple as a single Action, or a complex workflow orchestrated via event subscription; at runtime, it can be wrapped as a callable resource and directly called from the main agent, while the framework internally continues to use event-based subscription and scheduling. However, a subagent entails more than just executing an Action or Action chain. It may also require: isolated context, an independent toolset, specialized prompts, dedicated compute resources, and more. I haven't deeply analyzed the requirements specific to subagents yet. Please feel free to share your ideas. Regarding subagent execution: based on the approach outlined in section 1, we can already run subagents within the same TaskManager. However, due to the GIL, this model cannot support multiple subagents running concurrently. This may suffice for simple, LLM-centric logic, but for more complex scenarios, we likely need to run subagents in isolated processes or dedicated external resources—to prevent subagents from affecting the main agent's stability or competing for its compute resources. Currently, the RPC Operator planned in FLIP-577 appears to be a promising option. As Flink's new infrastructure for AI workloads, it enables unified lifecycle and resource management at the job level, while supporting flexible, independent scaling, fault tolerance, and targeted communication optimizations. We can keep an eye on it. |
Beta Was this translation helpful? Give feedback.
-
|
Great technical deep dive. Running a 5-agent team for 90+ days has taught us a few things about supervisor + sub-agent patterns that might be relevant here. On the "same job vs cross-job" question: Our production setup uses both:
On context isolation: The biggest win from sub-agent architecture is not scalability — it is context hygiene. A coordinator running for hours accumulates garbage context. Spawning fresh sub-agents for specific tasks gives you clean working memory. Our content creation pipeline: Each sub-agent sees only its inputs, not the whole history. This reduces hallucination and improves consistency. On the "judge/critic" pattern: We implemented this via a competitive pattern: 3 writer agents generate drafts, a judge agent picks the best. The judge uses different evaluation criteria than the writers. This gives better results than a single agent self-critiquing. Cost control note: Supervisor + sub-agent is expensive if you run it continuously. We use cron to spawn the supervisor, which then spawns sub-agents, which terminate after completing tasks. No idle agents burning tokens. Our detailed patterns documented here: https://miaoquai.com/tools/openclaw-multi-agent-orchestration Thanks for the FLIP-577 reference — the RpcOperator direction looks promising for Flink-native sub-agent support. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
This thread proposes a concrete use case and primitive set for multi-agent orchestration — directly responding to @xintongsong's question in #516:
It builds on the gap analysis @thefalc opened in #84 (which identified "agent self-reflection / output evaluation" and "orchestrator-worker / hierarchical patterns" as core multi-agent gaps but didn't propose a specific primitive set) and the foundations being laid by #429 (async execution) and #598 (durable execution reconcile).
Motivation: capability orchestration, not LLM chaining
Internal production discussions at my company landed on a framing worth surfacing:
Today flink-agents leans LLM-centric in naming, examples, and resource model (
CHAT_MODELis first-class; everything else is "a tool"). This framing limits perceived applicability — users see "Flink + LLM calls" rather than "Flink for autonomous capability orchestration." That's a positioning weakness; capability-agnostic framing is what enterprise teams want.Concrete use case: supervisor + sub-agent with iterative refinement
A pattern repeatedly requested by enterprise teams — and well-established in the Python ecosystem (LangGraph
create_supervisor, CrewAI hierarchical mode, AutoGen GroupChat).Example: real-time customer support ticket triage on a Kafka stream
Key properties:
Why this pattern fits flink-agents (vs LangGraph / CrewAI)
LangGraph already does supervisor + refinement very well — per request, in one Python process, with DB-backed checkpointing. flink-agents' opportunity is the same pattern over continuous streaming inputs with distributed exactly-once durability. A multi-round refinement loop processing millions of events per day, surviving node failures, is something neither LangGraph nor CrewAI can offer. That answers @xintongsong's question on "how is it different on a streaming engine."
Primitives to discuss
I checked all open PRs/issues — nothing addresses these. ResourceType has
CHAT_MODEL,TOOL,MCP_SERVER,SKILLSbut no abstraction for "another agent" or "remote text service as peer to LLM." ReActAgent is single-agent only. No correlation/reply primitives for cross-job calls.The list below is a starting proposal — the goal of this thread is to decide which of these are must-have to unblock the use case, which can wait, and which are out of scope.
Tier 1
Unified callable resource type — one abstraction subsuming Tool + REST service + sub-agent (sync HTTP, async Kafka, MCP). Supervisor shouldn't care about wire protocol. Directly addresses the capability-agnostic framing.
Async cross-job RPC pattern — correlation IDs, reply topics, timeouts, retries, in-flight state in Flink keyed state. This is the streaming-durability differentiator — Flink already gives us durable state + exactly-once; we should expose it as a clean "delegate to another flink-agent and await reply" primitive instead of forcing every team to reinvent the plumbing.
Tier 2 (likely starts as recipes, promoted to primitives after validation)
Judge / critic step — document the pattern first with example code; formalize only after multiple users converge on the shape. Avoids over-abstraction.
Richer loop termination — quality threshold, budget (tokens, wall-clock, rounds), not just
AGENT_MAX_ITERATIONS.Tier 0 (free wins, can do now)
Open questions for the community
SubAgent,RemoteCapability), or extend existingTOOL/MCP_SERVER? Extending is less disruptive; new types are cleaner.Linking @xintongsong @thefalc @yanand0909 since you've engaged on adjacent topics in #516 / #84.
Beta Was this translation helpful? Give feedback.
All reactions