Self-Healing, MCP-Governed AI Workflows for Apache Kafka.
Companion artifact for the API Days proposal "Agent Mesh on Streaming World: Building Self-Healing, MCP-Governed AI Workflows for Apache Kafka Driven Event and Data Streaming."
This is the drag-and-drop Agent Workflow Canvas from Act 3 of the talk, brought to life as a runnable demo. Every node is an MCP-speaking AI agent, every edge is a Kafka topic governed by an AsyncAPI 3.0 contract, and every infrastructure-mutating tool call is policy-gated through a human-in-the-loop approval gate.
The demo runs in three modes, side-by-side:
| Mode | URL | What it does |
|---|---|---|
| Simulator | / |
In-memory KRaft simulator. No infrastructure required. Five-minute laptop demo. |
| Real cluster | /setup |
Provisions Strimzi 0.51.0 on OpenShift, then routes the very same UI through KafkaJS to your live cluster. |
| Builder | /builder |
Drag-and-drop workflow editor that emits AsyncAPI 3.0 + Strimzi KafkaTopic / KafkaUser CRs. |
- Live KRaft simulator with full M→R→A→L loop, animated event particles, kill/replay, audit topic, MCP tool registry, AsyncAPI contracts.
- One-click Strimzi installer that streams progress over SSE while it applies 6 manifest groups, waits for
Kafka/KafkaTopic/KafkaUserto becomeReady, and pulls back credentials. - Pluggable Kafka backend. When real-mode is on, every record an agent produces is also written to the live cluster, and every record arriving on the cluster is mirrored back into the simulator. A terminal
kafka-console-consumersees exactly what the canvas shows. - Real scenario mutations. Each scenario button switches between the in-memory simulator and real
oc set env/oc scale/oc delete podcalls, with per-step toasts narrating what just happened on the cluster. - Drag-and-drop builder at
/builderwith palette → canvas DnD, editable inspector for nodes and topic edges, JSON import/export, and a Deploy preview modal that renders the AsyncAPI spec, KafkaTopic CRs, and KafkaUser CRs the design would produce — all copy-paste-able and downloadable. - Live cluster status in the navbar. Bootstrap host, brokers ready, controller epoch, mTLS / SCRAM / ACL counts — polled every 5 seconds against the real Strimzi resources.
The proposal calls out a four-layer reusable blueprint. This codebase is a working implementation of all four layers, runnable end-to-end on a laptop without a real Kafka cluster, architected so each layer is adoptable independently.
| Layer | Implementation here |
|---|---|
| Agent contract layer | Each of the 4 agents hosts an MCP server with typed JSON-RPC tool definitions. See src/lib/mcp.ts. |
| Channel contract layer | Every Kafka topic has a full AsyncAPI 3.0 spec, browsable from the right inspector. See src/lib/asyncapi.ts. |
| Execution governance layer | Infrastructure-affecting MCP tool calls route through a policy approval gate rendered as a live card. Pluggable. |
| Audit layer | ops.actions.audit.v1 is a first-class Kafka topic. Every agent decision, approval, tool call, and replay event is published there. |
- Four named agents (Intake, Monitor, Writer, Notification) laid out left-to-right.
- Five Kafka topic edges:
ops.requests.v1,ops.incidents.v1,ops.actions.audit.v1,ops.lessons.v1(self-loop on Monitor — the LEARN channel), and a virtual metrics feed into Monitor. - A live M → R → A → L progress strip (top-right) that lights up in real time as the Monitor agent moves between phases.
- Live animated event particles flying along each edge on every publish.
- A policy approval card that pulses in on every infra-mutating tool call.
- Right-side tabbed inspector: agent detail, AsyncAPI 3.0 contracts, MCP tool registry, Lessons learned, Audit stream, Outbound notifications.
Requires Node 20+.
cd agent-mesh-sre
npm install
npm run dev
# open http://localhost:3000No Docker, no broker, no AI keys required for the simulator path. To drive the demo against a real Kafka cluster, follow Mode 2 — Real cluster below.
Three modes. They share the same UI shell and the same agent runtime; only the Kafka backend changes. You can flip between them at any time.
npm run devand openhttp://localhost:3000.- Top bar shows
MOCKwith the simulated KRaft cluster: 3 brokersonline, controller epoch, mTLS / SASL / ACL badges. Everything is in-process. - On the left rail, click any of the four scenarios (lag spike, KRaft failover, share-group rebalance, benign rebalance). Watch the canvas:
- Particles travel along Kafka edges as records publish.
- The M → R → A → L strip lights up in the top-right of the canvas.
- For lag-spike + share-group, an approval card pulses on the left rail with the actual MCP
tools/callJSON-RPC — click Approve.
- Open the Audit tab on the right — every consume / publish / tool-call / approval / replay event is there.
- The "Replay moment" card on the left rail kills the Writer agent, lets incidents pile up, then drains them via Kafka offset replay when you click Restart & replay. Inspect the Audit tab for
agent-kill→replay-start→replay-complete. - Use Reset (top right of the rail) to clear all state and start over.
Pre-requisites
- An OpenShift cluster you can
oc loginto. - Strimzi 0.51.0 operator installed in OperatorHub (or any namespace that watches
agent-mesh-sre). ~/.kube/configconfigured for the target cluster. The Next.js process reads it directly via@kubernetes/client-node. Whatever permissions youroc whoamihas are the permissions the demo gets.- ~8 vCPU / 12 GiB RAM headroom for 3 controllers + 3 brokers + 4 demo workloads + Cruise Control.
Walkthrough
- From the main demo, click Setup in the top bar (or browse to
http://localhost:3000/setup). This opens the Setup Wizard. - Step 1 — Connect. The wizard reads
~/.kube/config, shows the cluster you're pointed at, and verifies the Strimzi operator is reachable. Confirm the namespace (agent-mesh-sre) and cluster name (agent-mesh-kafka). - Step 2 — Install. Click Install. The wizard streams 6 phases over SSE:
00-namespace.yaml→ namespace ready01-kafka-nodepools.yaml→ controller + brokerKafkaNodePools02-kafka-cluster.yaml→ theKafkaCR with mTLS + SCRAM-SHA-512 (waits forReady: True)03-kafka-topics.yaml→ 6 ops topics +demo.payments.events04-kafka-users.yaml→ 6 KafkaUsers (one per agent + the controller principal + a demo workload)05-demo-workloads.yaml→fast-producer/slow-consumer/cooperative-consumer/share-group-consumerEach phase shows liveReadyconditions polled from the cluster. Watch this happen in real time — it takes 2-4 minutes.
- Step 3 — Wire up. When credentials become available, the wizard auto-flips the runtime to
REALmode and opens aKafkaJSclient against the bootstrap route using theagent-mesh-controllerSCRAM-SHA-512 user. The top bar updates: bootstrap host, broker count, controller epoch, ACL count are now live. - Return to
/. The cluster chip now readsREAL. Click any scenario — a toast pops out in the bottom-right showing exactly what was just executed on your cluster:oc set env deploy/slow-consumer DELAY_MS=200oc delete pod <controller-leader>oc scale deploy/share-group-consumer --replicas=+1- etc. The agents observe and react against the real cluster; particles still animate on the canvas because the real Kafka traffic is being mirrored into the simulator.
- To verify with your own tools, in another terminal:
You will see the same records the canvas is rendering.
oc -n agent-mesh-sre exec -it agent-mesh-kafka-broker-0 -- \ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \ --topic ops.actions.audit.v1 --from-beginning - To roll back: from
/setup, click Uninstall — it deletes theKafkaCR, all PVCs (deleteClaim: true), all CRDs scoped to the namespace, and the namespace itself. Strimzi operator stays installed.
Manual fallback — if you'd rather not drive the wizard:
./deploy/scripts/install.sh --wait # apply manifests + wait for Ready
./deploy/scripts/get-credentials.sh > .env.local # write a ready-to-source env file
npm run dev # the app picks up .env.localSee deploy/README.md for the full Strimzi/OpenShift detail.
- From the main demo, click Builder in the top bar (or browse to
http://localhost:3000/builder). The builder loads with a starter mesh that mirrors the live demo topology. - Add agents. Drag any of the 6 templates from the left palette (Intake / Monitor / Writer / Notification / Policy / Custom) onto the canvas. Each drop creates an agent node with sensible defaults and the right MCP tool set.
- Wire up topics. Hover over the right edge of any node — a handle dot appears. Drag from there to the left edge of another node. A new Kafka topic edge is created with a placeholder name.
- Edit anything. Click a node or edge to open the right-side inspector:
- Agents: rename, change role/accent, add/remove MCP tools, change description.
- Topics: edit topic name, partitions (1-64),
cleanup.policy(delete/compact/compact,delete),retention.ms, AsyncAPI contract URI.
- Self-loops. Drag from a node's top-right handle to its top-left handle to create a self-feedback channel (this is how the Monitor agent's
ops.lessons.v1LEARN loop is modelled). - Save / load. All edits autosave to
localStorage. Use Export to download the design as JSON, Import to load one back. Reset restores the starter design. - Deploy preview. Click the green Deploy button. A modal opens with five tabs:
- Summary — validation warnings (invalid topic names, duplicate topics, agents with no tools).
- AsyncAPI — a complete AsyncAPI 3.0 document for the design.
- KafkaTopic CRs — multi-doc YAML, ready for
oc apply -f. - KafkaUser CRs — one per agent, with scoped SCRAM-SHA-512 + ACLs derived from each agent's consume/produce edges.
- Design JSON — the raw design, for code review or version control. Each tab has Copy and Download buttons.
- Apply to cluster (real mode only). When connected to a real cluster, the deploy modal shows an Apply to cluster button. Click it to:
- Stream live SSE progress as each
KafkaTopicandKafkaUserCR is applied via server-side apply. - Wait for each resource to become
Ready: True(with live status updates). - Auto-refresh the cluster snapshot so the top bar immediately reflects the new topic/user counts.
- See exactly which manifests were applied and their readiness state in real time.
- Stream live SSE progress as each
- Run on canvas (mock mode). In simulator mode, the deploy modal shows a Run on canvas toggle. Enable it to:
- Start a client-side flow simulator that emits synthetic Kafka records along every edge in your design.
- See animated particles travel along each topic edge on the canvas.
- View per-edge counters showing total emitted records and last emit timestamp.
- Adjust simulation speed (0.5–12 events/sec) via the play/pause controls in the builder top bar.
- Stop the simulation to drain all in-flight particles and reset counters.
All routes are JSON over HTTP unless noted. Most are wired to the demo UI; you can hit them directly from curl for scripting.
| Route | Method | Purpose |
|---|---|---|
/api/events |
GET (SSE) |
Live stream of every WireEvent (particle, audit, agent-state, topic-record, approval, lesson, log). Used by the canvas. |
/api/scenario |
POST |
Trigger a simulator scenario. Body: { "id": "lag-spike" | "controller-failover" | "share-group" | "benign-rebalance" }. |
/api/cluster/scenario |
POST |
Same scenarios, but executes real oc mutations against the cluster. Returns the per-step shell-equivalents that ran. |
/api/approvals |
POST |
{ id, decision: "approve" | "reject", actor }. |
/api/agent |
POST |
Kill or restart an agent. { agent, action: "kill" | "restart" }. |
/api/reset |
POST |
Wipe simulator state. |
/api/asyncapi |
GET |
All AsyncAPI 3.0 specs for the demo topics. |
/api/mcp |
GET |
The MCP tool registry as JSON-RPC envelopes. |
/api/cluster/connect |
GET |
Reads ~/.kube/config, returns context + cluster info. |
/api/cluster/install |
POST (SSE) |
Streams Strimzi install progress. |
/api/cluster/uninstall |
POST (SSE) |
Streams Strimzi uninstall progress. |
/api/cluster/snapshot |
GET |
Live snapshot of Kafka / KafkaTopic / KafkaUser resources. |
/api/cluster/credentials |
GET |
Reads bootstrap + CA + SCRAM password from cluster Secrets. Localhost-only for safety. |
/api/cluster/mode |
GET / POST |
Read or override the runtime mode (mock | real). |
/api/cluster/tap |
GET / POST |
Toggle the KafkaJS bridge between simulator and real cluster. Body: { "action": "enable" | "disable" }. |
/api/cluster/apply |
POST (SSE) |
Apply a Builder design to the cluster as KafkaTopic + KafkaUser CRs. Body: { design: BuilderDesign, namespace?, waitReady? }. Streams apply progress and readiness checks. |
The demo is choreographed around four scripted scenarios on the left rail and one marquee replay moment. Total runtime: ~5 minutes.
Top bar shows KRaft mode, controller broker with current epoch, brokers online, mTLS on, SASL/SCRAM on, ACLs active. These are the production-pattern badges from Act 4 of the talk — security and governance are visible, not aspirational.
Click "KRaft controller failover" on the left rail.
- The MRAL strip (top-right of canvas) lights up
Monitor → Reason → Act → Learn. - Click the Monitor agent. The Inspector on the right shows
lastReasoning— structured JSON the reasoning step emits:kafkaFeatureCited: "KRaft", confidence0.94,requiresApproval: false. - Narration: "This is a KRaft leadership change. Static alerting would page someone. The agent ack'd it. Nothing paged — but every step is in the audit topic."
- Open the Audit tab — every consume / publish / tool-call is there.
Click "Consumer lag spike".
- ~24,000 message backlog on
payments-consumer. Monitor's reasoning cross-references rebalance state (stable) and KRaft epoch (no change): this is real backlog, not benign churn. - MRAL strip pauses on AWAITING APPROVAL.
- A pulsing policy approval card appears in the left rail. The audience sees the actual MCP
tools/callJSON-RPC envelope forkafka.scaleConsumerswithdelta: 2. - Narration: "This is the execution-governance layer. No infrastructure mutation happens without this gate."
- Click Approve. The action executes. An incident publishes to
ops.incidents.v1. The Writer drafts a markdown postmortem. The Notification agent posts to Slack and opens an ITSM ticket — all visible as travelling particles on the canvas edges.
After ~5 seconds a lesson appears in the Lessons tab: lag before / lag after / effective / adjustedThreshold. This is the 60-second settling window from the proposal, compressed for the demo.
Trigger another scenario. The Monitor's next lastReasoning.rationale literally cites "Last N lessons informed this prompt: ...". That's the closed loop — past outcomes seed future decisions.
Click "Benign rebalance (suppress page)".
- Same lag rise, but rebalance state is
preparing-rebalance— a healthy KIP-848 cooperative rebalance. - No approval prompt. Reasoning returns
recommendedAction: rebalance-wait. - Narration: "Static alerts false-positive here. The agent recognises the context and suppresses. This is the value of cross-referencing rebalance state."
This is the moment the proposal is built around — "the pivotal moment of the demo."
On the left rail, in the "The replay moment" card:
- Click Kill next to Writer Agent. Its node turns red and shows
CRASHED. - Run two or three scenarios in quick succession — e.g. lag spike, share group, KRaft failover — approving any gates.
- Watch incidents accumulate on the
ops.incidents.v1edge. Click that edge — you can see the offset counter climbing and the records piling up unread in the inspector. - Narration: "Point-to-point agent frameworks would lose this data. Because the channel is a Kafka topic, events are durable and replayable."
- Click Restart & replay. Writer goes
REPLAYING, starts consuming from its last committed offset, drains the backlog, reports each incident to Slack + ITSM, and returns toONLINE. - Check the Audit tab — it includes
agent-kill,agent-restart,replay-start,replay-completeentries.
Zero data loss, in under 30 seconds, in front of the audience. That single moment makes the case for Kafka-as-backbone.
- Point back at the top-bar badges: mTLS, SASL/SCRAM, ACLs.
- Point at the Outbound tab: Slack + ITSM are the human-facing closure.
- "Everything you just saw is routable to IBM Event Streams on Strimzi, OpenShift, Watson AIOps. The agent contracts don't change."
| Scenario | Kafka feature | Reasoning output | Approval | Outcome |
|---|---|---|---|---|
| Consumer lag spike | KIP-848 (rebalance-aware) | scale-consumers, confidence 0.88 |
Required | N → N+2 via Strimzi CRD |
| KRaft controller failover | KRaft | controller-failover-ack, confidence 0.94 |
Not required | Ack + audit only |
| Share group rebalance | KIP-932 | share-group-rebalance-ack, confidence 0.86 |
Required | Checkpoint share group |
| Benign rebalance | KIP-848 (suppression) | rebalance-wait, confidence 0.91 |
Not required | False-positive suppressed |
agent-mesh-sre/
├── deploy/ # Strimzi 0.51.0 manifests + scripts
│ ├── base/00-namespace.yaml
│ ├── base/01-kafka-nodepools.yaml
│ ├── base/02-kafka-cluster.yaml
│ ├── base/03-kafka-topics.yaml
│ ├── base/04-kafka-users.yaml
│ ├── base/05-demo-workloads.yaml
│ ├── scenarios/01-lag-spike.sh
│ ├── scenarios/02-controller-failover.sh
│ ├── scenarios/03-share-group-rebalance.sh
│ ├── scenarios/04-benign-rebalance.sh
│ ├── scenarios/99-reset.sh
│ └── scripts/{install,uninstall,get-credentials}.sh
├── src/
│ ├── app/ # Next.js 14 app router
│ │ ├── page.tsx # Main UI - 3-column simulator/real-mode demo
│ │ ├── setup/page.tsx # Setup Wizard for Strimzi install/uninstall
│ │ ├── builder/page.tsx # Drag-and-drop workflow builder
│ │ ├── layout.tsx
│ │ ├── globals.css # Tailwind v4 + theme tokens
│ │ └── api/
│ │ ├── events/route.ts # SSE stream of every WireEvent
│ │ ├── scenario/route.ts # Trigger a simulator scenario
│ │ ├── approvals/route.ts # Approve/reject a policy gate
│ │ ├── agent/route.ts # Kill / restart an agent
│ │ ├── reset/route.ts # Reset mesh state
│ │ ├── asyncapi/route.ts # AsyncAPI 3.0 specs
│ │ ├── mcp/route.ts # MCP tool registry
│ │ └── cluster/
│ │ ├── connect/route.ts # Verify kubeconfig + Strimzi operator
│ │ ├── install/route.ts # SSE install pipeline
│ │ ├── uninstall/route.ts # SSE teardown pipeline
│ │ ├── snapshot/route.ts # Live KafkaTopic/KafkaUser/Kafka snapshot
│ │ ├── credentials/route.ts # Pulls SCRAM creds + auto-enables tap
│ │ ├── mode/route.ts # Read/write runtime mode
│ │ ├── scenario/route.ts # Real cluster scenario mutations
│ │ └── tap/route.ts # Toggle KafkaJS bridge
│ ├── lib/
│ │ ├── types.ts # Core domain types
│ │ ├── asyncapi.ts # AsyncAPI 3.0 specs for every topic
│ │ ├── mcp.ts # MCP tool definitions with policyTags
│ │ ├── agents-config.ts # Static definitions of the 4 agents
│ │ ├── broker.ts # In-memory KRaft simulator
│ │ ├── mesh.ts # Agent runtime + M-R-A-L loop + crash/replay
│ │ ├── event-bus.ts # In-process pub/sub for SSE
│ │ ├── runtime-mode.ts # mock|real mode + cached credentials
│ │ ├── store.ts # Zustand client store (simulator)
│ │ ├── cluster-store.ts # Zustand client store (real cluster)
│ │ ├── builder-store.ts # Zustand client store (workflow builder)
│ │ ├── toasts.ts # Toaster pub/sub
│ │ ├── sse-client.ts # EventSource hook
│ │ ├── k8s/
│ │ │ ├── client.ts # Wraps @kubernetes/client-node
│ │ │ ├── strimzi.ts # Apply / wait / read CRs
│ │ │ └── holder.ts # Singleton K8s client
│ │ └── kafka/
│ │ ├── backend.ts # KafkaJS adapter (TLS + SCRAM-SHA-512)
│ │ └── tap.ts # Bridge: real cluster <-> simulator
│ └── components/
│ ├── canvas/ # Demo canvas (React Flow + particles + MRAL)
│ ├── panels/ # TopBar, ScenarioPanel, ApprovalQueue, Inspector, EventLog
│ ├── setup/ # Setup Wizard step components
│ ├── builder/ # Builder palette + canvas + node + edge + inspector + deploy modal
│ └── ui/ # Shared UI primitives (Toaster, JsonPretty, etc.)
├── package.json
├── next.config.mjs
└── tsconfig.json
This is the sequential build order the proposal promises. Each bullet links to the file to read.
- Define agent contracts →
src/lib/mcp.ts(each MCP tool withinputSchema,outputSchema,requiresApproval,policyTags). - Wire Kafka topics as AsyncAPI specs →
src/lib/asyncapi.ts(one AsyncAPI 3.0 fragment per topic, withchannels,operations,components.messages,components.schemas). - Implement MCP tool handlers →
src/lib/mesh.tsinsideact()andrunNotification(). - Add policy gates →
ApprovalRequestlifecycle insrc/lib/mesh.ts+/api/approvalsroute + left-rail UI. - Wire the audit topic →
audit_()helper emits toops.actions.audit.v1on every consume / publish / tool-call / approval-decision / agent-kill / agent-restart / replay-start / replay-complete. - Run the canvas → React Flow plus custom
AgentNode+TopicEdge+ParticleLayer.
Each phase has a visible, structured output:
| Phase | What it emits | Where to see it |
|---|---|---|
| Monitor | MetricsSignal (lag, ISR, rebalance state, controller epoch, share-group marker) on ops.kafka.metrics.v1 |
Click the Monitor edge, open Recent records |
| Reason | ReasoningOutput JSON {rootCause, confidence, recommendedAction, requiresApproval, kafkaFeatureCited, proposedToolCall} |
Click Monitor agent → lastReasoning |
| Act | MCP JSON-RPC tools/call → ActionResult {approved, approvedBy, outcome, detail, lagBefore, lagAfter} |
Click Monitor agent → lastAction |
| Learn | LessonRecord published to ops.lessons.v1 with {actionTaken, effective, lagBefore, lagAfter, adjustedThreshold, notes}. Next reasoning prompt is seeded with the last 3. |
Right inspector → Lessons tab |
┌──────────────────────────────────────────────────────────────────────┐
│ Agent Mesh SRE │
│ │
│ ┌──────────┐ ops.requests.v1 ┌──────────┐ │
│ │ Intake │ ─────────────────────▶ │ Monitor │ │
│ │ Agent │ │ Agent │ │
│ └──────────┘ │ │ ◀── ops.lessons.v1 │
│ ▲ │ M→R→A→L │ ↻ │
│ │ MCP tools/call │ │ │
│ │ └────┬─────┘ │
│ │ │
│ ops.incidents.v1 │
│ │ │
│ ▼ │
│ ┌──────────────┐ ops.actions.audit.v1 ┌──────────┐ │
│ │ Notification │ ◀───────────────────── │ Writer │ │
│ │ Agent │ │ Agent │ │
│ └──────┬───────┘ └──────────┘ │
│ │ │
│ ├── Slack / ITSM │
│ │ │
│ └── ops.notifications.v1 │
│ │
│ ─────────────────────────────────────────────────────────────── │
│ Apache Kafka 4.x (KRaft) — mTLS, SASL/SCRAM, scoped ACLs │
│ AsyncAPI 3.0 contracts + schema registry │
│ Audit topic: first-class, durable, replayable │
└──────────────────────────────────────────────────────────────────────┘
| Layer | Choice |
|---|---|
| Frontend | Next.js 14 + React 18 + TypeScript |
| Canvas | @xyflow/react (React Flow v12) + custom nodes/edges |
| Styling | Tailwind CSS v4 + hand-tuned theme tokens |
| State | Zustand |
| Icons | lucide-react |
| Transport | Server-Sent Events for the live stream |
| Kafka | In-memory simulator (swap for kafkajs for a real cluster) |
The agents speak to Kafka through a thin BrokerSim abstraction. In production, point them at:
- Apache Kafka 4.x in KRaft mode, deployed via Strimzi Cluster Operator or IBM Event Streams.
- SASL/SCRAM + mTLS — already advertised on the top bar.
- Scoped ACLs per agent identity — one Kafka principal per agent, least-privilege access.
- AsyncAPI 3.0 specs registered in an Apicurio-style schema registry.
ops.actions.audit.v1withcleanup.policy=compact,deleteandretention.ms=31536000000(one year) for compliance.- KRaft controller leadership as a first-class signal into the Monitor agent's reasoning.
Runtime is Kubernetes. Strimzi's KafkaUser CRD + NetworkPolicies give you the principal-per-agent model described in Act 4.
- Reasoning is deterministic (no LLM call is made). The structured JSON shape matches what a real LLM reasoning step emits. Drop-in a Claude / GPT / watsonx.ai call in
Mesh.reason()to swap it for a real model. - Simulator is in-memory. Data does not persist across a server restart. In real mode the audit topic is durable; in simulator mode use Reset to clear state explicitly.
- The Intake agent is stub-simple (publishes on scenario trigger); the sophistication lives in Monitor. That matches the proposal.
- Builder → Deploy generates manifests but does not apply them automatically. Copy the YAML from the modal, review it, then
oc apply -f -yourself — deploying generated infra without code review would defeat the audit story. - Real-mode credentials are returned by
/api/cluster/credentialsonly to localhost callers. If you expose the dev server on a LAN, setKAFKA_MODE=mockin the environment to be safe.
| Symptom | Likely cause | Fix |
|---|---|---|
Cannot find module '.next/server/middleware-manifest.json' on npm run dev |
A previous next build was interrupted and corrupted .next/. |
rm -rf .next && npm run dev. |
Maximum update depth exceeded on /builder |
Custom Zustand selectors should never return fresh object literals. | Keep selectors primitive: useBuilder((s) => s.field), not useBuilder((s) => ({...})). |
Setup wizard hangs at Kafka/agent-mesh-kafka waiting for Ready |
Strimzi operator not installed, or the namespace is not in its watch list. | oc get csv -A | grep strimzi, then re-install if missing. |
| Real-mode scenario buttons fail with 503 | The KafkaJS tap isn't connected. | curl -X POST localhost:3000/api/cluster/tap -d '{"action":"enable"}', or re-run the wizard. |
Top bar still says MOCK after install completes |
Credential secret rolled out after the wizard finished polling. | Hit GET /api/cluster/credentials once — it auto-flips the mode. |
| Builder design didn't persist across reload | Browser is in private mode (no localStorage). |
Use a normal window, or Export before reload. |
MIT