Skip to content

fix(engine): route external wakeup callbacks through FlowOperation for correct node targeting#197

Closed
derek-miller wants to merge 1 commit intoNetflix:mainfrom
derek-miller:derek-miller/flow-engine-controller-routing
Closed

fix(engine): route external wakeup callbacks through FlowOperation for correct node targeting#197
derek-miller wants to merge 1 commit intoNetflix:mainfrom
derek-miller:derek-miller/flow-engine-controller-routing

Conversation

@derek-miller
Copy link
Copy Markdown
Collaborator

@derek-miller derek-miller commented Mar 4, 2026

Pull Request type

  • Bugfix

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

FlowEngineController was injecting FlowExecutor and calling it directly for all three endpoints (startFlow, single wakeUp, bulk wakeUp). This bypassed RestBasedFlowOperation — the routing layer that looks up group ownership and forwards requests to the correct node.

The consequence: when an external system (e.g. Relay) fires an HTTP callback to the Kubernetes service DNS name, K8s round-robins the request to any pod. If that pod doesn't own the target group, FlowExecutor.wakeUp() silently returns false and the step falls back to its polling reconciliation cycle.

The internal wakeup path (InstanceActionJobEventProcessor / UpdateJobEventProcessor) already correctly goes through FlowOperationRestBasedFlowOperation. This fix makes the external HTTP callback path consistent with that.

Fix: replace FlowExecutor injection in FlowEngineController with FlowOperation. RestBasedFlowOperation handles the routing:

  1. Looks up the group owner from the DB
  2. If local → executes directly via FlowExecutor
  3. If remote → forwards to the correct pod's address (no infinite loop: on the second hop the group is local)

The bulk wakeUp endpoint is also simplified from a stream of per-ref FlowExecutor calls to a single flowOperation.wakeUp(groupId, refs, code), doing one DB lookup for the whole batch.

New tests in RestBasedFlowOperationTest verify that when a group is owned by a remote pod, wakeUp (single and bulk) forwards via RestTemplate and never touches the local FlowExecutor.

…r correct pod targeting

When Relay fires an HTTP callback, Kubernetes DNS round-robins it to any
pod. FlowEngineController was calling FlowExecutor directly, bypassing the
RestBasedFlowOperation routing layer that looks up group ownership and
forwards to the correct pod. This caused ~94% of wakeups to be silently
dropped, falling back to the 30-minute polling reconciliation cycle.

Replace FlowExecutor injection in FlowEngineController with FlowOperation
so all three endpoints (startFlow, single wakeUp, bulk wakeUp) go through
the routing layer. Also add remote-routing tests to RestBasedFlowOperationTest.
@derek-miller derek-miller force-pushed the derek-miller/flow-engine-controller-routing branch from d04f382 to 5e3e08b Compare March 4, 2026 18:06
@derek-miller derek-miller marked this pull request as ready for review March 4, 2026 18:09
@derek-miller derek-miller changed the title fix(engine): route external wakeup callbacks through FlowOperation for correct pod targeting fix(engine): route external wakeup callbacks through FlowOperation for correct node targeting Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants