Skip to content

fix(engine): route external wakeup callbacks through FlowOperation for correct node targeting#198

Merged
derek-miller merged 1 commit intoNetflix:mainfrom
derek-miller:derek-miller/flow-engine-controller-routing
Mar 21, 2026
Merged

fix(engine): route external wakeup callbacks through FlowOperation for correct node targeting#198
derek-miller merged 1 commit intoNetflix:mainfrom
derek-miller:derek-miller/flow-engine-controller-routing

Conversation

@derek-miller
Copy link
Copy Markdown
Collaborator

Pull Request type

  • Bugfix

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

FlowEngineController was injecting FlowExecutor and calling it directly for all three endpoints (startFlow, single wakeUp, bulk wakeUp). This bypassed RestBasedFlowOperation — the routing layer that looks up group ownership and forwards requests to the correct node.

The consequence: when an external system (e.g. Relay) fires an HTTP callback to the Kubernetes service DNS name, K8s round-robins the request to any pod. If that pod doesn't own the target group, FlowExecutor.wakeUp() silently returns false and the step falls back to its polling reconciliation cycle.

The internal wakeup path (InstanceActionJobEventProcessor / UpdateJobEventProcessor) already correctly goes through FlowOperationRestBasedFlowOperation. This fix makes the external HTTP callback path consistent with that.

Fix: replace FlowExecutor injection in FlowEngineController with FlowOperation. RestBasedFlowOperation handles the routing:

  1. Looks up the group owner from the DB
  2. If local → executes directly via FlowExecutor
  3. If remote → forwards to the correct pod's address (no infinite loop: on the second hop the group is local)

The bulk wakeUp endpoint is also simplified from a stream of per-ref FlowExecutor calls to a single flowOperation.wakeUp(groupId, refs, code), doing one DB lookup for the whole batch.

New tests in RestBasedFlowOperationTest verify that when a group is owned by a remote pod, wakeUp (single and bulk) forwards via RestTemplate and never touches the local FlowExecutor.

…r correct pod targeting

When Relay fires an HTTP callback, Kubernetes DNS round-robins it to any
pod. FlowEngineController was calling FlowExecutor directly, bypassing the
RestBasedFlowOperation routing layer that looks up group ownership and
forwards to the correct pod. This caused ~94% of wakeups to be silently
dropped, falling back to the 30-minute polling reconciliation cycle.

Replace FlowExecutor injection in FlowEngineController with FlowOperation
so all three endpoints (startFlow, single wakeUp, bulk wakeUp) go through
the routing layer. Also add remote-routing tests to RestBasedFlowOperationTest.
Copy link
Copy Markdown
Collaborator

@praneethy91 praneethy91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

Copy link
Copy Markdown
Collaborator

@rdeepak2002 rdeepak2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@derek-miller derek-miller merged commit dc786cb into Netflix:main Mar 21, 2026
1 check passed
@derek-miller derek-miller deleted the derek-miller/flow-engine-controller-routing branch March 21, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants