Skip to content

[FLINK-37766] FlinkSessionJob deletion blocked by finalizer#1101

Open
prshnt wants to merge 1 commit intoapache:mainfrom
prshnt:fix/FLINK-37766
Open

[FLINK-37766] FlinkSessionJob deletion blocked by finalizer#1101
prshnt wants to merge 1 commit intoapache:mainfrom
prshnt:fix/FLINK-37766

Conversation

@prshnt
Copy link
Copy Markdown

@prshnt prshnt commented Apr 23, 2026

What is the purpose of the change

Fix the operator getting stuck in a CANCELLING loop when a job has already reached a terminal state (e.g. FAILED, FINISHED) before the cancel request is processed. cancelJobOrError now returns a boolean indicating whether the cancel is pending (true) or the job was already terminated (false), allowing the reconciler to skip the async re-observe wait and proceed directly to cleanup.

Having encountered the issue myself when running Flink jobs on our cluster upon searching the issues in Flink Confluence encountered: https://issues.apache.org/jira/browse/FLINK-37766

Brief change log
• Changed cancelJobOrError return type from void to boolean to distinguish between "cancel submitted, wait for async completion" vs "job already gone, proceed immediately"
• Extended isJobTerminated to also match Flink's "already reached another terminal state" error message (e.g. HTTP 400 BAD_REQUEST with that text), in addition to the existing HTTP CONFLICT check
• When the job is already missing or terminated during a STATELESS/CANCEL suspend, the operator no longer returns CancelResult.pending() — it falls through to CancelResult.completed()

Verifying this change

This change added tests and can be verified as follows:
• Added cancelErrorHandlingWithTerminalStateMessage unit test: simulates a REST client returning a 400 BAD_REQUEST with the "already reached another terminal state" message during cancel, and asserts that the job status transitions to FINISHED with the job ID cleared (rather than remaining stuck in CANCELLING)
• Updated existing cancelSessionJobTest to assert the job reaches FINISHED state (not CANCELLING) when the job is already gone during a stateless cancel

Does this pull request potentially affect one of the following parts:

• Dependencies: no
• Public API / CRD changes: no
• Core observer or reconciler logic that is regularly executed: yes — the cancel/suspend path in AbstractFlinkService

Documentation

• Does this pull request introduce a new feature? no (bug fix)

@prshnt prshnt changed the title FLINK-37766 - FlinkSessionJob deletion blocked by finalizer [FLINK-37766 ] FlinkSessionJob deletion blocked by finalizer Apr 23, 2026
@prshnt prshnt changed the title [FLINK-37766 ] FlinkSessionJob deletion blocked by finalizer [FLINK-37766] FlinkSessionJob deletion blocked by finalizer Apr 23, 2026
cancelJobOrError(clusterClient, status, suspendMode == SuspendMode.STATELESS);
// This is async we need to return and re-observe
return CancelResult.pending();
if (cancelJobOrError(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern on the scope: this PR only hardens the FlinkSessionJob cancel path (AbstractFlinkService#cancelSessionJob / the session-job finalizer), but AbstractFlinkService#cancelJob, used for the FlinkDeployment (application cluster) path, is left untouched and can hit the exact same failure mode described in FLINK-37766:

Could we:

  1. Extend the new "already terminal / not found" handling to AbstractFlinkService#cancelJob?
  2. Add a unit test mirroring the new session-job test, but against cancelJob, with a mocked JM response returning the terminal-state error (and ideally a 404), asserting the finalizer completes cleanly.
  3. Update the PR description (and ideally the JIRA) to make clear the fix covers both CR types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants