[FLINK-37766] FlinkSessionJob deletion blocked by finalizer#1101
Open
prshnt wants to merge 1 commit intoapache:mainfrom
Open
[FLINK-37766] FlinkSessionJob deletion blocked by finalizer#1101prshnt wants to merge 1 commit intoapache:mainfrom
prshnt wants to merge 1 commit intoapache:mainfrom
Conversation
| cancelJobOrError(clusterClient, status, suspendMode == SuspendMode.STATELESS); | ||
| // This is async we need to return and re-observe | ||
| return CancelResult.pending(); | ||
| if (cancelJobOrError( |
Contributor
There was a problem hiding this comment.
One concern on the scope: this PR only hardens the FlinkSessionJob cancel path (AbstractFlinkService#cancelSessionJob / the session-job finalizer), but AbstractFlinkService#cancelJob, used for the FlinkDeployment (application cluster) path, is left untouched and can hit the exact same failure mode described in FLINK-37766:
Could we:
- Extend the new "already terminal / not found" handling to
AbstractFlinkService#cancelJob? - Add a unit test mirroring the new session-job test, but against
cancelJob, with a mocked JM response returning the terminal-state error (and ideally a 404), asserting the finalizer completes cleanly. - Update the PR description (and ideally the JIRA) to make clear the fix covers both CR types?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Fix the operator getting stuck in a CANCELLING loop when a job has already reached a terminal state (e.g. FAILED, FINISHED) before the cancel request is processed. cancelJobOrError now returns a boolean indicating whether the cancel is pending (true) or the job was already terminated (false), allowing the reconciler to skip the async re-observe wait and proceed directly to cleanup.
Having encountered the issue myself when running Flink jobs on our cluster upon searching the issues in Flink Confluence encountered: https://issues.apache.org/jira/browse/FLINK-37766
Brief change log
• Changed cancelJobOrError return type from void to boolean to distinguish between "cancel submitted, wait for async completion" vs "job already gone, proceed immediately"
• Extended isJobTerminated to also match Flink's "already reached another terminal state" error message (e.g. HTTP 400 BAD_REQUEST with that text), in addition to the existing HTTP CONFLICT check
• When the job is already missing or terminated during a STATELESS/CANCEL suspend, the operator no longer returns CancelResult.pending() — it falls through to CancelResult.completed()
Verifying this change
This change added tests and can be verified as follows:
• Added cancelErrorHandlingWithTerminalStateMessage unit test: simulates a REST client returning a 400 BAD_REQUEST with the "already reached another terminal state" message during cancel, and asserts that the job status transitions to FINISHED with the job ID cleared (rather than remaining stuck in CANCELLING)
• Updated existing cancelSessionJobTest to assert the job reaches FINISHED state (not CANCELLING) when the job is already gone during a stateless cancel
Does this pull request potentially affect one of the following parts:
• Dependencies: no
• Public API / CRD changes: no
• Core observer or reconciler logic that is regularly executed: yes — the cancel/suspend path in AbstractFlinkService
Documentation
• Does this pull request introduce a new feature? no (bug fix)