Add comment/example for how to use retention for flink stateful failover#7478
Add comment/example for how to use retention for flink stateful failover#7478dahuo98 wants to merge 1 commit into
Conversation
Signed-off-by: dahuo98 <sxdahuo@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the documentation for Flink resource customization by providing a practical example of how to handle stateful failover. By adding a Lua script snippet, it guides users on how to preserve the initialSavepointPath in member clusters, preventing it from being overwritten by the Karmada control plane during failover scenarios. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Pull request overview
Adds inline guidance to Karmada’s third-party FlinkDeployment resource interpreter customization to illustrate how to use the retention hook to preserve spec.job.initialSavepointPath during Flink stateful failover (e.g., when failover injects state via labels from StatePreservation).
Changes:
- Adds a new
retentioncustomization section to the FlinkDeployment interpreter YAML. - Includes a commented Lua example showing how to retain
initialSavepointPathwhen a failover label is present.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| retention: | ||
| luaScript: | | ||
| -- For failing over flink applications, if initialSavepointPath is injected in member clusters, we will need to | ||
| -- retain it to avoid being overwritten by karmada control plane | ||
| -- For example, the following script retains initialSavepointPath if resourcebinding.karmada.io/failover-jobid | ||
| -- label present, this label is set by defining StatePreservation in propagation policy |
There was a problem hiding this comment.
Code Review
This pull request adds a retention Lua script for FlinkDeployment resources to preserve the initialSavepointPath during application failover. The review feedback correctly identifies a violation of the project's style guide (Rule 96) regarding commented-out code in interpreted languages and suggests enabling the function by default while removing an unused variable.
| -- For failing over flink applications, if initialSavepointPath is injected in member clusters, we will need to | ||
| -- retain it to avoid being overwritten by karmada control plane | ||
| -- For example, the following script retains initialSavepointPath if resourcebinding.karmada.io/failover-jobid | ||
| -- label present, this label is set by defining StatePreservation in propagation policy | ||
| ---------------------------- Code Example ---------------------------------------------------------------------- | ||
| -- function Retain(desiredObj, observedObj) | ||
| -- local ocfg = observedObj.spec and observedObj.spec.flinkConfiguration or nil | ||
| -- local labels = observedObj.metadata and observedObj.metadata.labels or {} | ||
| -- -- Retain initialSavepointPath only when failover label present | ||
| -- if labels["resourcebinding.karmada.io/failover-jobid"] ~= nil then | ||
| -- if observedObj.spec and observedObj.spec.job then | ||
| -- if observedObj.spec.job.initialSavepointPath ~= nil then | ||
| -- desiredObj.spec = desiredObj.spec or {} | ||
| -- desiredObj.spec.job = desiredObj.spec.job or {} | ||
| -- desiredObj.spec.job.initialSavepointPath = observedObj.spec.job.initialSavepointPath | ||
| -- end | ||
| -- end | ||
| -- end | ||
| -- end |
There was a problem hiding this comment.
The Retain function is currently provided as a commented-out example. According to the repository style guide (Rule 96), it is prohibited to use comment lines to disable functions in interpreted languages. Since this logic is essential for supporting stateful failover and is safely guarded by a label check (ensuring it only runs when StatePreservation is enabled), it should be active by default. Additionally, the unused variable ocfg should be removed to maintain code quality.
-- For failing over flink applications, if initialSavepointPath is injected in member clusters, we will need to
-- retain it to avoid being overwritten by karmada control plane.
-- The following script retains initialSavepointPath if resourcebinding.karmada.io/failover-jobid
-- label present, this label is set by defining StatePreservation in propagation policy.
function Retain(desiredObj, observedObj)
local labels = observedObj.metadata and observedObj.metadata.labels or {}
-- Retain initialSavepointPath only when failover label present
if labels["resourcebinding.karmada.io/failover-jobid"] ~= nil then
if observedObj.spec and observedObj.spec.job then
if observedObj.spec.job.initialSavepointPath ~= nil then
desiredObj.spec = desiredObj.spec or {}
desiredObj.spec.job = desiredObj.spec.job or {}
desiredObj.spec.job.initialSavepointPath = observedObj.spec.job.initialSavepointPath
end
end
end
endReferences
- It is strictly prohibited to use forms such as comment lines to merely disable the functions in interpreted languages. (link)
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7478 +/- ##
=======================================
Coverage 41.92% 41.93%
=======================================
Files 879 879
Lines 54328 54328
=======================================
+ Hits 22778 22782 +4
+ Misses 29828 29825 -3
+ Partials 1722 1721 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks! I'm wondering if we could document this scenario on the website instead of using code comments. For example, we could add it to a page like this: https://karmada.io/docs/next/userguide/failover/cluster-failover/#example-of-statepreservation-configuration |
|
@zhzhuang-zju Thanks for the comment! Totally agree, I think it's worth documenting under the statepreservation configuration feature, especially in the FlinkDeployment case. I believe we were planning on adding best-practices documentation specifically for the FlinkDeployment use-case, so it would be helpful to include this information there as well. (cc @RainbowMango ) As for this PR, I'm not opposed to having a simple default retain function that just retains the initialSavepointPath for the application in cases of FlinkDeployment - since it is critical to ensuring graceful recovery. As part of the best-practices for FlinkDeployment, we'd simply ask users to specifically use |
Yes! And the issue #6926 is tracking for this. By the way, it would be great to have a case study on cncf.io (Like the RedNote case study). @mszacillo Are you interested in this? You would be the best expert to write this. |
Do you mean the default retention function always retains My concern is that the label is not reserved, so it is not appropriate to be coded here. |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR adds a section of comment in retention section of flink's resource interpreter. The comment shows what needs to be done if one wants to start flink app from latest checkpoint/savepoint during failover event.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?: