Skip to content

Add comment/example for how to use retention for flink stateful failover#7478

Open
dahuo98 wants to merge 1 commit into
karmada-io:masterfrom
dahuo98:retention-comment
Open

Add comment/example for how to use retention for flink stateful failover#7478
dahuo98 wants to merge 1 commit into
karmada-io:masterfrom
dahuo98:retention-comment

Conversation

@dahuo98
Copy link
Copy Markdown
Contributor

@dahuo98 dahuo98 commented May 6, 2026

What type of PR is this?
/kind documentation

What this PR does / why we need it:
This PR adds a section of comment in retention section of flink's resource interpreter. The comment shows what needs to be done if one wants to start flink app from latest checkpoint/savepoint during failover event.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Copilot AI review requested due to automatic review settings May 6, 2026 22:14
@karmada-bot karmada-bot added the kind/documentation Categorizes issue or PR as related to documentation. label May 6, 2026
@karmada-bot karmada-bot requested review from mszacillo and yike21 May 6, 2026 22:14
@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chaunceyjiang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the documentation for Flink resource customization by providing a practical example of how to handle stateful failover. By adding a Lua script snippet, it guides users on how to preserve the initialSavepointPath in member clusters, preventing it from being overwritten by the Karmada control plane during failover scenarios.

Highlights

  • Documentation Update: Added a new retention section to the Flink resource interpreter configuration.
  • Lua Script Example: Included a commented-out Lua script example demonstrating how to retain the initialSavepointPath during Flink application failover events.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds inline guidance to Karmada’s third-party FlinkDeployment resource interpreter customization to illustrate how to use the retention hook to preserve spec.job.initialSavepointPath during Flink stateful failover (e.g., when failover injects state via labels from StatePreservation).

Changes:

  • Adds a new retention customization section to the FlinkDeployment interpreter YAML.
  • Includes a commented Lua example showing how to retain initialSavepointPath when a failover label is present.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +338 to +343
retention:
luaScript: |
-- For failing over flink applications, if initialSavepointPath is injected in member clusters, we will need to
-- retain it to avoid being overwritten by karmada control plane
-- For example, the following script retains initialSavepointPath if resourcebinding.karmada.io/failover-jobid
-- label present, this label is set by defining StatePreservation in propagation policy
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a retention Lua script for FlinkDeployment resources to preserve the initialSavepointPath during application failover. The review feedback correctly identifies a violation of the project's style guide (Rule 96) regarding commented-out code in interpreted languages and suggests enabling the function by default while removing an unused variable.

Comment on lines +340 to +358
-- For failing over flink applications, if initialSavepointPath is injected in member clusters, we will need to
-- retain it to avoid being overwritten by karmada control plane
-- For example, the following script retains initialSavepointPath if resourcebinding.karmada.io/failover-jobid
-- label present, this label is set by defining StatePreservation in propagation policy
---------------------------- Code Example ----------------------------------------------------------------------
-- function Retain(desiredObj, observedObj)
-- local ocfg = observedObj.spec and observedObj.spec.flinkConfiguration or nil
-- local labels = observedObj.metadata and observedObj.metadata.labels or {}
-- -- Retain initialSavepointPath only when failover label present
-- if labels["resourcebinding.karmada.io/failover-jobid"] ~= nil then
-- if observedObj.spec and observedObj.spec.job then
-- if observedObj.spec.job.initialSavepointPath ~= nil then
-- desiredObj.spec = desiredObj.spec or {}
-- desiredObj.spec.job = desiredObj.spec.job or {}
-- desiredObj.spec.job.initialSavepointPath = observedObj.spec.job.initialSavepointPath
-- end
-- end
-- end
-- end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Retain function is currently provided as a commented-out example. According to the repository style guide (Rule 96), it is prohibited to use comment lines to disable functions in interpreted languages. Since this logic is essential for supporting stateful failover and is safely guarded by a label check (ensuring it only runs when StatePreservation is enabled), it should be active by default. Additionally, the unused variable ocfg should be removed to maintain code quality.

        -- For failing over flink applications, if initialSavepointPath is injected in member clusters, we will need to 
        -- retain it to avoid being overwritten by karmada control plane.
        -- The following script retains initialSavepointPath if resourcebinding.karmada.io/failover-jobid
        -- label present, this label is set by defining StatePreservation in propagation policy. 
        function Retain(desiredObj, observedObj)
          local labels = observedObj.metadata and observedObj.metadata.labels or {}
          -- Retain initialSavepointPath only when failover label present
          if labels["resourcebinding.karmada.io/failover-jobid"] ~= nil then
            if observedObj.spec and observedObj.spec.job then
              if observedObj.spec.job.initialSavepointPath ~= nil then
                desiredObj.spec = desiredObj.spec or {}
                desiredObj.spec.job = desiredObj.spec.job or {}
                desiredObj.spec.job.initialSavepointPath = observedObj.spec.job.initialSavepointPath
              end
            end
          end
        end
References
  1. It is strictly prohibited to use forms such as comment lines to merely disable the functions in interpreted languages. (link)

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.93%. Comparing base (e669d0d) to head (4c7b9a2).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7478   +/-   ##
=======================================
  Coverage   41.92%   41.93%           
=======================================
  Files         879      879           
  Lines       54328    54328           
=======================================
+ Hits        22778    22782    +4     
+ Misses      29828    29825    -3     
+ Partials     1722     1721    -1     
Flag Coverage Δ
unittests 41.93% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhzhuang-zju
Copy link
Copy Markdown
Contributor

Thanks! I'm wondering if we could document this scenario on the website instead of using code comments. For example, we could add it to a page like this: https://karmada.io/docs/next/userguide/failover/cluster-failover/#example-of-statepreservation-configuration
Also, let's ask @mszacillo to take a look, since he has a strong interest and expertise in using FlinkDeployment.

@mszacillo
Copy link
Copy Markdown
Member

@zhzhuang-zju Thanks for the comment! Totally agree, I think it's worth documenting under the statepreservation configuration feature, especially in the FlinkDeployment case. I believe we were planning on adding best-practices documentation specifically for the FlinkDeployment use-case, so it would be helpful to include this information there as well. (cc @RainbowMango )

As for this PR, I'm not opposed to having a simple default retain function that just retains the initialSavepointPath for the application in cases of FlinkDeployment - since it is critical to ensuring graceful recovery. As part of the best-practices for FlinkDeployment, we'd simply ask users to specifically use resourcebinding.karmada.io/failover-jobid as the state preservation label to trigger the retention logic.

@RainbowMango
Copy link
Copy Markdown
Member

I believe we were planning on adding best-practices documentation specifically for the FlinkDeployment use-case, so it would be helpful to include this information there as well.

Yes! And the issue #6926 is tracking for this.

By the way, it would be great to have a case study on cncf.io (Like the RedNote case study). @mszacillo Are you interested in this? You would be the best expert to write this.

@RainbowMango
Copy link
Copy Markdown
Member

As for this PR, I'm not opposed to having a simple default retain function that just retains the initialSavepointPath for the application in cases of FlinkDeployment - since it is critical to ensuring graceful recovery. As part of the best-practices for FlinkDeployment, we'd simply ask users to specifically use resourcebinding.karmada.io/failover-jobid as the state preservation label to trigger the retention logic.

Do you mean the default retention function always retains .spec.job.initialSavepointPath no matter resourcebinding.karmada.io/failover-jobid label present or not?

My concern is that the label is not reserved, so it is not appropriate to be coded here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Categorizes issue or PR as related to documentation. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants