Skip to content

H4HIP: Wait with kstatus #374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 19, 2025
Merged

Conversation

AustinAbro321
Copy link
Contributor

proposal to replace the current wait logic in Helm with kstatus

Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Copy link
Member

@gjenkins8 gjenkins8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the HIP! I have been wanting to write this one myself for some time. I agree, kstatus is where the Kubernetes community has put significant effort into thinking about Kubernetes resource "readiness". And Helm would do well to reuse this effort.

I have put some comments. They are mostly centered around what noticable (if any) behaviors users would notice from the existing mechanism. And how to mitigate/manage those.


<!-- TODO: Decide if we want more than alphabetically, such as - The APIVersion/Kind of the resource will determine it's priority for being logged. For example, the first log messages will always describe deployments. All deployments will be logged first. Once all deployments are in ready status, all stateful sets will be logged, and so forth. -->

## Backwards compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any situation where kstatus will not return ready, but existing logic would?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the two called our here, where kstatus will wait to return ready until reconciliation is complete, and waiting for CRDs I am not thinking of any, but I am not 100% sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect, to be on the safe side, we will want/need to support a fallback flag to allow users to fallback to the "legacy" wait/ready code. Ideally we would deprecate that flag e.g. Helm 5 and remove the old code in a future version. Not sure what we would do if the legacy code serves some behavior that kstatus doesn't (ideally this doesn't happen of course), but I think we should allow users the fallback when upgrading to Helm 4 just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kstatus seems to be pretty extensible, we can implement a custom status reader and override the behavior for any resource type. I don't think there will have any long term issues

Copy link
Contributor

@mattfarina mattfarina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the HIP. I like the idea of using something from the Kubernetes community to know the status. When Helm's current code was built, nothing like this was available.

Signed-off-by: Austin Abro <[email protected]>
@AustinAbro321
Copy link
Contributor Author

Thank you guys for the feedback, I am aiming to create a draft PR sometime next week so we can get a sense for what it will look like.

@AustinAbro321
Copy link
Contributor Author

@mattfarina @gjenkins8 I created a draft PR with my implementation and updated this proposal with some of the finer details. LMK what feedback / questions you have

Draft PR - helm/helm#13604

hips/hip-0999.md Outdated
}
```

`WaitAndGetCompletedPodPhase` is an exported function that is not called anywhere within the Helm repository. It will be removed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak for helm maintainers, but it looks like this method is part of their public API, and not deprecated, so I'd be careful with removing it right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe since this is targeted at Helm v4 breaking changes in the public API are acceptable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR to get rid of this immediately: helm/helm#13665

hips/hip-0999.md Outdated

`WaitAndGetCompletedPodPhase` is an exported function that is not called anywhere within the Helm repository. It will be removed.

`WatchUntilReady` is used only for hooks. It has custom wait logic different from the Helm 3 general logic. Ideally, this could be replaced with a regular `Wait()` call. If there is any historical context as to why this logic is the way it is, please share.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again similar comment, I'd be careful breaking API. Either a deprecation or just wire the method to invoke the same underlying code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto with Helm 4 comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why hooks have different logic. Hooks are typically jobs/pods, so perhaps the logic is very simple ie. check pod has exited?, would need to go look.

That said, I have heard of folk using hooks for e.g. cluster pre-flight custom resources (@z4ce I think?)

If we could analyze and/or test whether hooks can use kstatus, that would be good. Same "legacy" fallback comments apply.

Worth mentioning too, the decision we make will remain in place for the lifetime of Helm 4 (ie. we won't change the default once Helm 4 is released)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the hook logic and it waits for pods and jobs to be completed rather than just ready. I believe this can be done with kstatus, and I will verify later this week. I took a look at how Flux does it and it seems like we could implement the same logic with kstatus, will just involve custom code. Assuming my assumptions are correct, we will keep the WatchUntilReady function and move it into the Waiter interface.

Flux Job wait implementation - https://github.com/fluxcd/kustomize-controller/blob/main/internal/statusreaders/job.go

Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
hips/hip-0999.md Outdated

`WatchUntilReady` is used only for hooks. It has custom wait logic different from the Helm 3 general logic. Ideally, this could be replaced with a regular `Wait()` call. If there is any historical context as to why this logic is the way it is, please share.

The Helm CLI will always use the `statusWaiter` implementation, if this is found to be insufficient during Helm 4 testing a new cli flag `wait-strategy` will be introduced with options `status` and `legacy` to allow usage of the `HelmWaiter`. If the `statusWaiter` is found to be sufficient the `HelmWaiter` will be deleted from the public SDK before Helm 4 is released.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can used the existing --wait flag e.g. --wait=true|false|none|kstatus|legacy (true & false are placeholders for kstatus and none respectively, to be compatible with existing users who might have --wait=true, if we decide this is warranted for Helm 4. else we just error those users)

Up for suggestions on whether we use the name "kstatus" in the flag. Or whether we consider this an "implementation detail". I suspect we will want to refer users to kstatus docs generally for behavioral reference, so it is okay to refer to kstatus directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only hiccup I see with structuring the --wait flag like this is that Helm hooks always wait regardless of if the wait flag is set. So a user wouldn't be able to not wait, and use the old Helm hook logic. Still, based on my other comment, we will end up doing a custom implementation anyway so we should be able to create relatively similar logic.

Other option would be introducing --wait-strategy and --hook-wait-strategy flags. IMO it's better to avoid the user complexity, but I'll leave the call to the maintainers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about --wait=false? That should handle not waiting, no?

Copy link
Member

@gjenkins8 gjenkins8 Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only hiccup I see with structuring the --wait flag like this is that Helm hooks always wait regardless of if the wait flag is set. So a user wouldn't be able to not wait, and use the old Helm hook logic. Still, based on my other comment, we will end up doing a custom implementation anyway so we should be able to create relatively similar logic.

Given the simplicity of hook wait logic, perhaps it is possible to directly determine whether kstatus will be suitable?

For the main wait logic, I think a fallback to legacy is needed as Helm's custom logic is fairly complex. And there is probably some scenario out there which kstatus and Helm's custom logic are not equivalent.

For the relatively hook wait logic, can we be more confident kstatus is a replacement?


Other option would be introducing --wait-strategy and --hook-wait-strategy flags. IMO it's better to avoid the user complexity, but I'll leave the call to the maintainers.

It would be nice to avoid adding more options to the Helm CLI (unless really needed) IMHO. Every CLI option adds to the complexity of the interface Helm presents to users.

Copy link
Contributor Author

@AustinAbro321 AustinAbro321 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the relatively hook wait logic, can we be more confident kstatus is a replacement?

Now that we've added customstatusreaders, IMO we can be much more confident that kstatus is a drop in replacement. I've basically copied the logic over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about --wait vs --wait-strategy a bit more. While more CLI flags do add more complexity, since it's likely that the new watcher will work in the vast majority of cases, I think it would be simpler for most users to only have to use --wait, rather than them have to figure out which type of waiter they want to use.

Additionally since --wait is currently a bool flag, a compatibility shim for --wait=true likely wouldn't work for most cases as users will usually run helm install --wait without specifying true / false.

LMK what you think, about using --wait and --wait-strategy. When only --wait is specified the new watcher will be used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch from plain bool to a string value isn't that hard, we've done something like that transparently to users in kubernetes/kubernetes#87580.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, thanks for the example. I updated the doc to reflect a --wait flag that accepts true|false|none|watcher|legacy

hips/hip-0999.md Outdated

The Helm CLI will always use the `statusWaiter` implementation, if this is found to be insufficient during Helm 4 testing a new cli flag `wait-strategy` will be introduced with options `status` and `legacy` to allow usage of the `HelmWaiter`. If the `statusWaiter` is found to be sufficient the `HelmWaiter` will be deleted from the public SDK before Helm 4 is released.

The current Helm wait logic does not wait for paused deployments to be ready. This is because if `helm upgrade --wait` is run on a chart with paused deployments they will never be ready, see [#5789](https://github.com/helm/helm/pull/5789).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems more like motivation that specification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not changing this in the PR, but I wanted to give the reference to the PR that explains why we wait for paused deployments. Let me make this line more clear


Below is the minimal set needed to watch a deployment with the status watcher. This can be verified by following instructions in this repo: https://github.com/AustinAbro321/kstatus-rbac-test.
```yaml
rules:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is extensive enough. I suspect kstatus will need watch and list for every resource type in a chart (not just apps)

In particular, Helm only considered a few resources types previously.

This is fine, just want the HIP to be accurate

Copy link
Member

@gjenkins8 gjenkins8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. A few comments/cleanups, but I think the HIP is close to being ready

Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
@AustinAbro321
Copy link
Contributor Author

@gjenkins8 Finally was able to get back to this and address all the comments. My PR is ready to review as well - helm/helm#13604. I just need to add the cobra logic for --wait I'll do that tomorrow.

Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Signed-off-by: Austin Abro <[email protected]>
Copy link
Member

@robertsirc robertsirc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your HIP. I am OK with what is being proposed here. You handle the legacy way this should work and a path going forward. I believe this HIP is fine for Helm v4.

@mattfarina mattfarina merged commit dcee115 into helm:main Feb 19, 2025
1 check passed
Copy link

@joejulian joejulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants