H4HIP: CRD updating #379

gjenkins8 · 2024-12-21T04:49:09Z

No description provided.

Signed-off-by: George Jenkins <[email protected]>

kfox1111 · 2024-12-21T13:44:58Z

Its pretty common these days for crds to be in templates so they can be upgraded too. We should probably keep this behavior and align it with the /crds directory too.

Maybe:

Render templates part 1. Render out any crds found, ignoring other templates. Treat the rendered crds as if in /crds, apply those. Render templates part 2. Render everything but the crds as normal

mattfarina

I've not worked through the specification, yet. But, I wanted to add thoughts to the material up front.

hips/hip-XXXX.md

mattfarina · 2025-01-08T21:23:15Z

hips/hip-XXXX.md

+For 1., it is thought that Helm should not treat CRDs specially here.
+Helm will readily operate on many other cluster wide resources today: Cluster roles, priority classes, namespaces, etc.
+That the modification/removal of could easily cause breakage outside of the chart's release.


Helm is a package manager rather than a general purpose management tool for Kubernetes resources. It's a subtle but important difference. Here are some ways to think about it...

In the profiles we cover applications specifically. Cluster operations is specifically out of scope.

The definition (from wikipedia) for a package manager:

A package manager or package-management system is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs for a computer in a consistent manner.

Namespaces were not able to be created when 3.0.0 came out. The philosophy is that applications should be installed in a namespace but its not up to Helm to manage those and those should be created first, possibly part of configuration management. The namespace creation was added to provide backwards compatibility to Helm v2 (which had it) because we got so many issues filed about it.

We have not considered Helm the right tool to manage all resources since it's targeted at applications and Kubernetes resources go beyond that.

I think I disagree here. If a chart contains a namespace resource, Helm will happily template that namespace resource and install it into a cluster. Furthermore, when the user uninstalls the chart, Helm will delete the namespace, and the deletion will cascade to all resources in that namespace.

While, ideally, Helm might position itself as a "package manager", with the intent the users only create resources in the given target namespace. It doesn't do anything to prevent creating cross-namespace resources, or namespace resources themselves.

Similar for cluster-roles and cluster-role binding RBAC rules. Helm will happily install/uninstall these resources. Which could easily cause cross-application breakage, if something came to depend on those namespace rules in the interim.

The overall point is, Helm doesn't restrict other cross-namespace resources, but is very conservative with CRDs for being cross-namespace. Perhaps Helm should not delete any cluster-wide resource? (that is a different HIP :) ).

My aim is to relax Helm's approach to CRDs, so that users (chart operators) still have sufficient protection (ie. a rollback will recover). But less that todays "CRDs are cross-namespace, so Helm can't manage them" approach (since Helm will manage other cross-namespace resources).

mattfarina · 2025-01-08T21:35:03Z

hips/hip-XXXX.md

+In general, Helm as a package manager should not try to prempt unintended functional changes from a chart.
+Validating functional changes is well beyond the scope of a package manager.
+Helm should treat CRDs the same as other (cluster-wide) resources, where a Helm chart upgrade that causes unintended functional effects should be reverted (rolled back) by the user (chart operator).
+And as long as that rollback path exists, this is a suitable path for users to mitigate breaking functional changes.


Two thoughts here....

Chart authors create charts. Often, Application operators are entirely different people and they install or upgrade charts. Application operators often do not have expertise in k8s or the applications they are running. When an application operator has a problem, especially a severe one like data loss, they file issues in the Helm issue queue. We have experience this in the past which is one of the reasons Helm has been conservative. Responding to those issues and managing them is time consuming.

Those who can install/update CRDs are sometimes not the same people installing/upgrading the chart. In the past there has been a separation. We should better understand the current state of this. Being able to extract the CRDs and send them to someone with access is useful. Being able to install/upgrade a chart when you don't have global resource access is helpful. Or has been. We need to understand this landscape change.

I remember meeting with Helm users who had tight controls on their clusters. They had many who could use their namespaces and few who could manage things at the cluster level. This shaped Helm v3 designs. For example, it's the reason Helm uses secrets instead of a CRD. Using a CRD would limit who could use it.

especially a severe one like data loss

I want to call this out directly: This HIP specifically ensures Helm won't cause data loss with CRDs/CRs

For 1., there is a range. A chart author can produce a buggy chart (unrelated to CRDs), and if operator installs it, they may come to Helm and ask "why didn't my application work correctly?" This is unfortunate, as this is a chart problem, and not Helm problem.

But the chart operators recourse is to simply helm rollback. And there is not much (anything) Helm can do to prevent this buggy chart situation (aside from good UX to help the user discover/diagnose the problem).

The goal is to treat CRDs similarly, if a chart author produces a new chart version with a new CRD version which is incompatible with the existing served/stored versions, then we can simply consider the chart to be buggy. And the operator can rollback. The user may come to Helm, and like other issues, they would need to be directed to the chart author (in this case).

mattfarina · 2025-01-08T21:39:02Z

hips/hip-XXXX.md

+For 2., Data loss should be treated more carefully.
+As data loss can be irrevocable or require significant effort to restore.
+And especially, an application/chart operator should not expect a chart upgrade to cause data loss.
+Helm can prevent Data loss can be prevented by ensuring CRDs are "appended" only (with special exceptions in the specification below). An in particular, appending allows a rollback to (effectively) restore existing cluster state.
+(It should also be noted that Helm will today remove other resources which may cause data loss e.g. secrets, config maps, namespace, etc. A special, hard-coded, exception does exist for persistent volumes)


Data loss is different between CRDs and something like secrets. If I remove a secret it removes the one secret. This impacts that single case. If a CRD is deleted all the CRs are deleted. For example, you have user A with application instance of a chart. And you have user B in that same cluster with a separate instance of the same chart. These 2 users do not know about each other. If user A deletes a CRD, even unintentionally, it will remove the CRs for user B and cause data loss for user B. This is a very different surface area from something like deleting a secret.

We also know that some don't have backups and will make changes in production. When things go wrong, Helm devs will get issues to triage and deal with.

I should delete this last sentence, I'm not sure why I added it. The explicit intention of this HIP is to ensure Helm can not cause data loss with CRDs

gjenkins8 · 2025-02-01T23:23:23Z

Its pretty common these days for crds to be in templates so they can be upgraded too. We should probably keep this behavior and align it with the /crds directory too.

Maybe:

Render templates part 1. Render out any crds found, ignoring other templates. Treat the rendered crds as if in /crds, apply those. Render templates part 2. Render everything but the crds as normal

In what circumstances do people actually need CRDs to be dynamic (with templating)? To my knowledge, CRDs tend to end up in the templates/ dir, solely so Helm will update them. Not because the the CRD structure needs to be adjusted. Examples/rationale for otherwise appreciated!

(Generally, the goal here is to improve Helm's CRDs handling. Which doesn't mean all improvements need to be done at once. templating CRDs might be done as a future improvement)

Signed-off-by: George Jenkins <[email protected]>

kfox1111 · 2025-02-10T16:13:40Z

In what circumstances do people actually need CRDs to be dynamic (with templating)? To my knowledge, CRDs tend to end up in the templates/ dir, solely so Helm will update them. Not because the the CRD structure needs to be adjusted. Examples/rationale for otherwise appreciated!

I dont think its common for crds in templates to actually be templates. but, right now there are two choices. as a packager.

You put your crds in crds dir, and they never upgrade making things more complicated for end users.
You put them in templates, and upgrades work.

If 4 will maintain backwards compat with 3, then it will need to support crds in templates/ still.

gjenkins8 · 2025-02-22T04:47:46Z

If 4 will maintain backwards compat with 3, then it will need to support crds in templates/ still.

Certainly, this HIP only pertains to CRDs in the crds/ directory:

https://github.com/helm/community/pull/379/files#diff-e94b50b2c7a2d45c81db342448ae78950464d77488a62b315a2b715549fe1a95R79

hips/hip-XXXX.md

Signed-off-by: George Jenkins <[email protected]>

joejulian · 2025-03-06T19:57:16Z

hips/hip-XXXX.md

+- update existing versions with the exception of the schema field
+    - If in the future, functionality exists that shows a CRD version schema changes are backwards compatible, Helm may allow updating CRD version's schema field
+    - In particular, Helm will expect the update from the chart to correctly set a single version to `storage: true`
+- merge/update `/conversion` field


This won't work of the version of the webhook that gets installed doesn't support the conversion. If the application provider has deprecated v1alpha1, but has an upgrade from v1beta1 to v1beta2, if we keep the v1alpha1 configuration in the CRD, the data won't be lost, but it may not be supported by the conversion webhook nor the controller.

This is a problem with Helm today? That a user can deploy new controller/conversion webhook versions which don't support old stored versions?

Also for this scenario: where there are existing v1alpha1 versions in storage. They are already there. Whether v1alpha1 remains in the CRD versions array, or not. Kubernetes only attempts to convert them to a served version upon load. And the chart operator should have upgraded them already (https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version)(but probably won't have). Helm not removing the v1alpha1 entry from the versions array doesn't really change anything here I don't think.

(I'm going to write a test to check the exact behavior here. Will update shortly with kubernetes exact behavior with regards to versions in storage and loading/converting old versions)

This is a problem with Helm today? That a user can deploy new controller/conversion webhook versions which don't support old stored versions?

Also for this scenario: where there are existing v1alpha1 versions in storage. They are already there. Whether v1alpha1 remains in the CRD versions array, or not. Kubernetes only attempts to convert them to a served version upon load.

In the above scenario, the upgraded application would fail. There would be no v1betaX (most likely) and the operator/webhook would spew errors about the resource not existing. The CRD would need to be replaced and the v1alpha1 resources that are stored would be lost. That's kind-of by design with k8s versioning schemas.

I mostly point it out because it could cause surprising behavior to an end-user. "I have all my resources defined, why isn't my controller seeing them?"

And the chart operator should have upgraded them already (https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version)(but probably won't have). Helm not removing the v1alpha1 entry from the versions array doesn't really change anything here I don't think.

(I'm going to write a test to check the exact behavior here. Will update shortly with kubernetes exact behavior with regards to versions in storage and loading/converting old versions)

👍

I think my preferred behavior would be that instead of merging CRDs, that if the storage version in the installed CRD is no longer in the new CRD version, we just error and fail.

scottrigby · 2025-04-27T19:59:59Z

We've seen a lot of issues like this trying to find workarounds to Helm's current CRD update limitations. Linking this example here because this HIP is now on my radar. Will also try to link future issues here when I run into them naturally. actions/actions-runner-controller#2295

pull-request-size bot added the size/L label Dec 21, 2024

gjenkins8 force-pushed the hip4_crd_appending branch 2 times, most recently from eb8ecef to 2b5d9d7 Compare December 21, 2024 05:00

Initial draft

343cb9f

Signed-off-by: George Jenkins <[email protected]>

gjenkins8 force-pushed the hip4_crd_appending branch from 2b5d9d7 to 343cb9f Compare December 21, 2024 05:01

mattfarina reviewed Jan 8, 2025

View reviewed changes

review

836107b

Signed-off-by: George Jenkins <[email protected]>

gjenkins8 commented Feb 22, 2025

View reviewed changes

hips/hip-XXXX.md Outdated Show resolved Hide resolved

Apply suggestions from code review

19d4229

Signed-off-by: George Jenkins <[email protected]>

joejulian reviewed Mar 6, 2025

View reviewed changes

This was referenced Jun 27, 2025

Create Helm Chart from Static Manifests kubernetes-csi/external-snapshotter#1285

Open

Helm's CRD upgrade behavior frustrates users, is a production time bomb, and discourages good testing practices helm/helm#31027

Open

H4HIP: CRD updating #379

Are you sure you want to change the base?

H4HIP: CRD updating #379

Uh oh!

Conversation

gjenkins8 commented Dec 21, 2024

Uh oh!

kfox1111 commented Dec 21, 2024

Uh oh!

mattfarina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattfarina Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

gjenkins8 Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattfarina Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

gjenkins8 Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattfarina Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

gjenkins8 Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjenkins8 Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

gjenkins8 commented Feb 1, 2025

Uh oh!

kfox1111 commented Feb 10, 2025

Uh oh!

gjenkins8 commented Feb 22, 2025

Uh oh!

Uh oh!

joejulian Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

gjenkins8 Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joejulian Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

scottrigby commented Apr 27, 2025

Uh oh!

Uh oh!

gjenkins8 Jan 28, 2025 •

edited

Loading

gjenkins8 Jan 28, 2025 •

edited

Loading

gjenkins8 Jan 28, 2025 •

edited

Loading

gjenkins8 Mar 18, 2025 •

edited

Loading