Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 203 additions & 0 deletions docs/design/proposals/rgd-ownership-and-deletion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# ResourceGraphDefinition Ownership and Deletion Protection

## Problem statement

Previously, **CustomResourceDefinitions (CRDs)** generated by **ResourceGraphDefinitions (RGDs)** were only linked to their parent by a label (`metadata.labels["kro.run/resourcegraphdefinition"]`).
There was **no ownership relation** (`ownerReferences`) or **controller reference**, leading to several issues:

* The KRO controller required explicit logic to decide when to delete CRDs, controlled by a Helm flag `allowCRDDeletion`.
* When this flag is set to `false` (the default), deleting an RGD **did not delete** its generated CRD, leaving orphaned definitions.
* Users could still manually delete RGDs without realizing that this orphaned affected CRDs and controller behavior, causing potential **inconsistent API states**.

The lack of Kubernetes-native ownership semantics made RGD–CRD lifecycle management unreliable and opaque, relying on our custom management.

## Proposal

Introduce proper **Kubernetes ownership semantics** between RGDs and their generated CRDs.
Combine this with a **safe-deletion mechanism** for user-initiated RGD removals and **clearer internal deletion control** for KRO.

### Overview

This proposal introduces three coordinated changes:

1. **Ownership introduction** — CRDs now include an `OwnerReference` pointing to the creating RGD.
2. **Deletion protection** — A `ValidatingAdmissionPolicy` prevents accidental user deletion of RGDs.
Copy link
Contributor

@ellistarn ellistarn Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand protecting against the deletion of the Kinds managed by RGDs, but what's the thinking behind stopping a user that says k delete rgd foo from deleting the rgd foo. Delete means delete, right?

It strikes me that this could be something that organizations choose to opt in more broadly (deletion protection for any k8s resource), but odd that KRO would take the opinion for users when other projects don't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main reason is that by default kubernetes deletes with background propagation. if we delete a RGD, that would automatically start deletion of the CRD as well during the delete. I only added this after there was an opinion against simply allowing a delete. I personally favor this too but adjusted after community call feedback

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, k8s standard leans towards no protection and delete everything, cascade.
Moreover, once marked for deletion, there is no way back, you must complete deletion before being able to re-use it again (kubectl delete starts by setting the deletionTimestamp that can't be undone)

As there is a clear 1-1 ownership link RGD <-> CRD I would indeed favour the standard behaviour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally favor this too but adjusted after community call feedback

I think this is the kind of thing where you can easily find voices in both directions. You'll see this debate surface through many AWS APIs, too. Just try deleting an S3 bucket with objects in it.

Given the controversy and the fact that it's broader than just our API, I'd just stick with the existing standard and let a separate solution handle deletion protection holistically.

3. **Controller deletion control** — The Helm flag `allowCRDDeletion` now cleanly governs whether KRO is permitted to delete CRDs as part of reconciliation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the current implementation does not fully honor the flag?

What about making the default to true?

There's also a discussion to have about finalizers that plays a different role IMO (the best place is probably below in the discarded section)

Suggested change
3. **Controller deletion control**The Helm flag `allowCRDDeletion` now cleanly governs whether KRO is permitted to delete CRDs as part of reconciliation.
1. **Ownership introduction**CRDs now include an `OwnerReference` pointing to the creating RGD.
2. **Deletion protection** — A `ValidatingAdmissionPolicy` prevents accidental user deletion of RGDs (blocks deletion of RGDs with instances)
3. **Controller deletion control** — The Helm flag `allowCRDDeletion` now cleanly governs whether KRO is permitted to delete CRDs as part of reconciliation.


Together, these changes align KRO with Kubernetes-native resource management and improve data safety.

---

### Design details

#### 1. Ownership introduction

Each CRD created by an RGD would carry an **OwnerReference** set by the controller:

```go
if err := ctrl.SetControllerReference(rgd, crd, r.Scheme()); err != nil {
mark.KindUnready(err.Error())
return nil, nil, fmt.Errorf("failed to set controller reference of CRD: %w", err)
}
```

This produces CRDs with metadata similar to:

```yaml
metadata:
name: sample.kro.run
ownerReferences:
- apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
name: sample
uid: 12345-abcde
controller: true
blockOwnerDeletion: true
```

**Resulting behavior:**

* When an RGD is deleted, its owned CRD is automatically garbage-collected. This is proper behavior
as we can still delete RGDs with policy "orphan" to preserve CRDs.
* Manual CRD deletion (while RGD still exists) will trigger re-reconciliation, as ownership implies controller responsibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

(we can and should think about warning or failing user deletion requests on CRDs managed by KRO)
* The original RGD label remains for traceability, but lifecycle is now managed by native Kubernetes GC.

Integration tests confirm the presence and correctness of `ownerReferences` (including `controller` and `blockOwnerDeletion` flags).

#### 2. Deletion protection for RGDs

Because RGD deletion now cascades to delete its CRD, an **admission policy** prevents accidental data loss.

When `config.allowCRDDeletion` is `false` (default), Helm installs a **ValidatingAdmissionPolicy** and **Binding**:

```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: {{ include "kro.fullname" . }}-crd-protection-policy
spec:
matchConstraints:
resourceRules:
- apiGroups: ["kro.run"]
apiVersions: ["v1alpha1"]
operations: ["DELETE"]
resources: ["resourcegraphdefinitions"]
validations:
- expression: |
has(oldObject.metadata.annotations) &&
oldObject.metadata.annotations['{{ .Values.validation.admission.policy.rgd.annotation.key }}'] == 'true'
reason: Invalid
message: |
Deletion denied. To proceed, set annotation '{{ .Values.validation.admission.policy.rgd.annotation.key }}: "true"'.
Removing an RGD also deletes its CustomResourceDefinition and may cause data loss.
```

Users must explicitly annotate an RGD to confirm deletion:

```yaml
metadata:
annotations:
kro.run/allow-delete: "true"
```

This ensures intentional deletions only.

#### 3. Controller-side deletion control (`allowCRDDeletion`)

Previously, the flag `allowCRDDeletion` determined whether KRO would delete CRDs it had created if the RGD was removed or changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to explore the use cases for setting this flag one way or another. If we're certain that all of the instances managed by the RGD are gone, is it really dangerous to remove the Kind? I'm not seeing the upsides of leaking it if we're confident we're not losing data. I can see downsides of not removing the kind, such as attempting to do a fresh installation of a new version where the Kind has changed substantially.

Now, with `OwnerReferences` managing lifecycle automatically, the flag’s meaning is refined:

* When `allowCRDDeletion: false`:
KRO still sets owner references, but **does not itself delete CRDs**.
Deletion is handled by Kubernetes GC only when the RGD is removed.
The ValidatingAdmissionPolicy remains active to prevent accidental RGD deletion.

* When `allowCRDDeletion: true`:
KRO may directly delete CRDs it manages (e.g., during reconciliation or cleanup).
The Helm chart skips installation of the ValidatingAdmissionPolicy, allowing full manual control.
Comment on lines +111 to +118
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure I make sense of the semantics.
It feels weird to me to have allowCRDDeletion: false and that CRD would be deleted anyhow (by k8s).

There's also a case that could lead to inconsistent behaviour, when the RGD Kind, Group or APIVersion changes. In this case, when allowCRDDeletion: false we wouldn't delete the dangling CRD but the CRD would be deleted in case of RGD deletion.

The proposed behaviour and lifecycle sounds more like confirmCRDDeletion or requireManualJudgementForCRDDeletion
This semantics would also bring consistency when Kind, Group or APIVersion changes.
Maybe the validation webhook needs to be on the CRD itself then?


Updated values:

```yaml
config:
allowCRDDeletion: false

validation:
admission:
policy:
rgd:
annotation:
key: kro.run/allow-delete
actions: '[Deny]'
```

#### 4. Documentation and tests

**Docs:**
A new section “Deletion of ResourceGraphDefinitions” explains:

* Ownership semantics between RGDs and CRDs
* The new cascading deletion behavior
* The admission-based deletion protection
* Helm configuration for disabling or customizing behavior

**Integration tests:**

* Assert CRDs are created with correct owner references
* Confirm garbage collection after RGD deletion
* Verify protection policy denies unannotated deletions
* Ensure disabling the policy (via Helm) removes admission resources

---

## Other solutions considered

| Option | Reason Rejected |
| ------------------------------------ |-----------------------------------------------------------------------------------------|
| Continue using label linkage only | No lifecycle tracking or automatic cleanup, against k8s API semantics |
| Finalizer-based blocking | Overcomplicates reconciliation; redundant with OwnerReferences and foreground finalizer |
| Webhook validation | Equivalent logic but adds latency and operational overhead |
| Hard-coded controller deletion rules | Less flexible than declarative GC + admission policy |

---

## Scoping

### In scope

* Add `OwnerReference` and `Controller` relationship between RGD and CRD
* Introduce ValidatingAdmissionPolicy for RGD deletions
* Retain `allowCRDDeletion` flag for controller-side cleanup behavior
* Documentation and Helm value updates

### Not in scope

* Migration logic for preexisting orphaned CRDs
* Protection for other dependent resources (we only want to look at CRDs)
* Multi-owner CRD sharing or advanced garbage-collection rules

## Testing strategy

### Requirements

* Kubernetes 1.30+ with `ValidatingAdmissionPolicy` enabled
* Kind or cluster-based integration test environment

### Test plan

1. Verify CRDs include correct `OwnerReferences` after reconciliation
2. Deleting an RGD triggers automatic CRD deletion via GC
3. Deletion without annotation is rejected with clear error message
4. Deletion with `kro.run/allow-delete: "true"` proceeds successfully
5. Setting `allowCRDDeletion: true` disables admission policy and allows controller-driven and manual cleanup

---

## Discussion and notes

This change brings RGD–CRD management in line with native Kubernetes semantics:

* Ownership ensures consistent lifecycle and prevents orphaned CRDs.
* Admission control ensures user awareness and intentional deletion.
* The `allowCRDDeletion` flag cleanly separates controller behavior from user intent.