Skip to content

Unhealthy ResourceGraphDefinition causes stale state for all other ResourceGraphDefinitions #886

@toweroy

Description

@toweroy

Description

Observed Behavior:
When multiple ResourceGraphDefinitions are Active and reconciling and a new ResourceGraphDefinition is applied that is misconfigured and causing the kro controller to fail,
all new changes to every ResourceGraphDefinition will be blocked. This means that only the "old state" of the ResourceGraphDefinitions will be applied.

Note: As a collateral issue to this, only deleting the problematic ResourceGraphDefinition and restarting the kro controller pod returns us to a healthy setup.

Expected Behavior:
One unhealthy/failing ResourceGraphDefinition should not prevent other healthy ResourceGraphDefinitions from being updated and reconciling. Also I would expect a failing state to be propagated to the ResourceGraphDefinition's status.

Reproduction Steps (Please find the ResourceGraphDefinitions and Instances files used in the bottom Appendix section):

  1. Apply the HealthyRgd ResourceGraphDefinition and ClusterRole to a cluster and verify it's Active:
kubectl apply -f clusterrole.yaml
kubectl apply -f healthy-rgd.yaml

kubectl get rgd
NAME                                         APIVERSION   KIND                     STATE    AGE
healthy-rgd-test                             v1           HealthyRgd               Active   30s
  1. Apply the HealthyRgd resource, verify it has reconciled and also that the underlying resource has been created (a Namespace in this example):
kubectl apply -f healthy-rgd-resource.yaml

kubectl get healthyrgd -n test-namespace
NAME                   STATE    READY   AGE
healthy-rgd-resource   ACTIVE   True    18s

kubectl get namespace healthy-rgd-resource-generated
NAME                                STATUS   AGE
healthy-rgd-resource-generated      Active   30s
  1. Apply the UnhealthyRgd ResourceGraphDefinition, UnhealthyRgd has no state (which I would think is expected):
kubectl apply -f unhealthy-rgd.yaml 
resourcegraphdefinition.kro.run/unhealthy-rgd-test created

kubectl get rgd
NAME                                         APIVERSION   KIND                     STATE    AGE
healthy-rgd-test                             v1           HealthyRgd               Active   69m
unhealthy-rgd-test                           v1           UnhealthyRgd                      40s
  1. kro-system pod logs will show the issue (expect since a ClusterRole was not setup for this ResourceGraphDefinition):
2025-12-05T09:18:39Z    ERROR   dynamic-controller      watch error for lazy informer   {"gvr": "unhealthy.rgd.com/v1, Resource=unhealthyrgds", "gvr": "unhealthy.rgd.com/v1, Resource=unhealthyrgds", "error": "failed to list *v1.PartialObjectMetadata: unhealthyrgds.unhealthy.rgd.com is forbidden: User \"system:serviceaccount:kro-system:kro\" cannot list resource \"unhealthyrgds\" in API group \"unhealthy.rgd.com\" at the cluster scope"}
github.com/kubernetes-sigs/kro/pkg/dynamiccontroller/internal.(*LazyInformer).ensureInformer.func1
        github.com/kubernetes-sigs/kro/pkg/dynamiccontroller/internal/gvr_watch.go:86
k8s.io/client-go/tools/cache.(*sharedIndexInformer).SetWatchErrorHandler.func1
        k8s.io/[email protected]/tools/cache/shared_informer.go:497
k8s.io/client-go/tools/cache.(*Reflector).RunWithContext.func1
        k8s.io/[email protected]/tools/cache/reflector.go:361
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        k8s.io/[email protected]/pkg/util/wait/backoff.go:233
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext.func1
        k8s.io/[email protected]/pkg/util/wait/backoff.go:255
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext
        k8s.io/[email protected]/pkg/util/wait/backoff.go:256
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        k8s.io/[email protected]/pkg/util/wait/backoff.go:233
k8s.io/client-go/tools/cache.(*Reflector).RunWithContext
        k8s.io/[email protected]/tools/cache/reflector.go:359
k8s.io/client-go/tools/cache.(*controller).RunWithContext.(*Group).StartWithContext.func3
        k8s.io/[email protected]/pkg/util/wait/wait.go:63
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
        k8s.io/[email protected]/pkg/util/wait/wait.go:72
  1. Verify creating a new HealthyRgd resource (e.g. named healthy-rgd-test-2) - Works
kubectl apply -f healthy-rgd-resource-2.yaml
healthyrgd.healthy.rgd.com/healthy-rgd-resource-2 created

kubectl get healthyrgd -n test-namespace      
NAME                     STATE    READY   AGE
healthy-rgd-resource-2     ACTIVE   True    77s

kubectl get namespace healthy-rgd-resource-2-test
NAME                        STATUS   AGE
healthy-rgd-resource-2-test   Active   2m5s
  1. Modify the HealthyRgd ResourceGraphDefinition to add a new field (e.g. newShinyField) and verify it is updated:
...
spec:
    schema:
        apiVersion: v1
        group: healthy.rgd.com
        kind: HealthyRgd
        spec:
            ...
            newShinyField: string | required=true description="A new shiny field added to test RGD updates"
...
kubectl apply -f healthy-rgd.yaml 
resourcegraphdefinition.kro.run/healthy-rgd-test configured

kubectl get rgd healthy-rgd-test -oyaml
...
(newShinyField is present in the spec)
...
  1. Update the HealthyRgd resource with the new field (newShinyField) - Fails
kubectl apply -f healthy-rgd-resource.yaml
The request is invalid: patch: Invalid value: "...": strict decoding error: unknown field "spec.newShinyField"
  1. Delete the UnhealthyRgd ResourceGraphDefinition:
kubectl delete rgd unhealthy-rgd-test
resourcegraphdefinition.kro.run "unhealthy-rgd-test" deleted

Note: You can attempt to re-apply the HealthyRgd resource but that will fail with the same previous decoding error

  1. Delete the kro-system pod (the only way I have found around this broken state):
kubectl delete pod <pod-name> -n kro-system
  1. Re-apply the HealthyRgd resource with the new field (newShinyField):
kubectl apply -f healthy-rgd-resource.yaml
healthyrgd.healthy.rgd.com/healthy-rgd-resource configured

Everything is back to normal i.e. reconciliation is successful again.

Versions:

  • kro version: 0.7.0
  • Kubernetes Version (kubectl version): v1.33.5

Involved Controllers:

  • Controller URLs and Versions (if applicable): kro dynamic-controller

Error Logs (if applicable)**: Shared in the above reproduction steps

Appendix

ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    rbac.kro.run/aggregate-to-controller: 'true'
  name: kro:controller:healthy-rgd-test
rules:
  - apiGroups:
      - healthy.rgd.com
    resources:
      - healthyrgds
      - healthyrgds/status
    verbs:
      - '*'
  - apiGroups:
      - ''
    resources:
      - namespaces
    verbs:
      - '*'

Healthy ResourceGraphDefinition

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: healthy-rgd-test
spec:
  schema:
    apiVersion: v1
    group: healthy.rgd.com
    kind: HealthyRgd
    spec:
      expectedState: string | required=true description="The expected state of the resource"

  resources:
  - id: ns
    template:
      apiVersion: v1
      kind: Namespace
      metadata:
        name: ${schema.metadata.name}-test

Unhealthy ResourceGraphDefinition

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: unhealthy-rgd-test
spec:
  schema:
    apiVersion: v1
    group: unhealthy.rgd.com
    kind: UnhealthyRgd
    spec:
      expectedState: string | required=true description="The expected state of the resource"

  resources:
  - id: ns
    template:
      apiVersion: v1
      kind: Namespace
      metadata:
        name: ${schema.metadata.name}-generated

HealthyRgd resource

apiVersion: healthy.rgd.com/v1
kind: HealthyRgd
metadata:
  name: healthy-rgd-resource
  namespace: test-namespace
spec:
  expectedState: "healthy"
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Which option describes the most your issue?

ResourceGraphDefinition (Create, Update, Deletion)

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions