-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Enhance Resource Health Monitoring within App CR #717
base: develop
Are you sure you want to change the base?
[Proposal] Enhance Resource Health Monitoring within App CR #717
Conversation
Addresses the issue: carvel-dev/kapp-controller#1412 Signed-off-by: Varsha Prasad Narsing <[email protected]>
✅ Deploy Preview for carvel ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
||
We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented: | ||
|
||
1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to know more and discuss about how we can enable watches in the App reconciler. Looks like currently, the kapp
command is called, and based on its output the App status is popoulated:
https://github.com/carvel-dev/kapp-controller/blob/df87efdcf0c0c140ff644c8286257cd38a74fd42/pkg/app/app_deploy.go#L25
If we could return the list of resources which are being created (which currently is present in the cmd output) and dynamically set up watches, it would make it easier.
This is how we do in Rukpak for reference: https://github.com/operator-framework/rukpak/blob/6a8a84c9aff05efaba7b05992704ad38462a7ee8/internal/controllers/bundledeployment/bundledeployment.go#L389-L402.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be ideal, but you can find all the resources by taking the involved GroupKinds from the ConfigMap associated with the kapp app, then listing/watching/informing using the appropriate app label selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible approach would be to have kapp
write some information about specific resources to a file upon reconciliation and have this information copied over to the status.
This is how we have information about used group versions and namespaces to the App status, using output from the --app-metadata-file-output
flag.
This however means this information will be reported whenever the App syncs.
|
||
#### Use Case: Monitoring the state of resources | ||
|
||
Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm - the output of inspect
command in App's status is populated during deploy. After which it is not dynamically updated when the health of any resources change? Am I missing anything here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, it would only report the health of resources when a reconciliation occurs.
One correction (though it is not very relevant to the context) would be that inspect
is not a part of rawOptions
, enabling it would look more like:
#.....other spec
deploy:
- kapp:
inspect: {}
#....
## Open Questions: | ||
|
||
1. Can using informers to watch resources increase cache size, potentially impacting the performance? | ||
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is what needs to happen eventually. It doesn't make sense to have two status fields serve similar purpose. The healthy
condition can just list all the resources (instead of just the failed/unhealthy) ones or vice-versa.
Before refactoring the proposal for the same, I would like to confirm the use case of inspect
and make sure of the direction we would like to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inspect
section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect
command.
We disabled it by default in favour of reducing the number of API calls we make.
Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.
If we are watching all the resources in the cluster anyway, and triggering a reconcile - wondering if calling an inspect
command on top of it is necessary. If so, this may also end up loading the API server - since I can expect more no. of reconciliations due to dependent resources.
Additionally, the second aspect is inspect
and healthy
showing conflicting information at any point in time to the user (I haven't looked into the codebase of inspect
yet, but assuming controller client and the one used with kapp are different?).
Given:
Inspect
- would show a superset health status of all the resources.
Healthy
- would show only the unhealthy resources.
If we decide to support both of them, then probably we should make them exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for putting this together!
I aggregated some of my thoughts into comments.
So far the two goals that stand out for me are:
- Immediate reporting of failures for certain resources
- Structured per-resource reporting in case of failure
I would be curious to know if I am missing something else we are looking for too 🙏🏼
Let's take the discussion forward in the community meeting 🚀
We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented: | ||
|
||
1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package. | ||
2. Introduce a `Healthy` condition in App CR's `status` [field][app_cr_status]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believed the original suggestion was to introduce conditions per resource as well, is that not required for your use case?
To illustrate something like
- type: HealthCheck/someapi/someversion/somenamespace/resource
status: False
message: "Failed to meet condition: "some more information""
Which could live in a separate in a separate field such as status.resourceConditions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced we need to report a condition for every resource. Imagine the results of an App
being hundreds of resources created on cluster. The primary need is to signal to a user "this app is degraded". Including a subset of the unhealthy resources would be useful. I'd like to ensure we don't get anywhere close to the 1.5MB size limit for data in etcd, and whenever there's an unbounded data set (# of resources in this case), I start to get concerned.
From there, the user can go investigate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what Andy mentioned. A single condition, which consists of a consolidated list of unhealthy objects is sufficient. Something like this:
- lastTransitionTime: "2023-08-02T04:24:27Z"
Message: "unhealthy resources: ["apiextensions.k8s.io/v1/CustomResourceDefinition/my.new.crd":"InvalidVersion", "deployments/test-ns/my-deploy":"MinimumReplicasUnavailable", "pods/test-ns/standalone-pod":"ImagePullBackoff"]
|
||
7. All other unspecified resources will be considered healthy. | ||
|
||
If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to not cases where the ReconcileFailed
condition is not present, but the Healthy
condition is false.
If this is not a possibility, is the problem we are trying to solve: having more structured information about failed resources surfaced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my previous comment re not wanting to list every unhealthy resource
|
||
If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically. | ||
|
||
Since the resources deployed by the App reconciler have informers created for them, any change in the resource state will trigger a reconcile that in turn will re-evaluate the health of all resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is value to be able to treat some resources as more critical! (carvel-dev/kapp-controller#1279 comes to mind)
Today, in case of failure we have a mechanism which leads to immediate reconciliation on failure. However, in case of repeated failure, the reconciler exponentially backs off. Meaning that it would take longer to reconcile the app again if it has already failed >3 times (for example).
Worth noting the longest the app waits will always be equal to it's syncPeriod
.
This prevents an app or a set of apps that is doomed to fail from hogging the reconciliation queue. Would we want something similar here as well?
|
||
#### Use Case: Monitoring the state of resources | ||
|
||
Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, it would only report the health of resources when a reconciliation occurs.
One correction (though it is not very relevant to the context) would be that inspect
is not a part of rawOptions
, enabling it would look more like:
#.....other spec
deploy:
- kapp:
inspect: {}
#....
## Open Questions: | ||
|
||
1. Can using informers to watch resources increase cache size, potentially impacting the performance? | ||
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inspect
section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect
command.
We disabled it by default in favour of reducing the number of API calls we make.
Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.
|
||
Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. | ||
|
||
Though this command provides information about the resources created by the respective App CR, it does so by sending API requests during the reconciliation. Instead, using informers provides additional advantages of having real-time updates, efficient resource utilization and reduced load on API server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we work with informers it might be interesting to see what's an optimal number of resources to be watched we would recommend keeping resource utilisation in mind.
(Just a note, not something this proposal should address)
5. An APIService resource will be healthy if/when: | ||
- `Available` type condition in status is true. | ||
|
||
6. A CustomResourceDefinition resource will be healthy if/when: | ||
- `StoredVersions` has the expected API version for the CRD. | ||
|
||
7. All other unspecified resources will be considered healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where 5,6,7 would not lead to a deployment failure? if so do we really need to report health in these resources types?
|
||
## Open Questions: | ||
|
||
1. Can using informers to watch resources increase cache size, potentially impacting the performance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be the case since the same kapp-controller can be in charge of hundreds of apps, and if we do this for all the apps, we might end up getting informers for every resource in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have this as an optional feature to start, similar to how inspect
is right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this proposal is implemented, health would ultimately be determined by evaluating only the following kinds:
- Pod
- ReplicationController
- ReplicaSet
- Deployment
- StatefulSet
- APIService
- CustomResourceDefinition
Which would mean the additional overhead is at most 6 more informers with label selectors limiting the cache contents to just what kapp-controller is managing. We don't have to have informers for the App resources that do not contribute to the health condition, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it'd only need to be a subset of all APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion in the community meeting today -
The two use-cases for setting up informers to watch resources are:
- Health monitoring and aggregating status.
- Triggering a reconcile if any resource is unhealthy.
- From OLM's end, the use case we want to fulfil is (1).
- (2) is something that can cause performance issues in terms of continuously reconciling for any unhealthy resources even if we have informers set up for limited number of GVK's (especially on clusters where Kapp-ctrl is managing large no of App CRs).
If (2) is not to be addressed, to maintain modularity in terms of kapp controller's functionality, @joaopapereira suggested we explore having a separate controller to monitor the health status of individual resources which can be optionally enabled.
|
||
1. Can using informers to watch resources increase cache size, potentially impacting the performance? | ||
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another open question is if kapp-controller starts reacting to all changes in the cluster, what will happen to performance in general? At this point in time kapp-controller can become the major consumer of CPU of the full cluster.
Addresses the issue: carvel-dev/kapp-controller#1412