Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added metrics for vulnerabilities on a workload level #24

Merged
merged 4 commits into from
Dec 18, 2024

Conversation

hebestreit
Copy link
Contributor

Overview

Currently the Prometheus Exporter only provides metrics on a cluster and namespace level. We find it useful to also have an overview on a workload level which gives the possibility to know which exact Deployment has the most vulnerabilities or define custom alerts.

According to the existing metrics name pattern a new suffix is introduced for Vulnerabilities and ConfigurationScans like:

  • kubescape_controls_total_workload_<severity>
  • kubescape_vulnerabilities_total_workload_<severity>
  • kubescape_vulnerabilities_relevant_workload_<severity>

Additional Information

Initial discussion started here:
https://cloud-native.slack.com/archives/C04GY6H082K/p1733500846063089

How to Test

Examples/Screenshots

This is how the metrics are exported via /metrics endpoint. Note the value is a dummy.

kubescape_controls_total_workload_medium{namespace="monitoring",workload="promtail",workload_kind="serviceaccount"} 1
kubescape_vulnerabilities_total_workload_critical{namespace="monitoring",workload="promtail",workload_kind="daemonset"} 2
kubescape_vulnerabilities_relevant_workload_medium{namespace="monitoring",workload="promtail",workload_kind="daemonset"} 3

Copy link
Contributor

@matthyx matthyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it probably works for a small cluster, for bigger ones I would use watch and pagers to avoid overloading the storage component

api/api.go Outdated
@@ -43,6 +62,26 @@ func (sc *StorageClientImpl) GetVulnerabilitySummaries() (*v1beta1.Vulnerability

}

func (sc *StorageClientImpl) GetWorkloadConfigurationScanSummaries() (*v1beta1.WorkloadConfigurationScanSummaryList, error) {
workloadConfigurationScanSummaries, err := sc.clientset.SpdxV1beta1().WorkloadConfigurationScanSummaries("").List(context.TODO(), metav1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than getting the full list all the time, it would make sense to use a Watch() and update counters as you get added/removed events for the objects in a go routine - and when requested you just provide the counters values

Copy link
Contributor Author

@hebestreit hebestreit Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Do I understand correctly that the Prometheus Exporter gets the full list on startup using a pager, populates all counters and starts a go routine with Watch(). This guarantees that the Prometheus Exporter first synchronizes with the cluster state and doesn't overload the storage component for updates.

Inside this go routine it gets notified about Kubescape resources (WorkloadConfigurationScanSummary and VulnerabilityManifestSummary) being added/removed to increase/decrease the counter value.

Copy link
Contributor

@matthyx matthyx Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch gives you the existing CRDs by default... hmm let me check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. I implemented the logic and did some quick tests which seems to work. When an object is received it simply calls the same function as for the full list to update a single item.

Initial exposed metric during startup:

kubescape_vulnerabilities_total_workload_low{namespace="external-dns",workload="external-dns",workload_kind="deployment"} 1234

After deleting the Vulnerabilitymanifestsummaries resource for external-dns:

kubescape_vulnerabilities_total_workload_low{namespace="external-dns",workload="external-dns",workload_kind="deployment"} 0

Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good... does it perform the way you want? Can we merge it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just recognized that workloads with multiple containers overwrite the metric value because they share the same workload name and kind but are separate VulnerabilityManifestSummary resources.

My idea is to add the kubescape.io/workload-container-name label as metric label workload_container_name which provides the benefit to filter also on a container level but definitely increases the number of exported metrics.

Later today or tomorrow I'll let you know how it impacts the performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total number of exported metrics has been significantly increased and is heavily dependent on the size of the cluster. The option of deactivating this function via an environment variable is a good compromise.

I have also added logic to delete the exported metric from the output when the resource has been deleted in the cluster. So it corresponds to the same state when the Prometheus Exporter is restarted and the deleted resource is no longer available.

I think we're good to merge now 👍

…f items

added environment variable to enable metrics on workload level

Signed-off-by: hebestreit <[email protected]>
added logic to delete exported metric when resource has been deleted in cluster

Signed-off-by: hebestreit <[email protected]>
@matthyx matthyx merged commit a63a525 into kubescape:main Dec 18, 2024
2 checks passed
@matthyx
Copy link
Contributor

matthyx commented Dec 18, 2024

thanks @hebestreit !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants