Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: add `inject-per-series-metadata' #2632

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jacobstr
Copy link
Contributor

@jacobstr jacobstr commented Mar 20, 2025

See the discussion here #2551 This frees users from having to join on the various label series, which is useful when infrastructure has agents that can't do recording rules.

Note: this is a draft proposal because repeating this work to all other metric families will get quite tedious. I wanted to walk this implementation out a bit to propose some neatly encapsulated way to inject labels and annotations with somewhat of a wrapper func - it might not perform as well, but adding all of the slice-wrangling + kube label and annotation filtering to the business logic of each metric would be a bit messy.

What this PR does / why we need it: See #2551 With systems like Grafana Cloud it's useful to filter / aggregate certain series by segmenting metrics according to a custom label (such as app, environment, etc). Having to first label_join the existing kube_<resource>_label gauges requires some upstream prometheus with a TSDB and precludes using things like prometheus in agent mode or alloy.

By propagating the labels to individual time series we could do more effective filtering / aggregation at smart backends while having "dumb" scrapers and forwarders upstream.

One of the things this would let us do is aggregate thousands of metrics for similar pods running processing heavy / task heavy workloads. Many of these are categorized by pod labels. At scale, I don't need the pod dimension, but I do want to keep data for related workloads segmented.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality) Increases, but the behavior is opt-in.

Which issue(s) this PR fixes #2551

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jacobstr
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 20, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 20, 2025
@@ -40,7 +40,12 @@ var (
podStatusReasons = []string{"Evicted", "NodeAffinity", "NodeLost", "Shutdown", "UnexpectedAdmissionError"}
)

func podMetricFamilies(allowAnnotationsList, allowLabelsList []string) []generator.FamilyGenerator {
func podMetricFamilies(injectPerSeriesMetadata bool, allowAnnotationsList []string, allowLabelsList []string) []generator.FamilyGenerator {
mc := &MetricConfig{
Copy link
Contributor Author

@jacobstr jacobstr Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this so instead of adding 3 arguments to each generator, I add one. I suppose it makes it less "pure" as the behavior of the function is dictated by a complex configuration object.

LabelKeys: []string{"phase"},
LabelValues: []string{p.n},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were shadowing the outer p (Pod).

@@ -38,6 +39,12 @@ var (
conditionStatuses = []v1.ConditionStatus{v1.ConditionTrue, v1.ConditionFalse, v1.ConditionUnknown}
)

type MetricConfig struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be made private since it's built / used internally to this module.

@@ -175,6 +182,17 @@ func isPrefixedNativeResource(name v1.ResourceName) bool {
return strings.Contains(string(name), v1.ResourceDefaultNamespacePrefix)
}

// convenience wrapper to inject allow-listed labels and annotations to a metric if per-series injection is enabled.
func injectLabelsAndAnnos(m *metric.Metric, metricConfig *MetricConfig, obj *metav1.ObjectMeta) *metric.Metric {
if !metricConfig.InjectPerSeriesMetadata {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the pass-by-reference + early guard clause, this should be "0-cost" for those leaving the --inject-per-series-metadata at it's default of false.

@jacobstr jacobstr force-pushed the koobz/per-metric-metadata branch from 42b2197 to 9538070 Compare March 21, 2025 08:13
@@ -82,7 +87,7 @@ func podMetricFamilies(allowAnnotationsList, allowLabelsList []string) []generat
createPodSpecVolumesPersistentVolumeClaimsInfoFamilyGenerator(),
createPodSpecVolumesPersistentVolumeClaimsReadonlyFamilyGenerator(),
createPodStartTimeFamilyGenerator(),
createPodStatusPhaseFamilyGenerator(),
createPodStatusPhaseFamilyGenerator(mc),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full implementation would add this param to every metric family generator. This is also the reason I'm bundling up the configuration in a struct and passing it as a single unit.

See the discussion here kubernetes#2551
This frees users from having to join on the various label series, which
is useful when infrastructure has agents that can't do recording rules.
@jacobstr jacobstr force-pushed the koobz/per-metric-metadata branch from 9538070 to a4c4905 Compare March 21, 2025 08:20
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2025
@jacobstr jacobstr marked this pull request as draft March 21, 2025 08:21
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
@mrueg
Copy link
Member

mrueg commented Mar 24, 2025

Thanks for your contribution!
Unfortunately, we are not able to accept it, as it goes against the intended design of kube-state-metrics and as maintainers, we do not want to support this.
https://github.com/kubernetes/kube-state-metrics/blob/main/docs/design/metrics-best-practices.md#avoid-pre-computation

A solution like this has come up multiple times, for more information, please take a look at:
#2428 (comment)
#1758 (comment)
#2129 (comment)

@jacobstr
Copy link
Contributor Author

jacobstr commented Mar 25, 2025

@mrueg if you can humor this for a moment, the intent here is to reduce the cardinality of metrics when they are forwarded and stored by giving cluster administrators an additional lever by which they can aggregate.

Yes, this can be done with recording rules, but those are only applied (generally) at backends with a TSDB. I've got one technology, albeit a closed-source, cloud offering from Grafana (adaptive metrics) where having pod labels would let me reduce 10k time series for a given pod metric down to something much more manageable by classifying workloads and aggregating them by a custom classification label.

The circumstances in the wild can get somewhat complex and perhaps arbitrary, so I don't want to over-index on what I'm doing necessarily, other than to suggest that allowing for labels directly removes a constraint on having to do aggregation against the kube_pod_labels time series. A concrete result of that is I might be able to run prometheus in agent mode "sooner." In general, I think this gives cluster administrators more flexibility around their metrics topology.


I wanted to talk about cardinality a bit. There's an assertion that this will increase cardinality that I want to argue.

Here's what I get when I query the kube-state-metrics endpoint directly:

kube_pod_info{namespace="apps",pod="app-5c674df6c4-qx65r",uid="xxx",host_ip="xxx",pod_ip="xxx",node="xxx",created_by_kind="ReplicaSet",created_by_name="app-5c674df6c4",priority_class="",host_network="false"} 1

The metric already includes the pod name. Adding a custom label mapping such as owner="team-xyz" Is not going to increase the overall number of kube_pod_info time series' under what I would regard as "normal" circumstances‡. There is already a unique time series per pod - the cardinality generally can't get worse than a unique value for every instance of a pod.

‡ I can contrive a scenario where e.g. the pod updates its labels while it's running. I think that's quite atypical, and in those cases, this would indeed be a questionable fit. However, I still think it's manageable in a multitude of ways - such as not enabling the capability on a subset of labels that have been made dynamic.

kube_pod_info{namespace="apps",pod="app-5c674df6c4-qx65r",uid="xxx",host_ip="xxx",pod_ip="xxx",node="xxx",created_by_kind="ReplicaSet",created_by_name="app-5c674df6c4",priority_class="",host_network="false", **label_owner="footeam"**}

Of course, you can't simply drop the pod ID during relabelling as you'll end up with multiple ambiguous time series. So I want some way to segment the metrics so that when I aggregate them I can do slightly better than aggregating all pods by namespace.


Additional memory usage also came up. I think this can be largely implemented in a zero-cost-ish manner unless enabled. Suppose a single label is mapped, I think we're talking an increase of 5-10% per label depending on the key/value lengths though the prom client libraries themselves might be clever here and use shared string references rather than allocating 10k copies of "footeam."

Again we're hard pressed to do worse than the pod label, and we can't simply labeldrop pod - we need to aggregate on "something."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants