Skip to content

Fetching kube-scheduler and kube-controller-manager metrics from AWS EKS Control Plane #1219

Open
@RaulFiol93

Description

First of all, I really appreciate the work you are doing with this helm chart. It helps a lot to build powerful observability solutions in a simple way!

I am having problems to fetch kube-scheduler and kube-controller-manager metrics using the Cluster Metrics feature in an EKS cluster. Setting a configuration like the following one, it does not work for me:

clusterMetrics:
  enabled: true
  apiServer:
    enabled: true
  cAdvisor:
    enabled: true
  controlPlane:
    enabled: true
  kubeControllerManager:
    enabled: true
  kubeScheduler:
    enabled: true
  windows-exporter:
    enabled: false
  node-exporter:
    enabled: true
  kube-state-metrics:
    enabled: true

On the AWS documentation, it is mentioned:

"For clusters that are Kubernetes version 1.28 and above, Amazon EKS also exposes metrics under the API group metrics.eks.amazonaws.com. These metrics include control plane components such as kube-scheduler and kube-controller-manager"

I added the following extra config to the Alloy collector for metrics converting part of a sample Prometheus configuration provided in the AWS documentation and now I am able to scrape kube-scheduler and kube-proxy metrics from the metrics.eks.amazonaws.com api group:

alloy-metrics:
  enabled: true
  extraConfig: |
    discovery.kubernetes "kube_scheduler" {
      role = "endpoints"
    }

    discovery.kubernetes "kube_controller_manager" {
      role = "endpoints"
    }

    discovery.relabel "kube_scheduler" {
      targets = discovery.kubernetes.kube_scheduler.targets

      rule {
        source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_service_name", "__meta_kubernetes_endpoint_port_name"]
        regex         = "default;kubernetes;https"
        action        = "keep"
      }
    }

    discovery.relabel "kube_controller_manager" {
      targets = discovery.kubernetes.kube_controller_manager.targets

      rule {
        source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_service_name", "__meta_kubernetes_endpoint_port_name"]
        regex         = "default;kubernetes;https"
        action        = "keep"
      }
    }

    prometheus.scrape "kube_scheduler" {
      targets         = discovery.relabel.kube_scheduler.output
      forward_to      = [prometheus.remote_write.metricstore.receiver]
      job_name        = "kube-scheduler"
      scrape_interval = "30s"
      metrics_path    = "/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics"
      scheme          = "https"

      authorization {
        type             = "Bearer"
        credentials_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      }

      tls_config {
        ca_file              = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
        insecure_skip_verify = true
      }
    }

    prometheus.scrape "kube_controller_manager" {
      targets         = discovery.relabel.kube_controller_manager.output
      forward_to      = [prometheus.remote_write.metricstore.receiver]
      job_name        = "kube-controller-manager"
      scrape_interval = "30s"
      metrics_path    = "/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics"
      scheme          = "https"

      authorization {
        type             = "Bearer"
        credentials_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      }

      tls_config {
        ca_file              = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
        insecure_skip_verify = true
      }
    }

The problem is that the created clusterrole that is used by the Alloy pod for collecting metrics needs to be patched with the following permissions in order to access the metrics.eks.amazonaws.com endpoint:

{
  "effect": "allow",
  "apiGroups": [
    "metrics.eks.amazonaws.com"
  ],
  "resources": [
    "kcm/metrics",
    "ksh/metrics"
  ],
  "verbs": [
    "get"
  ] },

When I upgrade the chart, the patch for the new permissions is lost and the clusterrole needs to be repatched to get the metrics again. I searched the rbac.yaml template from the Alloy chart to check if permissions could be added but it seems they are hardcoded.

Is there any workaround here that could be provided? Maybe I am missing something. Thanks again!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions