Description
First of all, I really appreciate the work you are doing with this helm chart. It helps a lot to build powerful observability solutions in a simple way!
I am having problems to fetch kube-scheduler and kube-controller-manager metrics using the Cluster Metrics feature in an EKS cluster. Setting a configuration like the following one, it does not work for me:
clusterMetrics:
enabled: true
apiServer:
enabled: true
cAdvisor:
enabled: true
controlPlane:
enabled: true
kubeControllerManager:
enabled: true
kubeScheduler:
enabled: true
windows-exporter:
enabled: false
node-exporter:
enabled: true
kube-state-metrics:
enabled: true
On the AWS documentation, it is mentioned:
"For clusters that are Kubernetes version 1.28 and above, Amazon EKS also exposes metrics under the API group metrics.eks.amazonaws.com. These metrics include control plane components such as kube-scheduler and kube-controller-manager"
I added the following extra config to the Alloy collector for metrics converting part of a sample Prometheus configuration provided in the AWS documentation and now I am able to scrape kube-scheduler and kube-proxy metrics from the metrics.eks.amazonaws.com api group:
alloy-metrics:
enabled: true
extraConfig: |
discovery.kubernetes "kube_scheduler" {
role = "endpoints"
}
discovery.kubernetes "kube_controller_manager" {
role = "endpoints"
}
discovery.relabel "kube_scheduler" {
targets = discovery.kubernetes.kube_scheduler.targets
rule {
source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_service_name", "__meta_kubernetes_endpoint_port_name"]
regex = "default;kubernetes;https"
action = "keep"
}
}
discovery.relabel "kube_controller_manager" {
targets = discovery.kubernetes.kube_controller_manager.targets
rule {
source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_service_name", "__meta_kubernetes_endpoint_port_name"]
regex = "default;kubernetes;https"
action = "keep"
}
}
prometheus.scrape "kube_scheduler" {
targets = discovery.relabel.kube_scheduler.output
forward_to = [prometheus.remote_write.metricstore.receiver]
job_name = "kube-scheduler"
scrape_interval = "30s"
metrics_path = "/apis/metrics.eks.amazonaws.com/v1/ksh/container/metrics"
scheme = "https"
authorization {
type = "Bearer"
credentials_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
}
tls_config {
ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
insecure_skip_verify = true
}
}
prometheus.scrape "kube_controller_manager" {
targets = discovery.relabel.kube_controller_manager.output
forward_to = [prometheus.remote_write.metricstore.receiver]
job_name = "kube-controller-manager"
scrape_interval = "30s"
metrics_path = "/apis/metrics.eks.amazonaws.com/v1/kcm/container/metrics"
scheme = "https"
authorization {
type = "Bearer"
credentials_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
}
tls_config {
ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
insecure_skip_verify = true
}
}
The problem is that the created clusterrole that is used by the Alloy pod for collecting metrics needs to be patched with the following permissions in order to access the metrics.eks.amazonaws.com
endpoint:
{
"effect": "allow",
"apiGroups": [
"metrics.eks.amazonaws.com"
],
"resources": [
"kcm/metrics",
"ksh/metrics"
],
"verbs": [
"get"
] },
When I upgrade the chart, the patch for the new permissions is lost and the clusterrole needs to be repatched to get the metrics again. I searched the rbac.yaml
template from the Alloy chart to check if permissions could be added but it seems they are hardcoded.
Is there any workaround here that could be provided? Maybe I am missing something. Thanks again!
Activity