Skip to content

enable and scrape CSI driver metrics#1773

Open
matthias-horne wants to merge 1 commit intomasterfrom
csi-driver-metrics
Open

enable and scrape CSI driver metrics#1773
matthias-horne wants to merge 1 commit intomasterfrom
csi-driver-metrics

Conversation

@matthias-horne
Copy link
Copy Markdown
Contributor

How to categorize this PR?

/area monitoring
/kind enhancement
/platform aws

What this PR does / why we need it:

This PR enables metrics for the CSI driver and adds a ServiceMonitor to scrape them.

Which issue(s) this PR fixes:
Fixes #1568

Special notes for your reviewer:

Release note:

Metrics for the CSI driver controller are now enabled

@matthias-horne matthias-horne requested a review from a team as a code owner April 17, 2026 14:26
@gardener-prow gardener-prow Bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This change adds comprehensive monitoring capabilities to the CSI driver controller by exposing metrics on port 3301 and integrating with Prometheus for observability of AWS EBS CSI operations.

Walkthrough

  • New Feature: Added metrics endpoint exposure on port 3301 to the CSI driver controller for monitoring and observability
  • New Feature: Created Kubernetes Service to expose the metrics port with proper networking annotations for scraping access
  • New Feature: Implemented ServiceMonitor configuration for Prometheus integration, enabling collection of AWS EBS CSI-specific metrics including API request duration, errors, throttles, and EC2 detach pending time

Model: claude-sonnet-4-20250514 | Prompt Tokens: 922 | Completion Tokens: 143

@gardener-prow gardener-prow Bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label Apr 17, 2026
@federated-github-access federated-github-access Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. and removed ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Apr 17, 2026
@hebelsan
Copy link
Copy Markdown
Contributor

In the issue, you mentioned that we should consider putting this feature behind a flag so it can be enabled on demand, given that it could scale into an unbounded number of metrics.
What led to the decision to enable it by default now?

@matthias-horne
Copy link
Copy Markdown
Contributor Author

In the issue, you mentioned that we should consider putting this feature behind a flag so it can be enabled on demand, given that it could scale into an unbounded number of metrics. What led to the decision to enable it by default now?

I did take a detailed look at the exposed metrics and discussed with @rickardsjp. The following table shows the number of individual metrics that will be generated.

Metric Calculation Count
aws_ebs_csi_api_request_duration_seconds 16 request types * 14 metrics/type 224
aws_ebs_csi_api_request_errors_total 16 request types * 1 counter max 16
aws_ebs_csi_api_request_throttles_total 16 request types * 1 counter max 16
aws_ebs_csi_ec2_detach_pending_seconds 1 counter * number of volumes pending detachment 1 max number of PVC

We concluded that the number of additional metrics should not be too much for Prometheus. Worst case the retention time might be reduced. The low risk combined with a need to timely deliver the additional metrics justifies delivering this feature without an additional flag.

Footnotes

  1. only created when volume is not detached in the first attempt. Deleted as soon as volume is successfully detached. If this happens between two scrapes, the metric might never show up in Prometheus.

Copy link
Copy Markdown
Contributor

@hebelsan hebelsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@gardener-prow gardener-prow Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2026
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Apr 21, 2026

LGTM label has been added.

DetailsGit tree hash: a71ed1a188338237de34ef4f3397101d7bb8f6e4

@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hebelsan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing metric in Gardener setup

2 participants