Initialize DRA request metrics series at startup#1029
Initialize DRA request metrics series at startup#1029dims wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
Pre-create the DRA request metric series before exposing /metrics so the first Prometheus scrape includes zero-valued request counters, histograms, and in-flight gauges even before any request has been processed. Add a regression test that exercises the metrics handler prior to the first DRA request and confirms the initialized series are present in the exposition output. Also clarify the kubelet-plugin error metric descriptions to match their current usage.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
/assign @shengnuo |
|
@dims: GitHub didn't allow me to assign the following users: shengnuo. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR fixes the root cause of the metrics smoke-test failure seen in #1025.
While #1028 relaxed the test, the real issue was that the DRA request metrics were not visible on
/metricsuntil the firstprepareorunpreparecall created labeled series, so a fresh kubelet-plugin scrape could shownvidia_dra_prepared_devicesbut notnvidia_dra_requests_totalas in the:Prow log.
This change initializes the request metric series at startup for both kubelet plugins before the HTTP endpoint is exposed, keeps them visible at
0on the first scrape, adds a regression test for that behavior, and updates the error-metric descriptions to match current usage.