Implement Prometheus metrics for gpu-kubelet-plugin, compute-domain-plugin, and dra-controller#964
Conversation
bd4d808 to
5a94361
Compare
2974ec5 to
2d461d3
Compare
shivamerla
left a comment
There was a problem hiding this comment.
Great work! left some comments. Thanks.
Iterating on user-facing metric names is really important before a first release. Hence, I'd like to ask a few questions (I am sure there are more):
Just thinking out loud! This needs a bit of brainstorm between a number of people that ask critical questions, and we also need to look at different variants side by side. Bad naming choices will either stick around forever, or create quite a lot of migration pain later on. |
What does this really measure and why is this useful? Which labels did you consider adding to this metric, and why? |
|
As a general note, let's make sure to read https://prometheus.io/docs/practices/naming :-) (before trying to construct metric names).
Why would this be useful? Let's think about this end to end: let's construct a query-time (dashboarding/alerting) use case that demonstrates value (there may be one, but let's make it obvious). About "by status": which status do you refer to? About "Expose ComputeDomain objects": what do you mean with that? Should this gauge reflect a current count?
Same here: let's discuss the primary use case.
Same here. When I look at the name I think we maybe should refrain from adding CD daemon/controller-emitted metrics here in this first patch. The kubelet plugin metrics you have listed all seem to be rather sane, but there are so many questions about the currently proposed metrics emitted by CD daemon/controller. I think this patch may land faster if we keep it focused on the basic plugin metrics. |
2d461d3 to
9001190
Compare
d74cd23 to
afebe5b
Compare
afebe5b to
6d5d05e
Compare
|
@shengnuo Could you share a few examples of the metrics being collected, along with any sample dashboards (if available)? |
Signed-off-by: Sheng Lin <shelin@nvidia.com>
6d5d05e to
0dcfe57
Compare
|
@shivamerla Updated the PR description with sample Prometheus metrics |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shengnuo, shivamerla The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #352
Metrics prefix
All DRA Prometheus metrics use prefix
nvidia_gpu(exported names arenvidia_gpu_dra_<name>).Kubelet Plugins (GPU & ComputeDomain)
requests_totaldriver,operationrequest_duration_secondsdriver,operation0.05*2^n, 0<=n<=8requests_inflightdriver,operationprepared_devicesnode,driver,device_typenode_prepare_errors_totaldriver,error_typenode_unprepare_errors_totaldriver,error_typeCompute-domain controller
compute_domain_infostatusComputeDomainobjects by status, from informer cacheSample
Sample metrics on the GPU kubelet plugin after preparing and unpreparing a GPU and a MIG device