Is there an existing feature request for this?
Problem or Motivation
In a shared KubeAirunway cluster, there isn't a way to tell who’s burning through GPUs or using certain models. One team can crowd out everyone else without anyone noticing. This could lead to people just deploying separate instances per team to keep the peace, but that chews up extra compute and more work to manage. "Airtimebill" would add straightforward tracking by user or namespace, so you can see the big users, nudge them with quotas if needed, and have a shared dashboard for openness.
Down the road, for larger self-hosted setups, it could output reports for showback or even chargeback if you have a need to billback.
Proposed Solution
Airtimebill tracks GPU, memory, and inference usage per user/namespace via labeled Prometheus metrics from an auth proxy, displaying simple dashboards, CSV exports, to ensure fair sharing in multi-team setups.
Components:
- Proxy Layer: Injects x-user, x-nas from OIDC claims; logs requests.
- Metrics Endpoint: Backend emits /metrics with labels (CRD controller annotates pods)
- Storage/Query: Prometheus (user-provided) or in-mem for small setups; query sum(gpu_time) by (user,ns)
- UI: Embed Grafana panel or static charts; export via PromQL-to-CSV.
Alternatives Considered
No response
Feature Area
Metrics / Monitoring
How important is this feature to you?
Nice to have
Mockups or Examples
No response
Additional Context
No response
Is there an existing feature request for this?
Problem or Motivation
In a shared KubeAirunway cluster, there isn't a way to tell who’s burning through GPUs or using certain models. One team can crowd out everyone else without anyone noticing. This could lead to people just deploying separate instances per team to keep the peace, but that chews up extra compute and more work to manage. "Airtimebill" would add straightforward tracking by user or namespace, so you can see the big users, nudge them with quotas if needed, and have a shared dashboard for openness.
Down the road, for larger self-hosted setups, it could output reports for showback or even chargeback if you have a need to billback.
Proposed Solution
Airtimebill tracks GPU, memory, and inference usage per user/namespace via labeled Prometheus metrics from an auth proxy, displaying simple dashboards, CSV exports, to ensure fair sharing in multi-team setups.
Components:
Alternatives Considered
No response
Feature Area
Metrics / Monitoring
How important is this feature to you?
Nice to have
Mockups or Examples
No response
Additional Context
No response