You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: rewrite README with results-first layout, D2 diagram, recovery chart
- Reorder sections: results (scenario, measurements, key observations,
why-the-numbers-hold) first, then architecture, quickstart, project
structure, then the rest.
- Replace the mermaid block with a D2 source at
docs/diagrams/architecture.d2 and a rendered SVG. Add a 'make
diagrams' target so the SVG is regenerable.
- Add docs/media/restart-recovery.svg: a bar chart comparing
pre-restart (0.688) and post-restart (0.996) accuracy against the
steady-state baseline (0.96), with the recovery delta called out.
- Standard badges (Go, Kubernetes, Kubebuilder, Helm, Tests, E2E).
- Drop the License section; the LICENSE file remains and GitHub
surfaces it in the About sidebar.
- Tighter headings and prose throughout; no em-dashes.
A namespaced Kubernetes operator that tracks cumulative GPU-hour
4
-
consumption for pods matching a label selector against a rolling-window
5
-
quota, and enforces that quota via eviction, pause annotations, or
6
-
alert-only mode.
3
+
Kubernetes operator that meters cumulative GPU-hours against a rolling-window quota and enforces it via eviction, pause annotations, or alert-only mode. The accounting engine is stateless: every reconcile recomputes from the live API-server view, so a restarted operator converges back to within 0.4% of perfect without replaying any persisted state.
*Live Grafana panels driven by the operator's Prometheus metrics.
11
-
The gauge climbs past the quota line as 8 pods run against a quota
12
-
deliberately crossable in seconds. Midway through, the operator pod is
13
-
deleted; the pods-tracked stat holds at 8, showing state is rebuilt
14
-
from the API-server view on restart rather than restored from a cache.
15
-
Regenerate with `make demo`.*
16
-
17
-
## Why this matters
18
-
19
-
Cumulative GPU-hour budgets are the quota primitive AI-cloud platforms
20
-
use to keep shared fleets fair and predictable: teams get an allowance
21
-
per window, workloads stay fungible across nodes, and billing stays
22
-
out of the hot path. This operator is a self-contained implementation
23
-
of that control plane: rolling-window accounting, grace-period
24
-
enforcement, stateless restart recovery, all driven by a single CRD
25
-
(`GPUWorkloadBudget`, group `budget.zxuhan.dev`, version `v1alpha1`).
26
-
Accounting is derived from the API-server view on every reconcile, so
27
-
a restart recovers from cluster state rather than from a cache;
28
-
[docs/accounting-model.md](docs/accounting-model.md) explains the
29
-
bounded-error guarantees.
30
-
31
-
When `nvidia.com/gpu` is absent (e.g. a kind cluster) the accounting
32
-
engine falls back to scaled CPU-second counting: set
33
-
`spec.gpuResourceName: cpu` and the same control loop drives against
34
-
`resource.Quantity` CPU requests. The e2e and bench suites both rely
35
-
on this path.
14
+
*Grafana panels driven by the operator's Prometheus metrics. The gauge climbs past the quota line as 8 pods run against a quota crossable in seconds. Midway through, the operator pod is deleted. The tracked-pods stat holds at 8: state is rebuilt from the API-server view on restart, not from a cache. Regenerate with `make demo`.*
15
+
16
+
## Results
17
+
18
+
Numbers below were measured on a kind cluster (M-series laptop, 2026-04-20) and are checked in under [`bench-results/2026-04-20/`](bench-results/2026-04-20/) and [`chaos-results/2026-04-20/`](chaos-results/2026-04-20/). The harness owns the numbers; the README quotes them. Regenerate with `make bench` and `make chaos`.
19
+
20
+
### Scenario
21
+
22
+
| Parameter | Steady-state bench | Chaos run |
23
+
|---|---|---|
24
+
| pods | 50 | 50 |
25
+
| arrival rate | 10 pods/s | 10 pods/s |
26
+
| per-pod runtime | 30s | 60s |
27
+
| resource per pod | 0.1 CPU (simulated GPU) | 0.1 CPU |
28
+
| snapshots | t = 45s | t = 15s, t = 120s |
29
+
| event | none | operator pod deleted between snapshots |
-**State survives restart without persistence.**`tracked_pods` stays at 50/50 across the operator kill. The replacement pod's informer rebuilds from the API-server view and re-observes every pod. Nothing is replayed from a cache because nothing was ever cached.
45
+
-**Post-restart accuracy matches steady-state.** 0.996 after recovery is within rounding of the 0.96 bench number from a clean run. The restart cost no measurable accuracy.
46
+
-**The pre-restart 0.688 is a transient, not a regression.** It is a 15-second snapshot of a freshly launched workload, taken one reconcile after pod creation. Two reconcile cadences later, accuracy converges. See [`docs/benchmark-methodology.md`](docs/benchmark-methodology.md) for the convergence model.
47
+
-**Sub-second delta is kubelet, not the operator.** The -6 pod-second steady-state delta is the kubelet start-up lag between pod create and `state.running.startedAt`. The engine counts from `startedAt`, deliberately excluding image-pull and scheduling slop from quota.
48
+
49
+
### Why the numbers hold
50
+
51
+
The design choices that make those numbers cheap, in order of how much they matter:
52
+
53
+
1.**Accounting is derived, not stored.**`internal/accounting/` is a pure-Go function: given a pod set with `(Start, End, GPUs)`, compute consumed GPU-hours. There is no in-memory ledger to lose on a restart and no rolling counter to drift over weeks.
54
+
2.**`.status.consumedGpuHours` is overwritten, not accumulated.** Every reconcile writes the freshly computed value. A bug in one reconcile self-heals on the next.
55
+
3.**The reconciler does no math.** It translates pods to accounting input (see [`internal/controller/pod_conversion.go`](internal/controller/pod_conversion.go)) and writes status. All numeric logic lives in `internal/accounting/`, unit-tested to nanosecond precision against scripted timelines.
56
+
4.**GPU-less clusters use the same code path.** When `nvidia.com/gpu` is absent, set `spec.gpuResourceName: cpu` and the engine treats fractional CPU as fractional GPU. The e2e and bench suites both rely on this. See [`docs/accounting-model.md`](docs/accounting-model.md) for the bounded-error guarantee on what happens when the kubelet GC'es a pod the operator never saw.
57
+
58
+
## Architecture
59
+
60
+

61
+
62
+
*Source: [`docs/diagrams/architecture.d2`](docs/diagrams/architecture.d2). Regenerate with `make diagrams`.*
63
+
64
+
Three packages, each independently testable:
65
+
66
+
-**`internal/accounting/`** is pure Go, k8s-free. Given a pod set with `(Start, End, GPUs)`, returns consumed GPU-hours, clamped remaining, and an over-quota flag. Unit-tested to nanosecond precision.
67
+
-**`internal/controller/`** is the reconciler. Translates `Pod` objects to accounting input (`pod_conversion.go` handles the `earliestContainerStart` / `latestContainerFinish` rules), patches `.status`, and toggles `Ready` / `QuotaExceeded` / `Degraded` conditions.
68
+
-**`internal/enforcement/`** dispatches one of three actions per `spec.enforcement.action`. `Evict` submits `policy/v1.Eviction`, `Pause` writes an annotation, `AlertOnly` records a Kubernetes Event. Grace periods are wall-clock.
69
+
70
+
The validating webhook (`internal/webhook/v1alpha1/`) rejects empty selectors, non-positive quotas, and unknown enforcement actions at admission time, so the controller never sees a malformed CR. Validating-only on purpose; see [`docs/limitations.md`](docs/limitations.md) for why.
36
71
37
72
## Quickstart
38
73
39
-
### Install with Helm
74
+
Prerequisites: a Kubernetes cluster, `helm`, and `cert-manager` (the webhook needs TLS).
40
75
41
76
```sh
42
-
# cert-manager is a prerequisite for the validating webhook.
|`gwb_enforcement_actions_total{action, namespace, name}`| counter incremented per action fired |
125
-
|`gwb_tracked_pods{namespace, name}`| pods matched by selector at last reconcile |
126
-
|`gwb_accounting_accuracy_ratio{namespace, name}`| registered, currently always 0. The operator doesn't know ground truth; the bench harness computes this externally and writes it to `bench-results/…/results.json`|
141
+
|`gwb_remaining_gpu_hours{namespace, name}`|`quota - consumed`, clamped at zero|
142
+
|`gwb_enforcement_actions_total{action, namespace, name}`| counter, incremented per action fired |
143
+
|`gwb_tracked_pods{namespace, name}`| pods matched by the selector at last reconcile |
144
+
|`gwb_accounting_accuracy_ratio{namespace, name}`| registered but currently always zero. The operator does not know ground truth; the bench harness computes the ratio externally and writes it to `bench-results/.../results.json`|
The headline is `tracked_pods = 50` on both sides of the restart: when
174
-
the new operator pod comes up, controller-runtime's informer rebuilds
175
-
from the API-server view and every pod is re-observed. No state was
176
-
persisted and none was lost. The pre-restart 0.69 is reconcile cadence
177
-
against a fresh workload (first snapshot lands one reconcile after
178
-
workload launch). Once the operator has had a few ticks to re-sum
179
-
everyone's elapsed runtime, the post-restart reading converges to
180
-
0.996, essentially the same accuracy as `make bench`, which says the
181
-
restart cost nothing.
155
+
The accuracy formula and the reason benches run on kind rather than a cloud cluster live in [`docs/benchmark-methodology.md`](docs/benchmark-methodology.md). Scenario knobs (count, rate, runtime, gpus, observe-window) are documented in `hack/bench.sh`.
182
156
183
157
## Development
184
158
185
-
Prerequisites: Go 1.23+, Docker, kubectl, kind.
159
+
Prerequisites: Go 1.23+, Docker, `kubectl`, `kind`.
186
160
187
161
```sh
188
-
make test# unit + envtest suites
162
+
make test# unit and envtest suites
189
163
make test-e2e # Ginkgo against a fresh kind cluster
190
164
make lint # golangci-lint v2
191
-
make manifests generate # regen CRD + deepcopy after API changes
165
+
make manifests generate # regen CRD and deepcopy after API changes
192
166
make helm-lint # lint the Helm chart (requires helm)
167
+
make diagrams # re-render docs/media/architecture.svg from D2
193
168
make demo # regenerate docs/media/demo.gif (requires helm, node, ffmpeg)
194
169
```
195
170
196
-
The `config/` directory holds the kustomize sources; `make
197
-
build-installer` emits a single-file `dist/install.yaml` that's
198
-
functionally equivalent to the Helm chart for air-gapped clusters.
199
-
200
-
## Run on Azure (AKS)
201
-
202
-
The `deploy/aks/` directory ships a Bicep template + `parameters.example.json`
203
-
that provision an AKS cluster (1.31, 2× B2s, Azure CNI overlay) and a
204
-
Basic ACR with `adminUserEnabled: false` and an AcrPull role assignment
205
-
to the cluster's kubelet identity. `.github/workflows/aks-deploy.yml`
206
-
then builds the operator image on every push, pushes it to ACR, and
207
-
runs `helm upgrade --install` against the cluster via
208
-
`azure/setup-helm`. Intended for a student subscription. The
209
-
trade-offs (no GPU node pool, no monitoring addon, public API server)
210
-
are documented in [`deploy/aks/README.md`](deploy/aks/README.md).
171
+
`config/` holds the kustomize sources; `make build-installer` emits `dist/install.yaml`.
docs/media/ demo.gif (regenerable via `make demo`)
233
-
```
173
+
## Deploy on Azure (AKS)
234
174
235
-
## Status and limitations
175
+
[`deploy/aks/`](deploy/aks/) ships a Bicep template with `parameters.example.json` that provisions an AKS cluster (1.31, 2x B2s, Azure CNI overlay) and a Basic ACR with `adminUserEnabled: false` and an AcrPull role assignment to the cluster's kubelet identity. The workflow at [`.github/workflows/aks-deploy.yml`](.github/workflows/aks-deploy.yml) builds the operator image on every push, pushes it to ACR, and runs `helm upgrade --install` against the cluster via `azure/setup-helm`. Intended for a student subscription. The trade-offs (no GPU node pool, no monitoring addon, public API server) are spelled out in [`deploy/aks/README.md`](deploy/aks/README.md).
236
176
237
-
Alpha. Full list of known issues and scope boundaries:
238
-
[docs/limitations.md](docs/limitations.md). The TL;DR:
0 commit comments