Skip to content

Commit d3c741b

Browse files
committed
docs: rewrite README with results-first layout, D2 diagram, recovery chart
- Reorder sections: results (scenario, measurements, key observations, why-the-numbers-hold) first, then architecture, quickstart, project structure, then the rest. - Replace the mermaid block with a D2 source at docs/diagrams/architecture.d2 and a rendered SVG. Add a 'make diagrams' target so the SVG is regenerable. - Add docs/media/restart-recovery.svg: a bar chart comparing pre-restart (0.688) and post-restart (0.996) accuracy against the steady-state baseline (0.96), with the recovery delta called out. - Standard badges (Go, Kubernetes, Kubebuilder, Helm, Tests, E2E). - Drop the License section; the LICENSE file remains and GitHub surfaces it in the About sidebar. - Tighter headings and prose throughout; no em-dashes.
1 parent 6155194 commit d3c741b

5 files changed

Lines changed: 360 additions & 174 deletions

File tree

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,11 @@ demo-record: ## Re-record demo.gif only. Assumes 'make demo' already brought a c
146146
rm -f hack/demo/demo.webm hack/demo/.palette.png
147147
rm -rf hack/demo/.record
148148

149+
.PHONY: diagrams
150+
diagrams: ## Render D2 sources under docs/diagrams/ into docs/media/. Requires d2.
151+
@command -v d2 >/dev/null 2>&1 || { echo "d2 not installed (brew install d2)"; exit 1; }
152+
d2 --layout=dagre --pad=20 docs/diagrams/architecture.d2 docs/media/architecture.svg
153+
149154
##@ Build
150155

151156
.PHONY: build

README.md

Lines changed: 111 additions & 174 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,79 @@
11
# gpu-k8s-operator
22

3-
A namespaced Kubernetes operator that tracks cumulative GPU-hour
4-
consumption for pods matching a label selector against a rolling-window
5-
quota, and enforces that quota via eviction, pause annotations, or
6-
alert-only mode.
3+
Kubernetes operator that meters cumulative GPU-hours against a rolling-window quota and enforces it via eviction, pause annotations, or alert-only mode. The accounting engine is stateless: every reconcile recomputes from the live API-server view, so a restarted operator converges back to within 0.4% of perfect without replaying any persisted state.
4+
5+
![Go](https://img.shields.io/badge/Go-1.23%2B-00ADD8?logo=go&logoColor=white)
6+
![Kubernetes](https://img.shields.io/badge/Kubernetes-1.31%2B-326ce5?logo=kubernetes&logoColor=white)
7+
![Kubebuilder](https://img.shields.io/badge/Kubebuilder-v4-326ce5)
8+
![Helm](https://img.shields.io/badge/Helm-chart-0F1689?logo=helm&logoColor=white)
9+
[![Tests](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test.yml/badge.svg)](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test.yml)
10+
[![E2E](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test-e2e.yml/badge.svg)](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test-e2e.yml)
711

812
![demo](docs/media/demo.gif)
913

10-
*Live Grafana panels driven by the operator's Prometheus metrics.
11-
The gauge climbs past the quota line as 8 pods run against a quota
12-
deliberately crossable in seconds. Midway through, the operator pod is
13-
deleted; the pods-tracked stat holds at 8, showing state is rebuilt
14-
from the API-server view on restart rather than restored from a cache.
15-
Regenerate with `make demo`.*
16-
17-
## Why this matters
18-
19-
Cumulative GPU-hour budgets are the quota primitive AI-cloud platforms
20-
use to keep shared fleets fair and predictable: teams get an allowance
21-
per window, workloads stay fungible across nodes, and billing stays
22-
out of the hot path. This operator is a self-contained implementation
23-
of that control plane: rolling-window accounting, grace-period
24-
enforcement, stateless restart recovery, all driven by a single CRD
25-
(`GPUWorkloadBudget`, group `budget.zxuhan.dev`, version `v1alpha1`).
26-
Accounting is derived from the API-server view on every reconcile, so
27-
a restart recovers from cluster state rather than from a cache;
28-
[docs/accounting-model.md](docs/accounting-model.md) explains the
29-
bounded-error guarantees.
30-
31-
When `nvidia.com/gpu` is absent (e.g. a kind cluster) the accounting
32-
engine falls back to scaled CPU-second counting: set
33-
`spec.gpuResourceName: cpu` and the same control loop drives against
34-
`resource.Quantity` CPU requests. The e2e and bench suites both rely
35-
on this path.
14+
*Grafana panels driven by the operator's Prometheus metrics. The gauge climbs past the quota line as 8 pods run against a quota crossable in seconds. Midway through, the operator pod is deleted. The tracked-pods stat holds at 8: state is rebuilt from the API-server view on restart, not from a cache. Regenerate with `make demo`.*
15+
16+
## Results
17+
18+
Numbers below were measured on a kind cluster (M-series laptop, 2026-04-20) and are checked in under [`bench-results/2026-04-20/`](bench-results/2026-04-20/) and [`chaos-results/2026-04-20/`](chaos-results/2026-04-20/). The harness owns the numbers; the README quotes them. Regenerate with `make bench` and `make chaos`.
19+
20+
### Scenario
21+
22+
| Parameter | Steady-state bench | Chaos run |
23+
|---|---|---|
24+
| pods | 50 | 50 |
25+
| arrival rate | 10 pods/s | 10 pods/s |
26+
| per-pod runtime | 30s | 60s |
27+
| resource per pod | 0.1 CPU (simulated GPU) | 0.1 CPU |
28+
| snapshots | t = 45s | t = 15s, t = 120s |
29+
| event | none | operator pod deleted between snapshots |
30+
| cluster | kind on Docker | kind on Docker |
31+
32+
### Measurements
33+
34+
| Scenario | Tracked pods | Reported GPU-hours | Expected | Accuracy | Delta |
35+
|---|---|---|---|---|---|
36+
| Steady-state | 50 / 50 | 0.04000 | 0.04167 | **0.960** | -6 pod-seconds |
37+
| Pre-restart | 50 / 50 | 0.01200 | 0.01743 | 0.688 | -19 pod-seconds |
38+
| Post-restart | 50 / 50 | 0.08300 | 0.08333 | **0.996** | -1 pod-second |
39+
40+
![Restart recovery](docs/media/restart-recovery.svg)
41+
42+
### Key observations
43+
44+
- **State survives restart without persistence.** `tracked_pods` stays at 50/50 across the operator kill. The replacement pod's informer rebuilds from the API-server view and re-observes every pod. Nothing is replayed from a cache because nothing was ever cached.
45+
- **Post-restart accuracy matches steady-state.** 0.996 after recovery is within rounding of the 0.96 bench number from a clean run. The restart cost no measurable accuracy.
46+
- **The pre-restart 0.688 is a transient, not a regression.** It is a 15-second snapshot of a freshly launched workload, taken one reconcile after pod creation. Two reconcile cadences later, accuracy converges. See [`docs/benchmark-methodology.md`](docs/benchmark-methodology.md) for the convergence model.
47+
- **Sub-second delta is kubelet, not the operator.** The -6 pod-second steady-state delta is the kubelet start-up lag between pod create and `state.running.startedAt`. The engine counts from `startedAt`, deliberately excluding image-pull and scheduling slop from quota.
48+
49+
### Why the numbers hold
50+
51+
The design choices that make those numbers cheap, in order of how much they matter:
52+
53+
1. **Accounting is derived, not stored.** `internal/accounting/` is a pure-Go function: given a pod set with `(Start, End, GPUs)`, compute consumed GPU-hours. There is no in-memory ledger to lose on a restart and no rolling counter to drift over weeks.
54+
2. **`.status.consumedGpuHours` is overwritten, not accumulated.** Every reconcile writes the freshly computed value. A bug in one reconcile self-heals on the next.
55+
3. **The reconciler does no math.** It translates pods to accounting input (see [`internal/controller/pod_conversion.go`](internal/controller/pod_conversion.go)) and writes status. All numeric logic lives in `internal/accounting/`, unit-tested to nanosecond precision against scripted timelines.
56+
4. **GPU-less clusters use the same code path.** When `nvidia.com/gpu` is absent, set `spec.gpuResourceName: cpu` and the engine treats fractional CPU as fractional GPU. The e2e and bench suites both rely on this. See [`docs/accounting-model.md`](docs/accounting-model.md) for the bounded-error guarantee on what happens when the kubelet GC'es a pod the operator never saw.
57+
58+
## Architecture
59+
60+
![Architecture](docs/media/architecture.svg)
61+
62+
*Source: [`docs/diagrams/architecture.d2`](docs/diagrams/architecture.d2). Regenerate with `make diagrams`.*
63+
64+
Three packages, each independently testable:
65+
66+
- **`internal/accounting/`** is pure Go, k8s-free. Given a pod set with `(Start, End, GPUs)`, returns consumed GPU-hours, clamped remaining, and an over-quota flag. Unit-tested to nanosecond precision.
67+
- **`internal/controller/`** is the reconciler. Translates `Pod` objects to accounting input (`pod_conversion.go` handles the `earliestContainerStart` / `latestContainerFinish` rules), patches `.status`, and toggles `Ready` / `QuotaExceeded` / `Degraded` conditions.
68+
- **`internal/enforcement/`** dispatches one of three actions per `spec.enforcement.action`. `Evict` submits `policy/v1.Eviction`, `Pause` writes an annotation, `AlertOnly` records a Kubernetes Event. Grace periods are wall-clock.
69+
70+
The validating webhook (`internal/webhook/v1alpha1/`) rejects empty selectors, non-positive quotas, and unknown enforcement actions at admission time, so the controller never sees a malformed CR. Validating-only on purpose; see [`docs/limitations.md`](docs/limitations.md) for why.
3671

3772
## Quickstart
3873

39-
### Install with Helm
74+
Prerequisites: a Kubernetes cluster, `helm`, and `cert-manager` (the webhook needs TLS).
4075

4176
```sh
42-
# cert-manager is a prerequisite for the validating webhook.
4377
helm upgrade --install cert-manager jetstack/cert-manager \
4478
--namespace cert-manager --create-namespace \
4579
--set crds.enabled=true
@@ -48,7 +82,7 @@ helm upgrade --install gwb-operator ./deploy/helm/gwb-operator \
4882
--namespace gpu-k8s-operator-system --create-namespace
4983
```
5084

51-
### Create a budget
85+
Create a budget:
5286

5387
```yaml
5488
apiVersion: budget.zxuhan.dev/v1alpha1
@@ -68,182 +102,85 @@ spec:
68102
gracePeriodSeconds: 60
69103
```
70104
71-
Then watch:
105+
Watch it move:
72106
73107
```sh
74108
kubectl get gwb team-a -w
75109
```
76110

77-
## How it works
78-
79-
```mermaid
80-
flowchart LR
81-
P[Pod events<br/>API server] --> R[Reconciler<br/>internal/controller]
82-
R --> A[Accounting engine<br/>internal/accounting<br/>pure Go, k8s-free]
83-
A --> R
84-
R --> S[.status write<br/>Ready / QuotaExceeded /<br/>Degraded conditions]
85-
R --> D[Enforcement dispatcher<br/>internal/enforcement]
86-
D --> E1[Evict<br/>policy/v1.Eviction]
87-
D --> E2[Pause<br/>annotation stamp]
88-
D --> E3[AlertOnly<br/>Event only]
89-
R --> M[Prometheus metrics<br/>consumed / remaining /<br/>tracked_pods]
90-
```
91-
92-
Three packages, separated so each is independently testable:
93-
94-
- **`internal/accounting/`**: pure Go, k8s-free. Given a set of pods
95-
with `(Start, End, GPUs)`, computes `consumedGpuHours`, clamps
96-
`remainingGpuHours` at zero, and flags `over`. Unit-tested to ~ns
97-
precision.
111+
For an air-gapped deployment, `make build-installer` emits a single-file `dist/install.yaml` functionally equivalent to the Helm chart.
98112

99-
- **`internal/controller/`**: the reconciler. Translates
100-
k8s `Pod` objects to accounting input (see `pod_conversion.go` for
101-
the `earliestContainerStart` / `latestContainerFinish` rules),
102-
writes `.status`, and patches `Ready`/`QuotaExceeded`/`Degraded`
103-
conditions.
113+
## Project structure
104114

105-
- **`internal/enforcement/`**: one implementation per
106-
`spec.enforcement.action`: `Evict` submits `policy/v1.Eviction`,
107-
`Pause` writes an annotation, `AlertOnly` emits an event and records
108-
the action in `lastEnforcementAt`. Grace periods are wall-clock.
109-
110-
The validating webhook rejects empty selectors, zero quotas, and
111-
unknown enforcement actions at admission time so the controller never
112-
sees malformed state. It is validating only. See
113-
[docs/limitations.md](docs/limitations.md#webhook) for why.
115+
```
116+
api/v1alpha1/ GPUWorkloadBudget types and validation markers
117+
cmd/main.go manager entry point
118+
config/ generated CRD, RBAC, webhook, manager manifests
119+
internal/accounting/ pure-Go budget math
120+
internal/controller/ reconciler and pod-status conversion
121+
internal/enforcement/ Evict / Pause / AlertOnly handlers
122+
internal/webhook/ validating webhook
123+
test/e2e/ Ginkgo end-to-end suite
124+
test/bench/ accuracy harness and gwb-bench CLI
125+
test/workload-generator/ gwb-workload CLI
126+
hack/ bench.sh, chaos.sh, demo/, helm-lint.sh, bench-stack/
127+
deploy/helm/gwb-operator/ Helm chart
128+
deploy/aks/ Bicep template and parameters for AKS
129+
docs/ accounting model, benchmark methodology, limitations
130+
docs/diagrams/ D2 sources (rendered into docs/media/ via `make diagrams`)
131+
docs/media/ rendered SVGs and demo.gif
132+
```
114133

115134
## Metrics
116135

117-
Exposed on an HTTPS endpoint guarded by Kubernetes TokenReview
118-
(enable the Helm ServiceMonitor to scrape from kube-prometheus-stack):
136+
Exposed on an HTTPS endpoint guarded by Kubernetes TokenReview. Enable the Helm ServiceMonitor to scrape from `kube-prometheus-stack`.
119137

120138
| Metric | Meaning |
121139
|---|---|
122140
| `gwb_consumed_gpu_hours{namespace, name}` | current `.status.consumedGpuHours` |
123-
| `gwb_remaining_gpu_hours{namespace, name}` | `quota consumed`, clamped |
124-
| `gwb_enforcement_actions_total{action, namespace, name}` | counter incremented per action fired |
125-
| `gwb_tracked_pods{namespace, name}` | pods matched by selector at last reconcile |
126-
| `gwb_accounting_accuracy_ratio{namespace, name}` | registered, currently always 0. The operator doesn't know ground truth; the bench harness computes this externally and writes it to `bench-results//results.json` |
141+
| `gwb_remaining_gpu_hours{namespace, name}` | `quota - consumed`, clamped at zero |
142+
| `gwb_enforcement_actions_total{action, namespace, name}` | counter, incremented per action fired |
143+
| `gwb_tracked_pods{namespace, name}` | pods matched by the selector at last reconcile |
144+
| `gwb_accounting_accuracy_ratio{namespace, name}` | registered but currently always zero. The operator does not know ground truth; the bench harness computes the ratio externally and writes it to `bench-results/.../results.json` |
127145

128-
Controller-runtime's default metrics (reconcile latency, workqueue
129-
depth, etc.) are served alongside.
146+
Controller-runtime's default metrics (reconcile latency, workqueue depth, retries) are served alongside.
130147

131148
## Benchmarks
132149

133150
```sh
134-
make bench # one scenario bench-results/YYYY-MM-DD/SUMMARY.md
135-
make chaos # restart-correctness scenario → chaos-results/YYYY-MM-DD/SUMMARY.md
151+
make bench # one accuracy scenario into bench-results/YYYY-MM-DD/SUMMARY.md
152+
make chaos # restart-correctness run into chaos-results/YYYY-MM-DD/SUMMARY.md
136153
```
137154

138-
The accuracy formula and the reason benches run on kind are in
139-
[docs/benchmark-methodology.md](docs/benchmark-methodology.md).
140-
141-
### Measured numbers (kind, M-series laptop, 2026-04-20)
142-
143-
Recorded in-repo under [`bench-results/2026-04-20/`](bench-results/2026-04-20/)
144-
and [`chaos-results/2026-04-20/`](chaos-results/2026-04-20/). Regenerate
145-
any time with `make bench` / `make chaos`. The harness owns the
146-
numbers, the README just quotes them.
147-
148-
**Steady-state accuracy.** 50 busybox pods at 10/s, 30s runtime each,
149-
0.1 CPU "GPU" per pod, snapshot at t=45s (all pods terminated):
150-
151-
| Metric | Value |
152-
|---|---|
153-
| reported GPU-hours | 0.04000 |
154-
| expected GPU-hours | 0.04167 |
155-
| **accuracy ratio** | **0.96** |
156-
| delta | −6 pod-seconds |
157-
| tracked pods | 50 |
158-
159-
The −6-pod-second delta is the kubelet start-up lag: the accounting
160-
engine counts from `state.running.startedAt`, which kubelet stamps a
161-
fraction of a second after pod-create. See
162-
[docs/accounting-model.md](docs/accounting-model.md) for the formula.
163-
164-
**Restart correctness.** Same workload, runtime bumped to 60s so pods
165-
are still Running when we snapshot. Operator pod deleted at t=15s;
166-
post snapshot at t=120s (`CHAOS_POST_SECONDS=120`):
167-
168-
| Phase | Elapsed | Tracked pods | Reported | Expected | Accuracy |
169-
|---|---|---|---|---|---|
170-
| pre-restart | 15s | **50 / 50** | 0.0120 | 0.0174 | 0.69 |
171-
| post-restart | 120s | **50 / 50** | 0.0830 | 0.0833 | **0.996** |
172-
173-
The headline is `tracked_pods = 50` on both sides of the restart: when
174-
the new operator pod comes up, controller-runtime's informer rebuilds
175-
from the API-server view and every pod is re-observed. No state was
176-
persisted and none was lost. The pre-restart 0.69 is reconcile cadence
177-
against a fresh workload (first snapshot lands one reconcile after
178-
workload launch). Once the operator has had a few ticks to re-sum
179-
everyone's elapsed runtime, the post-restart reading converges to
180-
0.996, essentially the same accuracy as `make bench`, which says the
181-
restart cost nothing.
155+
The accuracy formula and the reason benches run on kind rather than a cloud cluster live in [`docs/benchmark-methodology.md`](docs/benchmark-methodology.md). Scenario knobs (count, rate, runtime, gpus, observe-window) are documented in `hack/bench.sh`.
182156

183157
## Development
184158

185-
Prerequisites: Go 1.23+, Docker, kubectl, kind.
159+
Prerequisites: Go 1.23+, Docker, `kubectl`, `kind`.
186160

187161
```sh
188-
make test # unit + envtest suites
162+
make test # unit and envtest suites
189163
make test-e2e # Ginkgo against a fresh kind cluster
190164
make lint # golangci-lint v2
191-
make manifests generate # regen CRD + deepcopy after API changes
165+
make manifests generate # regen CRD and deepcopy after API changes
192166
make helm-lint # lint the Helm chart (requires helm)
167+
make diagrams # re-render docs/media/architecture.svg from D2
193168
make demo # regenerate docs/media/demo.gif (requires helm, node, ffmpeg)
194169
```
195170

196-
The `config/` directory holds the kustomize sources; `make
197-
build-installer` emits a single-file `dist/install.yaml` that's
198-
functionally equivalent to the Helm chart for air-gapped clusters.
199-
200-
## Run on Azure (AKS)
201-
202-
The `deploy/aks/` directory ships a Bicep template + `parameters.example.json`
203-
that provision an AKS cluster (1.31, 2× B2s, Azure CNI overlay) and a
204-
Basic ACR with `adminUserEnabled: false` and an AcrPull role assignment
205-
to the cluster's kubelet identity. `.github/workflows/aks-deploy.yml`
206-
then builds the operator image on every push, pushes it to ACR, and
207-
runs `helm upgrade --install` against the cluster via
208-
`azure/setup-helm`. Intended for a student subscription. The
209-
trade-offs (no GPU node pool, no monitoring addon, public API server)
210-
are documented in [`deploy/aks/README.md`](deploy/aks/README.md).
171+
`config/` holds the kustomize sources; `make build-installer` emits `dist/install.yaml`.
211172

212-
Required repo secrets: `AZURE_CREDENTIALS`, `AZURE_RESOURCE_GROUP`,
213-
`AKS_CLUSTER_NAME`, `ACR_NAME`.
214-
215-
## Repository layout
216-
217-
```
218-
api/v1alpha1/ GPUWorkloadBudget types + validation markers
219-
cmd/main.go Manager entry point
220-
config/ Generated CRD, RBAC, webhook, manager manifests
221-
internal/accounting/ Pure-Go budget math
222-
internal/controller/ Reconciler + pod-status conversion
223-
internal/enforcement/ Evict / Pause / AlertOnly handlers
224-
internal/webhook/ Validating webhook
225-
test/e2e/ Ginkgo e2e suite
226-
test/bench/ Accuracy harness + gwb-bench CLI
227-
test/workload-generator/ gwb-workload CLI
228-
hack/ bench.sh, chaos.sh, demo/, helm-lint.sh, bench-stack/
229-
deploy/helm/gwb-operator/ Helm chart
230-
deploy/aks/ Bicep + parameters for AKS
231-
docs/ accounting-model, benchmark-methodology, limitations
232-
docs/media/ demo.gif (regenerable via `make demo`)
233-
```
173+
## Deploy on Azure (AKS)
234174

235-
## Status and limitations
175+
[`deploy/aks/`](deploy/aks/) ships a Bicep template with `parameters.example.json` that provisions an AKS cluster (1.31, 2x B2s, Azure CNI overlay) and a Basic ACR with `adminUserEnabled: false` and an AcrPull role assignment to the cluster's kubelet identity. The workflow at [`.github/workflows/aks-deploy.yml`](.github/workflows/aks-deploy.yml) builds the operator image on every push, pushes it to ACR, and runs `helm upgrade --install` against the cluster via `azure/setup-helm`. Intended for a student subscription. The trade-offs (no GPU node pool, no monitoring addon, public API server) are spelled out in [`deploy/aks/README.md`](deploy/aks/README.md).
236176

237-
Alpha. Full list of known issues and scope boundaries:
238-
[docs/limitations.md](docs/limitations.md). The TL;DR:
177+
Required repository secrets: `AZURE_CREDENTIALS`, `AZURE_RESOURCE_GROUP`, `AKS_CLUSTER_NAME`, `ACR_NAME`.
239178

240-
- Single-budget bench only; overlapping selectors work but aren't
241-
measured.
242-
- Enforcement respects PDBs: a protected workload can stay
243-
over-quota until the PDB changes.
244-
- Benches run on kind with simulated CPU-as-GPU; real NVIDIA
245-
device-plugin behaviour is not exercised.
179+
## Limitations
246180

247-
## License
181+
Alpha. Full list at [`docs/limitations.md`](docs/limitations.md). The short version:
248182

249-
Apache 2.0. See [`LICENSE`](LICENSE).
183+
- Single-budget bench only. Overlapping selectors work in code but are not measured.
184+
- Enforcement respects PodDisruptionBudgets. A protected workload can stay over-quota until the PDB changes.
185+
- Benches run on kind with simulated CPU-as-GPU. Real NVIDIA device-plugin behaviour is not exercised.
186+
- No long-running cluster proof. Tens of minutes under bench and chaos, not weeks under production load.

0 commit comments

Comments
 (0)