zxuhan
diff --git a/‎Makefile‎
Lines changed: 5 additions & 0 deletions b/‎Makefile‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 111 additions & 174 deletions b/‎README.md‎
Lines changed: 111 additions & 174 deletions
@@ -146,6 +146,11 @@ demo-record: ## Re-record demo.gif only. Assumes 'make demo' already brought a c
 	rm -f hack/demo/demo.webm hack/demo/.palette.png
 	rm -rf hack/demo/.record
 
+.PHONY: diagrams
+diagrams: ## Render D2 sources under docs/diagrams/ into docs/media/. Requires d2.
+	@command -v d2 >/dev/null 2>&1 || { echo "d2 not installed (brew install d2)"; exit 1; }
+	d2 --layout=dagre --pad=20 docs/diagrams/architecture.d2 docs/media/architecture.svg
+
 ##@ Build
 
 .PHONY: build
 
@@ -1,45 +1,79 @@
 # gpu-k8s-operator
 
-A namespaced Kubernetes operator that tracks cumulative GPU-hour
-consumption for pods matching a label selector against a rolling-window
-quota, and enforces that quota via eviction, pause annotations, or
-alert-only mode.
+Kubernetes operator that meters cumulative GPU-hours against a rolling-window quota and enforces it via eviction, pause annotations, or alert-only mode. The accounting engine is stateless: every reconcile recomputes from the live API-server view, so a restarted operator converges back to within 0.4% of perfect without replaying any persisted state.
+
+![Go](https://img.shields.io/badge/Go-1.23%2B-00ADD8?logo=go&logoColor=white)
+![Kubernetes](https://img.shields.io/badge/Kubernetes-1.31%2B-326ce5?logo=kubernetes&logoColor=white)
+![Kubebuilder](https://img.shields.io/badge/Kubebuilder-v4-326ce5)
+![Helm](https://img.shields.io/badge/Helm-chart-0F1689?logo=helm&logoColor=white)
+[![Tests](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test.yml/badge.svg)](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test.yml)
+[![E2E](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test-e2e.yml/badge.svg)](https://github.com/zxuhan/gpu-k8s-operator/actions/workflows/test-e2e.yml)
 
 ![demo](docs/media/demo.gif)
 
-*Live Grafana panels driven by the operator's Prometheus metrics.
-The gauge climbs past the quota line as 8 pods run against a quota
-deliberately crossable in seconds. Midway through, the operator pod is
-deleted; the pods-tracked stat holds at 8, showing state is rebuilt
-from the API-server view on restart rather than restored from a cache.
-Regenerate with `make demo`.*
-
-## Why this matters
-
-Cumulative GPU-hour budgets are the quota primitive AI-cloud platforms
-use to keep shared fleets fair and predictable: teams get an allowance
-per window, workloads stay fungible across nodes, and billing stays
-out of the hot path. This operator is a self-contained implementation
-of that control plane: rolling-window accounting, grace-period
-enforcement, stateless restart recovery, all driven by a single CRD
-(`GPUWorkloadBudget`, group `budget.zxuhan.dev`, version `v1alpha1`).
-Accounting is derived from the API-server view on every reconcile, so
-a restart recovers from cluster state rather than from a cache;
-[docs/accounting-model.md](docs/accounting-model.md) explains the
-bounded-error guarantees.
-
-When `nvidia.com/gpu` is absent (e.g. a kind cluster) the accounting
-engine falls back to scaled CPU-second counting: set
-`spec.gpuResourceName: cpu` and the same control loop drives against
-`resource.Quantity` CPU requests. The e2e and bench suites both rely
-on this path.
+*Grafana panels driven by the operator's Prometheus metrics. The gauge climbs past the quota line as 8 pods run against a quota crossable in seconds. Midway through, the operator pod is deleted. The tracked-pods stat holds at 8: state is rebuilt from the API-server view on restart, not from a cache. Regenerate with `make demo`.*
+
+## Results
+
+Numbers below were measured on a kind cluster (M-series laptop, 2026-04-20) and are checked in under [`bench-results/2026-04-20/`](bench-results/2026-04-20/) and [`chaos-results/2026-04-20/`](chaos-results/2026-04-20/). The harness owns the numbers; the README quotes them. Regenerate with `make bench` and `make chaos`.
+
+### Scenario
+
+| Parameter | Steady-state bench | Chaos run |
+|---|---|---|
+| pods | 50 | 50 |
+| arrival rate | 10 pods/s | 10 pods/s |
+| per-pod runtime | 30s | 60s |
+| resource per pod | 0.1 CPU (simulated GPU) | 0.1 CPU |
+| snapshots | t = 45s | t = 15s, t = 120s |
+| event | none | operator pod deleted between snapshots |
+| cluster | kind on Docker | kind on Docker |
+
+### Measurements
+
+| Scenario | Tracked pods | Reported GPU-hours | Expected | Accuracy | Delta |
+|---|---|---|---|---|---|
+| Steady-state | 50 / 50 | 0.04000 | 0.04167 | **0.960** | -6 pod-seconds |
+| Pre-restart  | 50 / 50 | 0.01200 | 0.01743 | 0.688 | -19 pod-seconds |
+| Post-restart | 50 / 50 | 0.08300 | 0.08333 | **0.996** | -1 pod-second |
+
+![Restart recovery](docs/media/restart-recovery.svg)
+
+### Key observations
+
+- **State survives restart without persistence.** `tracked_pods` stays at 50/50 across the operator kill. The replacement pod's informer rebuilds from the API-server view and re-observes every pod. Nothing is replayed from a cache because nothing was ever cached.
+- **Post-restart accuracy matches steady-state.** 0.996 after recovery is within rounding of the 0.96 bench number from a clean run. The restart cost no measurable accuracy.
+- **The pre-restart 0.688 is a transient, not a regression.** It is a 15-second snapshot of a freshly launched workload, taken one reconcile after pod creation. Two reconcile cadences later, accuracy converges. See [`docs/benchmark-methodology.md`](docs/benchmark-methodology.md) for the convergence model.
+- **Sub-second delta is kubelet, not the operator.** The -6 pod-second steady-state delta is the kubelet start-up lag between pod create and `state.running.startedAt`. The engine counts from `startedAt`, deliberately excluding image-pull and scheduling slop from quota.
+
+### Why the numbers hold
+
+The design choices that make those numbers cheap, in order of how much they matter:
+
+1. **Accounting is derived, not stored.** `internal/accounting/` is a pure-Go function: given a pod set with `(Start, End, GPUs)`, compute consumed GPU-hours. There is no in-memory ledger to lose on a restart and no rolling counter to drift over weeks.
+2. **`.status.consumedGpuHours` is overwritten, not accumulated.** Every reconcile writes the freshly computed value. A bug in one reconcile self-heals on the next.
+3. **The reconciler does no math.** It translates pods to accounting input (see [`internal/controller/pod_conversion.go`](internal/controller/pod_conversion.go)) and writes status. All numeric logic lives in `internal/accounting/`, unit-tested to nanosecond precision against scripted timelines.
+4. **GPU-less clusters use the same code path.** When `nvidia.com/gpu` is absent, set `spec.gpuResourceName: cpu` and the engine treats fractional CPU as fractional GPU. The e2e and bench suites both rely on this. See [`docs/accounting-model.md`](docs/accounting-model.md) for the bounded-error guarantee on what happens when the kubelet GC'es a pod the operator never saw.
+
+## Architecture
+
+![Architecture](docs/media/architecture.svg)
+
+*Source: [`docs/diagrams/architecture.d2`](docs/diagrams/architecture.d2). Regenerate with `make diagrams`.*
+
+Three packages, each independently testable:
+
+- **`internal/accounting/`** is pure Go, k8s-free. Given a pod set with `(Start, End, GPUs)`, returns consumed GPU-hours, clamped remaining, and an over-quota flag. Unit-tested to nanosecond precision.
+- **`internal/controller/`** is the reconciler. Translates `Pod` objects to accounting input (`pod_conversion.go` handles the `earliestContainerStart` / `latestContainerFinish` rules), patches `.status`, and toggles `Ready` / `QuotaExceeded` / `Degraded` conditions.
+- **`internal/enforcement/`** dispatches one of three actions per `spec.enforcement.action`. `Evict` submits `policy/v1.Eviction`, `Pause` writes an annotation, `AlertOnly` records a Kubernetes Event. Grace periods are wall-clock.
+
+The validating webhook (`internal/webhook/v1alpha1/`) rejects empty selectors, non-positive quotas, and unknown enforcement actions at admission time, so the controller never sees a malformed CR. Validating-only on purpose; see [`docs/limitations.md`](docs/limitations.md) for why.
 
 ## Quickstart
 
-### Install with Helm
+Prerequisites: a Kubernetes cluster, `helm`, and `cert-manager` (the webhook needs TLS).
 
 ```sh
-# cert-manager is a prerequisite for the validating webhook.
 helm upgrade --install cert-manager jetstack/cert-manager \
   --namespace cert-manager --create-namespace \
   --set crds.enabled=true
@@ -48,7 +82,7 @@ helm upgrade --install gwb-operator ./deploy/helm/gwb-operator \
   --namespace gpu-k8s-operator-system --create-namespace
 ```
 
-### Create a budget
+Create a budget:
 
 ```yaml
 apiVersion: budget.zxuhan.dev/v1alpha1
@@ -68,182 +102,85 @@ spec:
     gracePeriodSeconds: 60
 ```
 
-Then watch:
+Watch it move:
 
 ```sh
 kubectl get gwb team-a -w
 ```
 
-## How it works
-
-```mermaid
-flowchart LR
-    P[Pod events<br/>API server] --> R[Reconciler<br/>internal/controller]
-    R --> A[Accounting engine<br/>internal/accounting<br/>pure Go, k8s-free]
-    A --> R
-    R --> S[.status write<br/>Ready / QuotaExceeded /<br/>Degraded conditions]
-    R --> D[Enforcement dispatcher<br/>internal/enforcement]
-    D --> E1[Evict<br/>policy/v1.Eviction]
-    D --> E2[Pause<br/>annotation stamp]
-    D --> E3[AlertOnly<br/>Event only]
-    R --> M[Prometheus metrics<br/>consumed / remaining /<br/>tracked_pods]
-```
-
-Three packages, separated so each is independently testable:
-
-- **`internal/accounting/`**: pure Go, k8s-free. Given a set of pods
-  with `(Start, End, GPUs)`, computes `consumedGpuHours`, clamps
-  `remainingGpuHours` at zero, and flags `over`. Unit-tested to ~ns
-  precision.
+For an air-gapped deployment, `make build-installer` emits a single-file `dist/install.yaml` functionally equivalent to the Helm chart.
 
-- **`internal/controller/`**: the reconciler. Translates
-  k8s `Pod` objects to accounting input (see `pod_conversion.go` for
-  the `earliestContainerStart` / `latestContainerFinish` rules),
-  writes `.status`, and patches `Ready`/`QuotaExceeded`/`Degraded`
-  conditions.
+## Project structure
 
-- **`internal/enforcement/`**: one implementation per
-  `spec.enforcement.action`: `Evict` submits `policy/v1.Eviction`,
-  `Pause` writes an annotation, `AlertOnly` emits an event and records
-  the action in `lastEnforcementAt`. Grace periods are wall-clock.
-
-The validating webhook rejects empty selectors, zero quotas, and
-unknown enforcement actions at admission time so the controller never
-sees malformed state. It is validating only. See
-[docs/limitations.md](docs/limitations.md#webhook) for why.
+```
+api/v1alpha1/             GPUWorkloadBudget types and validation markers
+cmd/main.go               manager entry point
+config/                   generated CRD, RBAC, webhook, manager manifests
+internal/accounting/      pure-Go budget math
+internal/controller/      reconciler and pod-status conversion
+internal/enforcement/     Evict / Pause / AlertOnly handlers
+internal/webhook/         validating webhook
+test/e2e/                 Ginkgo end-to-end suite
+test/bench/               accuracy harness and gwb-bench CLI
+test/workload-generator/  gwb-workload CLI
+hack/                     bench.sh, chaos.sh, demo/, helm-lint.sh, bench-stack/
+deploy/helm/gwb-operator/ Helm chart
+deploy/aks/               Bicep template and parameters for AKS
+docs/                     accounting model, benchmark methodology, limitations
+docs/diagrams/            D2 sources (rendered into docs/media/ via `make diagrams`)
+docs/media/               rendered SVGs and demo.gif
+```
 
 ## Metrics
 
-Exposed on an HTTPS endpoint guarded by Kubernetes TokenReview
-(enable the Helm ServiceMonitor to scrape from kube-prometheus-stack):
+Exposed on an HTTPS endpoint guarded by Kubernetes TokenReview. Enable the Helm ServiceMonitor to scrape from `kube-prometheus-stack`.
 
 | Metric | Meaning |
 |---|---|
 | `gwb_consumed_gpu_hours{namespace, name}` | current `.status.consumedGpuHours` |
-| `gwb_remaining_gpu_hours{namespace, name}` | `quota − consumed`, clamped |
-| `gwb_enforcement_actions_total{action, namespace, name}` | counter incremented per action fired |
-| `gwb_tracked_pods{namespace, name}` | pods matched by selector at last reconcile |
-| `gwb_accounting_accuracy_ratio{namespace, name}` | registered, currently always 0. The operator doesn't know ground truth; the bench harness computes this externally and writes it to `bench-results/…/results.json` |
+| `gwb_remaining_gpu_hours{namespace, name}` | `quota - consumed`, clamped at zero |
+| `gwb_enforcement_actions_total{action, namespace, name}` | counter, incremented per action fired |
+| `gwb_tracked_pods{namespace, name}` | pods matched by the selector at last reconcile |
+| `gwb_accounting_accuracy_ratio{namespace, name}` | registered but currently always zero. The operator does not know ground truth; the bench harness computes the ratio externally and writes it to `bench-results/.../results.json` |
 
-Controller-runtime's default metrics (reconcile latency, workqueue
-depth, etc.) are served alongside.
+Controller-runtime's default metrics (reconcile latency, workqueue depth, retries) are served alongside.
 
 ## Benchmarks
 
 ```sh
-make bench    # one scenario → bench-results/YYYY-MM-DD/SUMMARY.md
-make chaos    # restart-correctness scenario → chaos-results/YYYY-MM-DD/SUMMARY.md
+make bench    # one accuracy scenario into bench-results/YYYY-MM-DD/SUMMARY.md
+make chaos    # restart-correctness run into chaos-results/YYYY-MM-DD/SUMMARY.md
 ```
 
-The accuracy formula and the reason benches run on kind are in
-[docs/benchmark-methodology.md](docs/benchmark-methodology.md).
-
-### Measured numbers (kind, M-series laptop, 2026-04-20)
-
-Recorded in-repo under [`bench-results/2026-04-20/`](bench-results/2026-04-20/)
-and [`chaos-results/2026-04-20/`](chaos-results/2026-04-20/). Regenerate
-any time with `make bench` / `make chaos`. The harness owns the
-numbers, the README just quotes them.
-
-**Steady-state accuracy.** 50 busybox pods at 10/s, 30s runtime each,
-0.1 CPU "GPU" per pod, snapshot at t=45s (all pods terminated):
-
-| Metric | Value |
-|---|---|
-| reported GPU-hours | 0.04000 |
-| expected GPU-hours | 0.04167 |
-| **accuracy ratio** | **0.96** |
-| delta | −6 pod-seconds |
-| tracked pods | 50 |
-
-The −6-pod-second delta is the kubelet start-up lag: the accounting
-engine counts from `state.running.startedAt`, which kubelet stamps a
-fraction of a second after pod-create. See
-[docs/accounting-model.md](docs/accounting-model.md) for the formula.
-
-**Restart correctness.** Same workload, runtime bumped to 60s so pods
-are still Running when we snapshot. Operator pod deleted at t=15s;
-post snapshot at t=120s (`CHAOS_POST_SECONDS=120`):
-
-| Phase | Elapsed | Tracked pods | Reported | Expected | Accuracy |
-|---|---|---|---|---|---|
-| pre-restart  | 15s  | **50 / 50** | 0.0120 | 0.0174 | 0.69 |
-| post-restart | 120s | **50 / 50** | 0.0830 | 0.0833 | **0.996** |
-
-The headline is `tracked_pods = 50` on both sides of the restart: when
-the new operator pod comes up, controller-runtime's informer rebuilds
-from the API-server view and every pod is re-observed. No state was
-persisted and none was lost. The pre-restart 0.69 is reconcile cadence
-against a fresh workload (first snapshot lands one reconcile after
-workload launch). Once the operator has had a few ticks to re-sum
-everyone's elapsed runtime, the post-restart reading converges to
-0.996, essentially the same accuracy as `make bench`, which says the
-restart cost nothing.
+The accuracy formula and the reason benches run on kind rather than a cloud cluster live in [`docs/benchmark-methodology.md`](docs/benchmark-methodology.md). Scenario knobs (count, rate, runtime, gpus, observe-window) are documented in `hack/bench.sh`.
 
 ## Development
 
-Prerequisites: Go 1.23+, Docker, kubectl, kind.
+Prerequisites: Go 1.23+, Docker, `kubectl`, `kind`.
 
 ```sh
-make test                # unit + envtest suites
+make test                # unit and envtest suites
 make test-e2e            # Ginkgo against a fresh kind cluster
 make lint                # golangci-lint v2
-make manifests generate  # regen CRD + deepcopy after API changes
+make manifests generate  # regen CRD and deepcopy after API changes
 make helm-lint           # lint the Helm chart (requires helm)
+make diagrams            # re-render docs/media/architecture.svg from D2
 make demo                # regenerate docs/media/demo.gif (requires helm, node, ffmpeg)
 ```
 
-The `config/` directory holds the kustomize sources; `make
-build-installer` emits a single-file `dist/install.yaml` that's
-functionally equivalent to the Helm chart for air-gapped clusters.
-
-## Run on Azure (AKS)
-
-The `deploy/aks/` directory ships a Bicep template + `parameters.example.json`
-that provision an AKS cluster (1.31, 2× B2s, Azure CNI overlay) and a
-Basic ACR with `adminUserEnabled: false` and an AcrPull role assignment
-to the cluster's kubelet identity. `.github/workflows/aks-deploy.yml`
-then builds the operator image on every push, pushes it to ACR, and
-runs `helm upgrade --install` against the cluster via
-`azure/setup-helm`. Intended for a student subscription. The
-trade-offs (no GPU node pool, no monitoring addon, public API server)
-are documented in [`deploy/aks/README.md`](deploy/aks/README.md).
+`config/` holds the kustomize sources; `make build-installer` emits `dist/install.yaml`.
 
-Required repo secrets: `AZURE_CREDENTIALS`, `AZURE_RESOURCE_GROUP`,
-`AKS_CLUSTER_NAME`, `ACR_NAME`.
-
-## Repository layout
-
-```
-api/v1alpha1/             GPUWorkloadBudget types + validation markers
-cmd/main.go               Manager entry point
-config/                   Generated CRD, RBAC, webhook, manager manifests
-internal/accounting/      Pure-Go budget math
-internal/controller/      Reconciler + pod-status conversion
-internal/enforcement/     Evict / Pause / AlertOnly handlers
-internal/webhook/         Validating webhook
-test/e2e/                 Ginkgo e2e suite
-test/bench/               Accuracy harness + gwb-bench CLI
-test/workload-generator/  gwb-workload CLI
-hack/                     bench.sh, chaos.sh, demo/, helm-lint.sh, bench-stack/
-deploy/helm/gwb-operator/ Helm chart
-deploy/aks/               Bicep + parameters for AKS
-docs/                     accounting-model, benchmark-methodology, limitations
-docs/media/               demo.gif (regenerable via `make demo`)
-```
+## Deploy on Azure (AKS)
 
-## Status and limitations
+[`deploy/aks/`](deploy/aks/) ships a Bicep template with `parameters.example.json` that provisions an AKS cluster (1.31, 2x B2s, Azure CNI overlay) and a Basic ACR with `adminUserEnabled: false` and an AcrPull role assignment to the cluster's kubelet identity. The workflow at [`.github/workflows/aks-deploy.yml`](.github/workflows/aks-deploy.yml) builds the operator image on every push, pushes it to ACR, and runs `helm upgrade --install` against the cluster via `azure/setup-helm`. Intended for a student subscription. The trade-offs (no GPU node pool, no monitoring addon, public API server) are spelled out in [`deploy/aks/README.md`](deploy/aks/README.md).
 
-Alpha. Full list of known issues and scope boundaries:
-[docs/limitations.md](docs/limitations.md). The TL;DR:
+Required repository secrets: `AZURE_CREDENTIALS`, `AZURE_RESOURCE_GROUP`, `AKS_CLUSTER_NAME`, `ACR_NAME`.
 
-- Single-budget bench only; overlapping selectors work but aren't
-  measured.
-- Enforcement respects PDBs: a protected workload can stay
-  over-quota until the PDB changes.
-- Benches run on kind with simulated CPU-as-GPU; real NVIDIA
-  device-plugin behaviour is not exercised.
+## Limitations
 
-## License
+Alpha. Full list at [`docs/limitations.md`](docs/limitations.md). The short version:
 
-Apache 2.0. See [`LICENSE`](LICENSE).
+- Single-budget bench only. Overlapping selectors work in code but are not measured.
+- Enforcement respects PodDisruptionBudgets. A protected workload can stay over-quota until the PDB changes.
+- Benches run on kind with simulated CPU-as-GPU. Real NVIDIA device-plugin behaviour is not exercised.
+- No long-running cluster proof. Tens of minutes under bench and chaos, not weeks under production load.