Skip to content

Commit 05c1811

Browse files
Updates AGENTS.md and styleguide (#79)
* Updates AGENTS.md and styleguide * Update AGENTS.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent f9b4e79 commit 05c1811

File tree

2 files changed

+84
-37
lines changed

2 files changed

+84
-37
lines changed

.gemini/styleguide.md

Lines changed: 34 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@ When performing code reviews on pull requests, you must strictly adhere to the f
1010

1111
5. **Respect Existing Repo Patterns**: Before suggesting review comments (like asking users to add boilerplate or specific patterns), actively check for existing design patterns across the repository. Do not suggest adding useless code or structures that contradict or fall outside the established Keras repo coding style.
1212

13-
14-
15-
1613
# Keras Remote API design guidelines
1714

1815
These guidelines are meant to help focus design discussions and help us create delightful developer experiences for remote execution.
@@ -135,22 +132,49 @@ This prevents confusing situations where a user sets an env var that works in on
135132

136133
---
137134

138-
## CLI commands must be idempotent and follow the reconciliation pattern.
135+
## CLI commands must be idempotent and use the centralized state module.
136+
137+
All Pulumi stack operations go through `cli/infra/state.py`**no command file should call `stack.up()`, `stack.destroy()`, `stack.refresh()`, or import `create_program`/`get_stack` directly.**
138+
139+
The three entry points are:
140+
141+
- `load_state(project, zone, cluster_name)``StackState` — loads ALL state dimensions (refresh, node pools, etc.)
142+
- `apply_update(config)` — runs `stack.up()` with a complete `InfraConfig`
143+
- `apply_destroy(config)` — runs `stack.destroy()`
144+
145+
Every mutating CLI command follows this pattern:
139146

140-
Every mutating CLI command (`up`, `pool add`, `pool remove`, etc.) must follow the refresh-read-merge-apply pattern:
147+
1. `load_state()` — refresh stack and read all current state dimensions into `StackState`
148+
2. Build `InfraConfig` — merge existing state with desired changes
149+
3. `apply_update(config)` or `apply_destroy(config)` — apply the diff
141150

142-
1. `stack.refresh()` — sync local state with cloud reality
143-
2. `get_current_node_pools()` — read current pools from stack exports
144-
3. Build `InfraConfig` — merge existing state with desired changes
145-
4. `stack.up()` — apply only the diff
151+
When adding a **new state dimension** (e.g. namespaces), add it to `StackState` and `load_state()` — every command inherits it automatically, preventing accidental omissions that would cause Pulumi to delete resources.
152+
153+
All CLI commands must use the `common_options` decorator from `cli/options.py` for `--project`/`--zone`/`--cluster` flags — never define these inline.
146154

147155
This ensures:
148156

149157
- Re-running after partial failure is always safe
150158
- Existing resources are never accidentally recreated (Pulumi tracks by URN)
151159
- External drift is detected and corrected
160+
- New state dimensions cannot be accidentally omitted by individual commands
161+
162+
---
163+
164+
## All infrastructure resources must be cluster-scoped.
165+
166+
Every resource managed by the CLI must include the cluster name in its identifier so that multiple clusters within the same GCP project are fully independent. The naming convention is `{project}-kr-{cluster_name}-{purpose}` for buckets and `kr-{cluster_name}` for Artifact Registry repos.
167+
168+
| Resource | Name pattern |
169+
| ------------- | ------------------------------------ |
170+
| Pulumi stack | `{project}-{cluster_name}` |
171+
| Jobs bucket | `{project}-kr-{cluster_name}-jobs` |
172+
| Builds bucket | `{project}-kr-{cluster_name}-builds` |
173+
| AR repository | `kr-{cluster_name}` |
174+
175+
The only exception is project-wide GCP API enablement, which is intentionally shared across clusters (`disable_on_destroy=False`).
152176

153-
When adding a new CLI command that modifies infrastructure, follow this pattern rather than directly creating or deleting resources.
177+
When adding a new infrastructure resource, always scope it to the `(project, cluster_name)` pair. Runtime code (`JobContext`, `container_builder`) resolves the cluster name from `KERAS_REMOTE_CLUSTER` env var or default.
154178

155179
---
156180

AGENTS.md

Lines changed: 50 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,10 @@ keras_remote/
1616
├── utils/ # Serialization (packager) and Cloud Storage helpers
1717
├── cli/ # CLI for infrastructure provisioning (Pulumi-based)
1818
│ ├── commands/ # up, down, status, config, pool (add/remove/list)
19-
│ └── infra/ # Pulumi programs, stack management, post-deploy steps
19+
│ ├── infra/ # Pulumi programs, stack management, state module, post-deploy steps
20+
│ └── options.py # Shared --project/--zone/--cluster Click options (common_options decorator)
2021
├── credentials.py # Credential verification & auto-setup (shared by core & CLI)
21-
└── constants.py # Zone/region utilities
22+
└── constants.py # Zone/region utilities, get_default_cluster_name()
2223
```
2324

2425
## Execution Pipeline
@@ -38,31 +39,34 @@ keras_remote/
3839

3940
## Key Modules
4041

41-
| Module | Responsibility |
42-
| ---------------------------- | -------------------------------------------------------------------------------- |
43-
| `core/core.py` | `@run()` decorator, backend routing, env var capture |
44-
| `core/accelerators.py` | Accelerator registry (`GPUS`, `TPUS`), parser (`parse_accelerator`) |
45-
| `credentials.py` | Credential verification & auto-setup (gcloud, ADC, kubeconfig) |
46-
| `backend/execution.py` | `JobContext` dataclass, `BaseK8sBackend` base class, `execute_remote()` pipeline |
47-
| `backend/gke_client.py` | K8s Job creation, status polling, pod log retrieval |
48-
| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs |
49-
| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build |
50-
| `data/data.py` | `Data` class, content hashing, data ref serialization |
51-
| `utils/packager.py` | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction |
52-
| `utils/storage.py` | GCS upload/download/cleanup for job artifacts and Data cache |
53-
| `runner/remote_runner.py` | Runs inside container: resolve Data refs/volumes, execute, upload result |
54-
| `cli/commands/pool.py` | Node pool add/remove/list commands |
55-
| `cli/infra/post_deploy.py` | kubectl, LWS CRD, GPU driver setup after stack.up() |
56-
| `cli/constants.py` | CLI defaults, paths, API list |
57-
| `cli/main.py` | CLI entry point (`keras-remote` command) |
42+
| Module | Responsibility |
43+
| ---------------------------- | --------------------------------------------------------------------------------------------------------- |
44+
| `core/core.py` | `@run()` decorator, backend routing, env var capture |
45+
| `core/accelerators.py` | Accelerator registry (`GPUS`, `TPUS`), parser (`parse_accelerator`) |
46+
| `credentials.py` | Credential verification & auto-setup (gcloud, ADC, kubeconfig) |
47+
| `backend/execution.py` | `JobContext` dataclass (carries `cluster_name`), `BaseK8sBackend` base class, `execute_remote()` pipeline |
48+
| `backend/gke_client.py` | K8s Job creation, status polling, pod log retrieval |
49+
| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs |
50+
| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build |
51+
| `data/data.py` | `Data` class, content hashing, data ref serialization |
52+
| `utils/packager.py` | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction |
53+
| `utils/storage.py` | GCS upload/download/cleanup for job artifacts and Data cache |
54+
| `runner/remote_runner.py` | Runs inside container: resolve Data refs/volumes, execute, upload result |
55+
| `cli/infra/state.py` | Centralized Pulumi state: `load_state()`, `apply_update()`, `apply_destroy()` |
56+
| `cli/options.py` | Shared `common_options` Click decorator (`--project`/`--zone`/`--cluster`) |
57+
| `cli/commands/pool.py` | Node pool add/remove/list commands |
58+
| `cli/infra/post_deploy.py` | kubectl, LWS CRD, GPU driver setup after stack.up() |
59+
| `cli/constants.py` | CLI defaults, paths, API list |
60+
| `cli/main.py` | CLI entry point (`keras-remote` command) |
5861

5962
## Key Abstractions
6063

61-
- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI.
64+
- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI, `cluster_name` (for cluster-scoped bucket/repo resolution).
6265
- **`BaseK8sBackend`** (`backend/execution.py`): Base class with `submit_job`, `wait_for_job`, `cleanup_job`. Subclassed by `GKEBackend` and `PathwaysBackend`.
6366
- **`GpuConfig` / `TpuConfig`** (`core/accelerators.py`): Frozen dataclasses for accelerator metadata. Single source of truth used by runtime, container builder, and CLI.
6467
- **`Data`** (`data/data.py`): Wraps a local path or GCS URI. Passed as a function argument or via the `volumes` decorator parameter. Resolved to a plain filesystem path on the remote pod. Content-hashed for upload caching.
6568
- **`InfraConfig` / `NodePoolConfig`** (`cli/config.py`): CLI provisioning configuration. `InfraConfig` holds project, zone, cluster name, and a list of `NodePoolConfig` entries. `NodePoolConfig` pairs a unique pool name (e.g., `gpu-l4-a3f2`) with a `GpuConfig` or `TpuConfig`.
69+
- **`StackState`** (`cli/infra/state.py`): Dataclass bundling all state dimensions loaded from a Pulumi stack (project, zone, cluster_name, node_pools, stack handle). Returned by `load_state()` and consumed by commands.
6670

6771
## Data API
6872

@@ -141,15 +145,34 @@ Additional CLI-only env vars:
141145

142146
### CLI State Management
143147

144-
The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each GCP project gets its own Pulumi stack (stack name = project ID).
148+
The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each `(project, cluster_name)` pair gets its own Pulumi stack (stack name = `{project}-{cluster_name}`), so multiple clusters in the same GCP project are fully independent.
145149

146-
Every mutating command (`up`, `pool add`, `pool remove`, etc.) follows this reconciliation pattern:
150+
**Centralized state module (`cli/infra/state.py`)** — All Pulumi stack operations go through three functions:
147151

148-
1. `stack.refresh()` — pull cloud reality into local state
149-
2. `get_current_node_pools()` — read current pools from stack exports
150-
3. Build new `InfraConfig` — merge existing pools with desired changes
151-
4. `create_program(config)` — generate Pulumi program from desired state
152-
5. `stack.up()` — diff desired vs current, apply only changes
152+
| Function | Purpose | Used by |
153+
| ----------------- | --------------------------------------------------------------------------------------- | ------------------------------- |
154+
| `load_state()` | Load ALL state dimensions (prerequisites, defaults, refresh, node pools) → `StackState` | `up`, `pool`, `status` |
155+
| `apply_update()` | Run `stack.up()` with a complete `InfraConfig` | `up`, `pool add`, `pool remove` |
156+
| `apply_destroy()` | Run `stack.destroy()` | `down` |
157+
158+
**Safety invariants:**
159+
160+
- `stack.up()`, `stack.destroy()`, `stack.refresh()` appear **only** in `state.py`
161+
- No command file imports `create_program` or `get_stack` directly
162+
- No command file defines inline `--project`/`--zone`/`--cluster` options (use `common_options` from `cli/options.py`)
163+
- When a new state dimension is added (e.g. namespaces), it is added to `StackState` and `load_state()` — every command gets it automatically
164+
165+
**Cluster-scoped resource naming:**
166+
167+
| Resource | Name pattern |
168+
| ------------- | ------------------------------------------------- |
169+
| Pulumi stack | `{project}-{cluster_name}` |
170+
| Jobs bucket | `{project}-kr-{cluster_name}-jobs` |
171+
| Builds bucket | `{project}-kr-{cluster_name}-builds` |
172+
| AR repository | `kr-{cluster_name}` |
173+
| GKE cluster | `{cluster_name}` |
174+
175+
*Note: GCP APIs are enabled project-wide, shared across clusters, and are not disabled when a cluster is destroyed (`disable_on_destroy=False`).*
153176

154177
Key behaviors:
155178

0 commit comments

Comments
 (0)