You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .gemini/styleguide.md
+34-10Lines changed: 34 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,9 +10,6 @@ When performing code reviews on pull requests, you must strictly adhere to the f
10
10
11
11
5.**Respect Existing Repo Patterns**: Before suggesting review comments (like asking users to add boilerplate or specific patterns), actively check for existing design patterns across the repository. Do not suggest adding useless code or structures that contradict or fall outside the established Keras repo coding style.
12
12
13
-
14
-
15
-
16
13
# Keras Remote API design guidelines
17
14
18
15
These guidelines are meant to help focus design discussions and help us create delightful developer experiences for remote execution.
@@ -135,22 +132,49 @@ This prevents confusing situations where a user sets an env var that works in on
135
132
136
133
---
137
134
138
-
## CLI commands must be idempotent and follow the reconciliation pattern.
135
+
## CLI commands must be idempotent and use the centralized state module.
136
+
137
+
All Pulumi stack operations go through `cli/infra/state.py` — **no command file should call `stack.up()`, `stack.destroy()`, `stack.refresh()`, or import `create_program`/`get_stack` directly.**
138
+
139
+
The three entry points are:
140
+
141
+
-`load_state(project, zone, cluster_name)` → `StackState` — loads ALL state dimensions (refresh, node pools, etc.)
142
+
-`apply_update(config)` — runs `stack.up()` with a complete `InfraConfig`
143
+
-`apply_destroy(config)` — runs `stack.destroy()`
144
+
145
+
Every mutating CLI command follows this pattern:
139
146
140
-
Every mutating CLI command (`up`, `pool add`, `pool remove`, etc.) must follow the refresh-read-merge-apply pattern:
147
+
1.`load_state()` — refresh stack and read all current state dimensions into `StackState`
148
+
2. Build `InfraConfig` — merge existing state with desired changes
149
+
3.`apply_update(config)` or `apply_destroy(config)` — apply the diff
141
150
142
-
1.`stack.refresh()` — sync local state with cloud reality
143
-
2.`get_current_node_pools()` — read current pools from stack exports
144
-
3. Build `InfraConfig` — merge existing state with desired changes
145
-
4.`stack.up()` — apply only the diff
151
+
When adding a **new state dimension** (e.g. namespaces), add it to `StackState` and `load_state()` — every command inherits it automatically, preventing accidental omissions that would cause Pulumi to delete resources.
152
+
153
+
All CLI commands must use the `common_options` decorator from `cli/options.py` for `--project`/`--zone`/`--cluster` flags — never define these inline.
146
154
147
155
This ensures:
148
156
149
157
- Re-running after partial failure is always safe
150
158
- Existing resources are never accidentally recreated (Pulumi tracks by URN)
151
159
- External drift is detected and corrected
160
+
- New state dimensions cannot be accidentally omitted by individual commands
161
+
162
+
---
163
+
164
+
## All infrastructure resources must be cluster-scoped.
165
+
166
+
Every resource managed by the CLI must include the cluster name in its identifier so that multiple clusters within the same GCP project are fully independent. The naming convention is `{project}-kr-{cluster_name}-{purpose}` for buckets and `kr-{cluster_name}` for Artifact Registry repos.
The only exception is project-wide GCP API enablement, which is intentionally shared across clusters (`disable_on_destroy=False`).
152
176
153
-
When adding a new CLI command that modifies infrastructure, follow this pattern rather than directly creating or deleting resources.
177
+
When adding a new infrastructure resource, always scope it to the `(project, cluster_name)` pair. Runtime code (`JobContext`, `container_builder`) resolves the cluster name from `KERAS_REMOTE_CLUSTER` env var or default.
|`cli/commands/pool.py`| Node pool add/remove/list commands |
58
+
|`cli/infra/post_deploy.py`| kubectl, LWS CRD, GPU driver setup after stack.up() |
59
+
|`cli/constants.py`| CLI defaults, paths, API list |
60
+
|`cli/main.py`| CLI entry point (`keras-remote` command) |
58
61
59
62
## Key Abstractions
60
63
61
-
-**`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI.
64
+
-**`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI, `cluster_name` (for cluster-scoped bucket/repo resolution).
62
65
-**`BaseK8sBackend`** (`backend/execution.py`): Base class with `submit_job`, `wait_for_job`, `cleanup_job`. Subclassed by `GKEBackend` and `PathwaysBackend`.
63
66
-**`GpuConfig` / `TpuConfig`** (`core/accelerators.py`): Frozen dataclasses for accelerator metadata. Single source of truth used by runtime, container builder, and CLI.
64
67
-**`Data`** (`data/data.py`): Wraps a local path or GCS URI. Passed as a function argument or via the `volumes` decorator parameter. Resolved to a plain filesystem path on the remote pod. Content-hashed for upload caching.
65
68
-**`InfraConfig` / `NodePoolConfig`** (`cli/config.py`): CLI provisioning configuration. `InfraConfig` holds project, zone, cluster name, and a list of `NodePoolConfig` entries. `NodePoolConfig` pairs a unique pool name (e.g., `gpu-l4-a3f2`) with a `GpuConfig` or `TpuConfig`.
69
+
-**`StackState`** (`cli/infra/state.py`): Dataclass bundling all state dimensions loaded from a Pulumi stack (project, zone, cluster_name, node_pools, stack handle). Returned by `load_state()` and consumed by commands.
The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each GCP projectgets its own Pulumi stack (stack name = project ID).
148
+
The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each `(project, cluster_name)` pair gets its own Pulumi stack (stack name = `{project}-{cluster_name}`), so multiple clusters in the same GCP project are fully independent.
145
149
146
-
Every mutating command (`up`, `pool add`, `pool remove`, etc.) follows this reconciliation pattern:
150
+
**Centralized state module (`cli/infra/state.py`)** — All Pulumi stack operations go through three functions:
147
151
148
-
1.`stack.refresh()` — pull cloud reality into local state
149
-
2.`get_current_node_pools()` — read current pools from stack exports
150
-
3. Build new `InfraConfig` — merge existing pools with desired changes
151
-
4.`create_program(config)` — generate Pulumi program from desired state
152
-
5.`stack.up()` — diff desired vs current, apply only changes
0 commit comments