keras-team · divyashreepathihalli · Mar 9, 2026 · Mar 9, 2026 · Mar 9, 2026
diff --git a/.gemini/styleguide.md b/.gemini/styleguide.md
@@ -10,9 +10,6 @@ When performing code reviews on pull requests, you must strictly adhere to the f
 
 5. **Respect Existing Repo Patterns**: Before suggesting review comments (like asking users to add boilerplate or specific patterns), actively check for existing design patterns across the repository. Do not suggest adding useless code or structures that contradict or fall outside the established Keras repo coding style.
 
-
-
-
 # Keras Remote API design guidelines
 
 These guidelines are meant to help focus design discussions and help us create delightful developer experiences for remote execution.
@@ -135,22 +132,49 @@ This prevents confusing situations where a user sets an env var that works in on
 
 ---
 
-## CLI commands must be idempotent and follow the reconciliation pattern.
+## CLI commands must be idempotent and use the centralized state module.
+
+All Pulumi stack operations go through `cli/infra/state.py` — **no command file should call `stack.up()`, `stack.destroy()`, `stack.refresh()`, or import `create_program`/`get_stack` directly.**
+
+The three entry points are:
+
+- `load_state(project, zone, cluster_name)` → `StackState` — loads ALL state dimensions (refresh, node pools, etc.)
+- `apply_update(config)` — runs `stack.up()` with a complete `InfraConfig`
+- `apply_destroy(config)` — runs `stack.destroy()`
+
+Every mutating CLI command follows this pattern:
 
-Every mutating CLI command (`up`, `pool add`, `pool remove`, etc.) must follow the refresh-read-merge-apply pattern:
+1. `load_state()` — refresh stack and read all current state dimensions into `StackState`
+2. Build `InfraConfig` — merge existing state with desired changes
+3. `apply_update(config)` or `apply_destroy(config)` — apply the diff
 
-1. `stack.refresh()` — sync local state with cloud reality
-2. `get_current_node_pools()` — read current pools from stack exports
-3. Build `InfraConfig` — merge existing state with desired changes
-4. `stack.up()` — apply only the diff
+When adding a **new state dimension** (e.g. namespaces), add it to `StackState` and `load_state()` — every command inherits it automatically, preventing accidental omissions that would cause Pulumi to delete resources.
+
+All CLI commands must use the `common_options` decorator from `cli/options.py` for `--project`/`--zone`/`--cluster` flags — never define these inline.
 
 This ensures:
 
 - Re-running after partial failure is always safe
 - Existing resources are never accidentally recreated (Pulumi tracks by URN)
 - External drift is detected and corrected
+- New state dimensions cannot be accidentally omitted by individual commands
+
+---
+
+## All infrastructure resources must be cluster-scoped.
+
+Every resource managed by the CLI must include the cluster name in its identifier so that multiple clusters within the same GCP project are fully independent. The naming convention is `{project}-kr-{cluster_name}-{purpose}` for buckets and `kr-{cluster_name}` for Artifact Registry repos.
+
+| Resource      | Name pattern                         |
+| ------------- | ------------------------------------ |
+| Pulumi stack  | `{project}-{cluster_name}`           |
+| Jobs bucket   | `{project}-kr-{cluster_name}-jobs`   |
+| Builds bucket | `{project}-kr-{cluster_name}-builds` |
+| AR repository | `kr-{cluster_name}`                  |
+
+The only exception is project-wide GCP API enablement, which is intentionally shared across clusters (`disable_on_destroy=False`).
 
-When adding a new CLI command that modifies infrastructure, follow this pattern rather than directly creating or deleting resources.
+When adding a new infrastructure resource, always scope it to the `(project, cluster_name)` pair. Runtime code (`JobContext`, `container_builder`) resolves the cluster name from `KERAS_REMOTE_CLUSTER` env var or default.
 
 ---
 

diff --git a/AGENTS.md b/AGENTS.md
@@ -16,9 +16,10 @@ keras_remote/
 ├── utils/          # Serialization (packager) and Cloud Storage helpers
 ├── cli/            # CLI for infrastructure provisioning (Pulumi-based)
 │   ├── commands/   # up, down, status, config, pool (add/remove/list)
-│   └── infra/      # Pulumi programs, stack management, post-deploy steps
+│   ├── infra/      # Pulumi programs, stack management, state module, post-deploy steps
+│   └── options.py  # Shared --project/--zone/--cluster Click options (common_options decorator)
 ├── credentials.py  # Credential verification & auto-setup (shared by core & CLI)
-└── constants.py    # Zone/region utilities
+└── constants.py    # Zone/region utilities, get_default_cluster_name()
 ```
 
 ## Execution Pipeline
@@ -38,31 +39,34 @@ keras_remote/
 
 ## Key Modules
 
-| Module                       | Responsibility                                                                   |
-| ---------------------------- | -------------------------------------------------------------------------------- |
-| `core/core.py`               | `@run()` decorator, backend routing, env var capture                             |
-| `core/accelerators.py`       | Accelerator registry (`GPUS`, `TPUS`), parser (`parse_accelerator`)              |
-| `credentials.py`             | Credential verification & auto-setup (gcloud, ADC, kubeconfig)                   |
-| `backend/execution.py`       | `JobContext` dataclass, `BaseK8sBackend` base class, `execute_remote()` pipeline |
-| `backend/gke_client.py`      | K8s Job creation, status polling, pod log retrieval                              |
-| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs                                     |
-| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build                             |
-| `data/data.py`               | `Data` class, content hashing, data ref serialization                            |
-| `utils/packager.py`          | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction         |
-| `utils/storage.py`           | GCS upload/download/cleanup for job artifacts and Data cache                     |
-| `runner/remote_runner.py`    | Runs inside container: resolve Data refs/volumes, execute, upload result         |
-| `cli/commands/pool.py`       | Node pool add/remove/list commands                                               |
-| `cli/infra/post_deploy.py`   | kubectl, LWS CRD, GPU driver setup after stack.up()                              |
-| `cli/constants.py`           | CLI defaults, paths, API list                                                    |
-| `cli/main.py`                | CLI entry point (`keras-remote` command)                                         |
+| Module                       | Responsibility                                                                                            |
+| ---------------------------- | --------------------------------------------------------------------------------------------------------- |
+| `core/core.py`               | `@run()` decorator, backend routing, env var capture                                                      |
+| `core/accelerators.py`       | Accelerator registry (`GPUS`, `TPUS`), parser (`parse_accelerator`)                                       |
+| `credentials.py`             | Credential verification & auto-setup (gcloud, ADC, kubeconfig)                                            |
+| `backend/execution.py`       | `JobContext` dataclass (carries `cluster_name`), `BaseK8sBackend` base class, `execute_remote()` pipeline |
+| `backend/gke_client.py`      | K8s Job creation, status polling, pod log retrieval                                                       |
+| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs                                                              |
+| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build                                                      |
+| `data/data.py`               | `Data` class, content hashing, data ref serialization                                                     |
+| `utils/packager.py`          | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction                                  |
+| `utils/storage.py`           | GCS upload/download/cleanup for job artifacts and Data cache                                              |
+| `runner/remote_runner.py`    | Runs inside container: resolve Data refs/volumes, execute, upload result                                  |
+| `cli/infra/state.py`         | Centralized Pulumi state: `load_state()`, `apply_update()`, `apply_destroy()`                             |
+| `cli/options.py`             | Shared `common_options` Click decorator (`--project`/`--zone`/`--cluster`)                                |
+| `cli/commands/pool.py`       | Node pool add/remove/list commands                                                                        |
+| `cli/infra/post_deploy.py`   | kubectl, LWS CRD, GPU driver setup after stack.up()                                                       |
+| `cli/constants.py`           | CLI defaults, paths, API list                                                                             |
+| `cli/main.py`                | CLI entry point (`keras-remote` command)                                                                  |
 
 ## Key Abstractions
 
-- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI.
+- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI, `cluster_name` (for cluster-scoped bucket/repo resolution).
 - **`BaseK8sBackend`** (`backend/execution.py`): Base class with `submit_job`, `wait_for_job`, `cleanup_job`. Subclassed by `GKEBackend` and `PathwaysBackend`.
 - **`GpuConfig` / `TpuConfig`** (`core/accelerators.py`): Frozen dataclasses for accelerator metadata. Single source of truth used by runtime, container builder, and CLI.
 - **`Data`** (`data/data.py`): Wraps a local path or GCS URI. Passed as a function argument or via the `volumes` decorator parameter. Resolved to a plain filesystem path on the remote pod. Content-hashed for upload caching.
 - **`InfraConfig` / `NodePoolConfig`** (`cli/config.py`): CLI provisioning configuration. `InfraConfig` holds project, zone, cluster name, and a list of `NodePoolConfig` entries. `NodePoolConfig` pairs a unique pool name (e.g., `gpu-l4-a3f2`) with a `GpuConfig` or `TpuConfig`.
+- **`StackState`** (`cli/infra/state.py`): Dataclass bundling all state dimensions loaded from a Pulumi stack (project, zone, cluster_name, node_pools, stack handle). Returned by `load_state()` and consumed by commands.
 
 ## Data API
 
@@ -141,15 +145,34 @@ Additional CLI-only env vars:
 
 ### CLI State Management
 
-The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each GCP project gets its own Pulumi stack (stack name = project ID).
+The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each `(project, cluster_name)` pair gets its own Pulumi stack (stack name = `{project}-{cluster_name}`), so multiple clusters in the same GCP project are fully independent.
 
-Every mutating command (`up`, `pool add`, `pool remove`, etc.) follows this reconciliation pattern:
+**Centralized state module (`cli/infra/state.py`)** — All Pulumi stack operations go through three functions:
 
-1. `stack.refresh()` — pull cloud reality into local state
-2. `get_current_node_pools()` — read current pools from stack exports
-3. Build new `InfraConfig` — merge existing pools with desired changes
-4. `create_program(config)` — generate Pulumi program from desired state
-5. `stack.up()` — diff desired vs current, apply only changes
+| Function          | Purpose                                                                                 | Used by                         |
+| ----------------- | --------------------------------------------------------------------------------------- | ------------------------------- |
+| `load_state()`    | Load ALL state dimensions (prerequisites, defaults, refresh, node pools) → `StackState` | `up`, `pool`, `status`          |
+| `apply_update()`  | Run `stack.up()` with a complete `InfraConfig`                                          | `up`, `pool add`, `pool remove` |
+| `apply_destroy()` | Run `stack.destroy()`                                                                   | `down`                          |
+
+**Safety invariants:**
+
+- `stack.up()`, `stack.destroy()`, `stack.refresh()` appear **only** in `state.py`
+- No command file imports `create_program` or `get_stack` directly
+- No command file defines inline `--project`/`--zone`/`--cluster` options (use `common_options` from `cli/options.py`)
+- When a new state dimension is added (e.g. namespaces), it is added to `StackState` and `load_state()` — every command gets it automatically
+
+**Cluster-scoped resource naming:**
+
+| Resource      | Name pattern                                      |
+| ------------- | ------------------------------------------------- |
+| Pulumi stack  | `{project}-{cluster_name}`                        |
+| Jobs bucket   | `{project}-kr-{cluster_name}-jobs`                |
+| Builds bucket | `{project}-kr-{cluster_name}-builds`              |
+| AR repository | `kr-{cluster_name}`                               |
+| GKE cluster   | `{cluster_name}`                                  |
+
+*Note: GCP APIs are enabled project-wide, shared across clusters, and are not disabled when a cluster is destroyed (`disable_on_destroy=False`).*
 
 Key behaviors: