Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 34 additions & 10 deletions .gemini/styleguide.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@ When performing code reviews on pull requests, you must strictly adhere to the f

5. **Respect Existing Repo Patterns**: Before suggesting review comments (like asking users to add boilerplate or specific patterns), actively check for existing design patterns across the repository. Do not suggest adding useless code or structures that contradict or fall outside the established Keras repo coding style.




# Keras Remote API design guidelines

These guidelines are meant to help focus design discussions and help us create delightful developer experiences for remote execution.
Expand Down Expand Up @@ -135,22 +132,49 @@ This prevents confusing situations where a user sets an env var that works in on

---

## CLI commands must be idempotent and follow the reconciliation pattern.
## CLI commands must be idempotent and use the centralized state module.

All Pulumi stack operations go through `cli/infra/state.py` — **no command file should call `stack.up()`, `stack.destroy()`, `stack.refresh()`, or import `create_program`/`get_stack` directly.**

The three entry points are:

- `load_state(project, zone, cluster_name)` → `StackState` — loads ALL state dimensions (refresh, node pools, etc.)
- `apply_update(config)` — runs `stack.up()` with a complete `InfraConfig`
- `apply_destroy(config)` — runs `stack.destroy()`

Every mutating CLI command follows this pattern:

Every mutating CLI command (`up`, `pool add`, `pool remove`, etc.) must follow the refresh-read-merge-apply pattern:
1. `load_state()` — refresh stack and read all current state dimensions into `StackState`
2. Build `InfraConfig` — merge existing state with desired changes
3. `apply_update(config)` or `apply_destroy(config)` — apply the diff

1. `stack.refresh()` — sync local state with cloud reality
2. `get_current_node_pools()` — read current pools from stack exports
3. Build `InfraConfig` — merge existing state with desired changes
4. `stack.up()` — apply only the diff
When adding a **new state dimension** (e.g. namespaces), add it to `StackState` and `load_state()` — every command inherits it automatically, preventing accidental omissions that would cause Pulumi to delete resources.

All CLI commands must use the `common_options` decorator from `cli/options.py` for `--project`/`--zone`/`--cluster` flags — never define these inline.

This ensures:

- Re-running after partial failure is always safe
- Existing resources are never accidentally recreated (Pulumi tracks by URN)
- External drift is detected and corrected
- New state dimensions cannot be accidentally omitted by individual commands

---

## All infrastructure resources must be cluster-scoped.

Every resource managed by the CLI must include the cluster name in its identifier so that multiple clusters within the same GCP project are fully independent. The naming convention is `{project}-kr-{cluster_name}-{purpose}` for buckets and `kr-{cluster_name}` for Artifact Registry repos.

| Resource | Name pattern |
| ------------- | ------------------------------------ |
| Pulumi stack | `{project}-{cluster_name}` |
| Jobs bucket | `{project}-kr-{cluster_name}-jobs` |
| Builds bucket | `{project}-kr-{cluster_name}-builds` |
| AR repository | `kr-{cluster_name}` |

The only exception is project-wide GCP API enablement, which is intentionally shared across clusters (`disable_on_destroy=False`).

When adding a new CLI command that modifies infrastructure, follow this pattern rather than directly creating or deleting resources.
When adding a new infrastructure resource, always scope it to the `(project, cluster_name)` pair. Runtime code (`JobContext`, `container_builder`) resolves the cluster name from `KERAS_REMOTE_CLUSTER` env var or default.

---

Expand Down
77 changes: 50 additions & 27 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@ keras_remote/
├── utils/ # Serialization (packager) and Cloud Storage helpers
├── cli/ # CLI for infrastructure provisioning (Pulumi-based)
│ ├── commands/ # up, down, status, config, pool (add/remove/list)
│ └── infra/ # Pulumi programs, stack management, post-deploy steps
│ ├── infra/ # Pulumi programs, stack management, state module, post-deploy steps
│ └── options.py # Shared --project/--zone/--cluster Click options (common_options decorator)
├── credentials.py # Credential verification & auto-setup (shared by core & CLI)
└── constants.py # Zone/region utilities
└── constants.py # Zone/region utilities, get_default_cluster_name()
```

## Execution Pipeline
Expand All @@ -38,31 +39,34 @@ keras_remote/

## Key Modules

| Module | Responsibility |
| ---------------------------- | -------------------------------------------------------------------------------- |
| `core/core.py` | `@run()` decorator, backend routing, env var capture |
| `core/accelerators.py` | Accelerator registry (`GPUS`, `TPUS`), parser (`parse_accelerator`) |
| `credentials.py` | Credential verification & auto-setup (gcloud, ADC, kubeconfig) |
| `backend/execution.py` | `JobContext` dataclass, `BaseK8sBackend` base class, `execute_remote()` pipeline |
| `backend/gke_client.py` | K8s Job creation, status polling, pod log retrieval |
| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs |
| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build |
| `data/data.py` | `Data` class, content hashing, data ref serialization |
| `utils/packager.py` | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction |
| `utils/storage.py` | GCS upload/download/cleanup for job artifacts and Data cache |
| `runner/remote_runner.py` | Runs inside container: resolve Data refs/volumes, execute, upload result |
| `cli/commands/pool.py` | Node pool add/remove/list commands |
| `cli/infra/post_deploy.py` | kubectl, LWS CRD, GPU driver setup after stack.up() |
| `cli/constants.py` | CLI defaults, paths, API list |
| `cli/main.py` | CLI entry point (`keras-remote` command) |
| Module | Responsibility |
| ---------------------------- | --------------------------------------------------------------------------------------------------------- |
| `core/core.py` | `@run()` decorator, backend routing, env var capture |
| `core/accelerators.py` | Accelerator registry (`GPUS`, `TPUS`), parser (`parse_accelerator`) |
| `credentials.py` | Credential verification & auto-setup (gcloud, ADC, kubeconfig) |
| `backend/execution.py` | `JobContext` dataclass (carries `cluster_name`), `BaseK8sBackend` base class, `execute_remote()` pipeline |
| `backend/gke_client.py` | K8s Job creation, status polling, pod log retrieval |
| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs |
| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build |
| `data/data.py` | `Data` class, content hashing, data ref serialization |
| `utils/packager.py` | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction |
| `utils/storage.py` | GCS upload/download/cleanup for job artifacts and Data cache |
| `runner/remote_runner.py` | Runs inside container: resolve Data refs/volumes, execute, upload result |
| `cli/infra/state.py` | Centralized Pulumi state: `load_state()`, `apply_update()`, `apply_destroy()` |
| `cli/options.py` | Shared `common_options` Click decorator (`--project`/`--zone`/`--cluster`) |
| `cli/commands/pool.py` | Node pool add/remove/list commands |
| `cli/infra/post_deploy.py` | kubectl, LWS CRD, GPU driver setup after stack.up() |
| `cli/constants.py` | CLI defaults, paths, API list |
| `cli/main.py` | CLI entry point (`keras-remote` command) |

## Key Abstractions

- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI.
- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI, `cluster_name` (for cluster-scoped bucket/repo resolution).
- **`BaseK8sBackend`** (`backend/execution.py`): Base class with `submit_job`, `wait_for_job`, `cleanup_job`. Subclassed by `GKEBackend` and `PathwaysBackend`.
- **`GpuConfig` / `TpuConfig`** (`core/accelerators.py`): Frozen dataclasses for accelerator metadata. Single source of truth used by runtime, container builder, and CLI.
- **`Data`** (`data/data.py`): Wraps a local path or GCS URI. Passed as a function argument or via the `volumes` decorator parameter. Resolved to a plain filesystem path on the remote pod. Content-hashed for upload caching.
- **`InfraConfig` / `NodePoolConfig`** (`cli/config.py`): CLI provisioning configuration. `InfraConfig` holds project, zone, cluster name, and a list of `NodePoolConfig` entries. `NodePoolConfig` pairs a unique pool name (e.g., `gpu-l4-a3f2`) with a `GpuConfig` or `TpuConfig`.
- **`StackState`** (`cli/infra/state.py`): Dataclass bundling all state dimensions loaded from a Pulumi stack (project, zone, cluster_name, node_pools, stack handle). Returned by `load_state()` and consumed by commands.

## Data API

Expand Down Expand Up @@ -141,15 +145,34 @@ Additional CLI-only env vars:

### CLI State Management

The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each GCP project gets its own Pulumi stack (stack name = project ID).
The CLI manages three layers of state: in-memory config (`InfraConfig`), Pulumi local state files (`~/.keras-remote/pulumi/`), and GCP cloud resources. Each `(project, cluster_name)` pair gets its own Pulumi stack (stack name = `{project}-{cluster_name}`), so multiple clusters in the same GCP project are fully independent.

Every mutating command (`up`, `pool add`, `pool remove`, etc.) follows this reconciliation pattern:
**Centralized state module (`cli/infra/state.py`)** — All Pulumi stack operations go through three functions:

1. `stack.refresh()` — pull cloud reality into local state
2. `get_current_node_pools()` — read current pools from stack exports
3. Build new `InfraConfig` — merge existing pools with desired changes
4. `create_program(config)` — generate Pulumi program from desired state
5. `stack.up()` — diff desired vs current, apply only changes
| Function | Purpose | Used by |
| ----------------- | --------------------------------------------------------------------------------------- | ------------------------------- |
| `load_state()` | Load ALL state dimensions (prerequisites, defaults, refresh, node pools) → `StackState` | `up`, `pool`, `status` |
| `apply_update()` | Run `stack.up()` with a complete `InfraConfig` | `up`, `pool add`, `pool remove` |
| `apply_destroy()` | Run `stack.destroy()` | `down` |

**Safety invariants:**

- `stack.up()`, `stack.destroy()`, `stack.refresh()` appear **only** in `state.py`
- No command file imports `create_program` or `get_stack` directly
- No command file defines inline `--project`/`--zone`/`--cluster` options (use `common_options` from `cli/options.py`)
- When a new state dimension is added (e.g. namespaces), it is added to `StackState` and `load_state()` — every command gets it automatically

**Cluster-scoped resource naming:**

| Resource | Name pattern |
| ------------- | ------------------------------------------------- |
| Pulumi stack | `{project}-{cluster_name}` |
| Jobs bucket | `{project}-kr-{cluster_name}-jobs` |
| Builds bucket | `{project}-kr-{cluster_name}-builds` |
| AR repository | `kr-{cluster_name}` |
| GKE cluster | `{cluster_name}` |

*Note: GCP APIs are enabled project-wide, shared across clusters, and are not disabled when a cluster is destroyed (`disable_on_destroy=False`).*

Key behaviors:

Expand Down
Loading