Skip to content

Commit 97f01a9

Browse files
Adds CLI for Infrastructure Management (#14)
* Adds CLI to orchestrate infra setup/teardown * Fix exceptions and err msgs * fix down logic + improve maintainability * Fixes issue when APIs are not enabled by skipping cleanup steps instead of hanging. * address reviews + Moves GCS handling to pulumi. * Uses py sdk instead of gcloud for container builds. * update CLI to use new accelerator registry types Replace dict-based accelerator configs with typed GpuConfig/TpuConfig from the core registry. Introduces InfraConfig dataclass and removes the now-redundant accelerator_configs.py lookup tables. * disable allow_create for command * remove dead code * rebasing fixes * Adds check for kubectl and docker in the prerequisites check * use a released nvidia driver instead of master * Adds an error helper * rename prerequisites to prerequisites_check * log error when checking for image existence * improve conditional logic * Adds label based filtering for resource cleanup
1 parent 66fe26e commit 97f01a9

29 files changed

+1209
-727
lines changed

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,4 +206,7 @@ marimo/_static/
206206
marimo/_lsp/
207207
__marimo__/
208208

209-
.claude/
209+
.claude/
210+
211+
# Pulumi
212+
.pulumi/

README.md

Lines changed: 51 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -45,38 +45,74 @@ final_loss = train_model()
4545

4646
## Installation
4747

48-
### From Source
48+
### Library Only
49+
50+
Install the core package to use the `@keras_remote.run()` decorator in your code:
4951

5052
```bash
5153
git clone https://github.com/keras-team/keras-remote.git
5254
cd keras-remote
5355
pip install -e .
5456
```
5557

58+
This is sufficient if your infrastructure (GKE cluster, Artifact Registry, etc.) is already provisioned.
59+
60+
### Library + CLI
61+
62+
Install with the `cli` extra to also get the `keras-remote` command for managing infrastructure:
63+
64+
```bash
65+
git clone https://github.com/keras-team/keras-remote.git
66+
cd keras-remote
67+
pip install -e ".[cli]"
68+
```
69+
70+
This adds the `keras-remote up`, `keras-remote down`, `keras-remote status`, and `keras-remote config` commands for provisioning and tearing down cloud resources.
71+
5672
### Requirements
5773

5874
- Python 3.11+
5975
- Google Cloud SDK (`gcloud`)
6076
- Run `gcloud auth login` and `gcloud auth application-default login`
77+
- [Pulumi CLI](https://www.pulumi.com/docs/install/) (required for `[cli]` install only)
6178
- A Google Cloud project with billing enabled
6279

6380
## Quick Start
6481

6582
### 1. Configure Google Cloud
6683

67-
Run the automated setup script:
84+
Run the CLI setup command:
6885

6986
```bash
70-
./setup.sh
87+
keras-remote up
7188
```
7289

73-
The script will:
90+
This will interactively:
7491

7592
- Prompt for your GCP project ID
93+
- Let you choose an accelerator type (CPU, GPU, or TPU)
7694
- Enable required APIs (Cloud Build, Artifact Registry, Cloud Storage, GKE)
7795
- Create the Artifact Registry repository
78-
- Configure Docker authentication
79-
- Verify the setup
96+
- Provision a GKE cluster with optional accelerator node pools
97+
- Configure Docker authentication and kubectl access
98+
99+
You can also run non-interactively:
100+
101+
```bash
102+
keras-remote up --project=my-project --accelerator=t4 --yes
103+
```
104+
105+
To view current infrastructure state:
106+
107+
```bash
108+
keras-remote status
109+
```
110+
111+
To view configuration:
112+
113+
```bash
114+
keras-remote config
115+
```
80116

81117
### 2. Set Environment Variables
82118

@@ -124,7 +160,7 @@ def train():
124160
- Support for GPU accelerators (T4, L4, A100, V100, H100)
125161
- Lower overhead for iterative development
126162

127-
**Setup:** Run `./setup.sh` and select GKE.
163+
**Setup:** Run `keras-remote up` and select a GPU accelerator.
128164

129165
### TPU VM
130166

@@ -348,15 +384,18 @@ gcloud artifacts repositories describe keras-remote \
348384
Remove all Keras Remote resources to avoid charges:
349385

350386
```bash
351-
./cleanup.sh
387+
keras-remote down
352388
```
353389

354390
This removes:
355391

356-
- Cloud Storage buckets
357-
- Artifact Registry repositories
358-
- GKE clusters (if created by setup)
359-
- TPU VMs
392+
- GKE cluster and accelerator node pools (via Pulumi)
393+
- Artifact Registry repository and container images
394+
- Cloud Storage buckets (jobs and builds)
395+
- TPU VMs and orphaned Compute Engine VMs
396+
397+
Use `--yes` to skip the confirmation prompt, or `--pulumi-only` to only
398+
destroy Pulumi-managed resources.
360399

361400
## Contributing
362401

cleanup.sh

Lines changed: 0 additions & 177 deletions
This file was deleted.

keras_remote/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,8 @@
1+
import os
2+
3+
# Suppress noisy gRPC fork/logging messages before any gRPC imports
4+
os.environ.setdefault("GRPC_VERBOSITY", "NONE")
5+
os.environ.setdefault("GLOG_minloglevel", "3")
6+
os.environ.setdefault("GRPC_ENABLE_FORK_SUPPORT", "0")
7+
18
from keras_remote.core.core import run

keras_remote/backend/execution.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313

1414
import cloudpickle
1515

16+
from keras_remote.constants import get_default_zone, zone_to_region
1617
from keras_remote.infra import container_builder
1718
from keras_remote.backend import gke_client
1819
from keras_remote.infra import infra
@@ -54,11 +55,7 @@ class JobContext:
5455

5556
def __post_init__(self):
5657
self.bucket_name = f"{self.project}-keras-remote-jobs"
57-
self.region = (
58-
self.zone.rsplit("-", 1)[0]
59-
if self.zone and "-" in self.zone
60-
else "us-central1"
61-
)
58+
self.region = zone_to_region(self.zone)
6259
self.display_name = f"keras-remote-{self.func.__name__}-{self.job_id}"
6360

6461
@classmethod
@@ -75,7 +72,7 @@ def from_params(
7572
) -> "JobContext":
7673
"""Factory method with default resolution for zone/project."""
7774
if not zone:
78-
zone = os.environ.get("KERAS_REMOTE_ZONE", "us-central1-a")
75+
zone = get_default_zone()
7976
if not project:
8077
project = os.environ.get("KERAS_REMOTE_PROJECT")
8178
if not project:
@@ -212,7 +209,6 @@ def _upload_artifacts(ctx: JobContext) -> None:
212209
job_id=ctx.job_id,
213210
payload_path=ctx.payload_path,
214211
context_path=ctx.context_path,
215-
location=ctx.region,
216212
project=ctx.project,
217213
)
218214

keras_remote/backend/gke_client.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
from kubernetes import client, config
77
from kubernetes.client.rest import ApiException
88

9-
from keras_remote.core.accelerators import GpuConfig
109
from keras_remote.core.accelerators import TpuConfig
1110
from keras_remote.core import accelerators
1211
from keras_remote.infra import infra
@@ -172,7 +171,7 @@ def cleanup_job(job_name, namespace="default"):
172171

173172
def _parse_accelerator(accelerator):
174173
"""Convert accelerator string to GKE pod spec fields."""
175-
parsed = parse_accelerator(accelerator)
174+
parsed = accelerators.parse_accelerator(accelerator)
176175

177176
if parsed is None:
178177
return {
@@ -334,7 +333,7 @@ def _print_pod_logs(core_v1, job_name, namespace):
334333
pods = core_v1.list_namespaced_pod(
335334
namespace, label_selector=f"job-name={job_name}"
336335
)
337-
336+
338337
for pod in pods.items:
339338
with suppress(ApiException):
340339
logs = core_v1.read_namespaced_pod_log(

keras_remote/cli/__init__.py

Whitespace-only changes.

keras_remote/cli/commands/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)