Skip to content

Commit 00a8e2d

Browse files
Updates README to ensure complete documentation of features (#85)
1 parent 8f3495a commit 00a8e2d

File tree

2 files changed

+68
-29
lines changed

2 files changed

+68
-29
lines changed

README.md

Lines changed: 67 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@ Run Keras and JAX workloads on cloud TPUs and GPUs with a simple decorator. No i
88
```python
99
import keras_remote
1010

11-
@keras_remote.run(accelerator="v3-8")
11+
@keras_remote.run(accelerator="v6e-8")
1212
def train_model():
1313
import keras
1414
model = keras.Sequential([...])
1515
model.fit(x_train, y_train)
1616
return model.history.history["loss"][-1]
1717

18-
# Executes on TPU v3-8, returns the result
18+
# Executes on TPU v6e-8, returns the result
1919
final_loss = train_model()
2020
```
2121

@@ -25,10 +25,12 @@ final_loss = train_model()
2525
- [Installation](#installation)
2626
- [Quick Start](#quick-start)
2727
- [Usage Examples](#usage-examples)
28+
- [Handling Data](#handling-data)
2829
- [Configuration](#configuration)
2930
- [Supported Accelerators](#supported-accelerators)
3031
- [Monitoring](#monitoring)
3132
- [Troubleshooting](#troubleshooting)
33+
- [Resource Cleanup](#resource-cleanup)
3234
- [Contributing](#contributing)
3335
- [License](#license)
3436

@@ -40,6 +42,7 @@ final_loss = train_model()
4042
- **Container caching** — Subsequent runs start in 2-4 minutes after initial build
4143
- **Built-in monitoring** — View job status and logs in Google Cloud Console
4244
- **Automatic cleanup** — Resources are released when jobs complete
45+
- **Transparent errors** — Remote exceptions are re-raised locally with the original traceback
4346

4447
## Installation
4548

@@ -65,7 +68,7 @@ cd keras-remote
6568
pip install -e ".[cli]"
6669
```
6770

68-
This adds the `keras-remote up`, `keras-remote down`, `keras-remote status`, and `keras-remote config` commands for provisioning and tearing down cloud resources.
71+
This adds the `keras-remote up`, `keras-remote down`, `keras-remote status`, `keras-remote config`, and `keras-remote pool` commands for provisioning and managing cloud resources.
6972

7073
### Requirements
7174

@@ -113,6 +116,19 @@ To view configuration:
113116
keras-remote config
114117
```
115118

119+
To manage accelerator node pools after initial setup:
120+
121+
```bash
122+
# Add a node pool for a specific accelerator
123+
keras-remote pool add --accelerator=v6e-8
124+
125+
# List current node pools
126+
keras-remote pool list
127+
128+
# Remove a node pool by name
129+
keras-remote pool remove <pool-name>
130+
```
131+
116132
### 2. Set Environment Variables
117133

118134
Add to your shell profile (`~/.bashrc`, `~/.zshrc`, etc.):
@@ -127,7 +143,7 @@ export KERAS_REMOTE_ZONE="us-central1-a" # Optional
127143
```python
128144
import keras_remote
129145

130-
@keras_remote.run(accelerator="v3-8")
146+
@keras_remote.run(accelerator="v6e-8")
131147
def hello_tpu():
132148
import jax
133149
return f"Running on {jax.devices()}"
@@ -143,7 +159,7 @@ print(result)
143159
```python
144160
import keras_remote
145161

146-
@keras_remote.run(accelerator="v3-8")
162+
@keras_remote.run(accelerator="v6e-8")
147163
def compute(x, y):
148164
return x + y
149165

@@ -156,7 +172,7 @@ print(f"Result: {result}") # Output: Result: 12
156172
```python
157173
import keras_remote
158174

159-
@keras_remote.run(accelerator="v3-8")
175+
@keras_remote.run(accelerator="v6e-8")
160176
def train_model():
161177
import keras
162178
import numpy as np
@@ -189,20 +205,22 @@ scikit-learn
189205

190206
Keras Remote automatically detects and installs dependencies on the remote worker.
191207

208+
> **Note:** JAX packages (`jax`, `jaxlib`, `libtpu`, `libtpu-nightly`) are automatically filtered from your `requirements.txt` to prevent overriding the accelerator-specific JAX installation. To keep a JAX line, append `# kr:keep` to it.
209+
192210
### Prebuilt Container Images
193211

194212
Skip container build time by using prebuilt images:
195213

196214
```python
197215
@keras_remote.run(
198-
accelerator="v3-8",
216+
accelerator="v6e-8",
199217
container_image="us-docker.pkg.dev/my-project/keras-remote/prebuilt:v1.0"
200218
)
201219
def train():
202220
...
203221
```
204222

205-
See [examples/Dockerfile.prebuilt](examples/Dockerfile.prebuilt) for a template.
223+
Build your own prebuilt image using the project's Dockerfile template as a starting point.
206224

207225
## Handling Data
208226

@@ -295,22 +313,26 @@ train("gs://my-bucket/arrayrecords/")
295313

296314
### Environment Variables
297315

298-
| Variable | Required | Default | Description |
299-
| ---------------------- | -------- | --------------- | ----------------------- |
300-
| `KERAS_REMOTE_PROJECT` | Yes || Google Cloud project ID |
301-
| `KERAS_REMOTE_ZONE` | No | `us-central1-a` | Default compute zone |
302-
| `KERAS_REMOTE_CLUSTER` | No || GKE cluster name |
316+
| Variable | Required | Default | Description |
317+
| ---------------------------- | -------- | ---------------------- | ----------------------- |
318+
| `KERAS_REMOTE_PROJECT` | Yes || Google Cloud project ID |
319+
| `KERAS_REMOTE_ZONE` | No | `us-central1-a` | Default compute zone |
320+
| `KERAS_REMOTE_CLUSTER` | No | `keras-remote-cluster` | GKE cluster name |
321+
| `KERAS_REMOTE_GKE_NAMESPACE` | No | `default` | Kubernetes namespace |
303322

304323
### Decorator Parameters
305324

306325
```python
307326
@keras_remote.run(
308-
accelerator="v3-8", # Required: TPU/GPU type
327+
accelerator="v6e-8", # TPU/GPU type (default: "v6e-8")
309328
container_image=None, # Custom container URI
310329
zone=None, # Override default zone
311330
project=None, # Override default project
331+
capture_env_vars=None, # Env var names/patterns to forward (supports * wildcard)
312332
cluster=None, # GKE cluster name
313-
namespace="default" # Kubernetes namespace
333+
backend=None, # "gke", "pathways", or None (auto-detect)
334+
namespace="default", # Kubernetes namespace
335+
volumes=None, # Dict mapping absolute paths to Data objects
314336
)
315337
```
316338

@@ -323,24 +345,39 @@ Note: each accelerator and topology requires
323345

324346
| Type | Configurations |
325347
| -------------- | ------------------------------------------- |
326-
| TPU v2 | `v2-8`, `v2-32` |
327-
| TPU v3 | `v3-8`, `v3-32` |
348+
| TPU v2 | `v2-4`, `v2-16`, `v2-32` |
349+
| TPU v3 | `v3-4`, `v3-16`, `v3-32` |
328350
| TPU v5 Litepod | `v5litepod-1`, `v5litepod-4`, `v5litepod-8` |
329351
| TPU v5p | `v5p-8`, `v5p-16` |
330352
| TPU v6e | `v6e-8`, `v6e-16` |
331353

332354
### GPUs
333355

334-
| Type | Aliases |
335-
| ----------- | --------------------------- |
336-
| NVIDIA T4 | `t4`, `nvidia-tesla-t4` |
337-
| NVIDIA L4 | `l4`, `nvidia-l4` |
338-
| NVIDIA V100 | `v100`, `nvidia-tesla-v100` |
339-
| NVIDIA A100 | `a100`, `nvidia-tesla-a100` |
340-
| NVIDIA H100 | `h100`, `nvidia-h100-80gb` |
356+
| Type | Aliases | Multi-GPU Counts |
357+
| ---------------- | ------------------------------- | ---------------- |
358+
| NVIDIA T4 | `t4`, `nvidia-tesla-t4` | 1, 2, 4 |
359+
| NVIDIA L4 | `l4`, `nvidia-l4` | 1, 2, 4 |
360+
| NVIDIA V100 | `v100`, `nvidia-tesla-v100` | 1, 2, 4, 8 |
361+
| NVIDIA A100 | `a100`, `nvidia-tesla-a100` | 1, 2, 4, 8 |
362+
| NVIDIA A100 80GB | `a100-80gb`, `nvidia-a100-80gb` | 1, 2, 4, 8 |
363+
| NVIDIA H100 | `h100`, `nvidia-h100-80gb` | 1, 2, 4, 8 |
341364

342365
For multi-GPU configurations on GKE, append the count: `a100x4`, `l4x2`, etc.
343366

367+
### CPU
368+
369+
Use `accelerator="cpu"` to run on a CPU-only node (no accelerator attached).
370+
371+
### Multi-Host TPU (Pathways)
372+
373+
Multi-host TPU configurations (those requiring more than one node, such as `v2-16`, `v3-32`, or `v5p-16`) automatically use the [Pathways](https://cloud.google.com/tpu/docs/pathways-overview) backend. You can also set the backend explicitly:
374+
375+
```python
376+
@keras_remote.run(accelerator="v3-32", backend="pathways")
377+
def distributed_train():
378+
...
379+
```
380+
344381
## Monitoring
345382

346383
### Google Cloud Console
@@ -370,9 +407,10 @@ export KERAS_REMOTE_PROJECT="your-project-id"
370407
Enable required APIs and create the Artifact Registry repository:
371408

372409
```bash
373-
gcloud services enable cloudbuild.googleapis.com \
374-
artifactregistry.googleapis.com storage.googleapis.com \
375-
container.googleapis.com --project=$KERAS_REMOTE_PROJECT
410+
gcloud services enable compute.googleapis.com \
411+
cloudbuild.googleapis.com artifactregistry.googleapis.com \
412+
storage.googleapis.com container.googleapis.com \
413+
--project=$KERAS_REMOTE_PROJECT
376414

377415
gcloud artifacts repositories create keras-remote \
378416
--repository-format=docker \
@@ -436,7 +474,8 @@ This removes:
436474
- GKE cluster and accelerator node pools
437475
- Artifact Registry repository and container images
438476
- Cloud Storage buckets (jobs and builds)
439-
Use `--yes` to skip the confirmation prompt.
477+
478+
Use `--yes` to skip the confirmation prompt.
440479

441480
## Contributing
442481

keras_remote/core/core.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414

1515
def run(
16-
accelerator="v3-8",
16+
accelerator="v6e-8",
1717
container_image=None,
1818
zone=None,
1919
project=None,

0 commit comments

Comments
 (0)