Skip to content

Commit 44757a6

Browse files
Updates AGENTS.md with Data API
1 parent a41579a commit 44757a6

File tree

1 file changed

+51
-4
lines changed

1 file changed

+51
-4
lines changed

AGENTS.md

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Keras Remote lets users execute Keras/JAX workloads on cloud TPUs and GPUs via a
1010
keras_remote/
1111
├── core/ # @run decorator, accelerator registry & parser
1212
├── backend/ # Job execution backends (GKE, Pathways)
13+
├── data/ # Data class for declaring data dependencies
1314
├── infra/ # Docker container building & caching
1415
├── runner/ # Remote worker entrypoint (runs inside container)
1516
├── utils/ # Serialization (packager) and Cloud Storage helpers
@@ -26,7 +27,7 @@ keras_remote/
2627
@keras_remote.run() called
2728
→ JobContext.from_params() # Resolve config from args/env vars
2829
→ ensure_credentials() # Verify/auto-configure gcloud, ADC, kubeconfig
29-
→ _prepare_artifacts() # Serialize function (cloudpickle), zip working dir
30+
→ _prepare_artifacts() # Upload Data, serialize function, zip working dir
3031
→ _build_container() # Build or retrieve cached Docker image
3132
→ _upload_artifacts() # Upload payload.pkl, context.zip to GCS
3233
→ backend.submit_job() # Create K8s Job (GKE) or LeaderWorkerSet (Pathways)
@@ -46,9 +47,10 @@ keras_remote/
4647
| `backend/gke_client.py` | K8s Job creation, status polling, pod log retrieval |
4748
| `backend/pathways_client.py` | LeaderWorkerSet creation for multi-host TPUs |
4849
| `infra/container_builder.py` | Content-hashed Docker image building via Cloud Build |
49-
| `utils/packager.py` | `save_payload()` (cloudpickle), `zip_working_dir()` |
50-
| `utils/storage.py` | GCS upload/download/cleanup for job artifacts |
51-
| `runner/remote_runner.py` | Runs inside container: deserialize, execute, upload result |
50+
| `data/data.py` | `Data` class, content hashing, data ref serialization |
51+
| `utils/packager.py` | `save_payload()` (cloudpickle), `zip_working_dir()`, Data ref extraction |
52+
| `utils/storage.py` | GCS upload/download/cleanup for job artifacts and Data cache |
53+
| `runner/remote_runner.py` | Runs inside container: resolve Data refs/volumes, execute, upload result |
5254
| `cli/commands/pool.py` | Node pool add/remove/list commands |
5355
| `cli/infra/post_deploy.py` | kubectl, LWS CRD, GPU driver setup after stack.up() |
5456
| `cli/constants.py` | CLI defaults, paths, API list |
@@ -59,8 +61,53 @@ keras_remote/
5961
- **`JobContext`** (`backend/execution.py`): Mutable dataclass carrying all job state through the pipeline — inputs, generated IDs, artifact paths, image URI.
6062
- **`BaseK8sBackend`** (`backend/execution.py`): Base class with `submit_job`, `wait_for_job`, `cleanup_job`. Subclassed by `GKEBackend` and `PathwaysBackend`.
6163
- **`GpuConfig` / `TpuConfig`** (`core/accelerators.py`): Frozen dataclasses for accelerator metadata. Single source of truth used by runtime, container builder, and CLI.
64+
- **`Data`** (`data/data.py`): Wraps a local path or GCS URI. Passed as a function argument or via the `volumes` decorator parameter. Resolved to a plain filesystem path on the remote pod. Content-hashed for upload caching.
6265
- **`InfraConfig` / `NodePoolConfig`** (`cli/config.py`): CLI provisioning configuration. `InfraConfig` holds project, zone, cluster name, and a list of `NodePoolConfig` entries. `NodePoolConfig` pairs a unique pool name (e.g., `gpu-l4-a3f2`) with a `GpuConfig` or `TpuConfig`.
6366

67+
## Data API
68+
69+
The `Data` class (`keras_remote.Data`) declares data dependencies for remote functions. It accepts local file/directory paths or GCS URIs (`gs://...`).
70+
71+
### Two usage patterns
72+
73+
**Function arguments**`Data` objects passed as args/kwargs are uploaded to GCS, serialized as data ref dicts in the payload, and resolved to local paths on the pod:
74+
75+
```python
76+
@keras_remote.run(accelerator="v3-8")
77+
def train(data_dir, config_path):
78+
... # data_dir and config_path are plain strings
79+
80+
train(Data("./dataset/"), Data("./config.json"))
81+
```
82+
83+
**Volumes**`Data` objects in the `volumes=` decorator parameter are downloaded to fixed mount paths before execution:
84+
85+
```python
86+
@keras_remote.run(accelerator="v3-8", volumes={"/data": Data("./dataset/")})
87+
def train():
88+
files = os.listdir("/data") # available at mount path
89+
```
90+
91+
Both patterns can be combined. `Data` objects can also be nested inside lists, dicts, and other containers — they are recursively discovered and resolved.
92+
93+
### Content-addressed caching
94+
95+
Local `Data` objects are content-hashed (SHA-256 over sorted file contents). Uploads go to `gs://{bucket}/{namespace}/data-cache/{hash}/`. A `.cache_marker` sentinel enables O(1) cache-hit checks. Identical data is uploaded only once.
96+
97+
### Pipeline integration
98+
99+
During `_prepare_artifacts()`:
100+
101+
1. Upload `Data` from `volumes` and function args via `storage.upload_data()` (content-addressed)
102+
1. Replace `Data` objects in args/kwargs with serializable `__data_ref__` dicts
103+
1. Local `Data` paths inside the caller directory are auto-excluded from `context.zip`
104+
105+
On the remote pod (`remote_runner.py`):
106+
107+
1. `resolve_volumes()` — download volume data to mount paths
108+
1. `resolve_data_refs()` — recursively resolve `__data_ref__` dicts in args/kwargs to local paths
109+
1. Single-file `Data` resolves to the file path; directory `Data` resolves to the directory path
110+
64111
## Conventions
65112

66113
### Code Style

0 commit comments

Comments
 (0)