Skip to content

Commit 09dc8aa

Browse files
committed
Merge remote-tracking branch 'origin/main' into weaver/marin-users-directory-for-output
# Conflicts: # infra/probes/deploy/deploy.py
2 parents 0843200 + 5c17dee commit 09dc8aa

65 files changed

Lines changed: 4028 additions & 602 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/profile-training/SKILL.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,44 @@ uv run ... \
6161
Keep the profiler window short when enabling HLO protobuf collection — it
6262
enlarges artifacts and can increase profile upload/finalization time.
6363

64+
Known-good TensorBoard scope recipe from CoreWeave Grug MoE profiling:
65+
`trainer.profiler.enabled=true`, `trainer.profiler.start_step=3`,
66+
`trainer.profiler.num_steps=2`, `trainer.profiler.perfetto_link=false`,
67+
`trainer.profiler.profile_options.host_tracer_level=1`,
68+
`trainer.profiler.profile_options.python_tracer_level=0`, and
69+
`trainer.profiler.profile_options.enable_hlo_proto=true` preserved useful
70+
`jax.named_scope` / `named_call` regions in TensorBoard for
71+
`GM2560-MAY-120S4096-W2048-B8-R1-E8M1-FA4PROFILE-S3B-N1-cw-20260617-2353`.
72+
Leave `device_tracer_level` unset unless device timelines are specifically
73+
needed; this profile still had useful hierarchical host/XLA metadata.
74+
75+
On GPU, command buffers can collapse or suppress the visible name stack in
76+
TensorBoard/Perfetto. For profile-readability runs, disable command buffers:
77+
78+
```bash
79+
export XLA_FLAGS="${XLA_FLAGS:-} --xla_gpu_enable_command_buffer=''"
80+
```
81+
82+
This hurts performance, so use it only when the goal is semantic trace
83+
attribution; leave it out of throughput comparisons unless command-buffer
84+
behavior is the axis being tested.
85+
86+
For GPU throughput runs, keep profile-readability flags separate from XLA code
87+
generation and scheduling flags. Start from JAX's GPU performance guide,
88+
especially the code generation flags section:
89+
<https://docs.jax.dev/en/latest/gpu_performance_tips.html#code-generation-flags>.
90+
The exact set of useful XLA flags is `jaxlib`-version dependent, so record the
91+
full `XLA_FLAGS` value with each profile or W&B run.
92+
6493
For better profile readability, use `haliax.jax_utils.named_call` and
6594
`jax.named_scope` liberally in model code; these names flow into trace
6695
annotations and make region-level summaries far more actionable.
6796

6897
Reference:
6998
- `lib/levanter/docs/Performance-Guide.md`
7099
- `.agents/skills/add-pallas-kernel/`
100+
- JAX GPU performance tips:
101+
<https://docs.jax.dev/en/latest/gpu_performance_tips.html>
71102

72103
## Ingest to Structured Summary
73104
Pick a download location for pulled profile artifacts: `/tmp` for
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
name: reserve-gpu
3+
description: Reserve an Iris-backed CoreWeave H100 pod for fast debugging with dev_gpu.py.
4+
---
5+
6+
# Skill: Dev GPU
7+
8+
Use this skill for the standard fast H100 debugging loop without wiring a full training job each time. It is the GPU counterpart to `reserve-tpu`.
9+
10+
`scripts/iris/dev_gpu.py` reserves a CoreWeave H100 pod through Iris, waits for the backing Kubernetes pod to come up, and `kubectl exec -it`s you into it. Marin's H100s are CoreWeave Kubernetes pods, not GCE VMs, so access is `kubectl`, not SSH — there is no `ssh`/`scp` transport and no `~/.ssh/config` alias.
11+
12+
This is a lean tool: `allocate`, `connect`, `status`, `release`. It does not sync files or run remote env setup (no `execute`/`watch`/`setup_env`). The CoreWeave task image is self-contained; the loop is "reserve a node, shell in." Sync those steps in yourself once connected.
13+
14+
## Cost rule
15+
16+
A holder pod sits on an expensive 8×H100 node for the session's lifetime. Release as soon as you are done — `Ctrl-C` the `allocate` terminal, or run `release` from another shell.
17+
18+
## Commands
19+
20+
- `allocate`: submit a holder job, resolve the assigned pod, persist session state, block until release
21+
- `status`: show the active local session metadata
22+
- `connect`: open an interactive shell (`kubectl exec -it … -- bash -l`) into the reserved pod
23+
- `release`: terminate the holder job and remove the local session file
24+
25+
## Prerequisites
26+
27+
1. Place the cluster kubeconfig at the path the config expects. The tool passes `--kubeconfig <platform.coreweave.kubeconfig_path>` to `kubectl` verbatim and fails fast if the file is absent. For the production H100 fleet (`cw-us-east-02a`, the `marin-gpu` cluster) that path is `~/.kube/coreweave-iris-gpu`, per `lib/iris/docs/coreweave.md`.
28+
29+
2. Ensure the Iris controller is running for the cluster. On the shared CoreWeave cluster this is usually already true; only start it yourself for a fresh cluster.
30+
31+
3. Use a cluster config whose platform is CoreWeave/Kubernetes. The tool gates on this and rejects GCP/TPU configs with a pointer back to `dev_tpu.py`.
32+
33+
## Command pattern
34+
35+
All invocations share this shape; only the subcommand and its flags change:
36+
37+
```bash
38+
uv run scripts/iris/dev_gpu.py \
39+
--config lib/iris/config/cw-us-east-02a.yaml \
40+
--name "$USER-gpu" \
41+
<subcommand> [flags]
42+
```
43+
44+
Subcommands and distinctive flags:
45+
46+
- `allocate` — reserves a whole `h100-8x` node (`--gpu-count` defaults to `8`) and holds it until `Ctrl-C`. Add `--timeout` (default `900`) to bound the wait for the task to reach `RUNNING`, and `--pod-timeout` (default `120`) to bound the wait for the backing pod. Only `--gpu-count 8` is validated; a sub-node value schedules as a fractional share (`nvidia-smi -L` then shows fewer GPUs) but fragments the 8-GPU InfiniBand gang pool, so prefer the whole node.
47+
- `status` — show the active session (job id, config, GPU count, resolved pod).
48+
- `connect` — interactive shell into the pod. It first checks job liveness with the controller (failing fast if the job is gone), then `kubectl exec -it`s into container `task`.
49+
- `release` — terminate the holder job and clear the session file. Pass `--force` to drop local state even when the terminate call fails (then confirm the job is gone with `iris job list`).
50+
51+
## GPU JAX inside the pod
52+
53+
The `iris-task` image ships a CPU-only `uv` environment at `/app`, so bare `python` has no JAX and `uv run python` falls back to a CPU device. To get GPU JAX (`jax[cuda13]`):
54+
55+
```bash
56+
cd /app && uv sync --all-packages --extra=gpu
57+
```
58+
59+
`--all-packages` is required: the `gpu` extra is defined on the sub-packages (`marin-levanter` / `marin-core`), not the root project. This is the GPU analog of `dev_tpu.py`'s `--extra=tpu`. Verify the hardware with `nvidia-smi -L` (expect 8×H100 80GB on a whole node).
60+
61+
## Observability
62+
63+
Use normal Iris tooling to inspect the backing cluster and holder job:
64+
65+
```bash
66+
uv run iris --config=lib/iris/config/cw-us-east-02a.yaml job list --prefix /$USER/dev-gpu
67+
uv run iris --config=lib/iris/config/cw-us-east-02a.yaml job logs /$USER/dev-gpu-<name>
68+
```
69+
70+
Inspect the pod directly with the same kubeconfig the tool uses:
71+
72+
```bash
73+
kubectl --kubeconfig ~/.kube/coreweave-iris-gpu --namespace iris get pods -l iris.task_id=<sanitized-task-id>
74+
```
75+
76+
## Session behavior
77+
78+
- Local session state lives under `~/.cache/marin/dev_gpu_iris/`.
79+
- If the `allocate` terminal dies unexpectedly, run `release` to terminate the holder job and clear the stale state file.
80+
- A failed `allocate` cleans up after itself: the holder job is terminated and the local state file is removed only once the job is confirmed gone, so a failed terminate never orphans an expensive pod with no local record of its job id.
81+
- `connect` execs into the pod resolved at allocation time. If Iris rescheduled the task onto a new pod while the job stayed active, `connect` fails — re-allocate.
82+
83+
## Agent Usage
84+
85+
Always pass `--name` to avoid collisions with other agents:
86+
87+
```bash
88+
export GPU_NAME="${USER}-$(git rev-parse --abbrev-ref HEAD | tr '/' '-')"
89+
uv run scripts/iris/dev_gpu.py --config lib/iris/config/cw-us-east-02a.yaml --name "$GPU_NAME" allocate
90+
```
91+
92+
## Cleanup
93+
94+
Normal cleanup is `Ctrl-C` in the `allocate` terminal. To clean up from another shell, run the `release` subcommand (add `--force` only if the job is already dead and `release` keeps erroring).

.github/workflows/iris-smoke-coreweave.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ jobs:
109109
JAX_TRACEBACK_FILTERING: off
110110
# When set, the marin-on-iris test uploads fixtures and writes
111111
# intermediate data to S3 (R2) so remote Zephyr pods can access them.
112-
MARIN_CI_S3_PREFIX: s3://marin-na/temp/ci
112+
MARIN_CI_S3_PREFIX: s3://marin-na/tmp/ttl=3d/ci
113113
AWS_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
114114
AWS_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
115115
AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com
@@ -127,7 +127,7 @@ jobs:
127127
WANDB_MODE: disabled
128128
WANDB_API_KEY: ""
129129
JAX_TRACEBACK_FILTERING: off
130-
MARIN_CI_S3_PREFIX: s3://marin-na/temp/ci
130+
MARIN_CI_S3_PREFIX: s3://marin-na/tmp/ttl=3d/ci
131131
AWS_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
132132
AWS_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
133133
AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com

infra/probes/README.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Health checks (emit a `probe_up` 1/0 sample; the runner adds `probe_latency_ms`)
1111
- `finelog-write` — write a nonce and read it back (60s).
1212
- `iris-job-submit/<zone>` — submit a tiny job per zone, wait for SUCCEEDED (300s).
1313

14-
Gauge:
14+
Gauges:
1515

1616
- `provisioning` — accelerator provisioning stats over a trailing 3h window,
1717
recomputed every 15 min. The controller's autoscaler emits one structured row
@@ -21,6 +21,17 @@ Gauge:
2121
`provision_*` count/latency/success-ratio gauges. See
2222
`iris.cluster.controller.autoscaler.provisioning` for the outcome vocabulary
2323
and `src/provisioning.py` for the emitted metrics.
24+
- `workers` — worker-fleet snapshot from `list_workers()` (60s). Rolls the
25+
healthy workers into fleet resource totals (`worker_healthy`,
26+
`worker_cpu_millicores`, `worker_memory_bytes`, `worker_tpu_chips`, all
27+
labelled `scope=fleet`) plus a per-region healthy head count
28+
(`worker_healthy{region=…}`).
29+
- `jobs` — root-job-state breakdown from one raw-SQL `GROUP BY` (120s). Splits
30+
into a live in-flight snapshot (`job_inflight{state=…}`) and a trailing-24h
31+
terminal window (`job_terminal_24h{state=…}`), each with a `scope=fleet` total.
32+
Runs the controller's `ExecuteRawQuery` RPC over a dedicated connect client
33+
(the same call the `iris query` CLI makes). See `src/cluster.py` for the
34+
emitted metrics.
2435

2536
Each sample is logged to stdout (`probe <name>: ok|fail [<ms>ms] start=<utc>`),
2637
written to the `infra.canary.metrics` finelog namespace (query it with
@@ -29,8 +40,9 @@ labels with DuckDB `json_extract`), and appended to a daily JSONL that rolls up
2940
to `gs://<us-central1 data bucket>/infra/probes/dt=<date>/` at UTC rollover.
3041

3142
Standalone package (own `pyproject.toml`/`uv.lock`): pulls `marin-iris`,
32-
`marin-finelog`, `marin-rigging` from the rolling GitHub releases via
33-
`find-links`. Bump to today's nightly with `uv lock -U` inside `infra/probes/`.
43+
`marin-finelog`, `marin-rigging` from PyPI as `0.2.x.dev` nightlies
44+
(`prerelease = "if-necessary"`). Bump to today's nightly with `uv lock -U`
45+
inside `infra/probes/`.
3446

3547
## Run
3648

@@ -48,7 +60,7 @@ Single COS VM `infra-probes` (us-central1-b), one container, `restart=always`.
4860
```bash
4961
cd infra/probes
5062
uv run deploy/deploy.py build # build + push :sha and :latest
51-
uv run deploy/deploy.py apply # roll the VM to :latest
63+
uv run deploy/deploy.py apply # roll the VM to this HEAD's :sha image
5264
uv run deploy/deploy.py status # VM state + recent logs
5365
```
5466

infra/probes/deploy/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
# Build context is THIS directory's parent (infra/probes/):
44
# docker build -f deploy/Dockerfile -t probes:dev infra/probes
55
#
6-
# marin-iris, marin-finelog, marin-rigging come from the per-package rolling
7-
# GitHub releases (see pyproject.toml [tool.uv] find-links). No marin source
8-
# is required in the build context.
6+
# marin-iris, marin-finelog, marin-rigging are installed from PyPI per the
7+
# lockfile (their nightly dev wheels). No marin source is required in the
8+
# build context.
99

1010
FROM python:3.12-slim AS base
1111

infra/probes/deploy/deploy.py

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,13 @@
2828
_MARIN_CONFIG = load_cluster_config("marin")
2929

3030
IMAGE_NAME = "infra-probes"
31-
# The probes daemon writes its JSONL roll-ups here; the SA needs object-create on
32-
# this bucket and the canary's GCS prefix lives under it (see infra_probes.py).
31+
# The probes daemon writes its JSONL roll-ups under this bucket+prefix (see
32+
# infra_probes.py). Rolling a day up overwrites a deterministic per-day object
33+
# when a stranded local file is re-uploaded after a restart, so the SA needs
34+
# create+get+delete — granted via objectUser, scoped by IAM condition to the
35+
# prefix so the canary can't touch the rest of this shared data bucket.
3336
RESULTS_BUCKET = _MARIN_CONFIG.region_buckets["us-central1"]
37+
RESULTS_GCS_PREFIX = "infra/probes"
3438
RESULTS_HOST_PATH = "/var/lib/probes"
3539
# Build context / git repo root for `build`: this script lives in deploy/.
3640
PROBES_DIR = Path(__file__).resolve().parent.parent
@@ -66,14 +70,18 @@ def cli(ctx: click.Context, project: str, region: str, zone: str, vm_name: str,
6670
}
6771

6872

73+
def _git_sha() -> str:
74+
return _run(
75+
["git", "-C", str(PROBES_DIR), "rev-parse", "--short", "HEAD"],
76+
capture_output=True,
77+
).stdout.strip()
78+
79+
6980
@cli.command()
7081
@click.pass_obj
7182
def build(cfg: dict[str, str]) -> None:
7283
"""Build the image, tag with git sha and 'latest', push to Artifact Registry."""
73-
sha = _run(
74-
["git", "-C", str(PROBES_DIR), "rev-parse", "--short", "HEAD"],
75-
capture_output=True,
76-
).stdout.strip()
84+
sha = _git_sha()
7785
image_sha = f"{cfg['registry']}:{sha}"
7886
image_latest = f"{cfg['registry']}:latest"
7987

@@ -100,9 +108,15 @@ def build(cfg: dict[str, str]) -> None:
100108
@cli.command()
101109
@click.pass_obj
102110
def apply(cfg: dict[str, str]) -> None:
103-
"""Roll the prod VM to the 'latest' image."""
104-
image_latest = f"{cfg['registry']}:latest"
105-
logger.info("Rolling VM %s (%s) to %s", cfg["vm_name"], cfg["zone"], image_latest)
111+
"""Roll the prod VM to the current git sha's image.
112+
113+
Deploys the immutable ``:<sha>`` tag, not ``:latest``: konlet keeps running a
114+
locally-cached ``:latest`` when update-container is handed the same mutable
115+
ref, so a same-tag roll silently runs the old image. A distinct ``:<sha>``
116+
ref forces the pull. Build the matching image first (``build`` at this HEAD).
117+
"""
118+
image_sha = f"{cfg['registry']}:{_git_sha()}"
119+
logger.info("Rolling VM %s (%s) to %s", cfg["vm_name"], cfg["zone"], image_sha)
106120
_run(
107121
[
108122
"gcloud",
@@ -112,7 +126,7 @@ def apply(cfg: dict[str, str]) -> None:
112126
cfg["vm_name"],
113127
f"--project={cfg['project']}",
114128
f"--zone={cfg['zone']}",
115-
f"--container-image={image_latest}",
129+
f"--container-image={image_sha}",
116130
]
117131
)
118132

@@ -172,7 +186,7 @@ def create(cfg: dict[str, str], iris_endpoint: str, machine_type: str) -> None:
172186
logger.info("Creating service account %s", sa)
173187
_run(["gcloud", "iam", "service-accounts", "create", IMAGE_NAME, f"--project={project}"])
174188

175-
# SA needs: pull image, ship stdout to Cloud Logging, write GCS roll-ups.
189+
# SA needs: pull image, ship stdout to Cloud Logging, manage GCS roll-ups.
176190
logger.info("Granting IAM roles to %s", sa)
177191
_run(
178192
[
@@ -198,6 +212,15 @@ def create(cfg: dict[str, str], iris_endpoint: str, machine_type: str) -> None:
198212
"--condition=None",
199213
]
200214
)
215+
# objectUser (create/get/delete) restricted to the roll-up prefix. The
216+
# bucket-scoped objects.list it implies is intentionally not covered by the
217+
# object-name condition; gcsfs only uses list to sniff bucket type and falls
218+
# back gracefully, so the upload still succeeds.
219+
prefix_condition = (
220+
f'expression=resource.name.startsWith("projects/_/buckets/{RESULTS_BUCKET}'
221+
f'/objects/{RESULTS_GCS_PREFIX}/"),title=infra-probes-prefix,'
222+
"description=Limit infra-probes SA object access to its rollup prefix"
223+
)
201224
_run(
202225
[
203226
"gcloud",
@@ -206,7 +229,8 @@ def create(cfg: dict[str, str], iris_endpoint: str, machine_type: str) -> None:
206229
"add-iam-policy-binding",
207230
f"gs://{RESULTS_BUCKET}",
208231
f"--member={member}",
209-
"--role=roles/storage.objectCreator",
232+
"--role=roles/storage.objectUser",
233+
f"--condition={prefix_condition}",
210234
]
211235
)
212236

infra/probes/pyproject.toml

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,19 @@ name = "marin-infra-probes"
77
version = "0.1.0"
88
description = "Synthetic infra monitoring: continuously exercises Iris and Finelog, records latency and error samples."
99
requires-python = ">=3.12,<3.14"
10-
# marin-* libs are not published to PyPI past a one-time bootstrap; the real
11-
# distribution channel is the per-package rolling GH release. Pinning the
12-
# floor at 0.99.dev0 + prerelease=allow + find-links picks up today's nightly.
10+
# marin-* libs ship to PyPI nightly (marin-release-libs-wheels.yaml for
11+
# iris/rigging, finelog-release-wheels.yaml for finelog), only as
12+
# `0.2.x.devYYYYMMDDhhmm` prereleases; `uv lock -U` picks up the latest. (The
13+
# old GH `*-latest` rolling releases are abandoned/frozen.)
1314
dependencies = [
14-
"marin-iris >= 0.99.dev0",
15-
"marin-finelog >= 0.99.dev0",
16-
"marin-rigging >= 0.99.dev0",
15+
"marin-iris >= 0.2.0",
16+
"marin-finelog >= 0.2.0",
17+
"marin-rigging >= 0.2.0",
1718
"click >= 8.0",
19+
# marin-iris pulls s3fs -> aiobotocore, which breaks at import under the
20+
# httpx 1.0 prereleases (no httpx.TimeoutException). Pin to the 0.28 stable
21+
# line; marin-iris only needs httpx >= 0.28.1.
22+
"httpx >= 0.28.1, < 1",
1823
]
1924

2025
[project.scripts]
@@ -24,12 +29,10 @@ probes = "infra_probes:main"
2429
dev = ["pytest >= 8.4"]
2530

2631
[tool.uv]
27-
prerelease = "allow"
28-
find-links = [
29-
"https://github.com/marin-community/marin/releases/expanded_assets/marin-iris-latest",
30-
"https://github.com/marin-community/marin/releases/expanded_assets/marin-finelog-latest",
31-
"https://github.com/marin-community/marin/releases/expanded_assets/marin-rigging-latest",
32-
]
32+
# if-necessary (not allow): take prereleases only for packages with no stable
33+
# release — the marin-* dev wheels — while pinning httpx/pydantic/etc to stable.
34+
# Global "allow" dragged in httpx 1.0.dev3, which broke aiobotocore at import.
35+
prerelease = "if-necessary"
3336

3437
[tool.hatch.build.targets.wheel]
3538
# Ship the whole package; an explicit per-file list silently drops new modules

0 commit comments

Comments
 (0)