marin-community
diff --git a/‎.agents/skills/profile-training/SKILL.md‎
Lines changed: 31 additions & 0 deletions b/‎.agents/skills/profile-training/SKILL.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎.agents/skills/reserve-gpu/SKILL.md‎
Lines changed: 94 additions & 0 deletions b/‎.agents/skills/reserve-gpu/SKILL.md‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎.github/workflows/iris-smoke-coreweave.yaml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/iris-smoke-coreweave.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎infra/probes/README.md‎
Lines changed: 16 additions & 4 deletions b/‎infra/probes/README.md‎
Lines changed: 16 additions & 4 deletions
diff --git a/‎infra/probes/deploy/Dockerfile‎
Lines changed: 3 additions & 3 deletions b/‎infra/probes/deploy/Dockerfile‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎infra/probes/deploy/deploy.py‎
Lines changed: 36 additions & 12 deletions b/‎infra/probes/deploy/deploy.py‎
Lines changed: 36 additions & 12 deletions
diff --git a/‎infra/probes/pyproject.toml‎
Lines changed: 15 additions & 12 deletions b/‎infra/probes/pyproject.toml‎
Lines changed: 15 additions & 12 deletions
@@ -61,13 +61,44 @@ uv run ... \
 Keep the profiler window short when enabling HLO protobuf collection — it
 enlarges artifacts and can increase profile upload/finalization time.
 
+Known-good TensorBoard scope recipe from CoreWeave Grug MoE profiling:
+`trainer.profiler.enabled=true`, `trainer.profiler.start_step=3`,
+`trainer.profiler.num_steps=2`, `trainer.profiler.perfetto_link=false`,
+`trainer.profiler.profile_options.host_tracer_level=1`,
+`trainer.profiler.profile_options.python_tracer_level=0`, and
+`trainer.profiler.profile_options.enable_hlo_proto=true` preserved useful
+`jax.named_scope` / `named_call` regions in TensorBoard for
+`GM2560-MAY-120S4096-W2048-B8-R1-E8M1-FA4PROFILE-S3B-N1-cw-20260617-2353`.
+Leave `device_tracer_level` unset unless device timelines are specifically
+needed; this profile still had useful hierarchical host/XLA metadata.
+
+On GPU, command buffers can collapse or suppress the visible name stack in
+TensorBoard/Perfetto. For profile-readability runs, disable command buffers:
+
+```bash
+export XLA_FLAGS="${XLA_FLAGS:-} --xla_gpu_enable_command_buffer=''"
+```
+
+This hurts performance, so use it only when the goal is semantic trace
+attribution; leave it out of throughput comparisons unless command-buffer
+behavior is the axis being tested.
+
+For GPU throughput runs, keep profile-readability flags separate from XLA code
+generation and scheduling flags. Start from JAX's GPU performance guide,
+especially the code generation flags section:
+<https://docs.jax.dev/en/latest/gpu_performance_tips.html#code-generation-flags>.
+The exact set of useful XLA flags is `jaxlib`-version dependent, so record the
+full `XLA_FLAGS` value with each profile or W&B run.
+
 For better profile readability, use `haliax.jax_utils.named_call` and
 `jax.named_scope` liberally in model code; these names flow into trace
 annotations and make region-level summaries far more actionable.
 
 Reference:
 - `lib/levanter/docs/Performance-Guide.md`
 - `.agents/skills/add-pallas-kernel/`
+- JAX GPU performance tips:
+  <https://docs.jax.dev/en/latest/gpu_performance_tips.html>
 
 ## Ingest to Structured Summary
 Pick a download location for pulled profile artifacts: `/tmp` for
 
@@ -0,0 +1,94 @@
+---
+name: reserve-gpu
+description: Reserve an Iris-backed CoreWeave H100 pod for fast debugging with dev_gpu.py.
+---
+
+# Skill: Dev GPU
+
+Use this skill for the standard fast H100 debugging loop without wiring a full training job each time. It is the GPU counterpart to `reserve-tpu`.
+
+`scripts/iris/dev_gpu.py` reserves a CoreWeave H100 pod through Iris, waits for the backing Kubernetes pod to come up, and `kubectl exec -it`s you into it. Marin's H100s are CoreWeave Kubernetes pods, not GCE VMs, so access is `kubectl`, not SSH — there is no `ssh`/`scp` transport and no `~/.ssh/config` alias.
+
+This is a lean tool: `allocate`, `connect`, `status`, `release`. It does not sync files or run remote env setup (no `execute`/`watch`/`setup_env`). The CoreWeave task image is self-contained; the loop is "reserve a node, shell in." Sync those steps in yourself once connected.
+
+## Cost rule
+
+A holder pod sits on an expensive 8×H100 node for the session's lifetime. Release as soon as you are done — `Ctrl-C` the `allocate` terminal, or run `release` from another shell.
+
+## Commands
+
+- `allocate`: submit a holder job, resolve the assigned pod, persist session state, block until release
+- `status`: show the active local session metadata
+- `connect`: open an interactive shell (`kubectl exec -it … -- bash -l`) into the reserved pod
+- `release`: terminate the holder job and remove the local session file
+
+## Prerequisites
+
+1. Place the cluster kubeconfig at the path the config expects. The tool passes `--kubeconfig <platform.coreweave.kubeconfig_path>` to `kubectl` verbatim and fails fast if the file is absent. For the production H100 fleet (`cw-us-east-02a`, the `marin-gpu` cluster) that path is `~/.kube/coreweave-iris-gpu`, per `lib/iris/docs/coreweave.md`.
+
+2. Ensure the Iris controller is running for the cluster. On the shared CoreWeave cluster this is usually already true; only start it yourself for a fresh cluster.
+
+3. Use a cluster config whose platform is CoreWeave/Kubernetes. The tool gates on this and rejects GCP/TPU configs with a pointer back to `dev_tpu.py`.
+
+## Command pattern
+
+All invocations share this shape; only the subcommand and its flags change:
+
+```bash
+uv run scripts/iris/dev_gpu.py \
+  --config lib/iris/config/cw-us-east-02a.yaml \
+  --name "$USER-gpu" \
+  <subcommand> [flags]
+```
+
+Subcommands and distinctive flags:
+
+- `allocate` — reserves a whole `h100-8x` node (`--gpu-count` defaults to `8`) and holds it until `Ctrl-C`. Add `--timeout` (default `900`) to bound the wait for the task to reach `RUNNING`, and `--pod-timeout` (default `120`) to bound the wait for the backing pod. Only `--gpu-count 8` is validated; a sub-node value schedules as a fractional share (`nvidia-smi -L` then shows fewer GPUs) but fragments the 8-GPU InfiniBand gang pool, so prefer the whole node.
+- `status` — show the active session (job id, config, GPU count, resolved pod).
+- `connect` — interactive shell into the pod. It first checks job liveness with the controller (failing fast if the job is gone), then `kubectl exec -it`s into container `task`.
+- `release` — terminate the holder job and clear the session file. Pass `--force` to drop local state even when the terminate call fails (then confirm the job is gone with `iris job list`).
+
+## GPU JAX inside the pod
+
+The `iris-task` image ships a CPU-only `uv` environment at `/app`, so bare `python` has no JAX and `uv run python` falls back to a CPU device. To get GPU JAX (`jax[cuda13]`):
+
+```bash
+cd /app && uv sync --all-packages --extra=gpu
+```
+
+`--all-packages` is required: the `gpu` extra is defined on the sub-packages (`marin-levanter` / `marin-core`), not the root project. This is the GPU analog of `dev_tpu.py`'s `--extra=tpu`. Verify the hardware with `nvidia-smi -L` (expect 8×H100 80GB on a whole node).
+
+## Observability
+
+Use normal Iris tooling to inspect the backing cluster and holder job:
+
+```bash
+uv run iris --config=lib/iris/config/cw-us-east-02a.yaml job list --prefix /$USER/dev-gpu
+uv run iris --config=lib/iris/config/cw-us-east-02a.yaml job logs /$USER/dev-gpu-<name>
+```
+
+Inspect the pod directly with the same kubeconfig the tool uses:
+
+```bash
+kubectl --kubeconfig ~/.kube/coreweave-iris-gpu --namespace iris get pods -l iris.task_id=<sanitized-task-id>
+```
+
+## Session behavior
+
+- Local session state lives under `~/.cache/marin/dev_gpu_iris/`.
+- If the `allocate` terminal dies unexpectedly, run `release` to terminate the holder job and clear the stale state file.
+- A failed `allocate` cleans up after itself: the holder job is terminated and the local state file is removed only once the job is confirmed gone, so a failed terminate never orphans an expensive pod with no local record of its job id.
+- `connect` execs into the pod resolved at allocation time. If Iris rescheduled the task onto a new pod while the job stayed active, `connect` fails — re-allocate.
+
+## Agent Usage
+
+Always pass `--name` to avoid collisions with other agents:
+
+```bash
+export GPU_NAME="${USER}-$(git rev-parse --abbrev-ref HEAD | tr '/' '-')"
+uv run scripts/iris/dev_gpu.py --config lib/iris/config/cw-us-east-02a.yaml --name "$GPU_NAME" allocate
+```
+
+## Cleanup
+
+Normal cleanup is `Ctrl-C` in the `allocate` terminal. To clean up from another shell, run the `release` subcommand (add `--force` only if the job is already dead and `release` keeps erroring).
@@ -109,7 +109,7 @@ jobs:
           JAX_TRACEBACK_FILTERING: off
           # When set, the marin-on-iris test uploads fixtures and writes
           # intermediate data to S3 (R2) so remote Zephyr pods can access them.
-          MARIN_CI_S3_PREFIX: s3://marin-na/temp/ci
+          MARIN_CI_S3_PREFIX: s3://marin-na/tmp/ttl=3d/ci
           AWS_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
           AWS_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
           AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com
@@ -127,7 +127,7 @@ jobs:
           WANDB_MODE: disabled
           WANDB_API_KEY: ""
           JAX_TRACEBACK_FILTERING: off
-          MARIN_CI_S3_PREFIX: s3://marin-na/temp/ci
+          MARIN_CI_S3_PREFIX: s3://marin-na/tmp/ttl=3d/ci
           AWS_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
           AWS_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
           AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com
 
@@ -11,7 +11,7 @@ Health checks (emit a `probe_up` 1/0 sample; the runner adds `probe_latency_ms`)
 - `finelog-write` — write a nonce and read it back (60s).
 - `iris-job-submit/<zone>` — submit a tiny job per zone, wait for SUCCEEDED (300s).
 
-Gauge:
+Gauges:
 
 - `provisioning` — accelerator provisioning stats over a trailing 3h window,
   recomputed every 15 min. The controller's autoscaler emits one structured row
@@ -21,6 +21,17 @@ Gauge:
   `provision_*` count/latency/success-ratio gauges. See
   `iris.cluster.controller.autoscaler.provisioning` for the outcome vocabulary
   and `src/provisioning.py` for the emitted metrics.
+- `workers` — worker-fleet snapshot from `list_workers()` (60s). Rolls the
+  healthy workers into fleet resource totals (`worker_healthy`,
+  `worker_cpu_millicores`, `worker_memory_bytes`, `worker_tpu_chips`, all
+  labelled `scope=fleet`) plus a per-region healthy head count
+  (`worker_healthy{region=…}`).
+- `jobs` — root-job-state breakdown from one raw-SQL `GROUP BY` (120s). Splits
+  into a live in-flight snapshot (`job_inflight{state=…}`) and a trailing-24h
+  terminal window (`job_terminal_24h{state=…}`), each with a `scope=fleet` total.
+  Runs the controller's `ExecuteRawQuery` RPC over a dedicated connect client
+  (the same call the `iris query` CLI makes). See `src/cluster.py` for the
+  emitted metrics.
 
 Each sample is logged to stdout (`probe <name>: ok|fail [<ms>ms] start=<utc>`),
 written to the `infra.canary.metrics` finelog namespace (query it with
@@ -29,8 +40,9 @@ labels with DuckDB `json_extract`), and appended to a daily JSONL that rolls up
 to `gs://<us-central1 data bucket>/infra/probes/dt=<date>/` at UTC rollover.
 
 Standalone package (own `pyproject.toml`/`uv.lock`): pulls `marin-iris`,
-`marin-finelog`, `marin-rigging` from the rolling GitHub releases via
-`find-links`. Bump to today's nightly with `uv lock -U` inside `infra/probes/`.
+`marin-finelog`, `marin-rigging` from PyPI as `0.2.x.dev` nightlies
+(`prerelease = "if-necessary"`). Bump to today's nightly with `uv lock -U`
+inside `infra/probes/`.
 
 ## Run
 
@@ -48,7 +60,7 @@ Single COS VM `infra-probes` (us-central1-b), one container, `restart=always`.
 ```bash
 cd infra/probes
 uv run deploy/deploy.py build    # build + push :sha and :latest
-uv run deploy/deploy.py apply    # roll the VM to :latest
+uv run deploy/deploy.py apply    # roll the VM to this HEAD's :sha image
 uv run deploy/deploy.py status   # VM state + recent logs
 ```
 
 
@@ -3,9 +3,9 @@
 # Build context is THIS directory's parent (infra/probes/):
 #   docker build -f deploy/Dockerfile -t probes:dev infra/probes
 #
-# marin-iris, marin-finelog, marin-rigging come from the per-package rolling
-# GitHub releases (see pyproject.toml [tool.uv] find-links). No marin source
-# is required in the build context.
+# marin-iris, marin-finelog, marin-rigging are installed from PyPI per the
+# lockfile (their nightly dev wheels). No marin source is required in the
+# build context.
 
 FROM python:3.12-slim AS base
 
 
@@ -28,9 +28,13 @@
 _MARIN_CONFIG = load_cluster_config("marin")
 
 IMAGE_NAME = "infra-probes"
-# The probes daemon writes its JSONL roll-ups here; the SA needs object-create on
-# this bucket and the canary's GCS prefix lives under it (see infra_probes.py).
+# The probes daemon writes its JSONL roll-ups under this bucket+prefix (see
+# infra_probes.py). Rolling a day up overwrites a deterministic per-day object
+# when a stranded local file is re-uploaded after a restart, so the SA needs
+# create+get+delete — granted via objectUser, scoped by IAM condition to the
+# prefix so the canary can't touch the rest of this shared data bucket.
 RESULTS_BUCKET = _MARIN_CONFIG.region_buckets["us-central1"]
+RESULTS_GCS_PREFIX = "infra/probes"
 RESULTS_HOST_PATH = "/var/lib/probes"
 # Build context / git repo root for `build`: this script lives in deploy/.
 PROBES_DIR = Path(__file__).resolve().parent.parent
@@ -66,14 +70,18 @@ def cli(ctx: click.Context, project: str, region: str, zone: str, vm_name: str,
     }
 
 
+def _git_sha() -> str:
+    return _run(
+        ["git", "-C", str(PROBES_DIR), "rev-parse", "--short", "HEAD"],
+        capture_output=True,
+    ).stdout.strip()
+
+
 @cli.command()
 @click.pass_obj
 def build(cfg: dict[str, str]) -> None:
     """Build the image, tag with git sha and 'latest', push to Artifact Registry."""
-    sha = _run(
-        ["git", "-C", str(PROBES_DIR), "rev-parse", "--short", "HEAD"],
-        capture_output=True,
-    ).stdout.strip()
+    sha = _git_sha()
     image_sha = f"{cfg['registry']}:{sha}"
     image_latest = f"{cfg['registry']}:latest"
 
@@ -100,9 +108,15 @@ def build(cfg: dict[str, str]) -> None:
 @cli.command()
 @click.pass_obj
 def apply(cfg: dict[str, str]) -> None:
-    """Roll the prod VM to the 'latest' image."""
-    image_latest = f"{cfg['registry']}:latest"
-    logger.info("Rolling VM %s (%s) to %s", cfg["vm_name"], cfg["zone"], image_latest)
+    """Roll the prod VM to the current git sha's image.
+
+    Deploys the immutable ``:<sha>`` tag, not ``:latest``: konlet keeps running a
+    locally-cached ``:latest`` when update-container is handed the same mutable
+    ref, so a same-tag roll silently runs the old image. A distinct ``:<sha>``
+    ref forces the pull. Build the matching image first (``build`` at this HEAD).
+    """
+    image_sha = f"{cfg['registry']}:{_git_sha()}"
+    logger.info("Rolling VM %s (%s) to %s", cfg["vm_name"], cfg["zone"], image_sha)
     _run(
         [
             "gcloud",
@@ -112,7 +126,7 @@ def apply(cfg: dict[str, str]) -> None:
             cfg["vm_name"],
             f"--project={cfg['project']}",
             f"--zone={cfg['zone']}",
-            f"--container-image={image_latest}",
+            f"--container-image={image_sha}",
         ]
     )
 
@@ -172,7 +186,7 @@ def create(cfg: dict[str, str], iris_endpoint: str, machine_type: str) -> None:
     logger.info("Creating service account %s", sa)
     _run(["gcloud", "iam", "service-accounts", "create", IMAGE_NAME, f"--project={project}"])
 
-    # SA needs: pull image, ship stdout to Cloud Logging, write GCS roll-ups.
+    # SA needs: pull image, ship stdout to Cloud Logging, manage GCS roll-ups.
     logger.info("Granting IAM roles to %s", sa)
     _run(
         [
@@ -198,6 +212,15 @@ def create(cfg: dict[str, str], iris_endpoint: str, machine_type: str) -> None:
             "--condition=None",
         ]
     )
+    # objectUser (create/get/delete) restricted to the roll-up prefix. The
+    # bucket-scoped objects.list it implies is intentionally not covered by the
+    # object-name condition; gcsfs only uses list to sniff bucket type and falls
+    # back gracefully, so the upload still succeeds.
+    prefix_condition = (
+        f'expression=resource.name.startsWith("projects/_/buckets/{RESULTS_BUCKET}'
+        f'/objects/{RESULTS_GCS_PREFIX}/"),title=infra-probes-prefix,'
+        "description=Limit infra-probes SA object access to its rollup prefix"
+    )
     _run(
         [
             "gcloud",
@@ -206,7 +229,8 @@ def create(cfg: dict[str, str], iris_endpoint: str, machine_type: str) -> None:
             "add-iam-policy-binding",
             f"gs://{RESULTS_BUCKET}",
             f"--member={member}",
-            "--role=roles/storage.objectCreator",
+            "--role=roles/storage.objectUser",
+            f"--condition={prefix_condition}",
         ]
     )
 
 
@@ -7,14 +7,19 @@ name = "marin-infra-probes"
 version = "0.1.0"
 description = "Synthetic infra monitoring: continuously exercises Iris and Finelog, records latency and error samples."
 requires-python = ">=3.12,<3.14"
-# marin-* libs are not published to PyPI past a one-time bootstrap; the real
-# distribution channel is the per-package rolling GH release. Pinning the
-# floor at 0.99.dev0 + prerelease=allow + find-links picks up today's nightly.
+# marin-* libs ship to PyPI nightly (marin-release-libs-wheels.yaml for
+# iris/rigging, finelog-release-wheels.yaml for finelog), only as
+# `0.2.x.devYYYYMMDDhhmm` prereleases; `uv lock -U` picks up the latest. (The
+# old GH `*-latest` rolling releases are abandoned/frozen.)
 dependencies = [
-    "marin-iris    >= 0.99.dev0",
-    "marin-finelog >= 0.99.dev0",
-    "marin-rigging >= 0.99.dev0",
+    "marin-iris    >= 0.2.0",
+    "marin-finelog >= 0.2.0",
+    "marin-rigging >= 0.2.0",
     "click >= 8.0",
+    # marin-iris pulls s3fs -> aiobotocore, which breaks at import under the
+    # httpx 1.0 prereleases (no httpx.TimeoutException). Pin to the 0.28 stable
+    # line; marin-iris only needs httpx >= 0.28.1.
+    "httpx >= 0.28.1, < 1",
 ]
 
 [project.scripts]
@@ -24,12 +29,10 @@ probes = "infra_probes:main"
 dev = ["pytest >= 8.4"]
 
 [tool.uv]
-prerelease = "allow"
-find-links = [
-    "https://github.com/marin-community/marin/releases/expanded_assets/marin-iris-latest",
-    "https://github.com/marin-community/marin/releases/expanded_assets/marin-finelog-latest",
-    "https://github.com/marin-community/marin/releases/expanded_assets/marin-rigging-latest",
-]
+# if-necessary (not allow): take prereleases only for packages with no stable
+# release — the marin-* dev wheels — while pinning httpx/pydantic/etc to stable.
+# Global "allow" dragged in httpx 1.0.dev3, which broke aiobotocore at import.
+prerelease = "if-necessary"
 
 [tool.hatch.build.targets.wheel]
 # Ship the whole package; an explicit per-file list silently drops new modules