Skip to content

Commit 618b566

Browse files
authored
Merge branch 'main' into fix_common_strategy
2 parents c4d963b + bde162a commit 618b566

292 files changed

Lines changed: 22692 additions & 2655 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# `.agents/` — agent-agnostic source of truth
2+
3+
This directory is the canonical location for assets shared by AI coding agents
4+
working in this repository (Claude Code, Codex, Cursor, …).
5+
6+
## Layout
7+
8+
```text
9+
.agents/
10+
├── skills/ # SKILL.md files (canonical)
11+
│ └── <skill-name>/SKILL.md
12+
├── scripts/ # shared helper scripts (sync-upstream-skills.sh, …)
13+
└── clusters.yaml.example # remote-cluster config template
14+
```
15+
16+
## Why this exists
17+
18+
Different agents look for skills/config in vendor-specific directories. Rather
19+
than maintaining N copies that drift out of sync, **`.agents/` is the single
20+
source of truth** — each agent's guidance or install mechanism points here
21+
directly.
22+
23+
## How each agent finds these
24+
25+
Each agent points at `.agents/` through whatever mechanism it supports — never
26+
a copy:
27+
28+
- **Claude Code** only auto-discovers skills under `.claude/skills/`, so
29+
`.claude/` holds relative in-repo symlinks back into `.agents/`:
30+
`.claude/skills → ../.agents/skills`, `.claude/scripts → ../.agents/scripts`,
31+
and `.claude/clusters.yaml.example → ../.agents/clusters.yaml.example`. These
32+
follow the same committed-symlink pattern already used elsewhere in this repo
33+
(e.g. `CLAUDE.md`, `tools/launcher/modules/Model-Optimizer`).
34+
- **Future agents** (Codex, Cursor, …) add their own symlink or config pointing
35+
at `.agents/`.
36+
37+
## Editing rules
38+
39+
- **Always edit files under `.agents/`**.
40+
- Vendored-verbatim skills (`launching-evals`, `accessing-mlflow`) are managed
41+
by `.agents/scripts/sync-upstream-skills.sh` — do not modify by hand.
42+
- New skills go in `.agents/skills/<skill-name>/SKILL.md` following the
43+
conventions of existing skills (e.g. `.agents/skills/monitor/SKILL.md`).
44+
45+
## Project-level cluster config
46+
47+
The remote-execution skills look for a `clusters.yaml` at, in order:
48+
49+
1. `~/.config/modelopt/clusters.yaml` (user-level, recommended)
50+
2. `<repo-root>/.agents/clusters.yaml` (project-level, canonical)
51+
3. `<repo-root>/.claude/clusters.yaml` (project-level, back-compat)
52+
53+
See `clusters.yaml.example` for the schema.

.agents/clusters.yaml.example

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# ModelOpt Remote Cluster Configuration
2+
# Copy to ~/.config/modelopt/clusters.yaml (user-level, recommended)
3+
# or .agents/clusters.yaml (project-level, can be committed).
4+
# .claude/clusters.yaml is also accepted for back-compat.
5+
6+
clusters:
7+
# GPU workstation or SLURM login node
8+
my-cluster:
9+
login_node: cluster-login.example.com
10+
user: myusername
11+
ssh_key: ~/.ssh/id_rsa
12+
# ssh_proxy: "socat - PROXY:localhost:%h:%p,proxyport=3128" # optional
13+
workspace: /path/to/remote/workdir
14+
gpu_type: H100 # used for quantization format recommendation
15+
# slurm:
16+
# default_account: my_account
17+
# default_partition: batch_short
18+
19+
default_cluster: my-cluster
Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,18 @@
2121
# NOT managed by this script — update it manually when pulling upstream changes.
2222
#
2323
# Usage:
24-
# .claude/scripts/sync-upstream-skills.sh # re-vendor at the pinned SHA
25-
# UPSTREAM_SHA=<sha> .claude/scripts/sync-upstream-skills.sh # bump to a new SHA
24+
# .agents/scripts/sync-upstream-skills.sh # re-vendor at the pinned SHA
25+
# UPSTREAM_SHA=<sha> .agents/scripts/sync-upstream-skills.sh # bump to a new SHA
2626
#
2727
# Requires: gh, base64, awk. Run from the repo root.
2828
#
29-
# The script overwrites .claude/skills/<skill>/ with upstream contents and
29+
# The script overwrites .agents/skills/<skill>/ with upstream contents and
3030
# re-applies our provenance lines into each SKILL.md frontmatter. If you have
3131
# local changes to a vendored skill, they will be lost — that is expected,
3232
# since vendored-verbatim skills should not be modified locally.
33+
#
34+
# Note: .claude/skills/ (and other agent-specific skill dirs) are symlinks to
35+
# .agents/skills/ — see .agents/README.md.
3336

3437
set -euo pipefail
3538

@@ -40,7 +43,7 @@ SHORT_SHA="${SHA:0:7}"
4043

4144
UPSTREAM_REPO="NVIDIA-NeMo/Evaluator"
4245
UPSTREAM_BASE="packages/nemo-evaluator-launcher/.claude/skills"
43-
DEST_BASE=".claude/skills"
46+
DEST_BASE=".agents/skills"
4447

4548
if [[ ! -d "$DEST_BASE" ]]; then
4649
echo "error: run from the repo root (expected $DEST_BASE/ to exist)" >&2
@@ -116,7 +119,7 @@ inject_provenance() {
116119
print "license: Apache-2.0"
117120
print "# Vendored verbatim from NVIDIA NeMo Evaluator (commit " short ")"
118121
print "# https://github.com/NVIDIA-NeMo/Evaluator/tree/" sha "/packages/nemo-evaluator-launcher/.claude/skills/" skill
119-
print "# To re-sync: .claude/scripts/sync-upstream-skills.sh"
122+
print "# To re-sync: .agents/scripts/sync-upstream-skills.sh"
120123
if (extra != "") {
121124
n = split(extra, lines, "\\|")
122125
for (i = 1; i <= n; i++) print "# " lines[i]

.claude/skills/common/environment-setup.md renamed to .agents/skills/common/environment-setup.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ If previous runs left patches in `modelopt/` (from 4C unlisted model work), chec
2424
2. **User doesn't specify** → check for cluster config:
2525

2626
```bash
27-
cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null
27+
cat ~/.config/modelopt/clusters.yaml 2>/dev/null || cat .agents/clusters.yaml 2>/dev/null || cat .claude/clusters.yaml 2>/dev/null
2828
```
2929

3030
If a cluster config exists with content → **use the remote cluster** (do not fall back to local even if local GPUs are available — the cluster config indicates the user's preferred execution environment). Otherwise → **local execution**.
@@ -34,7 +34,7 @@ If the cluster config contains multiple clusters and the user did not name the t
3434
For remote, connect:
3535

3636
```bash
37-
source .claude/skills/common/remote_exec.sh
37+
source .agents/skills/common/remote_exec.sh
3838
remote_load_cluster <cluster_name>
3939
remote_check_ssh
4040
remote_detect_env # sets REMOTE_ENV_TYPE = slurm / docker / bare
Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,9 @@ Read this when Claude Code runs on a different machine than the target GPU clust
99
Config locations (checked in order, first found wins):
1010

1111
1. `~/.config/modelopt/clusters.yaml` — user-level (not committed, recommended)
12-
2. `.claude/clusters.yaml` — project-level (can be committed for shared defaults)
13-
3. Interactive input — if neither file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding
12+
2. `.agents/clusters.yaml` — project-level, canonical (can be committed for shared defaults)
13+
3. `.claude/clusters.yaml` — project-level, back-compat
14+
4. Interactive input — if no file exists, ask the user (see SKILL.md Step 0) and write `~/.config/modelopt/clusters.yaml` before proceeding
1415

1516
```yaml
1617
clusters:
@@ -38,14 +39,14 @@ rsync -av /path/to/local/checkpoint <cluster-login>:<cluster-workspace>/<session
3839

3940
Use the `workspace` path from your cluster config as the destination root, and keep staged checkpoints under the session/model directory. Compute nodes on a given cluster share the same storage as its login node, so once staged, the path works everywhere on that cluster.
4041

41-
See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
42+
See `.agents/clusters.yaml.example` for a fully annotated example with multiple cluster types.
4243

4344
---
4445

4546
## 2. Connect and Establish Persistent Session
4647

4748
```bash
48-
source .claude/skills/common/remote_exec.sh
49+
source .agents/skills/common/remote_exec.sh
4950
remote_load_cluster <cluster_name> # or omit name to use default_cluster
5051
remote_check_ssh # validates connectivity + starts persistent session
5152
```
@@ -153,5 +154,5 @@ remote_sync_from <remote_output_subdir> /local/output/
153154
## Reference Files
154155

155156
- **`skills/common/remote_exec.sh`** — Full utility library (session, run, sync, SLURM, Docker helpers)
156-
- **`.claude/clusters.yaml`** — Active cluster configuration
157-
- **`.claude/clusters.yaml.example`** — Annotated example config
157+
- **`.agents/clusters.yaml`** — Active cluster configuration (canonical; `.claude/clusters.yaml` also accepted for back-compat)
158+
- **`.agents/clusters.yaml.example`** — Annotated example config
Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
# remote_exec.sh — Remote execution utility for ModelOpt agent skills
1818
#
1919
# Usage:
20-
# source .claude/skills/common/remote_exec.sh
20+
# source .agents/skills/common/remote_exec.sh
2121
# remote_load_cluster <cluster_name> # or: remote_load_cluster (uses default)
2222
# remote_check_ssh
2323
# remote_detect_env # detect SLURM vs Docker vs bare metal
@@ -41,12 +41,17 @@
4141
# ── Helpers ──────────────────────────────────────────────────────────────────
4242

4343
_remote_config_file() {
44-
# Find clusters.yaml: user-level > project-level
44+
# Find clusters.yaml: user-level > project-level.
45+
# Project-level is checked at .agents/clusters.yaml (canonical) and then
46+
# .claude/clusters.yaml (back-compat).
4547
local user_config="${HOME}/.config/modelopt/clusters.yaml"
4648
local project_config
47-
# Walk up from pwd looking for .claude/clusters.yaml
4849
local dir="$PWD"
4950
while [[ "$dir" != "/" ]]; do
51+
if [[ -f "$dir/.agents/clusters.yaml" ]]; then
52+
project_config="$dir/.agents/clusters.yaml"
53+
break
54+
fi
5055
if [[ -f "$dir/.claude/clusters.yaml" ]]; then
5156
project_config="$dir/.claude/clusters.yaml"
5257
break
@@ -196,7 +201,7 @@ remote_load_cluster() {
196201
if [[ -z "$config_file" ]]; then
197202
echo "ERROR: No clusters.yaml found. Provide cluster info interactively or create one." >&2
198203
echo " User config: ~/.config/modelopt/clusters.yaml" >&2
199-
echo " Project config: .claude/clusters.yaml" >&2
204+
echo " Project config: .agents/clusters.yaml (or .claude/clusters.yaml)" >&2
200205
return 1
201206
fi
202207

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ which docker 2>/dev/null && echo "RUNTIME=docker"
215215

216216
| Runtime | Typical clusters | SLURM integration |
217217
| --- | --- | --- |
218-
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
218+
| **enroot/pyxis** | HPC clusters with container runtime (e.g. DGX Cloud and similar Slurm + container setups) | `srun --container-image` |
219219
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |
220220

221221
### Step 2: Check credentials for the image's registry
File renamed without changes.

0 commit comments

Comments
 (0)