Skip to content

Commit c956a40

Browse files
authored
Autoscale Workflow Submissions based on workflow parameters in Agent Skill (#631)
* Update Skill to Autoscale Workflow Submissions based on workflow parameters and README contents * Parameterize torchrun workflow * Remove cookbook.md, just fetch from cookbook README * Revert changes
1 parent c535301 commit c956a40

File tree

5 files changed

+41
-70
lines changed

5 files changed

+41
-70
lines changed

cookbook/reinforcement_learning/multi_gpu/train_policy.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ workflow:
3232
3333
set -euxo pipefail
3434
35-
_isaac_sim/python.sh -m torch.distributed.run --nnodes=1 --nproc_per_node=2 \
35+
_isaac_sim/python.sh -m torch.distributed.run --nnodes=1 --nproc_per_node={{num_gpu}} \
3636
--rdzv_endpoint=localhost:5555 \
3737
scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 \
3838
--headless --distributed

cookbook/reinforcement_learning/multi_node/train_policy.yaml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ workflow:
3535
3636
set -euxo pipefail
3737
38-
_isaac_sim/python.sh -m torch.distributed.run --nproc_per_node={{num_gpu}} --nnodes=2 --node_rank=0 \
38+
_isaac_sim/python.sh -m torch.distributed.run --nproc_per_node={{num_gpu}} --nnodes={{num_nodes}} --node_rank=0 \
3939
--rdzv_id=123 --rdzv_backend=c10d --rdzv_endpoint=localhost:5555 \
4040
scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 --headless \
4141
--distributed
@@ -46,7 +46,8 @@ workflow:
4646
outputs:
4747
- dataset:
4848
name: robot-policy-dataset
49-
- name: worker
49+
{% for i in range(1, num_nodes) %}
50+
- name: worker-{{i}}
5051
command: ["bash"]
5152
args: ["/tmp/entry.sh"]
5253
image: nvcr.io/nvidia/isaac-lab:2.2.0
@@ -59,14 +60,15 @@ workflow:
5960
6061
set -euxo pipefail
6162
62-
_isaac_sim/python.sh -m torch.distributed.run --nproc_per_node={{num_gpu}} --nnodes=2 --node_rank=1 \
63+
_isaac_sim/python.sh -m torch.distributed.run --nproc_per_node={{num_gpu}} --nnodes={{num_nodes}} --node_rank={{i}} \
6364
--rdzv_backend=c10d --rdzv_endpoint={{host:master}}:5555 \
6465
--rdzv_id=123 scripts/reinforcement_learning/rsl_rl/train.py --task=Isaac-Cartpole-v0 \
6566
--headless --distributed
6667
6768
mv logs/ {{output}}/
6869
6970
path: /tmp/entry.sh
71+
{% endfor %}
7072
name: train-robot-multi-node
7173
resources:
7274
default:
@@ -77,3 +79,4 @@ workflow:
7779

7880
default-values:
7981
num_gpu: 2
82+
num_nodes: 2

cookbook/synthetic_data_generation/isaac_sim/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ SPDX-License-Identifier: Apache-2.0
2121
## Overview
2222

2323
This workflow uses Isaac Sim, a robotics simulator, to generate synthetic data that can be used to train deep neural
24-
networks. The workflow consists of one main task that launches Isaac Sim.
24+
networks. The workflow consists of one main task that launches Isaac Sim, and generates 60 images.
2525

2626
## Prerequisites
2727

skills/osmo-agent/SKILL.md

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ The `agents/` directory contains instructions for specialized subagents. Read th
3131

3232
The `references/` directory has additional documentation:
3333

34-
- `references/cookbook.md` — Real-world workflow examples to use as starting points
3534
- `references/workflow-patterns.md` — Multi-task, parallel execution, data dependencies, Jinja templating
3635
- `references/advanced-patterns.md` — Checkpointing, retry/exit behavior, node exclusion
3736

@@ -144,19 +143,37 @@ If the user also wants monitoring, debugging, or reporting results, use the
144143
what they want to run. Write the spec to `workflow.yaml` in the current directory.
145144

146145
**When generating a workflow spec:**
147-
- Consult `references/cookbook.md` for the closest real-world example and fetch its
148-
YAML via WebFetch as a starting point. Adapt it rather than generating from scratch.
149-
Fetch the README as well, substituting the YAML file name with README. Summarize the
150-
README, and add it as a comment in the generated workflow spec.
151-
- **Use cookbook metadata to decide submission count.** The cookbook table in
152-
`references/cookbook.md` annotates entries with throughput and constraint metadata
153-
(e.g. "60 images, 1 GPU ONLY"). Before deciding whether to submit one or multiple
146+
- Fetch the cookbook README via WebFetch to browse available examples:
147+
`https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/README.md`
148+
Pick the closest match to the user's request. The cookbook README links to each
149+
workflow's per-workflow README. To fetch the workflow YAML:
150+
1. Fetch the per-workflow README at the linked path (e.g.
151+
`https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/<path>/README.md`).
152+
2. Read that README to find the workflow YAML filename (do not assume it is
153+
`workflow.yaml` — look for the actual filename referenced in the README).
154+
3. Construct the workflow YAML URL as `<per-workflow README directory URL>/<filename>`
155+
and fetch it.
156+
Use the YAML as a starting point — adapt it rather than generating from scratch.
157+
Summarize the per-workflow README and add it as a comment in the generated workflow spec.
158+
- **Preserve Jinja template variables.** If the cookbook YAML uses `{{variable}}`
159+
placeholders (e.g. `{{num_gpu}}`), do NOT replace or hardcode them in the YAML.
160+
Keep the template variables as-is and pass the user's values via `--set` at submit
161+
time. Multiple variables are space-separated after a single `--set`:
162+
```
163+
osmo workflow submit workflow.yaml --pool <pool_name> --set num_gpu=4 other_var=value
164+
```
165+
Do not manually scale `resources` values to match the user's requested GPU count —
166+
the template handles this.
167+
- **Use workflow README and YAML to decide submission count.** After fetching those
168+
two files, find the throughput and constraint metadata
169+
(e.g. "60 images"). Before deciding whether to submit one or multiple
154170
workflows, read those annotations:
155171
- If a throughput figure is present and the user has a target quantity + time
156172
budget, calculate: `num_submissions = ceil(target / (throughput_per_run * time_budget))`
157173
and submit the same YAML that many times.
158-
- If a constraint is present (e.g. "1 GPU ONLY"), respect it — do not scale by
159-
requesting more GPUs per workflow; scale by submitting more workflows instead.
174+
- For scaling workflows, if a workflow's resource spec uses variables, then you can pass
175+
a new value in the submit call. If a resource spec uses constants, scale by submitting
176+
more workflows instead of requesting more GPUs, CPUs, etc. for a workflow.
160177
- If no metadata is present, submit a single workflow unless the user says otherwise.
161178
- If the workflow involves **multiple tasks, parallel execution, data dependencies
162179
between tasks, or Jinja templating**, read `references/workflow-patterns.md` for
@@ -202,8 +219,10 @@ If the user also wants monitoring, debugging, or reporting results, use the
202219
`Would you like me to submit this workflow to this pool?`
203220
Then execute the command yourself — do not tell the user to run it. Once confirmed, run:
204221
```
205-
osmo workflow submit workflow.yaml --pool <pool_name>
222+
osmo workflow submit workflow.yaml --pool <pool_name> --set key=value other_key=value
206223
```
224+
Include `--set` only when the workflow has Jinja template variables to override
225+
(e.g. `--set num_gpu=4`). Omit it if the YAML has no template variables.
207226
If the user wants to run the same workflow multiple times (e.g. "submit 2 of these"),
208227
submit the same YAML file multiple times — do not create duplicate YAML files.
209228
Report each workflow ID returned by the CLI so the user can track them.
@@ -218,7 +237,9 @@ If the user also wants monitoring, debugging, or reporting results, use the
218237

219238
**Validation errors:** If submission fails with a validation error indicating that
220239
resources failed assertions, read the node capacity values from the error table and
221-
adjust the `resources` section of `workflow.yaml` using these rules, then resubmit:
240+
adjust the hard-coded values in the `resources` section of `workflow.yaml` using these
241+
rules, then resubmit. (Do not touch Jinja template variables like `{{num_gpu}}`
242+
those are resolved at runtime via `--set`.)
222243

223244
- **Storage / Memory:** use `floor(capacity * 0.9)` if capacity ≥ 50, otherwise `capacity - 2`
224245
- **CPU:** use `floor(capacity * 0.9)` if capacity ≥ 30, otherwise `capacity - 2`

skills/osmo-agent/references/cookbook.md

Lines changed: 0 additions & 53 deletions
This file was deleted.

0 commit comments

Comments
 (0)