Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion lib/iris/OPS.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ iris rpc controller get-provider-status # scheduling events, cluster cap
iris cluster vm status # scale groups with slice counts
```

Priority bands: `PRIORITY_BAND_INTERACTIVE` (default), `PRIORITY_BAND_PRODUCTION` (can preempt interactive).
Priority bands: `PRIORITY_BAND_INTERACTIVE` (default), `PRIORITY_BAND_PRODUCTION` (can preempt interactive), `PRIORITY_BAND_BATCH` (preemptible). See [`docs/priority-bands.md`](docs/priority-bands.md) for the user-facing guide on when to pick each band.

## SQL Queries

Expand Down
53 changes: 53 additions & 0 deletions lib/iris/docs/priority-bands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Priority Bands

Iris ranks pending tasks by **priority band** before per-user fairness. Three
bands exist (defined in
[`job.proto`](../src/iris/rpc/job.proto)): `PRODUCTION`, `INTERACTIVE`, and
`BATCH`. Choose the right band for what you are running — picking the wrong one
either delays your work or disrupts other people's.

| Band | Selected via | Behavior |
|---|---|---|
| `PRODUCTION` | `--priority production` | Always scheduled before lower bands. Can preempt INTERACTIVE/BATCH. Never downgraded by the budget system. |
| `INTERACTIVE` | default (or `--priority interactive`) | Normal work. Yields to PRODUCTION; preempts BATCH. |
| `BATCH` | `--priority batch` | Opportunistic. Yields to anything else. Safe to launch in bulk. |

## When to use each band

### PRODUCTION

Use **only** for work that has been discussed at a weekly meeting or directly
with the PI (Percy) as high priority for the whole org and blocked on compute.
For Stanford folks: equivalent to `sphinx` queues on the NLP cluster.

Submitting to PRODUCTION without a prior conversation is antisocial — you are
preempting other researchers' running jobs.

### INTERACTIVE

The default band. Use for everyday research: training runs, ad-hoc evaluation,
debugging, single-shot experiments. Most jobs belong here.

### BATCH

Use for work you are happy to have preempted by anyone else. Equivalent to
`sc-loprio` on the NLP cluster. Good candidates:

- Hyperparameter sweeps
- Batch inference / offline evaluation
- Large fan-out experiments where any individual run can be retried
- Anything you want to run *a lot* of without crowding out the cluster

BATCH jobs are the polite default when you don't strictly need a result soon.

## How band selection interacts with budgets

Per-user budget tracking lives in
[`controller/budget.py`](../src/iris/cluster/controller/budget.py). When a user
exceeds their budget, INTERACTIVE submissions are silently downgraded to BATCH.
PRODUCTION is exempt — another reason to reserve it for vetted work.

## See also

- [`task-states.md`](task-states.md) — how preemption surfaces in task state
- [`OPS.md`](../OPS.md) — operator-side scheduler inspection
Loading