Skip to content

Conversation

@lebrice
Copy link
Collaborator

@lebrice lebrice commented Feb 11, 2025

This is a proof-of-concept.

Context: (from Slack chat)

Say that you want to run 100 jobs on a SLURM cluster.
Then you find out there is some weird bug that happens, and that sweep has to be redone.
That's a lot of time, energy, and compute that is essentially wasted, more or less.
Assuming that you have a simple sanity check, like doing one epoch of training, or something similar (automated tests, or something like pytest, etc) that you'd usually run yourself in an interactive job before launching a sweep, to make sure everything works correctly,
What if there was a first job that ran this sanity check / test suite, and then all the jobs in the sweep are only run if the first job succeeds?
(using the --dependency feature of sbatch)
On DRAC this would be super useful IMO, because you'd get to submit all the jobs all at once, instead of having a round-trip-time of multiple days between when you submit the first jobs and the rest.
Actually, I suspect people don't really do that much debugging of jobs on DRAC, they mostly just launch batches of jobs and hope that everything goes well. At least that's my feeling.

This example uses pytest -x -v as the pre-sweep command to use.

Runnable like so:

python project/main.py --multirun experiment=example name=job_deps resources=gpu hydra/launcher=submitit_test_dep

Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant