-
Notifications
You must be signed in to change notification settings - Fork 24
Description
There was a short conversation on Slack today in which Olexa noticed two errors in a student's sbatch script. The first error is irrelevant here, but the second error came from that student not being aware that H100s are not found in the long partition.
There is no description of that in the documentation.
Olexa told me that it's mentioned in an announcement here: https://mila-umontreal.slack.com/archives/CFAS8455H/p1727895577891089 .
He also went fishing into /etc/slurm/slurm.conf to get that answer about the H100s not in long, which isn't something that we should expect students to do.
Partitions for the Mila cluster should be described in more details in our documentation so that someone reading the documentation can understand better their roles without having to interact with the Slurm cluster. This is particularly important in the context of LLMs ingesting our documentation but not having access to Slurm (yet?).
We don't need too much details, but at least the equivalent of an index card containing a quick summary of how to decide which partitions to use, and what GPUs are available there.