Skip to content

missing information about H100 not being in in long partition #286

@gyom

Description

@gyom

There was a short conversation on Slack today in which Olexa noticed two errors in a student's sbatch script. The first error is irrelevant here, but the second error came from that student not being aware that H100s are not found in the long partition.

There is no description of that in the documentation.

Olexa told me that it's mentioned in an announcement here: https://mila-umontreal.slack.com/archives/CFAS8455H/p1727895577891089 .

He also went fishing into /etc/slurm/slurm.conf to get that answer about the H100s not in long, which isn't something that we should expect students to do.

Partitions for the Mila cluster should be described in more details in our documentation so that someone reading the documentation can understand better their roles without having to interact with the Slurm cluster. This is particularly important in the context of LLMs ingesting our documentation but not having access to Slurm (yet?).

We don't need too much details, but at least the equivalent of an index card containing a quick summary of how to decide which partitions to use, and what GPUs are available there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions