-
Notifications
You must be signed in to change notification settings - Fork 19
Add support for autodetection of gres resources #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some concerns
README.md
Outdated
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string. | ||
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example. | ||
- `file`: Omit if `gres_autodetect` is set, A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `file`: Omit if `gres_autodetect` is set, A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example. | |
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example. |
or move the addition to the end of the item 🤷 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've left it at the beginning, as I felt it was the most important bit of information.
Ready for review but merge #183 first (this PR targets that branch to avoid noise in diff) |
3608f48
to
e3f58ad
Compare
e3f58ad
to
1ca4a4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, but looks pretty good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the initial PR comment you say:
Adds support for setting the AutoDetection property on gres resources. This prevents the need to manually specify File in the gres dictionary. You can only use one auto-detection mechanism per node, otherwise slurm will complain (hence why it is a per partition option and not a per gres option).
Can you change that to
... (hence why it is a per nodegroup option and not a per gres option).
And add something like:
NB: autodetection requires rebuild of the OpenHPC packages - this is not provided by this role
@@ -59,9 +59,10 @@ unique set of homogenous nodes: | |||
`free --mebi` total * `openhpc_ram_multiplier`. | |||
* `ram_multiplier`: Optional. An override for the top-level definition | |||
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set. | |||
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define: | |||
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. | |
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. NB: The `gres` dictionary below is still required but only requires the `conf` key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also mention the requirement for recompliation here? Rather than just in the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh or maybe given you've got a whole section on it, maybe just mention that section ("see section below") or something?
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string. | ||
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example. | ||
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example. | ||
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Note [GresTypes] ...
bit is in the wrong place I think isn't it - it applies to the entire gres
dict, not to the file
key? If you look at the rendered page?
{% else %} | ||
{% for gres in gres_list %} | ||
{% set gres_name, gres_type, _ = gres.conf.split(':') %} | ||
{% for hostlist in (inventory_group_hosts | hostlist_expression) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do something similar to slurm.conf now and provide a comma-sep string, and skip the loops?
{% set hostlist_string = inventory_group_hosts | hostlist_expression | join(',') %}
...
NodeName={{ hostlist_string }}
instead of the (2x) loops for this?
{% set gres_autodetect = nodegroup.gres_autodetect | default('off') %} | ||
{% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %} | ||
{% set inventory_group_hosts = groups.get(inventory_group_name, []) %} | ||
{% if gres_autodetect | default('off') != 'off' %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the default here? Its in l4, or am I missing something?
Adds support for setting the AutoDetection property on gres resources. This prevents the need to manually specify File in the gres dictionary. You can only use one auto-detection mechanism per node, otherwise slurm will complain (hence why it is a per partition option and not a per gres option).
Example: