Skip to content

Add support for autodetection of gres resources #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jovial
Copy link
Contributor

@jovial jovial commented Apr 23, 2025

Adds support for setting the AutoDetection property on gres resources. This prevents the need to manually specify File in the gres dictionary. You can only use one auto-detection mechanism per node, otherwise slurm will complain (hence why it is a per partition option and not a per gres option).

Example:

# group_vars/all/openhpc.yml

openhpc_nodegroups:
    - name: cpu
    - name: gpu
      gres_autodetect: nvml
      gres:
        - conf: "gpu:nvidia_h100_80gb_hbm3:2"
        - conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
        - conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"

@jovial jovial requested a review from a team as a code owner April 23, 2025 17:07
@jovial jovial marked this pull request as draft April 23, 2025 20:20
@jovial jovial marked this pull request as ready for review April 24, 2025 09:04
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some concerns

README.md Outdated
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
- `file`: Omit if `gres_autodetect` is set, A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `file`: Omit if `gres_autodetect` is set, A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.

or move the addition to the end of the item 🤷 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've left it at the beginning, as I felt it was the most important bit of information.

@jovial jovial requested a review from sjpb April 28, 2025 08:39
@jovial jovial marked this pull request as draft May 8, 2025 14:18
@jovial jovial changed the base branch from master to feat/nodegroups May 8, 2025 14:27
@jovial jovial marked this pull request as ready for review May 8, 2025 15:59
@jovial
Copy link
Contributor Author

jovial commented May 8, 2025

Ready for review but merge #183 first (this PR targets that branch to avoid noise in diff)

Base automatically changed from feat/nodegroups to master May 13, 2025 08:19
@jovial jovial changed the base branch from master to feat/nodegroups-v2 May 16, 2025 12:48
@jovial jovial force-pushed the feature/gres-autodetect branch from 3608f48 to e3f58ad Compare May 16, 2025 12:54
@jovial jovial force-pushed the feature/gres-autodetect branch from e3f58ad to 1ca4a4e Compare May 16, 2025 13:24
Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, but looks pretty good to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial PR comment you say:

Adds support for setting the AutoDetection property on gres resources. This prevents the need to manually specify File in the gres dictionary. You can only use one auto-detection mechanism per node, otherwise slurm will complain (hence why it is a per partition option and not a per gres option).

Can you change that to

... (hence why it is a per nodegroup option and not a per gres option).

And add something like:

NB: autodetection requires rebuild of the OpenHPC packages - this is not provided by this role

@@ -59,9 +59,10 @@ unique set of homogenous nodes:
`free --mebi` total * `openhpc_ram_multiplier`.
* `ram_multiplier`: Optional. An override for the top-level definition
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key.
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. NB: The `gres` dictionary below is still required but only requires the `conf` key.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention the requirement for recompliation here? Rather than just in the example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh or maybe given you've got a whole section on it, maybe just mention that section ("see section below") or something?

- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Note [GresTypes] ... bit is in the wrong place I think isn't it - it applies to the entire gres dict, not to the file key? If you look at the rendered page?

{% else %}
{% for gres in gres_list %}
{% set gres_name, gres_type, _ = gres.conf.split(':') %}
{% for hostlist in (inventory_group_hosts | hostlist_expression) %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do something similar to slurm.conf now and provide a comma-sep string, and skip the loops?

{% set hostlist_string  = inventory_group_hosts | hostlist_expression | join(',') %}
...
NodeName={{ hostlist_string }}

instead of the (2x) loops for this?

{% set gres_autodetect = nodegroup.gres_autodetect | default('off') %}
{% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %}
{% set inventory_group_hosts = groups.get(inventory_group_name, []) %}
{% if gres_autodetect | default('off') != 'off' %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the default here? Its in l4, or am I missing something?

Base automatically changed from feat/nodegroups-v2 to master May 16, 2025 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants