Skip to content

[SARC-395] Ajuster la fonction de conversion gpu->rgu pour supporter différentes versions à travers le temps. #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 3, 2025

Conversation

notoraptor
Copy link
Contributor

  • Read GPU billing from database, not from config anymore
  • Add dependency iguane to get GPU->RGU values
  • Add new client function get_rgus()
  • Move series function into client: update_job_series_rgu()
  • update_job_series_rgu(): take into account evolution of GPU billing acrosse time and type of GPU billing (billing_is_gpu) on each cluster
  • load_job_series(): make sure users columns are included only if job user column is included in data frame.
  • tests: allow to create entries for all testing clusters: read cluster names from sarc-test.json

@notoraptor
Copy link
Contributor Author

@bouthilx Voici une PR pour finir la gestion des RGUs !

Par rapport au document de référence, j'ai toutefois apporté une petite modification. Dans le document de référence ("GPU vs RGU"), pour calculer les RGUs sur DRAC, on avait prévu le calcul suivant:

DRAC

allocated.gres_gpu: Nombre de RGU scalé

nb_rgu = allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time)

allocated.gres_rgu = nb_rgu

allocated.gres_gpu = nb_rgu / IGUANE[allocated.gpu_type]

Cependant, en observant des jobs réels, il me semble que allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time) retourne directement le nombre de GPUs, pas le nombre de RGUs

Comme exemple, j'ai ce genre de jobs:

 {
  "cluster_name": "beluga",
  "job_id": 47622739,
  "job_state": "CANCELLED",
  "exit_code": 0,
  "partition": "gpubase_bynode_b1",
  "nodes": [
   "bg12106",
   "bg12107",
   "bg12108",
   "bg12113"
  ],
  "submit_time": "2024-05-23 19:15:55-04:00",
  "start_time": "2024-05-23 19:15:57-04:00",
  "end_time": "2024-05-23 19:55:28-04:00",
  "elapsed_time": 2371,
  "requested": {
   "cpu": 160,
   "mem": 737280,
   "node": 4,
   "billing": 35555,
   "gres_gpu": 16,
   "gpu_type": null
  },
  "allocated": {
   "cpu": 160,
   "mem": 737280,
   "node": 4,
   "billing": 35555,
   "gres_gpu": 16,
   "gpu_type": "Tesla V100-SXM2-16GB"
  }
 },

Le billing ici est 35555, le gpu type est "Tesla V100-SXM2-16GB", et la partition est gpubase_bynode_b1. Si je regarde les propriétés de la partition dans le fichier de config slurm de beluga, je trouve ceci:

PartitionName=gpubase_bynode_b1 MaxTime=3:00:00 Default=no MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=11 DisableRootJobs=YES RootOnly=NO Hidden=NO OverSubscribe=NO GraceTime=0 PreemptMode=OFF
PriorityTier=10 ReqResv=NO DefMemPerCPU=256 AllowAccounts=ALL AllowQos=ALL 
Nodes=bg[11201-11214,11301-11313,11401-11414,11501-11513,11601-11614,11701-11713,11801-11814,11901-11913,12001-12014,12101-12113,12201-12214,12301-12313,12401-12410] 
TRESBillingWeights=CPU=222.22,Mem=47.62G,GRES/gpu=2200.0 DefaultTime=1:00:00 ExclusiveUser=NO

Qui m'indique donc GRES/gpu=2200.0. Et donc, si je fais 35555 / 2200.0, j'obtiens 16.161363636363635, soit environ 16, ce qui correspond bien au "gres_gpu": 16 dans le allocated du job.

PS: Je rappelle que allocated.gres_gpu prend la valeur de billing dans le job series.

J'ai donc remplacé:

DRAC

allocated.gres_gpu: Nombre de RGU scalé

nb_rgu = allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time)

allocated.gres_rgu = nb_rgu

allocated.gres_gpu = nb_rgu / IGUANE[allocated.gpu_type]

Par:

DRAC

allocated.gres_gpu: Nombre de RGU scalé

nb_gpu = allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time)

allocated.gres_rgu = nb_gpu * IGUANE[allocated.gpu_type]

allocated.gres_gpu = nb_gpu

@notoraptor notoraptor force-pushed the sarc-395-update-job-series-rgu branch from 23a7777 to ab9ad53 Compare April 4, 2025 13:14
@notoraptor
Copy link
Contributor Author

PS: Cette PR modifie sacct.py tout comme la PR #162. Si cette dernière est mergée en premier, alors il faudra rebaser + ajuster le code ici.

@abergeron
Copy link
Collaborator

I've rebased this onto master to make the change from poetry to uv.

@abergeron abergeron force-pushed the sarc-395-update-job-series-rgu branch from fdf4659 to fa2a17a Compare April 24, 2025 19:01
@soline-b soline-b self-requested a review May 16, 2025 13:30
Copy link
Collaborator

@soline-b soline-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve the modifications of the code itself; but, as it is an old PR, there are some conflicts now.

…différentes versions à travers le temps.

- Read GPU billing from database, not from config anymore
- Add dependency `iguane` to get GPU->RGU values
- Add new client function get_rgus()
- Move series function into client: update_job_series_rgu()
- update_job_series_rgu(): take into account evolution of GPU billing acrosse time and type of GPU billing (billing_is_gpu) on each cluster
- load_job_series(): make sure users columns are included only if job `user` column is included in data frame.
- tests: allow to create entries for all testing clusters: read cluster names from sarc-test.json
- harmonize names of billed GPUs
- get GPU nodes as a list instead of a string, as some nodes may have many GPUs (e.g. MIG GPUs)

Improve RGU function to handle harmonized names of MIG GPUs.

Improve update_allocated_gpu_type():
- check default allocated.gpu_type if a single gpu_type cannot be inferred from nodes
- harmonize GPU name using __DEFAULTS__ if available even if job does not have nodes
@notoraptor notoraptor force-pushed the sarc-395-update-job-series-rgu branch from fa2a17a to e626825 Compare June 2, 2025 20:06
- Remove now unused cluster config fields `rgu_start_date` and `gpu_to_rgu_billing` from YAML files
- Remove unused imports.
- Update uv fils
- Fix missing changes after rebase
- Update unit tests
@notoraptor notoraptor force-pushed the sarc-395-update-job-series-rgu branch from 24400ca to b255fba Compare June 2, 2025 20:45
@nurbal nurbal merged commit d979aec into master Jun 3, 2025
6 checks passed
@nurbal nurbal deleted the sarc-395-update-job-series-rgu branch June 3, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants