Skip to content

Commit d5ae483

Browse files
committed
feat(nvidia-smi): add GPU count success criteria
This change adds a success criterion to the `nvidia-smi` modifier to verify the number of GPUs against a variable. A new modifier variable `gpus_per_node` is introduced, which defaults to 8. A new Figure of Merit (FOM) named "GPU Count" is added to count the number of GPUs by counting the lines in the `nvidia-smi` output. A new success criterion named "gpu_count_check" is added to check if the "GPU Count" FOM matches the `gpus_per_node` variable.
1 parent 91fa393 commit d5ae483

File tree

1 file changed

+23
-1
lines changed
  • var/ramble/repos/builtin/modifiers/nvidia-smi

1 file changed

+23
-1
lines changed

var/ramble/repos/builtin/modifiers/nvidia-smi/modifier.py

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,12 @@ class NvidiaSmi(BasicModifier):
2424

2525
maintainers("samskillman")
2626

27+
variable(
28+
"gpus_per_node",
29+
default=8,
30+
description="The number of GPUs per node.",
31+
)
32+
2733
mode("standard", description="Standard execution mode for nvidia-smi")
2834
default_mode("standard")
2935

@@ -148,9 +154,25 @@ class NvidiaSmi(BasicModifier):
148154
log_file="{nvidia_smi_log}",
149155
)
150156

157+
figure_of_merit(
158+
"GPU Count",
159+
fom_regex=r"GPU Count: (?P<gpu_count>\d+)",
160+
group_name="gpu_count",
161+
units="",
162+
log_file="{nvidia_smi_log}",
163+
)
164+
165+
success_criteria(
166+
"gpu_count_check",
167+
mode="fom_comparison",
168+
fom_name="GPU Count",
169+
formula="{value} == {gpus_per_node}",
170+
)
171+
151172
register_builtin("nvidia_smi_exec")
152173

153174
def nvidia_smi_exec(self):
154175
return [
155-
"nvidia-smi --query-gpu=index,name,driver_version,pstate,pci.bus_id,serial,uuid,power.draw,power.limit,clocks.gr,clocks.mem --format=csv,noheader,nounits >> {nvidia_smi_log}"
176+
"nvidia-smi --query-gpu=index,name,driver_version,pstate,pci.bus_id,serial,uuid,power.draw,power.limit,clocks.gr,clocks.mem --format=csv,noheader,nounits > {nvidia_smi_log}",
177+
'echo "GPU Count: $(wc -l < {nvidia_smi_log})" >> {nvidia_smi_log}',
156178
]

0 commit comments

Comments
 (0)