Feat/mig: Multi Instance GPU (MIG)#61
Conversation
Signed-off-by: Venkat <venkata22a@gmail.com>
Signed-off-by: Venkat <venkata22a@gmail.com>
Signed-off-by: Venkat <venkata22a@gmail.com>
pmady
left a comment
There was a problem hiding this comment.
nice work on the MIG support, solid feature. couple things to address before merge:
-
MIGUUIDs()is copy-pasted identically in bothpkg/slurm/slurm.goandpkg/flux/flux.go— same loop, samestrings.HasPrefix(p, "MIG-")check. can you pull that into a shared helper? maybe something likegpu.ParseMIGUUIDs(csv string) []stringor a util inpkg/env. -
the new MIG fields on
Metricsstruct (IsMIGInstance,ParentIndex,MigProfile) don't have json tags. the rest of the struct uses explicit json tags — these should too for consistency and to match the csv output field names. something like:
IsMIGInstance bool `json:"is_mig_instance"`
ParentIndex int `json:"parent_index"`
MigProfile string `json:"mig_profile,omitempty"`-
CollectByUUIDhardcodesinstanceIdx: 0when the real index isn't known from a UUID lookup. worth adding a comment explaining this, or see ifGetIndex()works on MIG device handles to get the actual instance index. -
this PR has a merge commit from main that includes the cross-env code from #59. should we merge #59 first and rebase this on top? cleaner git history.
overall looks good — the physical metrics sharing approach is the right call and the test coverage on aggregation across MIG instances is solid.
What type of PR is this?
Description
Support for MIG Metrics across the environments.
Related Issue
#26
Checklist
make testpassesmake lintpassesgit commit -s)