Skip to content

Feat/mig: Multi Instance GPU (MIG)#61

Open
venkata22a wants to merge 4 commits into
pmady:mainfrom
venkata22a:feat/mig
Open

Feat/mig: Multi Instance GPU (MIG)#61
venkata22a wants to merge 4 commits into
pmady:mainfrom
venkata22a:feat/mig

Conversation

@venkata22a

@venkata22a venkata22a commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • Bug fix
  • New feature
  • Documentation
  • Refactoring
  • CI/Build

Description

Support for MIG Metrics across the environments.

Related Issue

#26

Checklist

  • Tests added/updated
  • Documentation updated (if applicable)
  • make test passes
  • make lint passes
  • Commits are signed off (git commit -s)

Signed-off-by: Venkat <venkata22a@gmail.com>
Signed-off-by: Venkat <venkata22a@gmail.com>
@venkata22a venkata22a requested a review from pmady as a code owner June 18, 2026 23:09
@venkata22a venkata22a changed the title Feat/mig Feat/mig: Multi Instance GPU (MIG) Jun 18, 2026
Signed-off-by: Venkat <venkata22a@gmail.com>

@pmady pmady left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on the MIG support, solid feature. couple things to address before merge:

  1. MIGUUIDs() is copy-pasted identically in both pkg/slurm/slurm.go and pkg/flux/flux.go — same loop, same strings.HasPrefix(p, "MIG-") check. can you pull that into a shared helper? maybe something like gpu.ParseMIGUUIDs(csv string) []string or a util in pkg/env.

  2. the new MIG fields on Metrics struct (IsMIGInstance, ParentIndex, MigProfile) don't have json tags. the rest of the struct uses explicit json tags — these should too for consistency and to match the csv output field names. something like:

IsMIGInstance bool   `json:"is_mig_instance"`
ParentIndex   int    `json:"parent_index"`
MigProfile    string `json:"mig_profile,omitempty"`
  1. CollectByUUID hardcodes instanceIdx: 0 when the real index isn't known from a UUID lookup. worth adding a comment explaining this, or see if GetIndex() works on MIG device handles to get the actual instance index.

  2. this PR has a merge commit from main that includes the cross-env code from #59. should we merge #59 first and rebase this on top? cleaner git history.

overall looks good — the physical metrics sharing approach is the right call and the test coverage on aggregation across MIG instances is solid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants