Skip to content

feat: Allow different NVLink partitions for Instance create and update#225

Open
hwadekar-nv wants to merge 2 commits intomainfrom
feat/nvlink-partition-count-instance
Open

feat: Allow different NVLink partitions for Instance create and update#225
hwadekar-nv wants to merge 2 commits intomainfrom
feat/nvlink-partition-count-instance

Conversation

@hwadekar-nv
Copy link
Contributor

@hwadekar-nv hwadekar-nv commented Mar 10, 2026

Description

  • Subset of GPUs allowed — NVLink Interfaces no longer need to cover all GPUs; fewer than the total count is now valid
  • Different partitions per interface — Removed the same-NVLinkLogicalPartitionID requirement across all interfaces in create, batch-create, and update requests
  • Stricter DeviceInstance validation — Each GPU index must be unique and within the machine's actual GPU count range
  • Update handler duplicate check — Changed from partition-level to partition + deviceInstance composite key to detect true duplicates

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in Github workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@hwadekar-nv hwadekar-nv self-assigned this Mar 10, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hwadekar-nv hwadekar-nv changed the title Allow different NVLink partitions for Instance create/update feat: Allow different NVLink partitions for Instance create/update Mar 10, 2026
@hwadekar-nv hwadekar-nv changed the title feat: Allow different NVLink partitions for Instance create/update feat: Allow different NVLink partitions for Instance create and update Mar 10, 2026
@hwadekar-nv hwadekar-nv force-pushed the feat/nvlink-partition-count-instance branch from 491e129 to d6e3f16 Compare March 10, 2026 18:57
@hwadekar-nv hwadekar-nv force-pushed the feat/nvlink-partition-count-instance branch from d6e3f16 to e02c47a Compare March 10, 2026 18:58
@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-10 18:59:11 UTC | Commit: e02c47a

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 64 vulnerability(ies)
📊 vs main: 64 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 64
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-10 18:59:20 UTC | Commit: e02c47a

@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Test Results

7 588 tests  +1   7 588 ✅ +1   8m 3s ⏱️ +6s
  129 suites ±0       0 💤 ±0 
   13 files   ±0       0 ❌ ±0 

Results for commit eda4a1b. ± Comparison against base commit 6b7415d.

This pull request removes 3 and adds 4 tests. Note that renamed tests count towards both.
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_failure_with_update_existing_NVLink_interfaces_with_same_NVLink_Logical_Partition_ID
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_with_NVLink_interface_update_
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/model ‑ TestAPIInstanceCreateRequest_Validate/test_invalid_Instance_create_request,_NVLink_Interfaces_specified_but_all_the_NVLink_Interfaces_does_not_have_same_NVLink_Logical_Partition
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_failure_with_update_existing_NVLink_interfaces_with_same_NVLink_Logical_Partition_ID_and_Device_Instance
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/handler ‑ TestUpdateInstanceHandler_Handle/test_Instance_update_API_endpoint_success_with_NVLink_interface_update_with_different_NVLink_Logical_Partition_IDs
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/model ‑ TestAPIInstanceCreateRequest_Validate/test_invalid_Instance_create_request,_NVLink_Interfaces_specified_with_device_instance_out_of_range
github.com/nvidia/bare-metal-manager-rest/api/pkg/api/model ‑ TestAPIInstanceUpdateRequest_Validate/test_invalid_Instance_update_request,_NVLink_Interfaces_specified_with_device_instance_out_of_range

♻️ This comment has been updated with latest results.

logger.Warn().Msg(fmt.Sprintf("NVLink Logical Partition: %v specified in request does not match with Instance Site", nvllp.ID))
return cerr.NewAPIErrorResponse(c, http.StatusBadRequest, fmt.Sprintf("NVLink Logical Partition: %v specified in request does not match with Instance Site", nvllp.ID), nil)
}
// Validate each NVLink Logical Partition independently for site, tenant, and status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwadekar-nv The goal wasn't to alter the behavior when default NVLink Logical Partition is present rather when there is no default. Based on the issue, we should resolve only the restrictions that produce the following error messages:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants