Skip to content

Conversation

@rollandf
Copy link
Contributor

@rollandf rollandf commented Oct 21, 2024

In case a RDMA device in exclusive mode is in use by a Pod, the DP was not reporting it as a resource after DP restart.

Following changes are introduced in RdmaSpec:

  • isRdma: in case of no rdma resources, check if netlink "enable_rdma" is available.
  • GetRdmaDeviceSpec: the device specs are retrieved dynamically and not on discovery stage as before.

Dynamic RDMA specs computation vs on discovery, comes to solve following scenario for exlusive mode:

  • Discover RDMA device
  • Allocate to Pod (resources are hidden on host)
  • Restart DP pod
  • Discovery
  • Deallocate
  • Reallocate

Fixes #565

@coveralls
Copy link
Collaborator

coveralls commented Oct 21, 2024

Pull Request Test Coverage Report for Build 11577952041

Details

  • 28 of 58 (48.28%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.6%) to 74.628%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/utils.go 0 6 0.0%
pkg/devices/rdma.go 27 36 75.0%
pkg/utils/netlink_provider.go 0 15 0.0%
Totals Coverage Status
Change from base Build 11458515878: -0.6%
Covered Lines: 2109
Relevant Lines: 2826

💛 - Coveralls

@rollandf
Copy link
Contributor Author

rollandf commented Oct 22, 2024

@SchSeba @zeeke PTAL

Note that exclusive mode will be exposed in SRIOV-Network-Operator with this PR:
k8snetworkplumbingwg/sriov-network-operator#666

default:
return false
}
// Checking for netlink param for exclusive RDMA use case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more information here why we need to check netlink param in this case. (requested by sebastian)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
@SchSeba PTAL

In case a RDMA device in exclusive mode is in use
by a Pod, the DP was not reporting it as a resource
after DP restart.

Following changes are introduced in RdmaSpec:

- isRdma: in case of no rdma resources,
  check if netlink "enable_rdma" is available.
- GetRdmaDeviceSpec: the device specs are retrieved
  dynamically and not on discovery stage as before.

Dynamic RDMA specs computation vs on discovery, comes
to solve following scenario for exlusive mode:
- Discover RDMA device
- Allocate to Pod (resources are hidden on host)
- Restart DP pod
- Deallocate
- Reallocate

Fixes k8snetworkplumbingwg#565

Signed-off-by: Fred Rolland <[email protected]>
@rollandf
Copy link
Contributor Author

@SchSeba can you PTAL?

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this one and also add a functional test to cover this one in the operator https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/799/files#diff-909069834ea269a01a51f28b8830efebec16799766767ebcd01b58f966ddc5c5R226 (real mlx device is needed for the test to run)

@SchSeba SchSeba merged commit a380ca5 into k8snetworkplumbingwg:master Nov 3, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Capacity and Allocatable number shows wrong if sriov-network-device-plugin restarts

7 participants