Contact Details
No response
What happened?
The E2E test on OpenShift failed for #420 . Because the assumption identified below was not actually true at that time and place.
The E2E test suite currently assumes (in the "Same-Node Port Collision Creates New Launcher" test case) that it can use 2 GPUs on the node where the server-requesting Pod is initially assigned. That is simply not always true.
It would be great if the test case could detect that situation and react in some sensible manner. It would be good if the debug dumping steps at the end of the job included one that exposed this situation.
This is really just a special case of the larger problem that GPU availability is dynamic in the shared cluster. Every test step that requires a GPU to be allocated is making an assumption of GPU availability that might not be true.
Version
main (please specify commit below)
Branch name
No response
Commit SHA
No response
Relevant log output
Contact Details
No response
What happened?
The E2E test on OpenShift failed for #420 . Because the assumption identified below was not actually true at that time and place.
The E2E test suite currently assumes (in the "Same-Node Port Collision Creates New Launcher" test case) that it can use 2 GPUs on the node where the server-requesting Pod is initially assigned. That is simply not always true.
It would be great if the test case could detect that situation and react in some sensible manner. It would be good if the debug dumping steps at the end of the job included one that exposed this situation.
This is really just a special case of the larger problem that GPU availability is dynamic in the shared cluster. Every test step that requires a GPU to be allocated is making an assumption of GPU availability that might not be true.
Version
main (please specify commit below)
Branch name
No response
Commit SHA
No response
Relevant log output