Skip to content

[Bug]: E2E test should be smarter about GPU saturation #422

@MikeSpreitzer

Description

@MikeSpreitzer

Contact Details

No response

What happened?

The E2E test on OpenShift failed for #420 . Because the assumption identified below was not actually true at that time and place.

The E2E test suite currently assumes (in the "Same-Node Port Collision Creates New Launcher" test case) that it can use 2 GPUs on the node where the server-requesting Pod is initially assigned. That is simply not always true.

It would be great if the test case could detect that situation and react in some sensible manner. It would be good if the debug dumping steps at the end of the job included one that exposed this situation.

This is really just a special case of the larger problem that GPU availability is dynamic in the shared cluster. Every test step that requires a GPU to be allocated is making an assumption of GPU availability that might not be true.

Version

main (please specify commit below)

Branch name

No response

Commit SHA

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions