-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
This is a tracking issue used to document the current set of features we would like to integrate into gantry.
This thread should also be used to discuss new directions for the project.
- Dynamic allocation #73
- Cost measurement and analysis #75
- Improved retries in Spack CI #74
- Resource fuzzing to assess job performance variation #76
- Optimizing scheduling of nodes and pods #77
Plan
- In the pilot phase, we will only be implementing predictions for requests, and ensuring that they will only increase compared to current allocations.
- If we see success in the pilot, we'll implement functionality which retries jobs with higher memory allocations if they've been shown to fail due to OOM kills.
- Then, we will "drop the floor" and allow the predictor to allocate less memory than the package is used to. At this step, requests will be fully implemented.
- Limits for CPU and memory will be implemented.
- Next, we want to introduce some experimentation in the system and perform a scaling study.
- Design a scheduler that decides which instance type a job should be placed on based on cost and expected usage and runtime.
Evaluation
The success of this framework can be evaluated against a number of factors:
- Has the cost per job changed?
- Are jobs being killed due to resource contention?
- What is the error distribution of our predictions?
- How much waste is there per build type?
alecbcs
Metadata
Metadata
Assignees
Labels
No labels