Description
The current AWS implementation supports on-demand and spot instances.
Our workload contains many independent heterogeneous jobs which should be executed on homogeneous hardware (same instance type, EB storage). A job consists of training a model on a given dataset split for a given automl framework with a certain time budget. A full run that evaluates all frameworks on all datasets would have roughly ~10.000 jobs (times the amount of time budgets).
The jobs are not real-time, but we do want them run within a reasonable time frame (e.g. 2 weeks for a full benchmark ).
The jobs do not have intermediate results, only final ones which are stored to an S3 bucket.
Some improvements we would like to have include:
- automatically determine the cheapest region(s) to rent Spot instances on
- automatically identify spot instance failures types to determine whether or not a job should be rerun
- distribute work across multiple regions (currently this is "achieved" through running the script multiple times in parallel)
Changes to the current script might be enough, but Sagemaker might provide a more robust and/or maintainable alternative.