Improvements to AWS infrastructure

The current AWS implementation supports on-demand and spot instances.
Our workload contains many independent heterogeneous jobs which should be executed on homogeneous hardware (same instance type, EB storage). A job consists of training a model on a given dataset split for a given automl framework with a certain time budget. A full run that evaluates all frameworks on all datasets would have roughly ~10.000 jobs (times the amount of time budgets).
The jobs are not real-time, but we do want them run within a reasonable time frame (e.g. 2 weeks for a full benchmark ).
The jobs do not have intermediate results, only final ones which are stored to an S3 bucket.

Some improvements we would like to have include:
 - automatically determine the cheapest region(s) to rent Spot instances on
 - automatically identify spot instance failures types to determine whether or not a job should be rerun
 - distribute work across multiple regions (currently this is "achieved" through running the script multiple times in parallel)
 
Changes to the current script might be enough, but Sagemaker might provide a more robust and/or maintainable alternative.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements to AWS infrastructure #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improvements to AWS infrastructure #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions