Skip to content

Improvements to AWS infrastructure #243

Open
@PGijsbers

Description

@PGijsbers

The current AWS implementation supports on-demand and spot instances.
Our workload contains many independent heterogeneous jobs which should be executed on homogeneous hardware (same instance type, EB storage). A job consists of training a model on a given dataset split for a given automl framework with a certain time budget. A full run that evaluates all frameworks on all datasets would have roughly ~10.000 jobs (times the amount of time budgets).
The jobs are not real-time, but we do want them run within a reasonable time frame (e.g. 2 weeks for a full benchmark ).
The jobs do not have intermediate results, only final ones which are stored to an S3 bucket.

Some improvements we would like to have include:

  • automatically determine the cheapest region(s) to rent Spot instances on
  • automatically identify spot instance failures types to determine whether or not a job should be rerun
  • distribute work across multiple regions (currently this is "achieved" through running the script multiple times in parallel)

Changes to the current script might be enough, but Sagemaker might provide a more robust and/or maintainable alternative.

Metadata

Metadata

Assignees

No one assigned

    Labels

    awsAWS supporthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions