Skip to content

Files

Failed to load latest commit information.

Latest commit

 Cannot retrieve latest commit at this time.

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

Training foundation models on AWS

1. Building a distributed data loader

2. Picking the right hyperparameters

3. Large-scale training on SageMaker

4. Compiling your model

  • Detailed walk-through from Chaim Rand on upgrading to PyTorch 2.0 on AWS and SageMaker.
  • Example notebook of changing batch size and learning rate as a function of model compilation, uses SageMaker Training Compiler
  • Guidance from the Neuron SDK on supported models and compiling your model for AWS custom machine learning accelerators, optimized for training, Trainium.

5. Measuring throughput

  • I'll add an example here of computing TFLOPS per accelerator
  • Some guidance on working with CloudWatch and SageMaker
  • Using SageMaker Debugger

6. Fine-tuning your model

  • Fine-tuning BLOOM with LoRA using Hugging Face's parameter-efficient fine-tuning on SageMaker
  • Fine-tuning Donut for SOTA document parsing with Hugging Face on SageMaker.

7. Evaluating your model

  • This is too broad to have a single example for everything - that would be like trying to compute the wave angles of every ocean continuously :). However, for some concrete examples in language, take a look at Hugging Face's repository from their book Natural Language Processing with Transformers right here.
  • Alternatively you can jump straight to the Hugging Face evaluate library, with many examples and tutorials.