Files

training

Failed to load latest commit information.

Cannot retrieve latest commit at this time.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Training foundation models on AWS

1. Building a distributed data loader

Working with datasets and dataloaders in PyTorch
Using the Hugging Face Dataset object with PyTorch
Profiling a data loader with SageMaker Debugger
SageMaker processing jobs
Example data pipeline in Python for training GPT-2 on SageMaker

2. Picking the right hyperparameters

10+ example notebooks for hyperparameter tuning on SageMaker
Hyperband automatic model tuning for distributed training example notebook
More guidance on updating learning rate and batch size as a function of the overall accelerator world size

3. Large-scale training on SageMaker

End-to-end notebook for training up to 30B parameter GPT-based model on SageMaker
Basic example of model parallel trainnig on SageMaker with Hugging Face Trainer API
Noteook for fine-tuning or pretraining Stable Diffusion on SageMaker
Using SageMaker warm pools to enhance your development cycles for the Training API
Notebook for using PyTorch FSDP with SageMaker and Hugging Face for 20B+ parameter models

4. Compiling your model

Detailed walk-through from Chaim Rand on upgrading to PyTorch 2.0 on AWS and SageMaker.
Example notebook of changing batch size and learning rate as a function of model compilation, uses SageMaker Training Compiler
Guidance from the Neuron SDK on supported models and compiling your model for AWS custom machine learning accelerators, optimized for training, Trainium.

5. Measuring throughput

I'll add an example here of computing TFLOPS per accelerator
Some guidance on working with CloudWatch and SageMaker
Using SageMaker Debugger

6. Fine-tuning your model

Fine-tuning BLOOM with LoRA using Hugging Face's parameter-efficient fine-tuning on SageMaker
Fine-tuning Donut for SOTA document parsing with Hugging Face on SageMaker.

7. Evaluating your model

This is too broad to have a single example for everything - that would be like trying to compute the wave angles of every ocean continuously :). However, for some concrete examples in language, take a look at Hugging Face's repository from their book Natural Language Processing with Transformers right here.
Alternatively you can jump straight to the Hugging Face evaluate library, with many examples and tutorials.