Skip to content

soumyadip1995/BabyGPT

Repository files navigation

BabyGPT 🐤

Sublime's custom image

Table of contents

Building on the intuition of Karpathy's ng-video-lectures and mingpt, BabyGPT provides a working model of a GPT on a much smaller scale (256 as well as 16 out channels, 5 layer GPT, fine-tuned). BabyGPT has been built from a toyGPT which was made to understand transformers from scratch. It has been scaled down , as you will see below. Visit the notebooks. We scale up to transformers from simple Language models, attention mechanisms and finally BabyGPT. While toyGPT has been built by separating all the layers of a transformer indiviually. In BabyGPT, the attention mechanism is implemented manually. The purpose of building smaller GPTs is to understand transformer functions at a much more granular level. Based on the above work , we have done a paper, you can check it in the paper folder above.

The models

To train small models we are using tinystories. You can download the weights from hugging face. We are setting max_iters to 5000 on a Tesla T4. For the OG model, we are using 256 out channels.

model context length n_layers n_head n_embd train loss val loss parametres data
15M 16 4 4 16 2.4633 2.4558 13k stories15M.bin
42M 32 8 8 32 2.3772 2.3821 1.01M stories42M.bin
BabyGPT Original 64 8 8 256 1.3954 1.5959 6.37M data

Note:- The 110M is being omitted for now. The RAM blew up..!!

The Roadmap 🚀

If you wish to understand the nitty gritty of how transformers work from scratch, this roadmap will guide you. We start from implementing simple language models bigram and ngram and then from there work our way up to building transformers , a GPT from scratch and then finally babyGPT.

alt_text

We implement a Low rank approximation as well as lit-lama to babyGPT as well. We finally train the models and generate tokens.

Low rank approximation

Low rank approximation improves parametre efficiency(compression technique). A LoRa_model.py has been added(on 256 out channels) . We receive a parametre reduction of about 2. All, we need to do is to compute a rank parametre and compute the attention accordingly. In the LoRa notebook, an estimation of FLOPs has been done according to the chinchilla paper.

Quantization on low rank approximation

Quantization has also been performed on the lora model. A calculation of FLOPs has been added as well. For the BabyGPT model for 256 out channels we get 0.407 Peta FLOPs. In the LoRa notebook we have added the quantization. In terms of size reduction, we are getting a reduction of a factor of 1.3 for now.

We support LLaMa for BabyGPT ⚡ ⚡

1. LLaMa version -1 🦙

An implemetation of the lit-llama model has been ported to BabyGPT(based on llama- version 1). You can find the notebook here -> llama_implementation . Run the model ,MFU has also been added. llama\python llama_model_v1.py. Training and generating tokens has been provided below.

Note:- We have ported build_rope_cache() , apply_rope() and RMSNorm() from version 1. We are also not using the version 1 weights or checkpoints(these are for even larger models 7B, 13B, 65B etc). You can download the weights and port llama to your own version.

2. LLaMa2 🦄

We have ported llama2 by meta into BabyGPT. You can find the implementation at llama\python llama2.py. we have also provided a calculation of FLOPs along with the model.

The FLOPs to compute k, v cache is the flops to compute is :- 2 * 2 * num_layers * (embedded_dim) ^2. Find more information on how to compute memory and compute in kipply's blog MFU has been added to llama2

Note:- We are not using original llama weights by meta. We are also using arbitrary values for 70B. You can port it to your own model using your own weights.

Tokenization

Tokenization using sentencepiece has been done. We are exporting a tokenizer.bin unlike the tokenizer the quant folder. Run it in llama\python tokenizer.py(meta-pieces added) . We can use the .bin file for further inference.

LLaMA with Model FLOP Utilization(MFU) ⚡

We need efficient memory usage for LLMs. Hardware accelerators use a technique called Hardware FLOP Utilization for efficient trade-offs between memory usage and compute. This is typically done using an estimate of the ratio of FLOPs observed on a given device to its theoretical peak FLOPs. MFU is the ratio of the observed throughput (tokens-per-second), relative to the theoretical maximum throughput of a system operating at peak FLOPs. The theoretical peak matmul of Tesla T4 is around 8.1 TFLOPS. Hence, we calculate the MFU of the LLaMA trainer model. See in the trainer notebook under LLaMA-trainer. We receive a MFU of : 0.0527723427% on 3.22M parametres. This would of course increase as the number of parametres increases. For a 530B parametre model, MPU is around 30% on A100 GPUs. We use Section B from the PaLM paper for reference.

Quantization

LLMs require many GPUs to run, we need to find ways to reduce these requirements while preserving the model's performance. Various technologies have been developed that try to shrink the model size, you may have heard of quantization and distillation. It has been discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size.

To remediate that, 8-bit quantization was introduced. This method uses a quarter precision, thus needing only 1/4th of the model size! But it's not done by just dropping another half of the bits. There's a lot more to this topic. Look at hugging face quantization.

You can see quant.md on how to perform llama-quantization. You can look at quantization Notebook for a beginner's introduction to quantization. Different benchmarkings has been done. For ex:- On a GPU, the 7B parametre model on bfloat16 will take about 15GB. BabyGPT will take about a few kilobytes..!!! quantization.py has been obtained from lit-llama repo. A tokenizer using sentencepiece has been added as well. Diffrent kinds of weight operations can be performed from the tokenizer.model

Our result

For post training quantization, we are able to reduce the model size by a factor of almost 4.

model_fp = BabyGPTmodel(config)
model_fp.eval()
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  
    
/// number of parameters: 3222637
12.9688 MB
3.4603 MB ////

Note: Just for quantization we are using a bigger model with about 3.22M paramtres

Performance Benchmark

Performance benchmarking has been done on the BabyGPTmodel and the quantized model. Below are the results.

alt_text.

It has been added to the quantization Notebook in the quant folder.

Files

BabyGPT
├── bigram_lm.py
├── ngram_lm.py
├── model.py
├── Lora_model.py
├── Llama_model.py
├── Attention
│   ├── dot product attention.py
│   ├── multi headed attention.py
│   ├── cross attention.py
│   ├── spatial attention.py
├── Notebook
│   ├── Dot product attention
│   ├── multiheaded attention
│   ├── gpt from scratch
│   ├── spatial transformer
│   ├── babyGPT
│   ├── LoRa
│   ├── llama_implementation
│   ├── mixed precision
│   ├── mechanistic Interpretability
│   ├── Additional BabyGPT
│   ├── BabyGPT CoT
├── Train
|	├── babygpt_trainer.py
|	├── llama_trainer.py
├── transformers
|   ├── transformer_model.py
│   ├── babyGPT.py
├── Quant
│   ├── quantization.py
│   ├── quantization notebook
│   ├── tokenizer.model
│   ├── tokenizer.vocab
│   ├── tokenizer.py
│   ├── model.pth
│   ├── quant.md
├── llama
│   ├── llama2.py
│   ├── llama_model_v1.py
│   ├── tokenizer.py
│   ├── tokenizer.vocab
│   ├── tokenizer.bin
│   ├── tokenizer.model
├── text.txt
├── trainer.ipynb
├── requirements.txt

Run 🏃

Clone the Repo and run the following:

! git clone https://github.com/soumyadip1995/BabyGPT.git

To run the bigram and ngram language models. python bigram_lm.py and python ngram_lm.py .

To run babygpt transformers\python babygpt.py from transformers folder.

To run a simple transformer model python transformer_model.py

To run a low rank approximation model python LoRa_model.py

To run the llama model llama\python llama_model_v1.py

To run llama2 llama\python llama2.py

Run the different attention mechanisms from Attention folder.

Auto Mixed Precision

A very preliminary auto mixed precision has been added. FP16/FP32 It can be achieved with a cuda enabled gpu. A combination of pytorch's autocast() and gradscalar() is used for mixed precision. See more in the pytorch tutorial. Unfortunately the gpu blew up during training and cpu for now only supports bfloat16. Takes a hell of a long time to train. If anyone can improve upon it that would be awesome. Check the Mixed Precision Notebook.

Train and Generate 🏃

If you wish to get started on BabyGPT and llama , but don't want to go through all the hassle of knowing all about transformer models, you can simply start by running the code from the train folder.

To train and generate text from both BabyGPT model and the LLaMA model. Run train\python babygpt_trainer.py and train\python llama_trainer.py from the train folder.

Both have been trained on the Tesla T4 GPUs. You can increase or decrease the values of max_iters according to your wish. Takes a few minutes to train.

Results 📋

You can see the result from both the models in the trainer notebook.

Result from the llama trainer.

Number of params = 3.22 M

``` number of parameters: 3222381 ```

step 0: train loss 4.6894, val loss 4.6895
step 500: train loss 2.1731, val loss 2.1832
step 1000: train loss 1.7580, val loss 1.8032
step 1500: train loss 1.5790, val loss 1.6645
step 2000: train loss 1.4482, val loss 1.5992
step 2500: train loss 1.3538, val loss 1.5874
step 3000: train loss 1.2574, val loss 1.5971
.
.
.
step 9000: train loss 0.5236, val loss 2.4614
step 9500: train loss 0.4916, val loss 2.5494
step 10000: train loss 0.4680, val loss 2.6631
step 10500: train loss 0.4448, val loss 2.6970
step 10999: train loss 0.4341, val loss 2.7462

Detroit, revior myself 'til I confused to get the big clead Mastles
Slaughterhouse on the blue, that's when he pine I'm hop with the cowprinton
robaly I want to a lox on my tempt

But now we can't never find a gift killed broke
Big before anyone could ever hear the first as I was cooped chill
But i this o for a big star
I said get chased up!
(Hello darkness, my old friend)[Eminem:]
If my legacy I acged buving in the tub (might what?)
I would know one [*Barrns, worried :]
Yeah, so kon bitch, it's

Seems like the model converges a bit early, towards the end. Maybe that will need more modification. Spitting some Eminem yo..:smile:

Data

The data folder contains the text document which has the lyrics to all of Eminem's songs.

Running the notebooks

Notebook Description
Dot product attention colab
Multi headed attention colab
GPT from scratch colab (Approx 860k parametres)
Spatial Transformers colab
BabyGPT colab(16, 256 out channels)
LoRa colab(256 out channels)
lit-llama for BabyGPT colab(16 out channels for lit-llama )
trainer for Babygpt and llama colab(16 out channels for BabyGPT , 256 out channels for llama)
Mechanistic Interpretability Code colab
Additional BabyGPT colab
BabyGPT_CoT colab

text.txt is based on Eminem's Stan.

Mechanistic Interpretability

Features are one of the most fundamental properties of Neural Networks. If you think of a direction vector in a vector space of activations of neurons in a given layer , this can correspond to directions in within a feature space. Even though, indiviual neurons can be studied, these features can be connected by weights forming circuits.

The circuits can be thought of as computational graphs that consist of a set of features, and the weighted edges that go between them in the original network. The question here is- are these vector spaces of neurons understandable ?. Can these indiviual neurons be studied ?. The answer lies in mechanistically interpreting these circuits at a deeper level.

Making use of the vector space of neurons, in particular their weights, one can contextualize the weights in a broader context of the network C.Olah et.al, 2018 . The challenge of contextualization is a recurring one in understanding neural networks: we can easily observe every activation, every weight, and every gradient; the challenge lies in determining what those values represent. In the Mechanistic Interpretability Notebook , we have used an example where we are loading a pre-trained VGG16 and selecting a convolutional layer an channel inside of it. A random input image of a dog is used. We are optimizing the pixels (via gradient ascent) to maximize activation of that neuron channel. Then the resulting synthetic image is displayed. The result is usually a patterned texture (e.g., stripes, curves, eyes), showing what kind of input strongly excites that channel. BabyGPT has also been analysed mechanistically in BabyGPT_CoT

Reverse Engineering

Mechanistic interpretability treats a neural network like a compiled program or electronic circuit. Instead of treating it as a black box (input → output correlations), we try to open the hood and explain the exact algorithms being implemented by the weights, neurons, and attention heads. This is reverse engineering in the same sense as analyzing a compiled binary. To demonstrate toy mechanistic interpretability, we can train BabyGPT to learn modular addition (a + b) mod N . In the Additional BabyGPT Notebook we have implemented modular addition to show which attention heads/neurons implement addition and how to trace the “circuit.” This has been done using Attention Inspection:- To visualize which input token the last layer attends and linear probe.- probing to check whether hidden states linearly encode (a + b) mod N . This has been adopted from the works of Nanda et al., Progress on Induction Heads, 2022, toy mechanistic models. We have also implemented Linear Probe.

Chain Of Thought

A step by Step CoT (addition operation) has been implemented with BabyGPT. BabyGPT_CoT. WE have also implemented FLOPs per reasoning token with and without RoPE. We found a Around 2 % increase in FLOP utilization using RopE. BabyGPT has also been implemented along with memory.

Acknowledgements

  1. Pytorch GPT tutorial
  2. Karpathy's youtube videos and tutorials
  3. Karpathy's Mingpt
  4. O' reilly notes for NLP
  5. lit-llama repository.
  6. llama from facebook
  7. chinchilla paper
  8. karpathy's nn zero to hero.
  9. Lightning AI for Low rank Approximation.
  10. IST das for gptq

LICENSES

Licenses have been updated to include facebookresearch/llama, lit-llama and IST-das lab You can use it under GNU, Apache and MIT.

TO DO

  1. look into triton
  2. Fix the readme.md
  3. Inference using libtorch or each separately
  4. look into sentencepiece and tokenizer
  5. look into tinystories

About

Something in the middle of Karpathy's mingpt model and video lectures, BabyGPT is an easy to use model on a much smaller scale (16 and 256 out channels , 5 heads, fine tuned).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors