BabyGPT 🐤

The models
The Roadmap 🚀
Low Rank Approximation
- Quantization on low rank approximation
We support LLaMa for BabyGPT ⚡ ⚡
Quantization
Our result
Performance Benchmark
Files
Run 🏃
Auto Mixed Prescision
Train and Generate 🏃
Results 📋
Data
Running the notebooks
Mechanistic Interpretability
Chain Of Thought
Acknowledgements
Licenses

Building on the intuition of Karpathy's ng-video-lectures and mingpt, BabyGPT provides a working model of a GPT on a much smaller scale (256 as well as 16 out channels, 5 layer GPT, fine-tuned). BabyGPT has been built from a toyGPT which was made to understand transformers from scratch. It has been scaled down , as you will see below. Visit the notebooks. We scale up to transformers from simple Language models, attention mechanisms and finally BabyGPT. While toyGPT has been built by separating all the layers of a transformer indiviually. In BabyGPT, the attention mechanism is implemented manually. The purpose of building smaller GPTs is to understand transformer functions at a much more granular level. Based on the above work , we have done a paper, you can check it in the paper folder above.

The models

To train small models we are using tinystories. You can download the weights from hugging face. We are setting max_iters to 5000 on a Tesla T4. For the OG model, we are using 256 out channels.

model	context length	n_layers	n_head	n_embd	train loss	val loss	parametres	data
15M	16	4	4	16	2.4633	2.4558	13k	stories15M.bin
42M	32	8	8	32	2.3772	2.3821	1.01M	stories42M.bin
BabyGPT Original	64	8	8	256	1.3954	1.5959	6.37M	data

Note:- The 110M is being omitted for now. The RAM blew up..!!

The Roadmap 🚀

If you wish to understand the nitty gritty of how transformers work from scratch, this roadmap will guide you. We start from implementing simple language models bigram and ngram and then from there work our way up to building transformers , a GPT from scratch and then finally babyGPT.

We implement a Low rank approximation as well as lit-lama to babyGPT as well. We finally train the models and generate tokens.

Low rank approximation

Low rank approximation improves parametre efficiency(compression technique). A LoRa_model.py has been added(on 256 out channels) . We receive a parametre reduction of about 2. All, we need to do is to compute a rank parametre and compute the attention accordingly. In the LoRa notebook, an estimation of FLOPs has been done according to the chinchilla paper.

Quantization on low rank approximation

Quantization has also been performed on the lora model. A calculation of FLOPs has been added as well. For the BabyGPT model for 256 out channels we get 0.407 Peta FLOPs. In the LoRa notebook we have added the quantization. In terms of size reduction, we are getting a reduction of a factor of 1.3 for now.

We support LLaMa for BabyGPT ⚡ ⚡

1. LLaMa version -1 🦙

An implemetation of the lit-llama model has been ported to BabyGPT(based on llama- version 1). You can find the notebook here -> llama_implementation . Run the model ,MFU has also been added. llama\python llama_model_v1.py. Training and generating tokens has been provided below.

Note:- We have ported build_rope_cache() , apply_rope() and RMSNorm() from version 1. We are also not using the version 1 weights or checkpoints(these are for even larger models 7B, 13B, 65B etc). You can download the weights and port llama to your own version.

2. LLaMa2 🦄

We have ported llama2 by meta into BabyGPT. You can find the implementation at llama\python llama2.py. we have also provided a calculation of FLOPs along with the model.

The FLOPs to compute k, v cache is the flops to compute is :- 2 * 2 * num_layers * (embedded_dim) ^2. Find more information on how to compute memory and compute in kipply's blog MFU has been added to llama2

Note:- We are not using original llama weights by meta. We are also using arbitrary values for 70B. You can port it to your own model using your own weights.

Tokenization

Tokenization using sentencepiece has been done. We are exporting a tokenizer.bin unlike the tokenizer the quant folder. Run it in llama\python tokenizer.py(meta-pieces added) . We can use the .bin file for further inference.

LLaMA with Model FLOP Utilization(MFU) ⚡

We need efficient memory usage for LLMs. Hardware accelerators use a technique called Hardware FLOP Utilization for efficient trade-offs between memory usage and compute. This is typically done using an estimate of the ratio of FLOPs observed on a given device to its theoretical peak FLOPs. MFU is the ratio of the observed throughput (tokens-per-second), relative to the theoretical maximum throughput of a system operating at peak FLOPs. The theoretical peak matmul of Tesla T4 is around 8.1 TFLOPS. Hence, we calculate the MFU of the LLaMA trainer model. See in the trainer notebook under LLaMA-trainer. We receive a MFU of : 0.0527723427% on 3.22M parametres. This would of course increase as the number of parametres increases. For a 530B parametre model, MPU is around 30% on A100 GPUs. We use Section B from the PaLM paper for reference.

Quantization

LLMs require many GPUs to run, we need to find ways to reduce these requirements while preserving the model's performance. Various technologies have been developed that try to shrink the model size, you may have heard of quantization and distillation. It has been discovered that instead of using the 4-byte FP32 precision, we can get an almost identical inference outcome with 2-byte BF16/FP16 half-precision, which halves the model size.

To remediate that, 8-bit quantization was introduced. This method uses a quarter precision, thus needing only 1/4th of the model size! But it's not done by just dropping another half of the bits. There's a lot more to this topic. Look at hugging face quantization.

You can see quant.md on how to perform llama-quantization. You can look at quantization Notebook for a beginner's introduction to quantization. Different benchmarkings has been done. For ex:- On a GPU, the 7B parametre model on bfloat16 will take about 15GB. BabyGPT will take about a few kilobytes..!!! quantization.py has been obtained from lit-llama repo. A tokenizer using sentencepiece has been added as well. Diffrent kinds of weight operations can be performed from the tokenizer.model

Our result

For post training quantization, we are able to reduce the model size by a factor of almost 4.

model_fp = BabyGPTmodel(config)
model_fp.eval()
model_int8 = torch.ao.quantization.quantize_dynamic(
    model_fp,  # the original model
    {torch.nn.Linear},  # a set of layers to dynamically quantize
    dtype=torch.qint8)  
    
/// number of parameters: 3222637
12.9688 MB
3.4603 MB ////

Note: Just for quantization we are using a bigger model with about 3.22M paramtres

Performance Benchmark

Performance benchmarking has been done on the BabyGPTmodel and the quantized model. Below are the results.

.

It has been added to the quantization Notebook in the quant folder.

Files

BabyGPT
├── bigram_lm.py
├── ngram_lm.py
├── model.py
├── Lora_model.py
├── Llama_model.py
├── Attention
│   ├── dot product attention.py
│   ├── multi headed attention.py
│   ├── cross attention.py
│   ├── spatial attention.py
├── Notebook
│   ├── Dot product attention
│   ├── multiheaded attention
│   ├── gpt from scratch
│   ├── spatial transformer
│   ├── babyGPT
│   ├── LoRa
│   ├── llama_implementation
│   ├── mixed precision
│   ├── mechanistic Interpretability
│   ├── Additional BabyGPT
│   ├── BabyGPT CoT
├── Train
|	├── babygpt_trainer.py
|	├── llama_trainer.py
├── transformers
|   ├── transformer_model.py
│   ├── babyGPT.py
├── Quant
│   ├── quantization.py
│   ├── quantization notebook
│   ├── tokenizer.model
│   ├── tokenizer.vocab
│   ├── tokenizer.py
│   ├── model.pth
│   ├── quant.md
├── llama
│   ├── llama2.py
│   ├── llama_model_v1.py
│   ├── tokenizer.py
│   ├── tokenizer.vocab
│   ├── tokenizer.bin
│   ├── tokenizer.model
├── text.txt
├── trainer.ipynb
├── requirements.txt

Run 🏃

Clone the Repo and run the following:

! git clone https://github.com/soumyadip1995/BabyGPT.git

To run the bigram and ngram language models. python bigram_lm.py and python ngram_lm.py .

To run babygpt transformers\python babygpt.py from transformers folder.

To run a simple transformer model python transformer_model.py

To run a low rank approximation model python LoRa_model.py

To run the llama model llama\python llama_model_v1.py

To run llama2 llama\python llama2.py

Run the different attention mechanisms from Attention folder.

Auto Mixed Precision

A very preliminary auto mixed precision has been added. FP16/FP32 It can be achieved with a cuda enabled gpu. A combination of pytorch's autocast() and gradscalar() is used for mixed precision. See more in the pytorch tutorial. Unfortunately the gpu blew up during training and cpu for now only supports bfloat16. Takes a hell of a long time to train. If anyone can improve upon it that would be awesome. Check the Mixed Precision Notebook.

Train and Generate 🏃

If you wish to get started on BabyGPT and llama , but don't want to go through all the hassle of knowing all about transformer models, you can simply start by running the code from the train folder.

To train and generate text from both BabyGPT model and the LLaMA model. Run train\python babygpt_trainer.py and train\python llama_trainer.py from the train folder.

Both have been trained on the Tesla T4 GPUs. You can increase or decrease the values of max_iters according to your wish. Takes a few minutes to train.

Results 📋

You can see the result from both the models in the trainer notebook.

Result from the llama trainer.

Number of params = 3.22 M

``` number of parameters: 3222381 ```

step 0: train loss 4.6894, val loss 4.6895
step 500: train loss 2.1731, val loss 2.1832
step 1000: train loss 1.7580, val loss 1.8032
step 1500: train loss 1.5790, val loss 1.6645
step 2000: train loss 1.4482, val loss 1.5992
step 2500: train loss 1.3538, val loss 1.5874
step 3000: train loss 1.2574, val loss 1.5971
.
.
.
step 9000: train loss 0.5236, val loss 2.4614
step 9500: train loss 0.4916, val loss 2.5494
step 10000: train loss 0.4680, val loss 2.6631
step 10500: train loss 0.4448, val loss 2.6970
step 10999: train loss 0.4341, val loss 2.7462

Detroit, revior myself 'til I confused to get the big clead Mastles
Slaughterhouse on the blue, that's when he pine I'm hop with the cowprinton
robaly I want to a lox on my tempt

But now we can't never find a gift killed broke
Big before anyone could ever hear the first as I was cooped chill
But i this o for a big star
I said get chased up!
(Hello darkness, my old friend)[Eminem:]
If my legacy I acged buving in the tub (might what?)
I would know one [*Barrns, worried :]
Yeah, so kon bitch, it's

Seems like the model converges a bit early, towards the end. Maybe that will need more modification. Spitting some Eminem yo..:smile:

Data

The data folder contains the text document which has the lyrics to all of Eminem's songs.

Running the notebooks

Notebook	Description
Dot product attention	colab
Multi headed attention	colab
GPT from scratch	colab (Approx 860k parametres)
Spatial Transformers	colab
BabyGPT	colab(16, 256 out channels)
LoRa	colab(256 out channels)
lit-llama for BabyGPT	colab(16 out channels for lit-llama )
trainer for Babygpt and llama	colab(16 out channels for BabyGPT , 256 out channels for llama)
Mechanistic Interpretability Code	colab
Additional BabyGPT	colab
BabyGPT_CoT	colab

text.txt is based on Eminem's Stan.

Mechanistic Interpretability

Features are one of the most fundamental properties of Neural Networks. If you think of a direction vector in a vector space of activations of neurons in a given layer , this can correspond to directions in within a feature space. Even though, indiviual neurons can be studied, these features can be connected by weights forming circuits.

The circuits can be thought of as computational graphs that consist of a set of features, and the weighted edges that go between them in the original network. The question here is- are these vector spaces of neurons understandable ?. Can these indiviual neurons be studied ?. The answer lies in mechanistically interpreting these circuits at a deeper level.

Making use of the vector space of neurons, in particular their weights, one can contextualize the weights in a broader context of the network C.Olah et.al, 2018 . The challenge of contextualization is a recurring one in understanding neural networks: we can easily observe every activation, every weight, and every gradient; the challenge lies in determining what those values represent. In the Mechanistic Interpretability Notebook , we have used an example where we are loading a pre-trained VGG16 and selecting a convolutional layer an channel inside of it. A random input image of a dog is used. We are optimizing the pixels (via gradient ascent) to maximize activation of that neuron channel. Then the resulting synthetic image is displayed. The result is usually a patterned texture (e.g., stripes, curves, eyes), showing what kind of input strongly excites that channel. BabyGPT has also been analysed mechanistically in BabyGPT_CoT

Reverse Engineering

Mechanistic interpretability treats a neural network like a compiled program or electronic circuit. Instead of treating it as a black box (input → output correlations), we try to open the hood and explain the exact algorithms being implemented by the weights, neurons, and attention heads. This is reverse engineering in the same sense as analyzing a compiled binary. To demonstrate toy mechanistic interpretability, we can train BabyGPT to learn modular addition (a + b) mod N . In the Additional BabyGPT Notebook we have implemented modular addition to show which attention heads/neurons implement addition and how to trace the “circuit.” This has been done using Attention Inspection:- To visualize which input token the last layer attends and linear probe.- probing to check whether hidden states linearly encode (a + b) mod N . This has been adopted from the works of Nanda et al., Progress on Induction Heads, 2022, toy mechanistic models. We have also implemented Linear Probe.

Chain Of Thought

A step by Step CoT (addition operation) has been implemented with BabyGPT. BabyGPT_CoT. WE have also implemented FLOPs per reasoning token with and without RoPE. We found a Around 2 % increase in FLOP utilization using RopE. BabyGPT has also been implemented along with memory.

Acknowledgements

Pytorch GPT tutorial
Karpathy's youtube videos and tutorials
Karpathy's Mingpt
O' reilly notes for NLP
lit-llama repository.
llama from facebook
chinchilla paper
karpathy's nn zero to hero.
Lightning AI for Low rank Approximation.
IST das for gptq

LICENSES

Licenses have been updated to include facebookresearch/llama, lit-llama and IST-das lab You can use it under GNU, Apache and MIT.

TO DO

look into triton
Fix the readme.md
Inference using libtorch or each separately
look into sentencepiece and tokenizer
look into tinystories

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
Additional		Additional
Attention		Attention
Notebook		Notebook
data		data
image		image
llama		llama
paper		paper
quant		quant
train		train
transformers		transformers
LICENSE		LICENSE
LoRa_model.py		LoRa_model.py
README.md		README.md
bigram_lm.py		bigram_lm.py
ngram_lm.py		ngram_lm.py
requirements.txt		requirements.txt
trainer.ipynb		trainer.ipynb
transformer_model.py		transformer_model.py

Folders and files

Latest commit

History

Repository files navigation

BabyGPT 🐤

Table of contents

The models

The Roadmap 🚀

Low rank approximation

Quantization on low rank approximation

We support LLaMa for BabyGPT ⚡ ⚡

1. LLaMa version -1 🦙

2. LLaMa2 🦄

Tokenization

LLaMA with Model FLOP Utilization(MFU) ⚡

Quantization

Our result

Performance Benchmark

Files

Run 🏃

Auto Mixed Precision

Train and Generate 🏃

Results 📋

Result from the llama trainer.

Data

Running the notebooks

Mechanistic Interpretability

Reverse Engineering

Chain Of Thought

Acknowledgements

LICENSES

TO DO

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages