GitHub - shaRk-033/ai.c: gpt written in plain c

gpt2 written in plain C

The main goal of this small project is to educate myself on how things are built from scratch, and I hope to convince at least a single person that they could build anything from scratch. Andrej Karpathy's llm.c and micrograd were the projects that motivated me to build this.

things I learnt building this:

Multi-dimensional arrays and tensors are just simple 1-dimensional arrays but with strides enabling us to access rows and columns in the desired way.
Learned a lot about the C language, including memory management, parallel processing, and memory access patterns. This is just the second thing I built in C, the first one being a basic password manager.
Derived backpropagation of layers like LayerNorm and Attention mechanisms. Improved my mathematical ability a lot.
Learned about how we could map files and use them as a sort of virtual memory (it was hard storing the activations and parameters in the RAM. They are humongous, something like ~20GB).

It was fun building something like this.

quick start

tokenize the training data (needs tiktoken):

pip install tiktoken
python3 prep_data.py

compile and run:

# macOS (uses Apple Accelerate for fast matmul)
gcc -O3 -DACCELERATE_NEW_LAPACK -o train ai.c -lm -framework Accelerate

# linux with openmp
gcc -O3 -march=native -funroll-loops -fopenmp -o train ai.c -lm

./train

decode generated tokens:

python3 decode.py "464,1182,286,..."

compiler flags

-O3: Aggressive optimizations
-march=native: CPU-specific optimizations
-funroll-loops: Loop unrolling for potential speed improvements
-fopenmp: OpenMP support for parallel processing
-framework Accelerate: Apple's BLAS for fast matrix multiplication (macOS only)
-DACCELERATE_NEW_LAPACK: use the updated cblas interface on macOS

blogs that helped me a lot:

This implementation isn't the most optimal approach; there are lots of things to improve.

references:

gpt in python

TODO:

Improve the Matrix Multiplication.
Improve the Attention Mechanism and its backprop, as it consumes a lot of training time.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ai.c		ai.c
decode.py		decode.py
helper.c		helper.c
mapfile.c		mapfile.c
prep_data.py		prep_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt2 written in plain C

things I learnt building this:

quick start

compiler flags

blogs that helped me a lot:

references:

TODO:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpt2 written in plain C

things I learnt building this:

quick start

compiler flags

blogs that helped me a lot:

references:

TODO:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages