sLAM

Demonstration code to create a GPT-2-style, decoder-only, generative small LAnguage Model that can be built using personal computing.

This is not for production. You can use this code to learn about generative language models, preprocessing, training, and model hyperparameters.

Installation

git clone git@github.com:bioteam/sLAM.git
cd sLAM
pip3 install .

Complete the installation:

>python3
>>> import nltk
>>> nltk.download('punkt_tab')

nltk is used for sentence tokenization.

Usage

> python3 sLAM/make-slam.py -h
usage: make-slam.py [-h] [-t TEXT_PERCENTAGE] [--context_size CONTEXT_SIZE] [-n NAME] [--temperature TEMPERATURE]
                    [--epochs EPOCHS] [--d_model D_MODEL] [-d DOWNLOAD] [--num_rows NUM_ROWS] [--use_mlflow] 
                    -p PROMPT [-v]

options:
  -h, --help            show this help message and exit
  -t TEXT_PERCENTAGE, --text_percentage TEXT_PERCENTAGE
                        Percentage of download used to make dataset
  -m MIN_SENTENCE_LEN, --min_sentence_len MIN_SENTENCE_LEN
                        Percentage of input text used to make dataset
  -n NAME, --name NAME  Name used to save files, default is timestamp of start time
  --temperature TEMPERATURE
                        Temperature used for generation
  --epochs EPOCHS       Number of epochs
  --d_model D_MODEL     Number of epochs
  --context_size CONTEXT_SIZE     
                        Context size
  -d DOWNLOAD, --download DOWNLOAD
                        Dataset to download. Default is cc_news.
  --num_rows NUM_ROWS   Number of rows to download from cc_news
  --use_mlflow USE_MLFLOW   
                        Use MLFlow for model tracking
  -p PROMPT, --prompt PROMPT
                        Prompt
  -v, --verbose         Verbose

The code uses cs_news (the default) or wikitext-2-v1 from Hugging Face as training text.

Build a model

Download and clean training data from cs_news, tokenize it into large chunks, create a model, train the model using context-window-sized slices for 3 epochs, be verbose, and try the given prompt:

python3 sLAM/make-slam.py --num_rows 500 -v --epochs 3 -p "This is a test"

This creates a Keras model (~1M input tokens) and a saved (serialized) tokenizer with the same name, and a histogram of sentence lengths. for example:

-rw-r--r--   332M Apr  1 05:09 04-01-2025-05-09-04.keras
-rw-r--r--    58K Apr  1 05:09 04-01-2025-05-09-04.pickle
-rw-r--r--    19K Mar 31 16:04 sentence_length_distribution.png

One epoch takes about ~1 hour on a Mac M1 laptop (32 GB RAM) with the command above. However, more text than that needs to be used to generate syntactically and semantically correct English.

Generate using an existing model

Supply the name of the model and the serialized tokenizer, and a prompt:

python3 sLAM/generate.py -n 04-01-2025-05-09-04 -p "This is a test"
This is a test if your favorite software is the news service for the bottom of the 
increasing equipment market is actually plans for their concerns and the narrative 
of the same time i think it was the course of the technology is that the 5th us and 
i think what we are the most youre doing it we do to do that you want what to avoid 
the first amendment and other candidates are not just as the most.

TensorFlow Architecture and Components

The sLAM code implements a small GPT-2-style transformer language model using TensorFlow/Keras. Here's a detailed explanation of the key components and how they work together:

Model Architecture Overview

The model is a decoder-only transformer - a type of neural network architecture that excels at understanding and generating sequential text. Unlike encoder-decoder models used for translation, this architecture focuses purely on text generation by predicting the next token in a sequence.

Core TensorFlow Components

1. Token and Positional Embeddings

Token Embeddings: Convert each word token into a dense vector representation (e.g., 256-dimensional). These embeddings learn to capture semantic meaning - similar words end up with similar vector representations.

Positional Embeddings: Since transformers process all tokens simultaneously (unlike RNNs), they need explicit position information. Positional embeddings encode where each token appears in the sequence, allowing the model to understand word order and syntax.

The final input representation is: Token Embedding + Positional Embedding

2. Multi-Head Attention Mechanism

The core innovation of transformers is self-attention, which allows each token to "attend to" or focus on other relevant tokens in the sequence. Multi-head attention runs several attention operations in parallel:

Query (Q), Key (K), Value (V) matrices: Each token is projected into these three representations
Attention computation: Attention(Q,K,V) = softmax(QK^T/√d_k)V
Multiple heads: Different attention heads can focus on different types of relationships (syntax, semantics, etc.)
Causal masking: Ensures the model can only attend to previous tokens during generation, not future ones

3. Transformer Blocks

Each transformer block contains:

Multi-head attention layer
Feed-forward network (2 dense layers with GELU activation)
Residual connections: Help with gradient flow during training
Layer normalization: Stabilizes training and improves performance

The model stacks multiple transformer blocks (default: 4 layers) to build increasingly complex representations.

4. Default Parameters Explained

vocab_size (50,000): Number of unique tokens the model can understand/generate
context_size (32): Maximum sequence length the model can process at once (the "memory window")
d_model (256): Dimensionality of embeddings and internal representations
n_heads (4): Number of parallel attention heads
n_layers (4): Number of transformer blocks stacked together
d_ff (1024): Hidden layer size in the feed-forward networks

Training Process

Data Preparation

Text Cleaning: Filters high-quality text from datasets (cc_news or wikitext)
Tokenization: Converts text to integer token IDs using Keras TextVectorization
Sequence Creation: Sliding window approach creates input/target pairs
- Input: [token1, token2, token3, token4]
- Target: [token2, token3, token4, token5] (next token prediction)

Model Training

Loss Function: Sparse Categorical Crossentropy (predicts next token from vocabulary)
Optimization: Adam optimizer with polynomial learning rate decay
Mixed Precision: Uses float16 for faster training while maintaining float32 for stability
Regularization: Dropout (10% by default) prevents overfitting

Training Monitoring

The code includes several custom callbacks for monitoring training stability:

Numerical Stability Callback: Detects NaN/infinite values and extreme weights
Validation Callback: Tracks performance on held-out data
MLFlow Integration: Logs metrics, parameters, and model artifacts for experiment tracking

Text Generation Process

During generation, the model:

Encodes the input prompt into token IDs
Predicts probability distribution over all possible next tokens
Applies temperature scaling: Controls randomness (lower = more deterministic, higher = more creative)
Samples next token from the probability distribution
Updates context window by sliding tokens left and adding the new token
Repeats until desired length or end token is reached

Memory and Compute Optimizations

GPU Memory Growth: Prevents TensorFlow from allocating all GPU memory at once
Asynchronous Memory Allocation: Uses CUDA malloc for better GPU memory management
Data Pipeline Optimization: Uses tf.data with prefetching and shuffling for efficient data loading
Checkpointing: Saves model state during training to prevent loss of progress

Model Scale Comparison

This sLAM model is significantly smaller than production models like GPT-3 in several key dimensions:

Arbitrary Parameter Count:

sLAM: ~1-5 million parameters (depending on vocab_size and d_model settings)
GPT-3: 175 billion parameters (35,000x larger)
GPT-2 small: 117 million parameters (still 100x larger than sLAM)

Model Dimensions:

sLAM: 256 embedding dimensions, 4 layers, 4 attention heads
GPT-3: 12,288 embedding dimensions, 96 layers, 96 attention heads
GPT-2 small: 768 embedding dimensions, 12 layers, 12 attention heads

Context Window:

sLAM: 32 tokens (very limited memory)
GPT-3: 2,048 tokens (64x larger context)
Modern models: up to 1M+ tokens

Training Data:

sLAM: Thousands of text samples (megabytes)
GPT-3: ~45TB of internet text data

Compute Requirements:

sLAM: Trainable on consumer hardware (few GB RAM, optional GPU)
GPT-3: Required thousands of high-end GPUs and months of training

Despite being much smaller, this architecture demonstrates the core principles behind modern large language models and provides a learning platform for understanding transformer-based text generation without requiring massive computational resources.

Library and package versions

One of the challenges in writing and running Deep Learning code is how many components there are, and how quickly new versions replace old versions. To get all your component versions aligned start with your computer, which may be a GPU. For example, if it's NVIDIA, what is the recommended version of CUDA? From that version find the recommended version of Tensorflow or PyTorch. Then for that package version what version of Python. An example set of versions, working with an older NVIDIA GPU:

RTX 5000 + CUDA 11.8 + Tensorflow 2.17 + Python 3.8

Then all the other Python dependencies (e.g. pandas, numpy) will follow from the Python version.

Getting these versions aligned is critical, because if the versions are out of alignment you may get errors of various kinds that do not reference versions and are difficult to debug, like out-of-memory or data shape errors.

Using Tensorflow from a container

Containers may be available that package all the right versions, e.g. of CUDA, Python, and Tensorflow. In this example we're computing at Texas Advanced Computing Center and downloading a Tensorflow container from NVIDIA:

srun -N 1 -n 10 -p rtx-dev -t 60:00 --pty bash
module load tacc-apptainer
apptainer pull docker://tensorflow/tensorflow:2.17.0-gpu

Or just:

docker pull tensorflow/tensorflow:2.17.0-gpu

Then you can run your script with singularity.

singularity exec --nv tensorflow_2.17.0-gpu.sif python3 scripts/mnist_convnet.py

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
sLAM		sLAM
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sLAM

Installation

Usage

Build a model

Generate using an existing model

TensorFlow Architecture and Components

Model Architecture Overview

Core TensorFlow Components

1. Token and Positional Embeddings

2. Multi-Head Attention Mechanism

3. Transformer Blocks

4. Default Parameters Explained

Training Process

Data Preparation

Model Training

Training Monitoring

Text Generation Process

Memory and Compute Optimizations

Model Scale Comparison

Library and package versions

Using Tensorflow from a container

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bioteam/sLAM

Folders and files

Latest commit

History

Repository files navigation

sLAM

Installation

Usage

Build a model

Generate using an existing model

TensorFlow Architecture and Components

Model Architecture Overview

Core TensorFlow Components

1. Token and Positional Embeddings

2. Multi-Head Attention Mechanism

3. Transformer Blocks

4. Default Parameters Explained

Training Process

Data Preparation

Model Training

Training Monitoring

Text Generation Process

Memory and Compute Optimizations

Model Scale Comparison

Library and package versions

Using Tensorflow from a container

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages