This project provides a comprehensive framework for training, inferencing, and fine-tuning Language Models (LLMs) on custom datasets. It leverages PyTorch and the Hugging Face Transformers library to implement a decoder-only transformer model.
- Decoder-Only Transformer Model: Implementation of a decoder-only transformer architecture suitable for language modeling tasks.
- Customizable Hyperparameters: Easily configurable model and training parameters, including embedding dimension, number of attention heads, and number of transformer blocks.
- Efficient Data Handling: Utilizes iterable datasets and data loaders for memory-efficient processing of large text datasets.
- Training and Validation: Includes training and validation loops with loss and perplexity monitoring.
- Text Generation: Provides a text generation function for sampling from the trained model.
- Model Saving and Loading: Functionality to save and load trained models and tokenizers.
- ✅ Decoder-Only Transformer Model
- ✅ Customizable Hyperparameters
- ✅ Efficient Data Handling
- ✅ Training and Validation
- ✅ Text Generation
- ✅ Model Saving and Loading
- ✅ Streaming data from storage to main memory
- ✅ Training script
- ✅ Inference script
- ✅ Data downloading
- KV Cache Optimization
- Dynamic Batching
- Prefill
- Speculative Decoding
- Python 3.7+
- PyTorch
- Hugging Face Transformers
- tqdm
- regex
- urllib
You can install the necessary dependencies using pip:
pip install -r requirements.txtA requirements.txt file is provided in this repository with all the required dependencies.
-
Clone the repository:
git clone <repository_url> cd ai4india
-
Install the dependencies:
pip install -r requirements.txt
-
Build the Docker image:
docker build -t ai4india . -
Run the Docker container:
docker run -it ai4india
This will build a Docker image named ai4india and run it. The container will execute the train.py script by default.
- The project expects a text dataset in a
.tar.gzarchive containingtrain.txtandtest.txtfiles. Each line in these files should represent a sentence. - You can specify the data URL in the
train.pyscript.
To train the model, run the train.py script:
python train.py-
Hyperparameters: The training script uses hyperparameters defined in the
get_hyperparametersfunction withinutils/model_utils.py. You can modify these values to customize the training process. Key hyperparameters include:emb_dim: Embedding dimensionnum_heads: Number of attention headsnum_blocks: Number of transformer blocksbatch_size: Batch sizelearning_rate: Learning ratenum_epochs: Number of epochscontext_size: Maximum sequence length
-
Training Process: The script downloads the dataset (if not already present), preprocesses it, and trains the decoder language model. It also performs periodic validation and saves the trained model and tokenizer to the
modelsdirectory.
To test the trained model, run the test.py script:
python test.py- The script loads the trained model and tokenizer from the
modelsdirectory and generates text based on predefined prompts.
transformer.py: This file is currently empty. It could be used to define custom transformer components or configurations.train.py: Contains the main training loop, data loading, model initialization, and saving logic.test.py: Contains the model testing and text generation logic.utils/data_utils.py: Implements data loading, preprocessing, and batching utilities.utils/model_utils.py: Implements model definition, weight initialization, training utilities (loss computation, perplexity), and saving/loading functions.
The core of this project is the DecoderLanguageModel class, defined in utils/model_utils.py. It consists of the following components:
- Embedding Layer: Maps input tokens to high-dimensional embeddings.
- Decoder Blocks: Stacked transformer decoder blocks, each containing:
- RMSNorm: Root Mean Square Layer Normalization for stable training.
- MultiHeadAttention: Multi-head self-attention mechanism.
- MLP: A multi-layer perceptron (feed-forward network).
- Output Layer: Projects the final hidden states to the vocabulary space.
download_and_prepare_data(url, batch_size, tokenizer, max_length)(inutils/data_utils.py): Downloads the dataset, extracts the training and testing files, and creates data loaders.DecoderLanguageModel(vocab_size, emb_dim, num_heads, num_blocks, pad_idx)(inutils/model_utils.py): Initializes the decoder-only transformer model.generate_text(model, start_string, tokenizer, device, max_length)(inutils/model_utils.py): Generates text from a given starting string using the trained model.save_model(model, tokenizer, model_name)(inutils/model_utils.py): Saves the trained model and tokenizer.load_model(model_name, device=None)(inutils/model_utils.py): Loads a pre-trained model and tokenizer.
- Dataset: To use your own dataset, modify the
download_and_prepare_datafunction inutils/data_utils.pyto load and preprocess your data. - Model Architecture: You can modify the
DecoderLanguageModelclass inutils/model_utils.pyto experiment with different model architectures. - Training Parameters: Adjust the hyperparameters in the
get_hyperparametersfunction inutils/model_utils.pyto optimize training for your specific dataset and task.
This project is licensed under the MIT License - see the LICENSE file for details.