- Install all dependencies from requirements.txt using:
pip install -r requirements.txt- Thirst to explore π
A small language model is a compact AI language model with far fewer parameters (typically millions to a few billion) designed to perform targeted NLP tasks efficiently on limited hardware, trading broad capability for lower latency, cost, and easier deployment.
This repository has codes to experiment with different libraries and different techniques to make a high performing small language model.
-
Experiment 1 has code with using Unsloth and LLMCompressor libraries. It also uses vLLM for faster inference.
Qwen.pyis made without any optimizations to set the baseline. All files use Alpaca-cleaned dataset. -
Experiment 2 has code with using gpt-oss 20B model for Lora finetuning. Math dataset is used for evaluation from https://github.com/ziye2chen/DEMI-MathAnalysis. Gemini 2.5 pro is used as a LLM as a judge.
This directory contains modularized components for Small Language Model (SLM) training, inference, and evaluation. The architecture follows a clean separation of concerns with abstract base classes and specialized implementations.
Modules/
βββ core/ # Base classes and interfaces
βββ data/ # Data processing utilities
βββ evaluation/ # Model evaluation tools
βββ inference/ # Inference engines
βββ training/ # Training implementations
βββ utils/ # Utility functions
Contains abstract base classes that define common interfaces for all training and inference implementations. Provides the foundation that other modules inherit from to ensure consistency across different approaches.
Handles data loading, preprocessing, and formatting for different dataset types. Supports multiple formats including Alpaca instruction datasets, mathematical problem datasets, and generic CSV data. Includes tokenization utilities for model training.
Provides model evaluation capabilities using LLM-as-a-Judge methodology with Gemini API integration. Compares model outputs and provides structured scoring for mathematical problem-solving tasks.
Gemini LLM Judge Key Features:
- Synchronous Evaluation: Standard evaluation using
LLMGeminiJudgeEvaluator - Asynchronous Evaluation: High-performance async evaluation using
LLMGeminiJudgeEvaluatorAsync - Batch Processing: Optimized batch processing for both CPU-bound (model generation) and I/O-bound (API calls) operations
- Memory Management: Intelligent memory cleanup between evaluation phases
- Rate Limiting: Built-in rate limiting to handle API constraints
- Intermediate Saving: Automatic saving of evaluation results with timestamps
- Performance Optimization: Configurable worker pools and batch sizes for optimal throughput
Multiple inference implementations optimized for different use cases:
- Standard: Compatible transformers-based inference with PEFT support
- Unsloth: Optimized for 2x faster inference and memory efficiency
- vLLM: High-throughput serving with GPU optimization and LoRA support
Specialized trainers for different optimization approaches:
- Unsloth: Fastest training with lowest memory usage
- LoRA: Standard PEFT implementation with broad compatibility
- LLMCompressor: Quantized training with model compression support
Common utility functions for memory management, CUDA operations, and model diagnostics. Includes tools for monitoring GPU usage, clearing cache, and getting model statistics.
End-to-end automation scripts for complete training and evaluation workflows.
script1.py - Complete Pipeline Test Script:
- Purpose: Simplified end-to-end testing of the entire SLM pipeline
- Features:
- Single-parameter configuration for quick testing
- Complete workflow: Training β Standard Inference β vLLM Inference β Async Evaluation
- Memory optimization with cleanup between phases
- Comprehensive logging and error handling
- JSON output with detailed results
- Async evaluation for all inference methods
Pipeline Stages:
- Model Training: LoRA fine-tuning with configurable parameters
- Standard Inference: Traditional transformers-based inference
- vLLM Inference: High-performance inference engine
- Async Evaluation: Parallel evaluation using Gemini API with optimized batching
- Results Compilation: Structured JSON output with comparison metrics
Usage:
python Scripts/script1.py- Modularity: Each component has a single responsibility
- Extensibility: Easy to add new trainers/inference engines
- Consistency: Common interfaces across implementations
- Performance: Optimized implementations for different use cases
- Flexibility: Configuration-driven behavior
- Memory Limited: Use Unsloth components for training and inference
- Speed Critical: Use vLLM for inference, Unsloth for training
- Model Compression: Use LLMCompressor trainer for quantized models
- Standard Compatibility: Use standard implementations for broad compatibility
- Large-Scale Evaluation: Use async evaluation components for efficient batch processing
config = {
"model_name": "Qwen/Qwen2-0.5B",
"lora_r": 16,
"lora_alpha": 16,
"lora_dropout": 0,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
"max_new_tokens": 128,
"device_map": "auto",
"seed": 108
}config = {
"base_model": "Qwen/Qwen2-0.5B",
"dtype": None,
"device_map": "auto",
"max_new_tokens": 128,
"temperature": 0.5,
"gpu_memory_utilization": 0.3,
"do_sample": False
}async_config = {
"cpu_workers": 4, # CPU-bound operations (model generation)
"io_workers": 10, # I/O-bound operations (API calls)
"cpu_batch_size": 2, # Batch size for GPU operations
"io_batch_size": 5 # Batch size for API operations
}This modular architecture enables flexible experimentation with different training techniques and inference optimizations while maintaining code reusability and clean interfaces.
- Unsloth - Fast and memory-efficient fine-tuning
- LLMCompressor - Model compression for faster inference
- vLLM - High-performance LLM serving framework
- My Notebook - Learnings and experiment results
Feel free to raise an issue or submit a pull request if you find any mistakes or have suggestions for improvement. Your contributions are welcome and appreciated!
Happy Coding!