Skip to content

VenkataSakethDakuri/Small-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

75 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Small Language Model

Prerequisites

  • Install all dependencies from requirements.txt using:
pip install -r requirements.txt
  • Thirst to explore 😊

Getting Started

A small language model is a compact AI language model with far fewer parameters (typically millions to a few billion) designed to perform targeted NLP tasks efficiently on limited hardware, trading broad capability for lower latency, cost, and easier deployment.

This repository has codes to experiment with different libraries and different techniques to make a high performing small language model.

Experiments

  • Experiment 1 has code with using Unsloth and LLMCompressor libraries. It also uses vLLM for faster inference. Qwen.py is made without any optimizations to set the baseline. All files use Alpaca-cleaned dataset.

  • Experiment 2 has code with using gpt-oss 20B model for Lora finetuning. Math dataset is used for evaluation from https://github.com/ziye2chen/DEMI-MathAnalysis. Gemini 2.5 pro is used as a LLM as a judge.

Modules Documentation

This directory contains modularized components for Small Language Model (SLM) training, inference, and evaluation. The architecture follows a clean separation of concerns with abstract base classes and specialized implementations.

πŸ“ Directory Structure

Modules/
β”œβ”€β”€ core/               # Base classes and interfaces
β”œβ”€β”€ data/               # Data processing utilities
β”œβ”€β”€ evaluation/         # Model evaluation tools
β”œβ”€β”€ inference/          # Inference engines
β”œβ”€β”€ training/           # Training implementations
└── utils/              # Utility functions

πŸ”§ Core Components

core/

Contains abstract base classes that define common interfaces for all training and inference implementations. Provides the foundation that other modules inherit from to ensure consistency across different approaches.

πŸ“Š Data Processing

data/

Handles data loading, preprocessing, and formatting for different dataset types. Supports multiple formats including Alpaca instruction datasets, mathematical problem datasets, and generic CSV data. Includes tokenization utilities for model training.

πŸ§ͺ Evaluation Tools

evaluation/

Provides model evaluation capabilities using LLM-as-a-Judge methodology with Gemini API integration. Compares model outputs and provides structured scoring for mathematical problem-solving tasks.

Gemini LLM Judge Key Features:

  • Synchronous Evaluation: Standard evaluation using LLMGeminiJudgeEvaluator
  • Asynchronous Evaluation: High-performance async evaluation using LLMGeminiJudgeEvaluatorAsync
  • Batch Processing: Optimized batch processing for both CPU-bound (model generation) and I/O-bound (API calls) operations
  • Memory Management: Intelligent memory cleanup between evaluation phases
  • Rate Limiting: Built-in rate limiting to handle API constraints
  • Intermediate Saving: Automatic saving of evaluation results with timestamps
  • Performance Optimization: Configurable worker pools and batch sizes for optimal throughput

πŸš€ Inference Engines

inference/

Multiple inference implementations optimized for different use cases:

  • Standard: Compatible transformers-based inference with PEFT support
  • Unsloth: Optimized for 2x faster inference and memory efficiency
  • vLLM: High-throughput serving with GPU optimization and LoRA support

πŸŽ“ Training Implementations

training/

Specialized trainers for different optimization approaches:

  • Unsloth: Fastest training with lowest memory usage
  • LoRA: Standard PEFT implementation with broad compatibility
  • LLMCompressor: Quantized training with model compression support

πŸ› οΈ Utilities

utils/

Common utility functions for memory management, CUDA operations, and model diagnostics. Includes tools for monitoring GPU usage, clearing cache, and getting model statistics.

πŸ“œ Scripts

Scripts/

End-to-end automation scripts for complete training and evaluation workflows.

script1.py - Complete Pipeline Test Script:

  • Purpose: Simplified end-to-end testing of the entire SLM pipeline
  • Features:
    • Single-parameter configuration for quick testing
    • Complete workflow: Training β†’ Standard Inference β†’ vLLM Inference β†’ Async Evaluation
    • Memory optimization with cleanup between phases
    • Comprehensive logging and error handling
    • JSON output with detailed results
    • Async evaluation for all inference methods

Pipeline Stages:

  1. Model Training: LoRA fine-tuning with configurable parameters
  2. Standard Inference: Traditional transformers-based inference
  3. vLLM Inference: High-performance inference engine
  4. Async Evaluation: Parallel evaluation using Gemini API with optimized batching
  5. Results Compilation: Structured JSON output with comparison metrics

Usage:

python Scripts/script1.py

🎯 Design Principles

  1. Modularity: Each component has a single responsibility
  2. Extensibility: Easy to add new trainers/inference engines
  3. Consistency: Common interfaces across implementations
  4. Performance: Optimized implementations for different use cases
  5. Flexibility: Configuration-driven behavior

πŸ“‹ Usage Guidelines

  • Memory Limited: Use Unsloth components for training and inference
  • Speed Critical: Use vLLM for inference, Unsloth for training
  • Model Compression: Use LLMCompressor trainer for quantized models
  • Standard Compatibility: Use standard implementations for broad compatibility
  • Large-Scale Evaluation: Use async evaluation components for efficient batch processing

βš™οΈ Example Configurations

Training Configuration

config = {
    "model_name": "Qwen/Qwen2-0.5B",
    "lora_r": 16,
    "lora_alpha": 16,
    "lora_dropout": 0,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "max_new_tokens": 128,
    "device_map": "auto",
    "seed": 108
}

Inference Configuration

config = {
    "base_model": "Qwen/Qwen2-0.5B",
    "dtype": None,
    "device_map": "auto",
    "max_new_tokens": 128,
    "temperature": 0.5,
    "gpu_memory_utilization": 0.3,
    "do_sample": False
}

Async Evaluation Configuration

async_config = {
    "cpu_workers": 4,        # CPU-bound operations (model generation)
    "io_workers": 10,        # I/O-bound operations (API calls)
    "cpu_batch_size": 2,     # Batch size for GPU operations
    "io_batch_size": 5       # Batch size for API operations
}

This modular architecture enables flexible experimentation with different training techniques and inference optimizations while maintaining code reusability and clean interfaces.

References

  • Unsloth - Fast and memory-efficient fine-tuning
  • LLMCompressor - Model compression for faster inference
  • vLLM - High-performance LLM serving framework
  • My Notebook - Learnings and experiment results

Contributing

Feel free to raise an issue or submit a pull request if you find any mistakes or have suggestions for improvement. Your contributions are welcome and appreciated!


Happy Coding!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages