Production-grade pipeline for fine-tuning Large Language Models using LoRA/QLoRA with automated deployment, evaluation, and experiment tracking.
This project implements an end-to-end ML pipeline for fine-tuning open-source LLMs (Llama, Mistral) using parameter-efficient fine-tuning (PEFT) techniques. The pipeline includes:
- Efficient Training: QLoRA (4-bit quantization) for training on consumer GPUs
- Experiment Tracking: MLflow and Weights & Biases integration
- Comprehensive Evaluation: Task-specific metrics and benchmarking
- Production Serving: vLLM-powered high-performance inference API
- Cost Analysis: Detailed ROI calculations vs. commercial APIs
- Parameter-Efficient Fine-tuning with LoRA/QLoRA
- 4-bit Quantization using bitsandbytes for memory efficiency
- Automated Experiment Tracking with MLflow
- Custom Evaluation Metrics for task-specific performance
- Fast Inference Serving with vLLM (50+ tokens/sec)
- Docker Support for reproducible training and deployment
- Cloud Deployment Ready (AWS, GCP compatible)
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Data Pipeline │────▶│ LoRA Training │────▶│ Evaluation │────▶│ Deployment │
│ HuggingFace │ │ QLoRA/PEFT │ │ Custom Metrics │ │ vLLM Server │
└─────────────────┘ └──────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │ │
│ │ │ │
▼ ▼ ▼ ▼
Validation MLflow/W&B Benchmarking FastAPI/REST
Augmentation Experiment Comparison Load Balancing
Tracking Cost Analysis
- Python 3.10+
- CUDA-capable GPU (8GB+ VRAM for training, 16GB+ recommended)
- HuggingFace account (for model access)
# Clone the repository
git clone https://github.com/ashwani65/llm-finetuner.git
cd llm-finetuner
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys and configuration# Prepare your dataset
python scripts/prepare_data.py \
--input data/raw/dataset.json \
--output data/processed/
# Start training
python scripts/train.py \
--config configs/llama_sql_config.yaml \
--data data/processed/train.json \
--output models/llama-sql-v1
# Monitor training
mlflow ui # Visit http://localhost:5000python scripts/evaluate.py \
--model models/llama-sql-v1 \
--test_data data/processed/test.json \
--output evaluation_results.json# Start vLLM inference server
python -m src.serving.vllm_server \
--model models/llama-sql-v1 \
--port 8000
# Test the API
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate SQL: Find all users who signed up in 2024",
"max_tokens": 128,
"temperature": 0.1
}'Start the complete stack with frontend and backend:
# Terminal 1: Start Production API
./start_api.sh
# API: http://localhost:8000
# Docs: http://localhost:8000/docs
# Terminal 2: Start Frontend
cd frontend
npm run dev
# UI: http://localhost:3000Web Interface Features:
- 📊 Real-time training monitoring
- 📁 Dataset upload and management
- ⚙️ Interactive training configuration
- 📈 Evaluation results visualization
- 🚀 Model deployment controls
- 💻 GPU monitoring
llm-finetuner/
├── src/
│ ├── data/ # Data preprocessing & validation
│ │ ├── dataset_builder.py
│ │ ├── preprocessing.py
│ │ └── validation.py
│ ├── training/ # Training pipeline
│ │ ├── trainer.py
│ │ ├── config.py
│ │ └── callbacks.py
│ ├── evaluation/ # Metrics & benchmarking
│ │ ├── metrics.py
│ │ ├── benchmarks.py
│ │ └── comparison.py
│ ├── serving/ # Inference & API server
│ │ ├── vllm_server.py
│ │ ├── production_api.py # Production FastAPI backend
│ │ └── api.py
│ ├── monitoring/ # Experiment tracking
│ │ ├── mlflow_tracking.py
│ │ └── wandb_logging.py
│ └── utils/ # Utilities
│ ├── gpu_utils.py
│ ├── model_utils.py
│ ├── cost_calculator.py
│ ├── database.py # SQLite ORM models
│ └── job_manager.py # Async training jobs
├── configs/ # Configuration files
│ ├── base_config.yaml
│ ├── llama_sql_config.yaml
│ └── mistral_config.yaml
├── scripts/ # Executable scripts
│ ├── prepare_data.py
│ ├── train.py
│ ├── evaluate.py
│ └── deploy.py
├── notebooks/ # Jupyter notebooks
│ ├── 01_data_exploration.ipynb
│ ├── 02_training_experiments.ipynb
│ └── 03_evaluation_analysis.ipynb
├── frontend/ # React web interface
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── services/ # API integration
│ │ └── App.jsx
│ └── package.json
├── tests/ # Unit tests
├── docker/ # Docker configurations
│ ├── Dockerfile.training
│ ├── Dockerfile.serving
│ └── docker-compose.yml
├── start_api.sh # API startup script
└── requirements.txt
This pipeline supports various fine-tuning tasks:
Fine-tune models to generate SQL queries from natural language.
Input: "Find total revenue by product category in 2024"
Output: "SELECT category, SUM(revenue) FROM sales WHERE year = 2024 GROUP BY category"Train models to provide code review comments and suggestions.
Create specialized chatbots for specific domains (legal, medical, financial).
lora:
r: 16 # LoRA rank
lora_alpha: 32 # LoRA scaling
lora_dropout: 0.05
target_modules: # Attention modules to apply LoRA
- "q_proj"
- "v_proj"
- "k_proj"
- "o_proj"quantization:
load_in_4bit: true
bnb_4bit_compute_dtype: "float16"
bnb_4bit_quant_type: "nf4"
bnb_4bit_use_double_quant: truetraining:
num_train_epochs: 3
batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
warmup_steps: 100
weight_decay: 0.01| Model | GPU | VRAM Usage | Training Time | Cost |
|---|---|---|---|---|
| Llama 2 7B (QLoRA) | A100 40GB | ~12GB | 8 hours | $20 |
| Mistral 7B (QLoRA) | A100 40GB | ~11GB | 6 hours | $15 |
| Llama 2 7B (Full) | A100 80GB | ~45GB | 24 hours | $60 |
| Serving Method | Tokens/sec | Latency (p50) | Throughput |
|---|---|---|---|
| vLLM | 55 | 120ms | 100 req/sec |
| HF Transformers | 12 | 850ms | 20 req/sec |
| Model | Exact Match | BLEU | Cost/1K Queries |
|---|---|---|---|
| Fine-tuned Llama 2 7B | 0.73 | 0.68 | $0.05 |
| GPT-4 Turbo | 0.89 | 0.82 | $10.00 |
| GPT-3.5 Turbo | 0.65 | 0.61 | $0.50 |
- GPU: A100 40GB @ $2.50/hour × 8 hours = $20
- Storage: S3/GCS = $2/month
- Total Training: ~$22
- Self-hosted (vLLM on T4): $0.35/hour × 24 × 30 = $252/month
- GPT-4 API: $10/1M tokens × 5M tokens = $50,000/month
- Savings: 99.5%
Break-even point: ~280 queries
# Automatic tracking in training
mlflow.log_params({
"model_name": "llama-2-7b",
"lora_r": 16,
"learning_rate": 2e-4
})
mlflow.log_metrics({
"train_loss": 0.234,
"eval_loss": 0.198
})# Enable W&B in config
tracking:
use_wandb: true
wandb_project: "llm-finetuner"docker build -f docker/Dockerfile.training -t llm-finetuner-train .
docker run --gpus all \
-v $(pwd)/data:/app/data \
-v $(pwd)/models:/app/models \
llm-finetuner-train \
python scripts/train.py --config configs/llama_sql_config.yamldocker build -f docker/Dockerfile.serving -t llm-finetuner-serve .
docker run --gpus all -p 8000:8000 \
-v $(pwd)/models:/app/models \
llm-finetuner-servedocker-compose up -dfrom src.evaluation.metrics import SQLEvaluator
evaluator = SQLEvaluator()
results = evaluator.evaluate(predictions, references)
# Returns: exact_match, component_match, BLEU, ROUGEfrom src.evaluation.comparison import ModelComparison
comparison = ModelComparison(evaluator)
comparison.add_model("fine-tuned", predictions1, references)
comparison.add_model("gpt-4", predictions2, references)
comparison.plot_comparison(save_path="comparison.png")from src.utils.cost_calculator import CostCalculator
calculator = CostCalculator()
results = calculator.compare_total_cost(
training_cost=20,
inference_queries=10000,
gpu_type="A100-40GB"
)
print(f"Self-hosted: ${results['self_hosted']['total']:.2f}")
print(f"API cost: ${results['api']['total']:.2f}")
print(f"Savings: ${results['savings']:.2f}")- Support for Llama 3.2 and Mistral 0.3
- Multi-GPU distributed training
- Automatic hyperparameter tuning
- Integration with LangChain/LlamaIndex
- Support for RLHF (Reinforcement Learning from Human Feedback)
- Kubernetes deployment templates
- Prompt engineering utilities
- Dataset versioning with DVC
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Core ML: PyTorch, Transformers, PEFT, bitsandbytes, Accelerate, TRL
Serving: vLLM, FastAPI, Uvicorn
Experiment Tracking: MLflow, Weights & Biases
Data: Datasets, Pandas, NumPy, scikit-learn
Evaluation: BLEU, ROUGE, Custom Metrics
Deployment: Docker, Kubernetes (optional)
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this project in your research or production, please cite:
@misc{llm-finetuner,
author = {Ashwani Singh},
title = {LLM Fine-tuner: Production-Ready LoRA Fine-tuning Pipeline},
year = {2024},
publisher = {GitHub},
url = {https://github.com/ashwani65/llm-finetuner}
}- HuggingFace for the Transformers and PEFT libraries
- vLLM team for the high-performance inference engine
- Meta AI for Llama 2
- Mistral AI for Mistral 7B
Ashwani Singh
- GitHub: @ashwani65
- LinkedIn: Ashwani Singh
- Email: [email protected]