A production-ready pipeline for fine-tuning large language models using Unsloth with YAML-based configuration, advanced training features, and web-based model serving.
Want to understand how this pipeline works step-by-step?
Check out my detailed guide on Medium:
Fine-Tuning LLM with Unsloth: A Practical Guide to Training Models like Qwen 2.5
This pipeline provides a streamlined approach to fine-tuning language models.
Key features include:
- Single YAML configuration file for all training parameters
- Smart early stopping with multiple stop conditions
- Memory-optimized training with 4-bit quantization
- LoRA (Low-Rank Adaptation) fine-tuning support
- ChatML conversation format support
- Graceful interruption handling
- Built-in model testing capabilities
- Web-based model serving with Gradio interface
- Model: unsloth/Qwen3-8B-unsloth-bnb-4bit
- Hardware: 12GB RTX 5070 GPU
- Dataset Format: ChatML (tested and validated)
- Training: Successfully completed with provided configuration
- Serving: Gradio web interface for interactive model testing
Install dependencies using the provided requirements file:
pip install -r requirements.txtpip install unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
pip install torch torchvision torchaudio
pip install transformers datasets trl peft accelerate bitsandbytes
pip install PyYAML gradio- Prepare your dataset in ChatML format (JSONL file)
- Configure training parameters in
config.yaml - Start training:
python finetune.py --config config.yamlAfter training, serve your model with a web interface:
python launch_model.pyThis will start a Gradio interface at http://localhost:7860 where you can interact with your fine-tuned model.
The pipeline expects JSONL files with ChatML conversation format:
{
"messages": [
{"role": "user", "content": "Your question here"},
{"role": "assistant", "content": "Model response here"}
]
}The config.yaml file is divided into several sections, each controlling different aspects of the training process.
model:
name: "unsloth/Qwen3-8B-unsloth-bnb-4bit"
max_seq_length: 4096
dtype: null
load_in_4bit: true
trust_remote_code: true
use_cache: falseParameters:
name: Hugging Face model identifiermax_seq_length: Maximum sequence length for trainingdtype: Data type (null for auto-detection)load_in_4bit: Enable 4-bit quantization for memory efficiencytrust_remote_code: Allow custom model code executionuse_cache: Enable/disable model caching
lora:
r: 8
alpha: 8
dropout: 0.0
bias: "none"
target_modules: [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
]
gradient_checkpointing: "unsloth"
random_state: 42
use_rslora: true
loftq_config: nullParameters:
r: LoRA rank (dimensionality of adaptation)alpha: LoRA alpha parameter for scalingdropout: Dropout rate for LoRA layersbias: Bias handling ("none", "all", or "lora_only")target_modules: List of modules to apply LoRA adaptationgradient_checkpointing: Checkpointing strategyrandom_state: Random seed for reproducibilityuse_rslora: Enable rank-stabilized LoRAloftq_config: LoftQ configuration (null to disable)
training:
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 1
learning_rate: 0.0001
max_steps: -1
gradient_checkpointing: true
optimizer: "adamw_8bit"
bf16: true
fp16: false
warmup_steps: 1000
warmup_ratio: 0.03
lr_scheduler_type: "cosine"
logging_steps: 100
save_strategy: "steps"
save_steps: 1000
save_total_limit: 10
report_to: "none"
packing: false
seed: 42Core Training Parameters:
per_device_train_batch_size: Batch size per GPU devicegradient_accumulation_steps: Steps to accumulate gradientsnum_train_epochs: Number of training epochslearning_rate: Initial learning ratemax_steps: Maximum training steps (-1 for epoch-based)
Optimization Settings:
gradient_checkpointing: Enable gradient checkpointing for memory efficiencyoptimizer: Optimizer type ("adamw_8bit" for memory efficiency)bf16/fp16: Mixed precision training settings
Scheduling:
warmup_steps: Number of warmup stepswarmup_ratio: Warmup ratio (alternative to warmup_steps)lr_scheduler_type: Learning rate scheduler type
Logging and Saving:
logging_steps: Frequency of loggingsave_strategy: When to save checkpoints ("steps" or "epoch")save_steps: Save frequency in stepssave_total_limit: Maximum number of checkpoints to keep
dataset:
path: "comprehensive_dataset.jsonl"
format: "jsonl"
conversation_format: "chatml"
max_length: 4096
subset_size: 0Parameters:
path: Path to the training datasetformat: Dataset file format ("jsonl", "json", "csv")conversation_format: Format for conversation data ("chatml", "alpaca")max_length: Maximum sequence length for filteringsubset_size: Use subset of data (0 for full dataset)
hardware:
device_map: "auto"
environment_variables:
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:256"
TOKENIZERS_PARALLELISM: "false"
CUDA_VISIBLE_DEVICES: "0"Parameters:
device_map: Device mapping strategy ("auto" for automatic)environment_variables: CUDA and training environment variables
smart_training:
enable_loss_early_stopping: true
early_stop_patience: 30
early_stop_min_delta: 0.005
early_stop_min_steps: 300
early_stop_check_interval: 25
target_loss: 0.3
max_time_minutes: 3000000
max_steps: null
dataset_num_proc: 4Early Stopping Parameters:
enable_loss_early_stopping: Enable loss plateau detectionearly_stop_patience: Steps to wait without improvementearly_stop_min_delta: Minimum improvement thresholdearly_stop_min_steps: Minimum steps before early stoppingearly_stop_check_interval: Frequency of early stopping checks
Training Limits:
target_loss: Stop training when reaching target lossmax_time_minutes: Maximum training time in minutesmax_steps: Maximum training steps override
output:
directory: "./qwen-kubernetes-0.0.8"
save_method: "merged_16bit"
test_prompts:
- "How do I create a Kubernetes deployment with 3 replicas and resource limits?"
- "What's the difference between a Service and an Ingress in Kubernetes?"
- "How can I debug a pod that's stuck in Pending state?"
- "Explain Kubernetes ConfigMaps and Secrets with examples"
- "How do I set up horizontal pod autoscaling based on CPU usage?"Parameters:
directory: Output directory for trained modelsave_method: Model saving format ("merged_16bit", "merged_4bit", "lora")test_prompts: List of prompts for post-training validation
After training your model, you can serve it with an interactive web interface:
python launch_model.py- Interactive Chat Interface: Real-time conversation with your fine-tuned model
- Kubernetes Expertise: Pre-configured for Kubernetes-related queries
- Example Prompts: Built-in examples to get started quickly
- Responsive Design: Works on desktop and mobile devices
- Shareable Interface: Option to create public links for sharing
To use the serving script with your own model, update the model path in launch_model.py:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="path/to/your/trained/model", # Update this path
max_seq_length=4096,
dtype=None,
load_in_4bit=True,
)The Gradio interface supports several configuration options:
- Temperature: Controls response randomness (0.1-1.0)
- Max Tokens: Maximum response length
- Repetition Penalty: Prevents repetitive responses
- System Prompt: Customizable system instructions
The pipeline includes multiple early stopping mechanisms:
- Loss Plateau Detection: Monitors training loss and stops when improvement plateaus
- Target Loss Achievement: Stops when a specific loss threshold is reached
- Time-based Stopping: Limits training to a maximum time duration
- Step-based Stopping: Limits training to a maximum number of steps
- 4-bit Quantization: Reduces memory usage significantly
- Gradient Checkpointing: Trades computation for memory
- 8-bit Optimizers: Memory-efficient optimization algorithms
- Automatic Memory Management: Built-in garbage collection and cache clearing
Training can be safely interrupted (Ctrl+C) while preserving the current model state:
^C
Graceful shutdown initiated...
Saving model...
Model saved to ./output-directorypython finetune.py --config config.yaml [OPTIONS]Available Options:
--config, -c: Path to YAML configuration file (required)--output-dir, -o: Override output directory from config--dry-run: Validate configuration without starting training
python launch_model.pyThe script will automatically start the Gradio interface on http://localhost:7860.
- GPU: 12GB VRAM (tested on RTX 5070)
- RAM: 16GB system RAM
- Storage: 20GB free space
- GPU: 16GB+ VRAM
- RAM: 32GB+ system RAM
- Storage: SSD with 50GB+ free space
- GPU: 8GB+ VRAM (for inference)
- RAM: 8GB+ system RAM
- Network: Stable internet connection for web interface
Reduce memory usage by adjusting these parameters:
training:
per_device_train_batch_size: 1 # Reduce batch size
gradient_accumulation_steps: 16 # Increase accumulation steps
model:
max_seq_length: 2048 # Reduce sequence lengthEnsure your JSONL file follows the correct format:
# Validate dataset structure
head -n 3 your_dataset.jsonl | python -m json.toolIf the Gradio interface fails to start:
- Check model path: Ensure the model path in
launch_model.pyis correct - GPU memory: Make sure you have enough VRAM for inference
- Port conflicts: Try a different port if 7860 is occupied
- Dependencies: Ensure Gradio is installed:
pip install gradio>=4.0.0
unsloth-pipeline/
├── config.yaml # Main configuration file
├── finetune.py # Training script
├── launch_model.py # Model serving script (NEW!)
├── requirements.txt # Python dependencies (updated)
├── devops-42k/ # Dataset processing example
│ └── devops-42k.ipynb # Data cleaning notebook
└── README.md # Documentation (updated)
# 1. Install dependencies
pip install -r requirements.txt
# 2. Prepare your dataset (see devops-42k/ for example)
python process_dataset.py
# 3. Train the model
python finetune.py --config config.yaml
# 4. Serve the trained model
python launch_model.py# Test configuration without training
python finetune.py --config config.yaml --dry-run
# Train with custom output directory
python finetune.py --config config.yaml --output-dir ./my-model
# Serve with specific model
python launch_model.py # Edit model path in script first