Last Updated: 2024-10-26 Version: 0.1.0-alpha
This document honestly documents known limitations, edge cases, and areas for improvement. We believe in transparency over marketing hype.
- Maturity & Production Readiness
- Memory & Resource Issues
- Training Quality Issues
- Model Compatibility
- Distributed Training Limitations
- Test Coverage & CI/CD
- Roadmap
What works well:
- ✅ Core LoRA fine-tuning (tested on 1B-7B models)
- ✅ GGUF conversion for Ollama
- ✅ Interactive wizard for ease of use
- ✅ CPU and GPU automatic detection
- ✅ Distributed training with SOLLOL (basic)
What needs improvement:
⚠️ Limited testing: Only tested by 1-2 developers on specific hardware⚠️ No automated tests: Manual testing only⚠️ No CI/CD: No continuous integration or quality gates⚠️ Documentation gaps: Some edge cases not documented⚠️ No production deployments: Unproven at scale
Risks:
- Early adopter risk: You're using experimental software
- Memory issues: OOM errors possible during merge step
- Dataset quality: Small datasets produce broken models
- Limited support: Community-only (no SLA)
Problem: Training completes successfully, but process gets killed (exit code 137) during LoRA adapter merge.
Root Cause: Merging LoRA adapters into base model requires loading both in memory simultaneously.
Memory Requirements:
| Model Size | Training RAM | Merge RAM | Total RAM Needed |
|---|---|---|---|
| 1-3B | 6-8 GB | 4-6 GB | 12-14 GB |
| 7B | 12-16 GB | 8-10 GB | 20-24 GB |
| 13B | 24-32 GB | 16-20 GB | 40-48 GB |
Your System:
- RAM: 15.3 GB total
- Training 7B model: ~12-16 GB used
- Merge step: Needs additional 8-10 GB
- Result: OOM kill (exit code 137)
Workarounds:
Option 1: Skip GGUF conversion
python llamaforge.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--data examples/datasets/alpaca_1k.jsonl \
--no-gguf # Saves in HuggingFace format onlyOption 2: Add swap space
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstabOption 3: Use smaller models
- TinyLlama-1.1B: Works reliably on 16GB systems
- Qwen2.5-3B: Works on 16GB with
--no-gguf - Llama-7B: Requires 24GB+ RAM
Option 4: Reduce sequence length
--max-length 256 # Instead of 512Planned fix: v0.2.0 will add streaming merge to reduce peak memory usage.
Problem: Training hangs or crashes with CUDA out-of-memory errors.
Root Cause: Model + gradients + optimizer states exceed GPU VRAM.
VRAM Requirements:
| Model Size | LoRA (r=8) | LoRA (r=16) | Full Fine-Tuning |
|---|---|---|---|
| 1-3B | 6-8 GB | 8-10 GB | 12-16 GB |
| 7B | 12-16 GB | 16-20 GB | 24-32 GB |
| 13B | 24-28 GB | 28-32 GB | 40-48 GB |
Workarounds:
Option 1: Reduce batch size
--batch-size 1
--gradient-accumulation 8 # Effective batch size = 8Option 2: Reduce LoRA rank
--lora-r 4 # Instead of 8 or 16Option 3: Reduce sequence length
--max-length 256 # Instead of 512Option 4: Use gradient checkpointing (automatic)
- LlamaForge enables this by default
- Trades compute for memory (25% slower, 40% less VRAM)
Problem: Models trained on <100 samples output incoherent garbage.
Example:
Input: "Hello, what are you?"
Output: "戥系统戥 Cosby戥 Cosby戥 Cosby..."
Root Cause: Severe overfitting to tiny dataset, loss of general language understanding.
Minimum Dataset Sizes:
| Samples | Result | Use Case |
|---|---|---|
| <100 | ❌ Broken garbage | Do not use |
| 100-500 | Emergency only | |
| 500-1000 | ✅ Basic fine-tuning | Quick experiments |
| 1K-5K | ✅ Good quality | Standard use |
| 5K-10K | ✅✅ High quality | Production |
| 10K+ | ✅✅✅ Best quality | Professional |
Solution: Use provided datasets:
examples/datasets/alpaca_1k.jsonl- 1,000 samples (minimum recommended)examples/datasets/alpaca_full.jsonl- 52,000 samplesexamples/datasets/code_alpaca_full.jsonl- 20,000 coding samples
Detection: If model outputs gibberish after training, you used too few samples.
Problem: Loss stays flat or increases during training.
Common Causes:
-
Learning rate too high
- Default: 2e-4
- Try: 1e-4 or 5e-5
- Symptoms: Loss increases or oscillates wildly
-
Learning rate too low
- Try: 3e-4 or 5e-4
- Symptoms: Loss decreases very slowly or not at all
-
Dataset formatting issues
- Check that
instruction/promptandoutput/completionfields exist - Verify samples are not empty or malformed
- Check that
-
Model already optimal
- Base model may already be well-suited to your task
- Further training may not help
Debugging:
# Enable loss logging
python llamaforge.py ... --logging-steps 10
# Watch for loss trendsTested and Working:
- ✅ Llama 2 (7B, 13B)
- ✅ Llama 3 / 3.1 (8B)
- ✅ Mistral (7B)
- ✅ CodeLlama (7B, 13B)
- ✅ Qwen 2.5 (1.5B, 3B, 7B)
- ✅ TinyLlama (1.1B)
Untested / May Not Work:
⚠️ Falcon models⚠️ GPT-NeoX⚠️ BLOOM⚠️ Gemma (may require special handling)⚠️ Mixtral (MoE architecture)
Known Incompatible:
- ❌ GPT-2 / GPT-J (different architecture)
- ❌ BERT / RoBERTa (encoder-only models)
- ❌ T5 (encoder-decoder)
If you encounter errors:
- Check model architecture is causal LM (decoder-only)
- Verify model is available on HuggingFace
- Try a known-working model first to verify setup
Problem: You cannot train on Q4-quantized Ollama models.
Why: Q4 quantization is lossy - precision is permanently discarded during quantization.
What LlamaForge Does:
- Detects your Ollama model (e.g.,
qwen2.5:3b) - Maps to base HuggingFace model (
Qwen/Qwen2.5-3B) - Downloads FP16 weights if not cached
- Trains on FP16 weights
- Exports to GGUF for Ollama
First-time Download:
- 3B model: ~6-8 GB download
- 7B model: ~14-16 GB download
- Subsequent runs use cached version (instant, no download)
See: TECHNICAL_REALITY.md for detailed explanation.
Current State:
- ❌ Workers must be set up manually
- ❌ No automatic deployment to remote nodes
- ❌ Must copy dataset to each node
- ❌ Must SSH into each node to launch training
Requirements:
- ✅ Ollama running on all nodes (for discovery)
- ✅ PyTorch installed on all nodes
- ✅ SSH access to remote nodes
- ✅ Dataset accessible on all nodes
- ✅ Port 29500 open for PyTorch DDP
Planned: v0.2.0 will add automatic worker deployment.
Problem: If one node fails mid-training, entire job fails.
Current Behavior:
- Node crashes → Training stops
- Network partition → Training hangs
- No automatic recovery
Workarounds:
- Run training in
tmuxorscreensessions - Use
systemdservice for persistence (see SYSTEMD_SERVICE_SETUP.md)
Planned: v0.2.0 will add checkpoint recovery.
Current State:
- ❌ No unit tests
- ❌ No integration tests
- ❌ No CI/CD pipeline
- ❌ Manual testing only
Risks:
- Changes may break existing functionality
- No quality gates before merge
- No automated regression testing
Test Directories Exist But:
test_8bit/- Manual test runs (not automated)test_work/- Manual test output (not test suite)
Planned: v0.2.0 will add basic test suite and GitHub Actions CI.
Current State:
- ❌ No coverage reports
- ❌ Unknown which code paths are tested
- ❌ No coverage trends over time
Planned: v0.2.0 will add pytest-cov and codecov integration.
Focus: Stability & Testing
- Add pytest test suite (target: 60% coverage)
- Add GitHub Actions CI/CD
- Streaming merge to reduce memory usage
- Checkpoint recovery for distributed training
- Better error messages and validation
- Automatic worker deployment
Focus: Features & Scale
- Support for larger models (13B+)
- QLoRA support (4-bit training)
- Multi-GPU training on single node
- Web UI for training monitoring
- Prometheus metrics export
- Better dataset validation
Focus: Production Readiness
- Fault tolerance and recovery
- Production-grade distributed training
- Model registry integration
- Experiment tracking (W&B, MLflow)
- 90%+ test coverage
- Performance benchmarks
Found an issue not listed here? Please report it: https://github.com/B-A-M-N/LlamaForge/issues
Want to help fix these issues?
- Check issues labeled
good-first-issue - Read CONTRIBUTING.md (coming soon)
Is LlamaForge production-ready?
For hobbyists / researchers: Yes, with caveats. Great for learning LoRA and distributed training.
For small teams (<5 people): Maybe. Test thoroughly first. Have 16GB+ RAM per training node.
For enterprises / production: Not yet. Wait for v0.3.0+ or contribute fixes.
Why use it anyway?
- ✅ Learning tool: Excellent for understanding LoRA, distributed training, GGUF
- ✅ Ollama integration: Seamless workflow from training to deployment
- ✅ SOLLOL distributed: Unique integration for multi-node training
- ✅ Honest documentation: Clear about what works and what doesn't
Why wait?
⚠️ Memory issues: OOM possible on <24GB RAM systems⚠️ No tests: Limited quality assurance⚠️ Manual setup: Distributed training requires manual node configuration⚠️ Alpha software: Expect rough edges
Bottom line: LlamaForge is a powerful learning tool for distributed LoRA fine-tuning. It works well if you understand the limitations and have adequate hardware. Not production-ready, but valuable for research and experimentation.