Your training corpus combines:
- 6.6M base corpus (FINAL_CORPUS_7M.jsonl)
- 307k gap-spanning datasets (reasoning, code, factual, dialog, DPO)
- 1.2M specialized datasets (structured output, short-context, tool/API, CoT compression, preferences)
- 640k dark-themed real datasets (psychology, philosophy, adversarial, narrative, humor)
After deduplication: 7,053,867 unique examples (34.26% dedup rate)
python3 merge_all_final_datasets.pyCreated:
data/FINAL_CORPUS_8M.jsonl- 7,053,867 unique examplesdata/FINAL_MANIFEST_8M.json- Statistics and metadata
python3 split_train_val.pyCreated:
data/train.jsonl(6,701,173 examples, 95%)data/val.jsonl(352,694 examples, 5%)
10% Sample (recommended for thorough validation):
data/samples/train_10pct.jsonl(670,000 examples)data/samples/val_10pct.jsonl(35,000 examples)
1% Sample (for ultra-fast validation):
data/train_sample.jsonl(~67k examples)data/val_sample.jsonl(~3.5k examples)
Runtime: 3-4 hours Purpose: Thorough validation before full training Dataset: 670k train + 35k val examples
./scripts/launch_sample.shWhat to verify:
- Loss decreases steadily over first 5k-10k steps
- GPU memory stays ~14-15GB (not exceeding 16GB)
- Tokens/sec ~25-40 is typical
- Validation loss flattens by ~80% through epoch
- Checkpoints save correctly every 500 steps
Runtime: 30-45 minutes Purpose: Quick sanity check for CUDA/setup issues Dataset: 67k train + 3.5k val examples
./scripts/launch_test.shMonitors:
- No CUDA OOM errors
- Data loads correctly
- LoRA adapters initialize and save
Runtime: 5-7 days on A4000 Purpose: Final production training Dataset: 6.7M train + 352k val examples
./scripts/launch_train.sh- GPU: RTX A4000 (16GB VRAM)
- CUDA: 12.1+
- Storage: ~50GB for dataset + checkpoints
QLoRA Settings:
- 4-bit quantization (NF4)
- LoRA rank: 64
- LoRA alpha: 32
- Dropout: 0.05
Optimization:
- Batch size: 1 (with gradient accumulation x16)
- Learning rate: 2e-5 (cosine schedule)
- Warmup: 3% of steps
- FP16 mixed precision
Memory Usage: ~14-15GB VRAM
python3 llamaforge.py merge \
--base Qwen/Qwen2.5-Coder-7B-Instruct \
--lora outputs/leviathan_local_run \
--output outputs/Leviathan-v1.1python3 llamaforge.py chat --model outputs/Leviathan-v1.1# GGUF for llama.cpp
python3 llamaforge.py quantize \
--model outputs/Leviathan-v1.1 \
--format gguf \
--quant Q4_K_M
# GPTQ for vLLM
python3 llamaforge.py quantize \
--model outputs/Leviathan-v1.1 \
--format gptq \
--bits 4After fine-tuning Qwen2.5-Coder-7B on this corpus, Leviathan will have:
-
Claude 3.5-Level Reasoning (15.1% CoT data)
- Multi-step logic, complex inference
- Philosophical arguments
- Mathematical proofs
-
ChatGPT-4o Efficiency (29.8% instructions)
- Fast, concise responses
- Strong instruction following
- Minimal verbosity
-
Deep Applications (16.6% code+tools+SQL)
- Enterprise tool use
- API integration
- Database operations
-
Multilang Coding (6.5% code)
- Python, Java, C++, JS, Go, Rust
- Debugging & code review
- SQL mastery
-
Esoteric Knowledge (12.5%)
- Occult, mysticism, symbolism
- Tarot, astrology, alchemy
-
Dark Psychology (2.9%)
- Manipulation awareness
- Defense mechanisms
- Trauma-informed responses
-
Dark Philosophy (5.5%)
- Existentialism, nihilism
- Moral relativism
- Ethical edge cases
-
Dark Protector Archetype (6.7%)
- Boundary enforcement
- Harm prevention
- Protective responses
-
Unfiltered Nature (17.7% dark domains)
- Direct, honest responses
- Comfortable with taboo subjects
- Safety without pearl-clutching
tensorboard --logdir outputs/leviathan_local_run --port 6006Navigate to: http://localhost:6006
- Train Loss: Should steadily decrease
- Val Loss: Should track train loss without diverging
- Learning Rate: Follows cosine decay
- Gradient Norm: Should stabilize (not explode)
Saved every 1000 steps in:
outputs/leviathan_local_run/
├── checkpoint-1000/
├── checkpoint-2000/
├── checkpoint-3000/
└── ...
- Reduce
cutoff_lenfrom 4096 to 2048 - Reduce
gradient_accumulation_stepsfrom 16 to 8 - Enable more aggressive gradient checkpointing
- Check GPU utilization:
nvidia-smi - Verify data loading: Check logs for I/O bottlenecks
- Reduce
num_workersif high CPU usage
- Verify learning rate schedule
- Check for NaN gradients in logs
- Ensure data preprocessing is correct
-
Evaluate on benchmarks
- HumanEval (code)
- MMLU (general knowledge)
- GSM8K (math reasoning)
-
Test dark domain capabilities
- Philosophical reasoning
- Psychological depth
- Esoteric knowledge
-
Production deployment
- Quantize to GGUF or GPTQ
- Deploy with vLLM or llama.cpp
- Set up API endpoint
-
Fine-tune further (optional)
- DPO for alignment
- Additional domain-specific data
- Longer context (up to 32k tokens)
LlamaForge/
│
├── data/
│ ├── FINAL_CORPUS_8M.jsonl # Full merged corpus
│ ├── FINAL_MANIFEST_8M.json # Statistics
│ ├── train.jsonl # 95% training split
│ ├── val.jsonl # 5% validation split
│ ├── train_sample.jsonl # 1% test sample
│ └── val_sample.jsonl # 1% test sample
│
├── configs/
│ ├── config_leviathan_local.yaml # Full training config
│ └── config_leviathan_test.yaml # Test config (1% sample)
│
├── scripts/
│ ├── launch_train.sh # Full training launcher
│ └── launch_test.sh # Test training launcher
│
├── outputs/
│ ├── leviathan_test_run/ # Test checkpoints
│ └── leviathan_local_run/ # Full checkpoints
│
├── logs/ # Training logs
│
├── merge_all_final_datasets.py # Merge script
├── split_train_val.py # Train/val split
├── create_test_sample.py # 1% sampling
└── llamaforge.py # Main training script
Check the logs in logs/ or reach out to the community.
Good luck with training! 🚀