Train models that reason like Claude:
- Chain-of-Thought (CoT): Step-by-step reasoning for multi-step problems
- Tree-of-Thought (ToT): Exploring multiple approaches with pros/cons
- Meta-Cognitive: Knowing WHEN to use each strategy
- Measurable: Test improvements with benchmarks
# Download all reasoning datasets (~800K examples)
python download_cot_tot_datasets.py \
--download-all \
--output examples/datasets/
# This downloads:
# - GSM8K with CoT (7.5K math problems)
# - CoT Collection (100K NLP tasks)
# - Orca Math (200K math with reasoning)
# - MetaMath (395K multi-strategy math)
# - WizardLM Evol (196K complex instructions)# Merge Claude dataset + CoT/ToT datasets
python download_cot_tot_datasets.py \
--merge \
--base examples/datasets/claude_ultimate_508k.jsonl \
--output examples/datasets/claude_with_reasoning_mega.jsonl
# Result: ~1.3M examples with reasoning!# Standard benchmarks
python evaluate_model.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--preset standard \
--output ./eval/baseline_standard
# Reasoning-specific evaluation
python evaluate_reasoning.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output ./eval/baseline_reasoningpython llamaforge.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--data examples/datasets/claude_with_reasoning_mega.jsonl \
--epochs 1 \
--batch-size 1 \
--max-length 512 \
--gradient-accumulation 4 \
--lora-r 32 \
--lora-alpha 64 \
--work-dir ./trained_reasoning_modelTraining time estimate:
- CPU: ~48-72 hours for 1.3M examples
- GPU (T4): ~12-18 hours
- Use SOLLOL distributed training to speed up (see below)
# Standard benchmarks (compare to baseline)
python evaluate_model.py \
--model ./trained_reasoning_model/merged_model \
--preset standard \
--output ./eval/after_standard \
--compare ./eval/baseline_standard/results_*.json
# Reasoning-specific evaluation
python evaluate_reasoning.py \
--model ./trained_reasoning_model/merged_model \
--output ./eval/after_reasoningExpected improvements:
STANDARD BENCHMARKS:
GSM8K (math): 12% → 35% ↑ +192% 🔥
HellaSwag: 45% → 62% ↑ +38% ✅
ARC-Challenge: 28% → 41% ↑ +46% ✅
REASONING BENCHMARKS:
CoT Score: 15% → 78% ↑ +420% 🔥🔥
ToT Score: 8% → 65% ↑ +713% 🔥🔥
Meta-Cognitive: 5% → 72% ↑ +1340% 🔥🔥🔥
Teaches: Math word problems with step-by-step solutions
Example:
{
"instruction": "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast. She uses 4 to make muffins. She sells the rest at the farmers' market for $2 per egg. How much money does she make per day?",
"output": "<thinking>\nStep 1: Calculate total eggs laid per day\n- Janet's ducks lay 16 eggs per day\n\nStep 2: Calculate eggs consumed\n- Eats for breakfast: 3 eggs\n- Uses for muffins: 4 eggs\n- Total consumed: 3 + 4 = 7 eggs\n\nStep 3: Calculate eggs sold\n- Eggs left to sell: 16 - 7 = 9 eggs\n\nStep 4: Calculate money earned\n- Price per egg: $2\n- Total earned: 9 × $2 = $18\n</thinking>\n\nJanet makes $18 per day at the farmers' market."
}Teaches: Reasoning across diverse NLP tasks (QA, NLI, etc.)
Teaches: Mathematical reasoning with detailed explanations
Teaches: Multiple solution strategies for the same problem
Example:
{
"instruction": "Solve: 2x + 5 = 15",
"output": "<thinking>\nMethod 1: Subtraction first\n- 2x + 5 = 15\n- Subtract 5 from both sides: 2x = 10\n- Divide by 2: x = 5\n\nMethod 2: Verification by substitution\n- If x = 5, then 2(5) + 5 = 10 + 5 = 15 ✓\n\nMethod 3: Visual understanding\n- We need a number that, when doubled and added to 5, gives 15\n- 15 - 5 = 10, and 10 ÷ 2 = 5\n</thinking>\n\nThe answer is x = 5."
}Teaches: Complex, evolved instructions with reasoning
If you want to enhance your EXISTING datasets with CoT/ToT:
# Augment Claude dataset (adds <thinking> tags to 30% of examples)
python create_cot_tot_dataset.py \
--input examples/datasets/claude_ultimate_508k.jsonl \
--output examples/datasets/claude_cot_augmented.jsonl \
--augmentation-rate 0.3This automatically:
- Detects questions that need reasoning (math, debugging, comparison)
- Wraps answers in
<thinking>tags with step-by-step structure - Adds meta-cognitive examples (when to use CoT vs ToT)
1.3M examples takes forever on one machine. Use SOLLOL:
# Start SOLLOL cluster
cd ~/SOLLOL
python -m sollol.server --host 0.0.0.0 --port 8765
# On worker nodes
python -m sollol.worker --coordinator <your-ip>:8765
# Launch distributed training
cd ~/LlamaForge
python launch_distributed_training.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--dataset examples/datasets/claude_with_reasoning_mega.jsonl \
--workers 4 \
--mode data_parallel \
--epochs 1 \
--batch-size 2
# Training time: 12-18 hours → 3-4 hours with 4 workers!GSM8K: Grade School Math with 8K problems
- Before training: 5-15% (mostly guessing)
- After CoT training: 30-50% (actual reasoning)
- What it measures: Multi-step math reasoning
HellaSwag: Common sense completion
- Before: 45%
- After: 60-65%
- What it measures: Commonsense reasoning
ARC-Challenge: Science questions
- Before: 25-30%
- After: 40-50%
- What it measures: Scientific reasoning
CoT Score: Does model use chain-of-thought appropriately?
- Tests:
- ✓ Uses CoT for multi-step problems
- ✓ Skips CoT for simple questions
- ✓ Shows correct steps
- Before training: 10-20%
- After training: 70-85%
ToT Score: Does model compare alternatives?
- Tests:
- ✓ Explores multiple approaches
- ✓ Discusses trade-offs
- ✓ Makes informed choice
- Before training: 5-15%
- After training: 60-75%
Meta-Cognitive Score: Does model know WHEN to reason?
- Tests:
- ✓ Explains when to use CoT
- ✓ Explains when to use ToT
- ✓ Recognizes appropriate situations
- Before training: 0-10%
- After training: 65-80%
ollama run your-trained-model
# Test 1: Should use CoT
>>> If I have 5 apples and buy 3 more, then give away 2, how many do I have?
# Expected: Step-by-step reasoning
# <thinking>
# Step 1: Start with 5 apples
# Step 2: Buy 3 more: 5 + 3 = 8
# Step 3: Give away 2: 8 - 2 = 6
# </thinking>
# You have 6 apples.
# Test 2: Should NOT use excessive CoT
>>> What is 2 + 2?
# Expected: Direct answer
# 4
# Test 3: Should use ToT
>>> What's the best database for a social media app?
# Expected: Multiple approaches discussed
# <thinking>
# Approach 1: PostgreSQL
# - Pros: ACID, relational...
# - Cons: Scaling...
# Approach 2: MongoDB
# - Pros: Flexible...
# - Cons: Consistency...
# Best choice: Depends on...
# </thinking># Full reasoning evaluation suite
python evaluate_reasoning.py \
--model your-trained-model \
--output ./reasoning_eval_results/
# Check results
cat ./reasoning_eval_results/reasoning_eval_results.jsonpython download_cot_tot_datasets.py --download gsm8k_cot cot_collection --output examples/datasets/
# ~100K examples, focused on math + NLP reasoningpython download_cot_tot_datasets.py --download gsm8k_cot cot_collection orca_math_cot --output examples/datasets/
# ~300K examples, strong math + reasoningpython download_cot_tot_datasets.py --download-all --output examples/datasets/
# Merge with claude_ultimate_508k.jsonl
# ~1.3M examples, maximum capability# 1. Download reasoning datasets
python download_cot_tot_datasets.py --download-all --output examples/datasets/
# 2. Merge with Claude data
python download_cot_tot_datasets.py \
--merge \
--base examples/datasets/claude_ultimate_508k.jsonl \
--output examples/datasets/claude_reasoning_ultimate.jsonl
# 3. Baseline evaluation
python evaluate_model.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--preset standard \
--output ./eval/baseline
python evaluate_reasoning.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output ./eval/baseline_reasoning
# 4. Train
python llamaforge.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--data examples/datasets/claude_reasoning_ultimate.jsonl \
--epochs 1 \
--batch-size 1 \
--gradient-accumulation 4 \
--max-length 512 \
--lora-r 32 \
--lora-alpha 64 \
--work-dir ./claude_reasoning_model
# 5. Post-training evaluation
python evaluate_model.py \
--model ./claude_reasoning_model/merged_model \
--preset comprehensive \
--output ./eval/after_training \
--compare ./eval/baseline/results_*.json
python evaluate_reasoning.py \
--model ./claude_reasoning_model/merged_model \
--output ./eval/after_reasoning
# 6. Convert to GGUF and deploy
python llamaforge.py --model ./claude_reasoning_model/merged_model --convert-only --quantization q4_k_m
ollama create my-reasoning-model -f ./claude_reasoning_model/Modelfile
ollama run my-reasoning-modelStandard Benchmarks:
- GSM8K: 8-12%
- HellaSwag: 42-48%
- ARC: 25-30%
Reasoning:
- CoT Score: 10-15%
- ToT Score: 5-10%
- Meta: 0-5%
Standard Benchmarks:
- GSM8K: 15-25%
- HellaSwag: 55-62%
- ARC: 35-42%
Reasoning:
- CoT Score: 30-40%
- ToT Score: 15-25%
- Meta: 10-20%
Standard Benchmarks:
- GSM8K: 35-50% 🔥
- HellaSwag: 62-70%
- ARC: 45-55%
Reasoning:
- CoT Score: 75-85% 🔥🔥
- ToT Score: 65-75% 🔥🔥
- Meta: 70-80% 🔥🔥
Cause: Too much CoT in training data
Fix: Lower --augmentation-rate to 0.2 or use original datasets with natural CoT distribution
Cause: Not enough CoT examples or model too small Fix:
- Increase CoT augmentation rate to 0.5
- Use larger base model (3B instead of 1B)
- Train longer (2 epochs)
Cause: Model memorizing format without understanding Fix:
- Use diverse reasoning datasets (MetaMath, Orca, etc.)
- Increase LoRA rank to 64
- Add more math-specific data
Your model is reasoning well when:
✅ Uses <thinking> tags for multi-step problems
✅ Shows correct intermediate steps
✅ Skips reasoning for trivial questions
✅ Compares alternatives when appropriate
✅ Can explain when to use CoT vs ToT
✅ GSM8K score > 35%
✅ CoT Score > 70%
✅ Meta-Cognitive Score > 65%
Now you have measurable, Claude-like reasoning! 🚀