🎮 TRAINING LEVIATHAN ON RTX A4000

⚠️ IMPORTANT: A4000 vs A100 Reality Check

Your RTX A4000 has significantly different specs than the A100 these configs were designed for:

Specification	A100 80GB	RTX A4000 16GB	Difference
VRAM	80 GB	16 GB	5x less
Compute	~312 TFLOPS	~19 TFLOPS	16x slower
Memory BW	2000 GB/s	448 GB/s	4.5x slower
Cost	$15-20k	$2-4k	Consumer vs datacenter

Bottom Line: The A4000 is a great workstation GPU, but it's in a completely different class than datacenter A100s.

⏱️ REALISTIC TRAINING TIMES

10% Validation Test (559K examples, 2 epochs)

GPU	Time	Cost (cloud)
A100 80GB	4-6 hours	~$10
A4000 16GB	24-36 hours	(your hardware)

Your A4000: 1-1.5 days for 10% test

Full Training (5.5M examples, 3 epochs)

GPU	Time	Cost (cloud)
A100 80GB	48-72 hours	~$80-120
A4000 16GB	10-15 DAYS	(your hardware)

Your A4000: 240-360 hours = 10-15 days continuous running

🚀 RECOMMENDED STRATEGY

Option 1: Train 10% Test on A4000, Full Training on Cloud (RECOMMENDED)

Best balance of cost and practicality:

# 1. Run 10% test on your A4000 (24-36 hrs, FREE)
./scripts/train_10pct_a4000.sh

# 2. Verify everything works
python3 scripts/test_inference.py \
  --adapter work/training/leviathan_10pct_a4000/checkpoint-final

# 3. If test passes, rent cloud A100 for full training
# RunPod/Lambda: ~$1.50/hr × 60 hrs = $90 total

Pros:

✅ Validate setup on your hardware (free)
✅ Full training completes in 2-3 days (vs 2 weeks)
✅ Total cost: ~$90 (very reasonable)
✅ No babysitting a 2-week training run

Cons:

Need to set up cloud instance
Transfer 13GB dataset

Option 2: Full Training on A4000 (Patient/Budget Option)

For those with:

Unlimited time/patience
Reliable power
Existing A4000 setup
No cloud budget

# Expect 10-15 DAYS continuous running
./scripts/train_10pct_a4000.sh  # First test this!

# If test works, then:
# (Need to create A4000 full training config)

Pros:

✅ Free (your hardware)
✅ Full control

Cons:

❌ 10-15 days continuous running
❌ Power costs (~$30-50 for 2 weeks @ 200W)
❌ Risk of interruption (power outage, crash)
❌ Extremely slow (0.3-0.5 steps/sec)
❌ Ties up your GPU for weeks

Option 3: Cloud-Only Training (Fast & Simple)

Rent A100 for both test and full training:

# On cloud A100 instance
./scripts/train_10pct_test.sh    # 4-6 hrs
./scripts/train_full.sh          # 48-72 hrs

Cost: ~$100-120 total

Pros:

✅ Fastest (3-4 days total)
✅ No local setup needed
✅ Reliable datacenter power
✅ Can use multiple GPUs

Cons:

Higher upfront cost
Need cloud account

💰 CLOUD GPU COST COMPARISON

Recommended Providers

Provider	GPU	Price/hr	10% Test	Full Training	Total
RunPod	A100 80GB	$1.50	$8	$90	$98
Lambda Labs	A100 80GB	$1.10	$6	$70	$76
Vast.ai	A100 80GB	$1.20	$7	$80	$87
AWS (on-demand)	A100 80GB	$4.00	$20	$240	$260
Azure	A100 80GB	$3.60	$18	$216	$234

Best bang for buck: Lambda Labs (~$76 total)

Easiest setup: RunPod (~$98 total, better UX)

🛠️ A4000 OPTIMIZATION SETTINGS

I've created an A4000-specific config with aggressive memory optimizations:

Key Differences from A100 Config

# A100 Config              →  A4000 Config
sequence_len: 4096         →  2048  # HALVED to fit memory
micro_batch_size: 2        →  1     # MINIMUM
gradient_accumulation: 8   →  32    # COMPENSATE for smaller batch
lora_r: 64                 →  32    # SMALLER adapter
lora_target_modules: 7     →  5     # FEWER modules
save_steps: 1000           →  2000  # LESS FREQUENT (save time)

Memory Usage Estimates

Phase	A100 Config	A4000 Config
Model Loading	45 GB	12 GB
Training Peak	68 GB	14-15 GB
Checkpointing	72 GB	15 GB

Your A4000 should just barely fit with the optimized config.

📋 PRE-FLIGHT CHECKLIST (A4000)

Before starting 10% test on A4000:

GPU has 16GB VRAM (check: nvidia-smi)
PyTorch installed with CUDA support
Axolotl installed: pip install axolotl[flash-attn]
100GB+ free disk space
10% sample dataset exists (1.27GB)
Stable power for 24-36 hours
Screen/tmux for persistent session
Temperature monitoring (nvidia-smi dmon -s u)
Plan for interruptions (can resume from checkpoint)

🚀 LAUNCH 10% TEST ON A4000

# Use persistent terminal session
screen -S leviathan-a4000
cd /home/joker/LlamaForge

# Launch A4000-optimized test
./scripts/train_10pct_a4000.sh

# Expected:
# - Duration: 24-36 hours
# - Speed: ~0.3-0.5 steps/sec (SLOW!)
# - Memory: ~14-15GB peak
# - Loss: Should decrease to <1.5

# Detach: Ctrl+A, D
# Reattach: screen -r leviathan-a4000

Monitor Progress

# Watch logs
tail -f work/training/leviathan_10pct_a4000/logs/training_*.log

# Monitor GPU (separate terminal)
watch -n 1 nvidia-smi

# Check training speed
grep "steps/sec" work/training/leviathan_10pct_a4000/logs/training_*.log | tail

Expected Training Curve

Hour    Steps   Loss    GPU Mem   Status
----------------------------------------
0       0       2.8     14GB      ✓ Starting
4       ~500    2.3     15GB      ✓ Decreasing
12      ~1500   1.9     15GB      ✓ On track
24      ~3000   1.5     15GB      ✓ Near complete
36      DONE    1.3     -         ✓ Success

Success Criteria

🆘 TROUBLESHOOTING A4000

Out of Memory (OOM)

# In configs/leviathan_10pct_a4000.yaml, reduce:
sequence_len: 1024  # was 2048
lora_r: 16          # was 32
lora_target_modules:  # Remove more modules
  - q_proj
  - v_proj
  - gate_proj

Training Too Slow (<0.2 steps/sec)

This is expected on A4000. Options:

Accept it: 24-36 hours is just the reality
Reduce dataset: Test on 5% instead of 10%
Switch to cloud: Much faster

GPU Throttling (Thermal)

# Check temperature
nvidia-smi dmon -s t

# If >85°C:
# - Improve case cooling
# - Reduce power limit: sudo nvidia-smi -pl 150
# - Undervolt GPU (advanced)

Training Crashes/Interrupts

# Resume from checkpoint
./scripts/train_10pct_a4000.sh
# Axolotl will auto-resume from latest checkpoint

🌐 CLOUD SETUP GUIDE (RECOMMENDED)

Option A: RunPod (Easiest)

# 1. Sign up at runpod.io
# 2. Add credits (~$100)
# 3. Deploy "PyTorch" template with A100 80GB
# 4. SSH into instance
# 5. Upload dataset & configs
# 6. Run training scripts

Option B: Lambda Labs (Cheapest)

# 1. Sign up at lambdalabs.com
# 2. Request A100 instance
# 3. Launch instance
# 4. SSH and upload files
# 5. Run training

Upload Dataset to Cloud

# From your machine
scp examples/datasets/FINAL_CORPUS_7M_PLUS_ESOTERIC.jsonl \
  user@cloud-gpu:/workspace/

# Or use rsync for resume capability
rsync -avz --progress \
  examples/datasets/FINAL_CORPUS_7M_PLUS_ESOTERIC.jsonl \
  user@cloud-gpu:/workspace/

💡 RECOMMENDED WORKFLOW

Phase 1: Validate on A4000 (24-36 hours, FREE)

# Run 10% test on your A4000
./scripts/train_10pct_a4000.sh

# Verify it works
python3 scripts/test_inference.py \
  --adapter work/training/leviathan_10pct_a4000/checkpoint-final

Outcome: Confirm corpus quality, config works, no bugs

Phase 2: Full Training on Cloud (48-72 hours, ~$80-100)

# Upload to cloud A100
scp -r configs examples/datasets scripts user@cloud-gpu:/workspace/

# SSH to cloud
ssh user@cloud-gpu

# Launch full training (A100 config, not A4000!)
cd /workspace
./scripts/train_full.sh

Outcome: Production model in 2-3 days

Phase 3: Download & Deploy

# Download merged model
scp -r user@cloud-gpu:/workspace/models/leviathan-8b-v1-merged ./

# Deploy locally
python -m vllm.entrypoints.openai.api_server \
  --model models/leviathan-8b-v1-merged

Total Cost: ~$80-100 cloud + your time Total Time: ~3-4 days (vs 2+ weeks on A4000)

📊 DECISION MATRIX

Factor	A4000 Local	Cloud A100
Time	❌ 10-15 days	✅ 2-3 days
Cost	✅ Free (power ~$40)	⚠️ ~$80-100
Reliability	⚠️ Power outages risk	✅ Datacenter stable
Speed	❌ 0.3 steps/sec	✅ 5 steps/sec
Convenience	✅ Your hardware	⚠️ Setup needed
GPU Availability	✅ Always	⚠️ Sometimes wait

Recommendation

🏆 BEST APPROACH:

Run 10% test on A4000 (validate setup)
Rent cloud A100 for full training (fast + reliable)
Total: ~$80-100, 3-4 days

If budget is absolutely zero:

Run full training on A4000
Expect 10-15 days
Ensure stable power
Use screen/tmux
Monitor closely

✅ READY TO START

For A4000 (10% Test):

./scripts/train_10pct_a4000.sh

For Cloud A100 (Full Training):

./scripts/train_full.sh  # Use standard config, not A4000

🌊 Final Thoughts

The A4000 is a capable GPU for inference and small-scale training, but fine-tuning an 8B model on 5.5M examples is pushing its limits.

For $80-100, you can rent an A100 and finish in 2-3 days vs 2+ weeks on A4000. That's often the better choice unless you have unlimited time and patience.

Whatever you choose: Start with the 10% test first. It validates everything before committing to the full run.

Good luck! 🚀

FilesExpand file tree

A4000_TRAINING_GUIDE.md

Latest commit

History

A4000_TRAINING_GUIDE.md

File metadata and controls

🎮 TRAINING LEVIATHAN ON RTX A4000

⚠️ IMPORTANT: A4000 vs A100 Reality Check

⏱️ REALISTIC TRAINING TIMES

10% Validation Test (559K examples, 2 epochs)

Full Training (5.5M examples, 3 epochs)

🚀 RECOMMENDED STRATEGY

Option 1: Train 10% Test on A4000, Full Training on Cloud (RECOMMENDED)

Option 2: Full Training on A4000 (Patient/Budget Option)

Option 3: Cloud-Only Training (Fast & Simple)

💰 CLOUD GPU COST COMPARISON

Recommended Providers

🛠️ A4000 OPTIMIZATION SETTINGS

Key Differences from A100 Config

Memory Usage Estimates

📋 PRE-FLIGHT CHECKLIST (A4000)

🚀 LAUNCH 10% TEST ON A4000

Monitor Progress

Expected Training Curve

Success Criteria

🆘 TROUBLESHOOTING A4000

Out of Memory (OOM)

Training Too Slow (<0.2 steps/sec)

GPU Throttling (Thermal)

Training Crashes/Interrupts

🌐 CLOUD SETUP GUIDE (RECOMMENDED)

Option A: RunPod (Easiest)

Option B: Lambda Labs (Cheapest)

Upload Dataset to Cloud

💡 RECOMMENDED WORKFLOW

Phase 1: Validate on A4000 (24-36 hours, FREE)

Phase 2: Full Training on Cloud (48-72 hours, ~$80-100)

Phase 3: Download & Deploy

📊 DECISION MATRIX

Recommendation

✅ READY TO START

For A4000 (10% Test):

For Cloud A100 (Full Training):

🌊 Final Thoughts