Skip to content

Commit 2614a9b

Browse files
B-A-M-Nclaude
andcommitted
feat: Complete Leviathan training pipeline with 7M dataset
## Dataset Infrastructure (7,053,867 examples) - Merged corpus combining base + gap-spanning + specialized + dark-themed datasets - 34.26% deduplication rate (removed 2.26M duplicates) - Train/val split (95%/5%): 6.7M train, 352k val - Test samples: 10% (670k) and 1% (67k) for validation runs ## Training Setup - QLoRA configs for A4000 16GB VRAM (4-bit quantization, LoRA rank 64) - Three training options: - 10% sample test (3-4 hours) - recommended first validation - 1% quick test (30-45 min) - ultra-fast sanity check - Full training (5-7 days) - production training on 7M examples - Launch scripts for all three options - TensorBoard monitoring support ## Dataset Categories - Instruction: 23.2% (1.64M) - ChatGPT-4o efficiency - Esoteric: 14.2% (1.0M) - occult, mysticism, symbolism - Reasoning: 11.4% (804k) - Claude 3.5-level deep reasoning - Dialog: 8.9% (626k) - conversational depth - Code: 7.3% (518k) - multilang programming - Factual: 7.4% (522k) - knowledge grounding - Red Team: 6.1% (432k) - adversarial robustness - Tool/API: 3.8% (265k) - enterprise integration - Dark domains: 17.7% total (psychology, philosophy, protector archetype) ## Download Scripts - Gap-spanning datasets (reasoning, code, factual, dialog, DPO) - Specialized datasets (structured-output, tool/API, CoT compression) - Dark-themed datasets (psychology, philosophy, adversarial, narrative) - Merge and deduplication pipeline ## Documentation - TRAINING_GUIDE.md - comprehensive training documentation - Multiple expansion strategy docs - Dataset composition and statistics ## Expected Capabilities Target: Claude 3.5-level reasoning + ChatGPT-4o efficiency + dark domain expertise - Deep multi-step reasoning and philosophical arguments - Efficient, concise responses without over-verbosity - Enterprise tool use and API integration - Esoteric knowledge (tarot, astrology, alchemy, occult) - Dark psychology (manipulation awareness, trauma-informed) - Dark philosophy (existentialism, nihilism, moral relativism) - Dark protector archetype (boundary enforcement, harm prevention) - Unfiltered nature (direct, honest responses on controversial topics) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 28d9e46 commit 2614a9b

77 files changed

Lines changed: 21679 additions & 2 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,24 @@ test_work/
4949
work_dir/
5050
*.log
5151

52-
# Datasets (keep examples)
52+
# Datasets (exclude large files - use download scripts instead)
5353
testtraindata/*.jsonl
5454
testtraindata/*.csv
5555
testtraindata/*.txt
5656
!testtraindata/practical_coding.jsonl
5757

58+
# Training data directories (exclude all large datasets)
59+
data/
60+
examples/datasets/*.jsonl
61+
examples/datasets/*/
62+
*.jsonl
63+
64+
# Download logs
65+
*_download.log
66+
gap_download.log
67+
specialized_download.log
68+
dark_themed_download.log
69+
5870
# Cache
5971
.cache/
6072
*.cache

0 commit comments

Comments
 (0)