|
1 | | -# microgpt (optimized) |
| 1 | +# microgpt (optimized + CUDA) |
2 | 2 |
|
3 | 3 |  |
4 | 4 |
|
5 | | -Optimized version of [Karpathy's microgpt](https://karpathy.ai/microgpt.html), the most atomic way to train and inference a GPT in pure, dependency-free Python. |
| 5 | +A minimal GPT project with two aligned implementations: |
6 | 6 |
|
7 | | -**293 lines, 0 dependencies.** All optimizations preserve the original simplicity. |
| 7 | +- `microgpt.py`: pure Python, dependency-free reference implementation. |
| 8 | +- `microgpt_cuda.cu`: CUDA/C++ implementation for Windows (MSVC + CUDA), optimized for speed while keeping the same model/training logic. |
8 | 9 |
|
9 | | -## What's Changed |
| 10 | +## What this repo focuses on |
10 | 11 |
|
11 | | -| Optimization | Lines | Impact | |
12 | | -|---|---|---| |
13 | | -| Direct `__truediv__` implementation | +8 | ~20-30% fewer computation graph nodes per step | |
14 | | -| Fused `cross_entropy` (log-softmax + NLL) | +5 | Fewer nodes + better numerical stability | |
15 | | -| Iterative `backward()` topological sort | 0 | Eliminates recursion depth limit | |
16 | | -| `sum(losses[1:], losses[0])` | 0 | Removes phantom `Value(0)` node | |
17 | | -| Adam running product | +2 | Numerically stable bias correction at large step counts | |
18 | | -| `with open()` file handle | +1 | Proper resource cleanup | |
19 | | -| **Weight tying** (wte = lm_head) | -1 | Standard GPT-2 practice, fewer params | |
20 | | -| **Cosine LR schedule** | 0 | Smoother decay than linear | |
21 | | -| **Train/val split** (90/10) | +3 | Basic ML hygiene, detect overfitting | |
22 | | -| **Periodic validation** (every 100 steps) | +10 | Pure-float NLL eval on held-out docs | |
23 | | -| **Gradient clipping** (global norm) | +4 | Prevents exploding gradients, stabilizes training | |
24 | | -| **AdamW weight decay** | +1 | Decoupled regularization | |
25 | | -| **Top-k sampling** (k=5) | +4 | Higher quality inference, avoids garbage tokens | |
26 | | -| **Per-step timing** | +3 | Performance observability in ms/step | |
| 12 | +- Keep the project small and readable. |
| 13 | +- Preserve algorithmic parity between Python and CUDA paths. |
| 14 | +- Push performance through GPU residency and kernel fusion where it matters. |
27 | 15 |
|
28 | | -**Total: +50 lines** (243 -> 293), no new dependencies. |
| 16 | +Core model/training recipe (both paths): |
29 | 17 |
|
30 | | -## Files |
| 18 | +- Character tokenizer with `<BOS>`. |
| 19 | +- GPT-style block with RMSNorm, causal multi-head attention, and ReLU^2 MLP. |
| 20 | +- Weight tying (`wte` reused as LM head). |
| 21 | +- AdamW + cosine LR + global grad clipping. |
| 22 | +- Train/val split, periodic validation, top-k sampling inference. |
31 | 23 |
|
32 | | -- **`microgpt.py`** - Complete optimized Python algorithm (runnable) |
33 | | -- **`microgpt_cuda.cu`** - CUDA/C++ port with full train/val/inference loop |
34 | | -- **`microgpt_optimized.html`** - Syntax-highlighted 3-column view with change annotations |
35 | | -- **`CMakeLists.txt`** - CMake entrypoint for CUDA build |
| 24 | +## Repository layout |
36 | 25 |
|
37 | | -## Quick Start |
| 26 | +- `microgpt.py`: full Python algorithm (train + val + inference). |
| 27 | +- `microgpt_cuda.cu`: full CUDA/C++ algorithm (train + val + inference). |
| 28 | +- `microgpt_optimized.html`: side-by-side Python/CUDA code converter view. |
| 29 | +- `CMakeLists.txt`: CUDA build entry. |
| 30 | +- `input.txt`: corpus (auto-downloaded if missing on first run). |
| 31 | + |
| 32 | +## Quick start (Python) |
38 | 33 |
|
39 | 34 | ```bash |
40 | 35 | python microgpt.py |
41 | 36 | ``` |
42 | 37 |
|
43 | | -It auto-downloads `input.txt` on first run, trains for 500 steps with periodic validation, then generates samples via top-k sampling. |
| 38 | +If `input.txt` is missing, the script downloads the default names dataset automatically. |
| 39 | + |
| 40 | +## Quick start (CUDA / Windows) |
| 41 | + |
| 42 | +Prerequisites: |
44 | 43 |
|
45 | | -## CUDA Build |
| 44 | +- NVIDIA GPU + compatible driver |
| 45 | +- CUDA Toolkit (your setup: CUDA 13.1) |
| 46 | +- Visual Studio 2022 (MSVC, x64 toolchain) |
| 47 | +- CMake 3.24+ |
| 48 | + |
| 49 | +Build: |
46 | 50 |
|
47 | 51 | ```bash |
48 | | -cmake -S . -B build -G "Visual Studio 17 2022" -A x64 |
| 52 | +cmake -S . -B build -G "Visual Studio 17 2022" -A x64 -DCMAKE_CUDA_ARCHITECTURES=86 |
49 | 53 | cmake --build build --config Release |
| 54 | +``` |
| 55 | + |
| 56 | +Run: |
| 57 | + |
| 58 | +```bash |
| 59 | +.\build\Release\microgpt_cuda.exe --help |
50 | 60 | .\build\Release\microgpt_cuda.exe |
51 | 61 | ``` |
52 | 62 |
|
53 | | -For quick smoke tests: |
| 63 | +Smoke test: |
54 | 64 |
|
55 | 65 | ```bash |
56 | 66 | .\build\Release\microgpt_cuda.exe --steps 5 --samples 3 |
57 | 67 | ``` |
58 | 68 |
|
| 69 | +## CUDA CLI options |
| 70 | + |
| 71 | +- `--steps <int>`: training steps (default `500`) |
| 72 | +- `--val-every <int>`: validation interval (default `100`) |
| 73 | +- `--val-docs <int>`: max validation docs per eval (default `20`) |
| 74 | +- `--samples <int>`: generated samples after training (default `20`) |
| 75 | +- `--top-k <int>`: top-k for sampling (default `5`) |
| 76 | +- `--temperature <float>`: sampling temperature (default `0.6`) |
| 77 | +- `--seed <int>`: RNG seed (default `42`) |
| 78 | + |
| 79 | +## Important implementation notes |
| 80 | + |
| 81 | +- CUDA path keeps parameters, gradients, and optimizer states on GPU. |
| 82 | +- Training step is fused into one kernel launch (forward + backward + grad clip + AdamW update). |
| 83 | +- Current fused implementation is specialized to `n_layer = 1` (same as current Python config). |
| 84 | +- `kMaxVocab = 256` in `microgpt_cuda.cu`; if your dataset exceeds this, increase it and rebuild. |
| 85 | +- Default `CMAKE_CUDA_ARCHITECTURES` is `86`; set it to your GPU architecture when needed. |
| 86 | + |
| 87 | +## Code converter page |
| 88 | + |
| 89 | +Open `microgpt_optimized.html` in a browser to switch between: |
| 90 | + |
| 91 | +- Python view |
| 92 | +- CUDA view |
| 93 | +- Bilingual side-by-side comparison |
| 94 | + |
| 95 | +This is useful for checking one-to-one conceptual mapping between the two codebases. |
| 96 | + |
59 | 97 | ## Credits |
60 | 98 |
|
61 | | -Original by [@karpathy](https://github.com/karpathy) - [microgpt](https://karpathy.ai/microgpt.html) | [Gist](https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95) |
| 99 | +Original microgpt idea and baseline by [@karpathy](https://github.com/karpathy): |
| 100 | + |
| 101 | +- https://karpathy.ai/microgpt.html |
| 102 | +- https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95 |
0 commit comments