Corpus: nanoGPT, minGPT, micrograd (3 repos) + 5 research papers on attention/transformers + 4 images
Files: 29 Python files + 14 docs/READMEs + 5 PDFs + 4 images (total 52 files)
Words: ~92,616 · Tokens (naive full-context): ~123,488
Date: 2026-04-04
Extraction: AST (tree-sitter, deterministic) for code + Claude semantic for docs/papers/images
| Metric | Value |
|---|---|
| Corpus tokens (29 code files) | ~16,997 |
| Average query cost (BFS subgraph) | ~1,929 tokens |
| Reduction ratio | 8.8x |
| Metric | Value |
|---|---|
| Corpus tokens (52 files, naive full-context) | ~123,488 |
| Average query cost (BFS subgraph) | ~1,726 tokens |
| Reduction ratio | 71.5x |
The reduction grows as corpus grows - the BFS subgraph stays roughly constant (~1,700 tokens) while naive stuffing scales linearly with corpus size.
| Reduction | Question |
|---|---|
| 126.7x | what connects micrograd to nanoGPT |
| 100.8x | how does FlashAttention improve memory efficiency |
| 68.6x | what are the core abstractions |
| 68.6x | how are errors handled |
| 43.5x | how does the attention mechanism work |
The "attention mechanism" question returns a larger subgraph (2,836 tokens) because FlashAttention, CausalSelfAttention (nanoGPT), CausalSelfAttention (minGPT), and the AttnRes paper all connect to it. Still 43.5x cheaper than naive.
| Metric | Value |
|---|---|
| Nodes | 285 (163 AST + 112 semantic) |
| Edges | 340 (281 AST + 97 semantic, after pruning) |
| Communities | 53 (17 major + 36 isolates) |
| Community | Nodes | What it found |
|---|---|---|
| 0 (30 nodes) | nanoGPT Model Architecture | Block, forward(), dataclasses - transformer architecture |
| 1 (24 nodes) | minGPT Training + Datasets | batch_end_callback, eval_split, get_config, CharDataset, chargpt |
| 2 (23 nodes) | nanoGPT Training Pipeline | get_batch, bench.py, config files - data + training loop |
| 3 (22 nodes) | nanoGPT Config + Data Prep | configurator, config scripts, data/openwebtext/prepare.py |
| 4 (21 nodes) | micrograd NN Layer | Layer, __call__, __init__, MLP |
| 5 (21 nodes) | FlashAttention Paper | IO-awareness, HBM/SRAM, recomputation, BERT/GPT-2 benchmarks |
| 6 (17 nodes) | BPE Tokenizer | BPETokenizer, decode, bytes_to_unicode, full tokenisation logic |
| 7 (16 nodes) | micrograd Autograd Engine | Value, backward, __add__, __mul__ - the autograd core |
| 8 (14 nodes) | Stdlib + Config Utilities | ast, json, CfgNode - supporting infrastructure |
| 9 (13 nodes) | Addition Dataset | AdditionDataset, get_block_size, get_vocab_size |
| 10 (12 nodes) | micrograd README + Backprop | README concepts, backprop explanation, computation graph |
| 11 (7 nodes) | Attention Residuals Paper | Kimi model, pre-norm dilution, MMLU scaling |
| 12 (6 nodes) | Continual LoRA Paper | CoLoR, catastrophic forgetting, ViT fine-tuning |
| 13 (6 nodes) | minGPT Trainer Class | add_callback, run, set_callback |
| 14 (5 nodes) | NeuralWalker Paper | SSM, graph expressivity, Pascal VOC results |
| Node | Edges | Why central |
|---|---|---|
Value (micrograd) |
15 | The autograd primitive - everything math-related connects through it |
Training Script (nanoGPT) |
11 | Orchestrates model + data + optimizer |
GPT (nanoGPT) |
9 | Main model class - Block, attention, config all flow through here |
Layer (micrograd nn) |
8 | The neural net abstraction - connects engine to high-level API |
- micrograd split correctly into two communities - engine (Value + autograd) and nn (Layer + MLP) are separate communities, matching the intended architecture split in the repo.
- nanoGPT model vs training separation - communities 0 and 2 correctly separate model definition from training loop. Different concerns in different files; Leiden found the boundary.
- BPETokenizer isolated -
bpe.pyforms its own cluster, correctly identified as standalone rather than merged with model or trainer. - Cross-repo connections found - the graph found that nanoGPT
Blockand minGPTBlockshare structural similarity (same class name, similar methods), creating a cross-repo INFERRED edge. This is genuine: both implement the same GPT block pattern. - Paper → code connections - FlashAttention paper cluster (Community 5) connects to
CausalSelfAttentionin both nanoGPT and minGPT. NeuralWalker paper connects to graph structural concepts in micrograd. - Images correctly identified -
gpt2_124M_loss.pngextracted as "val_loss=2.905 at step 399";gout.svgrecognized as micrograd computation graph;moon_mlp.pngas MLP decision boundary.
- Stdlib imports create 94 validation warnings -
setuptools,os,math,sysemit "target does not match any node" warnings. The AST extractor emits import edges to stdlib names before the validator can prune them. These are discarded but inflate edge count before pruning. - Config-only files become isolates -
eval_gpt2.py,eval_gpt2_large.pyetc. are config scripts with no functions; they land as single-node communities. Expected, but adds ~36 trivial communities. - 53 communities from 285 nodes - the isolate problem means ~36 of 53 communities are single nodes. The "17 major communities" number from the code-only run was cleaner. The isolate handling is correct but visually noisy.
- Papers not deep-linked to implementation - the FlashAttention paper cluster knows about "3x GPT-2 speedup" but the graph doesn't directly link that claim to the specific
CausalSelfAttentionimplementation that would benefit. That would require--mode deepon the paper extraction pass.
micrograd/engine.py::Value.backward()→minGPT/mingpt/trainer.py::Trainer.run()- both implement the foundational forward/backward pattern at different scales. The graph surfaces this cross-repo connection without being asked.FlashAttention paper(Community 5) bridges intoCausalSelfAttentionnodes in both nanoGPT and minGPT, creating the only paper→code cross-community edges in the graph.nanoGPT/train.pyandminGPT/mingpt/trainer.pyland in the same community (Community 2) despite being in different repos and never importing each other. Leiden found the structural similarity through shared vocabulary (optimizer, scheduler, gradient clipping).
71.5x token reduction on a 92k-word mixed corpus. The reduction grows as corpus grows - on a 500k-word research library the same BFS subgraph stays ~2k tokens while naive stuffing hits 670k tokens.
Graph quality: high for code structure, strong for paper-to-concept connections (semantic extraction found the FlashAttention→CausalSelfAttention bridge), weaker on direct paper-to-implementation links (need --mode deep with explicit cross-file context).
The main cost is honesty: 53 communities when 17 are real and 36 are isolates. This is correct behavior (isolates shouldn't be merged), but the visualization is noisy. A future --min-community-size flag would clean this up.