This runs Karpathy's stories260K TinyStories transformer (260K params: dim 64,
5 layers, 8 heads / 4 KV heads, vocab 512) as a generated .sb3. It produces
coherent little stories:
Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big, red ball...
Open out/llm_generate.sb3 in TurboWarp and press the
green flag — the story variable fills in as it generates. (Use TurboWarp, not
vanilla Scratch: it compiles the project to JavaScript, which is what makes the
matrix math fast.)
The model isn't unrolled into millions of blocks. Instead the codegen emits a tiny interpreter:
WF— float weights (embeddings/norms/RoPE).WQ/WS— the matmul weights as per-row int8 + scales (y[i] = scale[i]·Σ int8·x), which cuts the.sb3~3-5× with no change to the predicted token.M— working memory with a fixed region per activation vector (X, Q, K, V…).KC/VC— the KV cache, so each new token is O(context), not O(context²).- ~10 generic run-without-screen-refresh blocks (
mm,rms,rope,attn,swiglu,classifier,argmax,sample) do the work in runtime loops. - A generated driver calls them with constant offsets, layer by layer.
So the block count is ~1.5k; the weights are the data, not the program.
python llm/gen_scratch_llm.py chat builds out/llm_chat.sb3 — a chat loop: an
ask-and-wait text box takes your line, a tokenizer encoder (built in Scratch)
turns it into tokens, they're primed onto the KV cache, and the reply streams into a
chat list (one line per turn — reads like a chat log).
-
Context: each message gets a fresh context (your line, re-read with BOS, within the 256-token window). These tiny models (≤5M) degenerate into gibberish if fed accumulated chat history, so per-message reset keeps replies coherent and on the current input — at the cost of cross-message memory. (A bigger model could hold real multi-turn context; the KV cache already supports it.)
-
Sampling: temperature (
temp, default 0.4). Higher (0.8+) makes these small models ramble; 0.4 stays coherent with a little variety. -
The encoder does BPE merges numerically (token-id pairs), so they're exact. Only the initial char→token step uses Scratch's string
=, which is case-insensitive — so capitalization is approximate. The model still continues coherently (verified). For perfect tokenization you'd need a char-code, e.g. a TurboWarp extension. -
With
stories260Kit "continues" your text (story model). The fine-tunedchat260K(below) turns the same box into a real chatbot.
train_chat.py fine-tunes stories260K into a chatbot:
python -m llm.train_chat [n_convos] [steps] # -> llm/chat260K.bin
It builds a PyTorch Llama, verifies it reproduces the NumPy reference (gate),
trains on a subset of TinyChat ([INST]..[/INST].. turns, the same dataset
CraftGPT used), and exports back to the .bin layout. python llm/gen_scratch_llm.py chat auto-picks the best model present (chathuge > chatbig > chat260K) and
wraps your input in the [INST] template.
Three trained models (all int8-quantized in the .sb3):
| build | params | loss | .sb3 |
quality |
|---|---|---|---|---|
train_chat |
260K (fine-tune) | ~1.2 | 0.7 MB | short, simple |
train_chat big |
~1.2M (dim 128) | ~0.92 | 2.1 MB | good small-talk |
train_chat huge |
~4.6M (dim 256) | ~0.79 | 7 MB | best — "how are you today" → "I am feeling very happy today, thank you for asking" |
These are tiny models trained only on TinyChat, so they chat about everyday things (feelings, weather, activities) but won't recall facts like your name.
Sampling: temperature (temp, default 0.4). Higher temps (0.8+) make these
small models ramble — 0.4 keeps them coherent with a little variety.
These run in TurboWarp, not on scratch.mit.edu. Two hard reasons, both fundamental:
- Upload size. The weights live in a list, so the uncompressed
project.jsonis 7-23 MB; scratch.mit.edu caps uploads around 5 MB and rejects it (HTTP 413). - The 200,000-item list cap. scratch-vm refuses to build a list past 200,000
entries at runtime — and the models have 226K-4.4M weights. So you can't store the
weights compressed and unpack them on load: the unpacked list just truncates. The
weights must be the list's loaded-from-
project.jsoninitial value (loading isn't capped), which is exactly what blows past the upload limit. Catch-22.
So there's no way onto scratch.mit.edu for a real model. To publish: use the TurboWarp Packager to export a standalone HTML and host it (e.g. GitHub Pages) — no size limit, full speed.
download_model.py pull stories260K.bin + tok512.bin from HuggingFace
llama_ref.py pure-NumPy reference forward pass (the spec) + greedy gen
tokenizer_ref.py reference BPE encoder + numeric merge/char tables
gen_scratch_llm.py codegen: python llm/gen_scratch_llm.py [verify|chat|chat_verify]
verify_llm.js / compare.py Scratch logits vs NumPy reference (argmax-exact)
verify_enc.js / compare_enc.py Scratch encoder == reference tokens
run_generate.js / run_chat.js stream a story / chat reply headless
gen_scratch_llm.py verify emits a one-forward-pass build; verify_llm.js runs
it in scratch-vm and compare.py checks it against the NumPy reference:
reference argmax: 403 -> 'Once'
scratch argmax: 403 -> 'Once'
argmax match: True
The full forward pass (matmul, RMSNorm, RoPE, GQA, SwiGLU, classifier) is correct: in pure float it matched to ~1e-14; with int8 weights the logits differ slightly but the predicted token is identical, which is what generation needs.
i.e. matmul, RMSNorm, RoPE, grouped-query attention, SwiGLU and the classifier are all bit-for-bit correct (to float noise).
In the headless scratch-vm interpreter it runs ~0.6 s/token. TurboWarp
compiles to JS, so it's far faster there (the hot loop is ~260K multiply-adds
per token). Levers already applied: KV cache, flat-list weights, warp blocks,
hoisted matmul row base. Further options: int8-quantized weights (smaller load),
op fusion, a smaller-vocab/byte-level model, or a 1-bit model (sign-bit weights →
adds instead of multiplies).