Skip to content

Latest commit

 

History

History
135 lines (103 loc) · 6.58 KB

File metadata and controls

135 lines (103 loc) · 6.58 KB

A real LLM running in Scratch

This runs Karpathy's stories260K TinyStories transformer (260K params: dim 64, 5 layers, 8 heads / 4 KV heads, vocab 512) as a generated .sb3. It produces coherent little stories:

Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big, red ball...

Open out/llm_generate.sb3 in TurboWarp and press the green flag — the story variable fills in as it generates. (Use TurboWarp, not vanilla Scratch: it compiles the project to JavaScript, which is what makes the matrix math fast.)

How it works

The model isn't unrolled into millions of blocks. Instead the codegen emits a tiny interpreter:

  • WF — float weights (embeddings/norms/RoPE). WQ/WS — the matmul weights as per-row int8 + scales (y[i] = scale[i]·Σ int8·x), which cuts the .sb3 ~3-5× with no change to the predicted token.
  • M — working memory with a fixed region per activation vector (X, Q, K, V…).
  • KC/VC — the KV cache, so each new token is O(context), not O(context²).
  • ~10 generic run-without-screen-refresh blocks (mm, rms, rope, attn, swiglu, classifier, argmax, sample) do the work in runtime loops.
  • A generated driver calls them with constant offsets, layer by layer.

So the block count is ~1.5k; the weights are the data, not the program.

Chat interface

python llm/gen_scratch_llm.py chat builds out/llm_chat.sb3 — a chat loop: an ask-and-wait text box takes your line, a tokenizer encoder (built in Scratch) turns it into tokens, they're primed onto the KV cache, and the reply streams into a chat list (one line per turn — reads like a chat log).

  • Context: each message gets a fresh context (your line, re-read with BOS, within the 256-token window). These tiny models (≤5M) degenerate into gibberish if fed accumulated chat history, so per-message reset keeps replies coherent and on the current input — at the cost of cross-message memory. (A bigger model could hold real multi-turn context; the KV cache already supports it.)

  • Sampling: temperature (temp, default 0.4). Higher (0.8+) makes these small models ramble; 0.4 stays coherent with a little variety.

  • The encoder does BPE merges numerically (token-id pairs), so they're exact. Only the initial char→token step uses Scratch's string =, which is case-insensitive — so capitalization is approximate. The model still continues coherently (verified). For perfect tokenization you'd need a char-code, e.g. a TurboWarp extension.

  • With stories260K it "continues" your text (story model). The fine-tuned chat260K (below) turns the same box into a real chatbot.

Training the chat model

train_chat.py fine-tunes stories260K into a chatbot:

python -m llm.train_chat [n_convos] [steps]   # -> llm/chat260K.bin

It builds a PyTorch Llama, verifies it reproduces the NumPy reference (gate), trains on a subset of TinyChat ([INST]..[/INST].. turns, the same dataset CraftGPT used), and exports back to the .bin layout. python llm/gen_scratch_llm.py chat auto-picks the best model present (chathuge > chatbig > chat260K) and wraps your input in the [INST] template.

Three trained models (all int8-quantized in the .sb3):

build params loss .sb3 quality
train_chat 260K (fine-tune) ~1.2 0.7 MB short, simple
train_chat big ~1.2M (dim 128) ~0.92 2.1 MB good small-talk
train_chat huge ~4.6M (dim 256) ~0.79 7 MB best — "how are you today" → "I am feeling very happy today, thank you for asking"

These are tiny models trained only on TinyChat, so they chat about everyday things (feelings, weather, activities) but won't recall facts like your name.

Sampling: temperature (temp, default 0.4). Higher temps (0.8+) make these small models ramble — 0.4 keeps them coherent with a little variety.

Size & where it runs (read this before trying to publish)

These run in TurboWarp, not on scratch.mit.edu. Two hard reasons, both fundamental:

  1. Upload size. The weights live in a list, so the uncompressed project.json is 7-23 MB; scratch.mit.edu caps uploads around 5 MB and rejects it (HTTP 413).
  2. The 200,000-item list cap. scratch-vm refuses to build a list past 200,000 entries at runtime — and the models have 226K-4.4M weights. So you can't store the weights compressed and unpack them on load: the unpacked list just truncates. The weights must be the list's loaded-from-project.json initial value (loading isn't capped), which is exactly what blows past the upload limit. Catch-22.

So there's no way onto scratch.mit.edu for a real model. To publish: use the TurboWarp Packager to export a standalone HTML and host it (e.g. GitHub Pages) — no size limit, full speed.

Pipeline (train outside, infer in Scratch)

download_model.py     pull stories260K.bin + tok512.bin from HuggingFace
llama_ref.py          pure-NumPy reference forward pass (the spec) + greedy gen
tokenizer_ref.py      reference BPE encoder + numeric merge/char tables
gen_scratch_llm.py    codegen:  python llm/gen_scratch_llm.py [verify|chat|chat_verify]
verify_llm.js  / compare.py        Scratch logits  vs NumPy reference (argmax-exact)
verify_enc.js  / compare_enc.py    Scratch encoder == reference tokens
run_generate.js / run_chat.js      stream a story / chat reply headless

Verified

gen_scratch_llm.py verify emits a one-forward-pass build; verify_llm.js runs it in scratch-vm and compare.py checks it against the NumPy reference:

reference argmax: 403  -> 'Once'
scratch   argmax: 403  -> 'Once'
argmax match: True

The full forward pass (matmul, RMSNorm, RoPE, GQA, SwiGLU, classifier) is correct: in pure float it matched to ~1e-14; with int8 weights the logits differ slightly but the predicted token is identical, which is what generation needs.

i.e. matmul, RMSNorm, RoPE, grouped-query attention, SwiGLU and the classifier are all bit-for-bit correct (to float noise).

Speed

In the headless scratch-vm interpreter it runs ~0.6 s/token. TurboWarp compiles to JS, so it's far faster there (the hot loop is ~260K multiply-adds per token). Levers already applied: KV cache, flat-list weights, warp blocks, hoisted matmul row base. Further options: int8-quantized weights (smaller load), op fusion, a smaller-vocab/byte-level model, or a 1-bit model (sign-bit weights → adds instead of multiplies).