Copy-paste templates for running Marin pipelines as standalone experiments. Each template is a self-contained directory — marin is pulled in as a library via find-links wheels, no submodule, no vendoring.
| Template | Input | Pipeline |
|---|---|---|
tiny-stories/ |
HF text dataset | download → tokenize → train |
speech-asr/ |
HF audio dataset | download → Mimi-encode → train BPE → tokenize → train |
Start with tiny-stories/ if your data is text. Start with speech-asr/ if you need a pre-tokenization stage (audio, images, anything that needs to become discrete tokens before training).
cp -r tiny-stories my-experiment
cd my-experiment
Each template has its own pyproject.toml and virtual environment — nothing cross-references the source directory.
Every template is driven by one launch.py that wires ExecutorSteps together. The per-template README walks through each stage and calls out what to change:
- Data: swap the HF dataset ID + revision at the top of
launch.py. - Model: resize
TINY_MODEL/SPEECH_MODEL(hidden_dim,num_layers,num_heads,max_seq_len). - Tokenizer: swap
MARIN_TOKENIZER, or (for speech-asr) change the BPE vocab size / special tokens.
Every template supports a CPU smoke test that exercises the full pipeline end-to-end on a tiny subset — enough to confirm download → tokenize → train → checkpoint works before committing compute.
ACCELERATOR=cpu MARIN_PREFIX=/tmp/marin uv run python launch.py
Finishes in under a minute for tiny-stories, ~3 min for speech-asr (Mimi on CPU dominates).
Once the smoke test passes, submit the same launch.py to the shared marin TPU cluster via iris:
uv run iris --cluster=marin job run python launch.py --region=europe-west4
--cluster=marin targets the shared coordinator. --region is required because TPU availability is region-scoped and the default us-central1 has no v6e-4 capacity.
If you don't have access to the shared marin cluster, you can run your own iris cluster — see the iris docs for setup.
x Failed to download `marin-iris==0.99.devYYYYMMDD`
`-> HTTP status client error (404 Not Found) for url
(https://github.com/marin-community/marin/releases/download/marin-iris-latest/...)
The marin-* wheels are published to rolling GitHub releases whose assets are
replaced on each upstream rebuild, so a committed uv.lock eventually points
at wheels that no longer exist. Repin against the current wheels:
uv lock --upgrade
A scheduled workflow (repin-lockfiles.yml)
keeps the locks in this repo fresh, but if you copied a template into your own
repo a while ago you'll need to repin yourself.
README.md # this file
AGENTS.md # repo-level guidance for Claude / other agents
tiny-stories/ # text template
speech-asr/ # audio template
submodules/marin/ # marin source (for local iris config; not imported)