Download TinyStories, tokenize it with marin-community/marin-tokenizer, train a ~30M-param Grug decoder-only transformer.
See the repo root README for the getting-started workflow (copy, adapt, smoke-test, scale up). This file documents what's specific to the tiny-stories template.
launch.py wires three ExecutorSteps. Each step's config is hashed (via versioned(...)) so outputs cache on content; output_path_of(...) and this_output_path() thread paths between steps without hardcoding them.
tinystories_download = download_hf_step(
"raw/tinystories",
hf_dataset_id=TINYSTORIES_HF_ID,
revision=TINYSTORIES_HF_REVISION,
hf_urls_glob=["data/*.parquet"],
).as_executor_step()Pinned to an HF revision so caching is deterministic. Swap TINYSTORIES_HF_ID / TINYSTORIES_HF_REVISION for a different dataset; refresh the revision with curl -s https://huggingface.co/api/datasets/<id> | jq -r .sha.
tinystories_tokenized = ExecutorStep(
name="tokenized/tinystories",
fn=tokenize,
config=TokenizeConfig(
train_paths=[output_path_of(tinystories_download, "data/train-*.parquet")],
validation_paths=[output_path_of(tinystories_download, "data/validation-*.parquet")],
cache_path=this_output_path(),
tokenizer=versioned(MARIN_TOKENIZER),
format=TextLmDatasetFormat(),
sample_count=versioned(1_000 if _accelerator() == "cpu" else None),
),
)CPU caps sample_count=1000 per shard; TPU/GPU tokenize everything. Swap the tokenizer by changing MARIN_TOKENIZER.
run_tiny_stories_trial wraps Levanter's TrainerConfig with a GrugRunConfig (model, data, resources, optimizer) and calls run_grug. Resize the model via TINY_MODEL (hidden_dim, num_layers, num_heads, max_seq_len). Change accelerator/resources via _resolve_resources().
| Var | Default | Meaning |
|---|---|---|
ACCELERATOR |
tpu |
tpu / gpu / cpu — picks resources, batch size, precision, sample cap. |
MARIN_PREFIX |
/tmp/marin |
Pipeline output root (GCS on cluster, local dir otherwise). |
TINY_STORIES_STEPS |
2000 (1 on CPU) |
Training step count. |
GRUG_RUN_ID |
tiny-stories-30m-$ACCELERATOR |
W&B / run ID. |
WANDB_API_KEY |
unset | Optional. W&B is disabled on CPU regardless. |
launch.py— pipeline definition,TINY_MODEL,executor_mainentrypoint.model.py—GrugModelConfig.train.py—run_grug,GrugRunConfig,GrugTrainerConfig,GrugEvalConfig.pyproject.toml— marin wheel pins andfind-linksURLs.