Skip to content

allenai/EMO

Repository files navigation

Emo Logo

GitHub License Blog Post Paper URL Model Checkpoints

EMO is a new Mixture-of-Experts model trained so that modular structure emerges during pretraining without requiring human-defined priors. EMO enables selective expert use, down to 12.5% of total experts, with minimal performance degradation. We find that its expert groups specialize to higher-level topics and capabilities rather than low-level lexical patterns.

Table of Contents

Installation

git clone https://github.com/allenai/EMO.git
cd EMO
conda create -n emo python==3.12
conda activate emo
uv pip install -e .[all]
uv pip install --upgrade 'chardet>=7'

Released Models

All checkpoints are available in the EMO collection on the Hugging Face Hub.

Main Release (1T tokens)

Model Active Total Pretraining (1T) Annealing (50B) Description
allenai/Emo_1b14b_1T 1B 14B EMO [train_script] EMO [train_script] Main EMO release
allenai/StdMoE_1b14b_1T 1B 14B standard [train_script] standard [train_script] Architecture-matched standard MoE baseline

Ablation Models (130B tokens)

Smaller-scale checkpoints used for memory-matched comparisons. These models were not midtrained.

Model Active Total Pretraining (130B) Description
allenai/Emo_1b14b_130B 1B 14B EMO [train_script] EMO at the 130B-token ablation scale
allenai/StdMoE_1b14b_130B 1B 14B standard [train_script] Standard MoE baseline at the 130B-token scale
allenai/StdMoE_1b4b_130B 1B 4B standard [train_script] Memory-matched standard MoE with 32 experts ("Reg. MoE @ 32" in Figure 1), used as a memory-matched baseline for EMO's 32-expert subsets
allenai/Dense_1b_130B 1B 1B dense LM [train_script] Dense baseline matched to active parameters ("Dense @ 8" in Figure 1), used as a memory-matched baseline for EMO's 8-expert subsets

Midtraining Ablation Models

Checkpoints used in Appendix B.4 to test whether modularity can be induced after pretraining via annealing alone, rather than during pretraining.

Model Active Total Pretraining (1T) Annealing (50B) Description
allenai/StdMoE_1b14b_1T_Preanneal 1B 14B standard [train_script] Standard MoE checkpoint after 1T-token pretraining, before any annealing. Starting point for the EMO-anneal experiment
allenai/StdMoE_1b14b_1T_EmoAnnealed 1B 14B standard [train_script] EMO [train_script] EMO-anneal: a standard MoE annealed under the document-level expert pool constraint for 50B tokens

Inference

See Released Models for the available checkpoints. All inference snippets below require trust_remote_code=True since the models use custom modeling code from the ryanyxw/transformers fork (Note: you do not need to clone this fork yourself, the Hugging Face Hub will pull the necessary code when you load the model with trust_remote_code=True).

With Hugging Face Transformers

You can use our Hugging Face transformers integration to run inference on the released checkpoints:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "allenai/Emo_1b14b_1T"
olmo = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
# inputs = {k: v.to('cuda') for k,v in inputs.items()} # optional verifying cuda
# olmo = olmo.to('cuda')
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=1.0, top_p=0.7)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Alternatively, with the Hugging Face pipeline abstraction:

from transformers import pipeline
olmo_pipe = pipeline("text-generation", model="allenai/Emo_1b14b_1T", trust_remote_code=True)
print(olmo_pipe("Language modeling is"))

With vLLM

vLLM provides high-throughput inference. We ship a small out-of-tree plugin at src/vllm_plugin/ that registers EmoForCausalLM with vLLM's native model registry

pip install vllm>=0.11.0
pip install -e src/vllm_plugin  # optional; only needed for the native path

You can run offline batched inference:

from vllm import LLM, SamplingParams
llm = LLM(model="allenai/Emo_1b14b_1T", trust_remote_code=True)
sampling_params = SamplingParams(temperature=1.0, top_p=0.7)
prompts = ["Language modeling is"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

For more details, see the vLLM documentation.

Data

All EMO models are pretrained on the exact same data as OLMoE, whose raw (text) form is publicly available on the Hugging Face Hub as allenai/OLMoE-mix-0924. The clustering/coverage analyses additionally use the publicly available WebOrganizer dataset.

These are raw-text datasets. Before they can be used by the scripts in this repo, they must be tokenized into .npy files and then registered as a data mix — a config file in this repo (under src/olmo_core/data/mixes/) that lists the tokenized files. The two steps are described below.

Tokenizing the data

The training and analysis scripts read pre-tokenized data, i.e. a collection of .npy files where each file holds a flat array of token IDs. To produce these files from the raw datasets above, tokenize them yourself with Dolma (follow the tokenization instructions there). Use the same tokenizer as the released models (allenai/dolma2-tokenizer).

Registering a data mix

Once the data has been prepared, you will have a directory of .npy files. To make them usable by the pretraining scripts, register them as a named data mix:

  1. Create a mix file. Add a new .txt file under src/olmo_core/data/mixes/, e.g. my-olmoe-mix.txt. Each non-comment line is label,path, one per .npy file. The label is an arbitrary source/domain tag (used for per-source logging); path is appended to the --data-root base directory at runtime, so list paths relative to that root (or set --data-root=/ and use absolute paths). For example:

    # my-olmoe-mix.txt
    cc,tokenized/cc/part-00-00000.npy
    cc,tokenized/cc/part-01-00000.npy
    starcoder,tokenized/starcoder/part-00-00000.npy
    

    You can also use the {TOKENIZER} placeholder in a path (it is substituted with the tokenizer id at load time) — see OLMoE-mix-0824.txt for a full example.

  2. Register the mix name. Add a corresponding member to the DataMix enum in src/olmo_core/data/mixes/__init__.py, where the value matches the file's basename:

    class DataMix(DataMixBase):
        ...
        my_olmoe_mix = "my-olmoe-mix"
  3. Point the launch script at the mix. In your pretraining launch script (see Training scripts), select the mix and point --data-root at the directory containing the .npy files:

    --dataset.mix=my-olmoe-mix \
    --data-root=/path/to/tokenized/data \

    (--data-root defaults to the internal /weka/... path, so set it to your local/remote data root.)

Training scripts

Project-specific pretraining recipes live in scripts. Please refer to Released Models for the training scripts corresponding to each released checkpoint. See Data for how to prepare and register the tokenized data these scripts read.

Run a script locally:

bash scripts/models/dense_1b_lr-4e-3_0213.sh

Submit it as a Beaker job:

MODE=beaker bash scripts/models/dense_1b_lr-4e-3_0213.sh

Override paths via env vars before launching:

  • PREFIX — output root
  • MODELS_DIR — derived from PREFIX (${PREFIX}/models)
  • DATASET_CACHE — tokenizer-mapped dataset cache

Override Beaker cluster sizing per script with BEAKER_GPUS=8 BEAKER_NODES=4 ....

After training, OLMo-core checkpoints can be converted to the HuggingFace format (suitable for inference and the evaluation pipelines below) with scripts/convert_olmo_to_hf.sh.

Evaluation scripts

Selective Expert Usage

The launch scripts in scripts/selective_hf/ exercise the full router-activation → expert selection → finetuning → eval pipeline on the released checkpoints. Each (model × keep-k × task × method) combination lands in its own subdirectory under selective_evals_final/<model>/..., with the pruned-expert model, finetuned checkpoint, and per-checkpoint metrics all colocated. Three scripts target different questions:

Script Investigates Sweep
launch_selective_hf.sh Main selective-expert evaluation — how each released model performs when only a subset of experts is retained for a given task (Figure 3 of the paper). All released models × keep-k ∈ {8, 16, 32, 64, 128} × MC9 / Gen5 / MMLU / MMLU-Pro / GSM8K task groups.
launch_selective_method_hf.sh Robustness to the choice of expert-selection method (Figure 4 of the paper). {layerwise, easy_ep, random} selection methods × main 1T models × keep-k × tasks.
launch_selective_validation_hf.sh Calibration-data ablation — how much validation data and how many few-shot examples are needed to identify the right experts (Appendix B.2 of the paper). Validation-set sizes ∈ {1, 5, 10, 100, All} × 3 shot-count configurations × Emo_1b14b_1T × keep-k ∈ {8, 16, 32, 128} × tasks.

Output layout

Every (model × keep-k × task × method) combination produces one self-contained subdirectory under selective_evals_final/:

selective_evals_final/
└── <sanitized_model>/                            # e.g. allenaiEmo_1b14b_1T
    └── <task>_keepk_<K>_bs-<B>_lr-<LR>_epoch-<E>_selectivemode-{layerwise,easy_ep,random}[_nselective-<N>][_pseed-<S>][_pshots-<X>][_eshots-<Y>]/
        ├── selected_model/                       # pruned-expert HF checkpoint + pruning_metadata.json
        ├── finetuned_model/
        │   └── checkpoint-<N>/                   # HF Trainer-format finetuned weights
        └── results/
            └── checkpoint-<N>/
                ├── task-<name>-metrics.json      # aggregate metrics for the task
                ├── task-<name>-predictions.jsonl # per-instance predictions
                └── per_subject/                  # only for MMLU category tasks
                    └── <subject>/
                        └── task-<name>-metrics.json

The optional _nselective-, _pseed-, _pshots-, _eshots- suffixes only appear when the corresponding override is set (e.g. you'll only see _nselective-100 when running with a sub-sampled calibration set).

Customization

Each script writes its config (MODELS, SELECTIVE_KEEP_K_VALUES, TASK_GROUPS_LIST, etc.) at the top — comment lines out to skip combinations. Override the output root with OUTPUT_DIR=… and the per-worker GPU count with NUM_GPUS=….

We recommend running these on a slurm or other scheduling system, since each script launches many sequential worker invocations.

Aggregating results into tables

Once one of the launchers above has populated selective_evals_final/, two scripts in scripts/plotting/ walk the per-run subdirectories and produce flat CSV/TSV/markdown tables suitable for downstream analysis:

Script Source launcher What it produces
get_table_scores_selective_evals_final.py launch_selective_hf.sh and launch_selective_method_hf.sh Per-metric tables with rows = (model × keep-k variant) and paired columns "task (lw) / task (ep) / task (rd)" — one column per selection method. Group averages (mc9_avg, gen5_avg, mmlu_merged_avg_no_other, mmlu_pro_merged_avg_no_other) are prepended automatically. Both last-checkpoint (post-finetune) and first-checkpoint (pre-finetune) variants are emitted by default.
get_table_scores_nselective_ablation.py launch_selective_validation_hf.sh Validation-data-ablation tables: rows = (model, selection-method, task group), columns = keepk_K (1) / keepk_K (5) / keepk_K (10) / keepk_K (100) / keepk_K (All) / keepk_K (Random). Includes optional 0-shot variants when _pshots-0/_eshots-0 runs are present.

Both scripts default to reading from <repo>/selective_evals_final/ and writing to <repo>/plots/. Overrides:

# Main + method-comparison tables
python -m scripts.plotting.get_table_scores_selective_evals_final \
    --selective-evals-root selective_evals_final \
    --output-dir plots

# Validation-size ablation tables
python -m scripts.plotting.get_table_scores_nselective_ablation \
    --selective-evals-root selective_evals_final \
    --output-dir plots

The model registries (MODEL_SPECS at the top of each file) currently list the released HF Hub checkpoints — add new entries there if you point either script at a directory built from a different model.

Clustering Pretraining Document Tokens

scripts/clustering/run_pretraining_compare.sh reproduces the side-by-side router-activation clustering used to compare EMO and the standard MoE baseline (Section 5.3 / Figure 5 of the paper). For each of allenai/Emo_1b14b_1T and allenai/StdMoE_1b14b_1T it:

  1. Streams ~1M tokens of the OLMoE pretraining mix from S3
  2. Runs a forward pass and saves token-level router logits
  3. Derives softmax probs, runs PCA + spherical k-means at k=32
  4. Renders an interactive side-by-side HTML explorer of both models' clusters
bash scripts/clustering/run_pretraining_compare.sh
# → cluster_eval_final/pretraining/compare_Emo_1b14b_1T_vs_StdMoE_1b14b_1T.html

Output layout

cluster_eval_final/
├── pretraining_mix.json                   # generated once, then reused
└── pretraining/
    ├── Emo_1b14b_1T/
    │   ├── embeddings_logits.npy + ...    # extract outputs (tokens, doc boundaries, metadata)
    │   ├── embeddings_probs.npy           # transform output
    │   └── probs_mean_pca_l2_spherical_kmeans_k32/
    │       ├── assignments.npy, run_info.json, summary.json
    │       └── cluster_explorer.html
    ├── StdMoE_1b14b_1T/
    │   └── (same structure)
    └── compare_Emo_1b14b_1T_vs_StdMoE_1b14b_1T.html

The underlying primitives (extract / transform / cluster / visualize) live in scripts/clustering/ — see its README for the modular pipeline.

Customization

  • CLUSTER_ROOT=… overrides the output root (default cluster_eval_final/).
  • TARGET_TOKENS=… and MAX_TOKENS_PER_DOC=… change the extraction budget and per-doc truncation.
  • CUDA_VISIBLE_DEVICES=… restricts which GPUs the model is sharded across.

Note: this script uses the exact same data as OLMoE (allenai/OLMoE-mix-0924). See Data for how to obtain, tokenize, and register this data.

Weborganizer Expert Coverage

scripts/clustering/run_weborganizer_compare.sh reproduces the per-domain expert-activation heatmaps used to compare EMO and the standard MoE baseline (Section 5.3 / Figure 6 of the paper). For each of allenai/Emo_1b14b_1T and allenai/StdMoE_1b14b_1T it:

  1. Streams ~20M tokens of the cc_all_dressed weborganizer mix from S3, sampled uniformly across the 24 topics
  2. Runs a single forward pass and aggregates router activations into per-document expert vectors (top-k frequency + softmax probs)
  3. Renders 5 expert-coverage heatmaps per embedding type (10 PNGs total per model)

Both models share a single topic_order.json (stratified row/column ordering) so the resulting heatmaps are directly comparable side-by-side.

bash scripts/clustering/run_weborganizer_compare.sh
# → cluster_eval_final/weborganizer/{Emo_1b14b_1T,StdMoE_1b14b_1T}/*.png

Output layout

cluster_eval_final/
└── weborganizer/
    ├── mix_composition.json      # auto-generated on first run by extract_document.py
    ├── topic_order.json          # shared row/column ordering for cross-model comparison
    ├── Emo_1b14b_1T/
    │   ├── embeddings_doc_topk_freq.npy
    │   ├── embeddings_doc_probs.npy
    │   └── *.png                 # 5 heatmaps × 2 embedding types = 10 PNGs
    └── StdMoE_1b14b_1T/
        └── (same structure)

The underlying primitives (extract_document / plot_doc_expert_coverage) live in scripts/clustering/weborganizer/.

Customization

  • CLUSTER_ROOT=… overrides the output root (default cluster_eval_final/).
  • TARGET_TOKENS=… changes the extraction budget (default 20M).
  • CUDA_VISIBLE_DEVICES=… restricts which GPUs the model is sharded across.

Note: this script uses the WebOrganizer dataset, which is publicly accessible here. See Data for how to tokenize and register this data.

Contact and Contributing

If you have a fix, improvement, or extension you'd like to share, please open a pull request — direct contributions are the best way to help the project, and we're happy to review them.

For other interactions:

  • Public questions, bug reports, or feature suggestions: please file a GitHub issue. This keeps the conversation visible to everyone and lets others benefit from the answer.
  • Private or sensitive inquiries (e.g. anything you'd rather not discuss in public): email ryanyxw@berkeley.edu.

Citing

@misc{wang2026emopretrainingmixtureexperts,
      title={EMO: Pretraining Mixture of Experts for Emergent Modularity}, 
      author={Ryan Wang and Akshita Bhagia and Sewon Min},
      year={2026},
      eprint={2605.06663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.06663}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages