Official code repository for the paper "Genomic Next-Token Predictors are In-Context Learners".
Authors: Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi
- Overview
- Key Findings
- Installation
- Quick Start
- Experimental Framework
- Usage Guide
- Understanding Results
- Reproducing Paper Results
This repository demonstrates that in-context learning (ICL) is not unique to language models. We show that genomic foundation models trained on DNA sequences exhibit the same log-linear ICL scaling trends as language models, suggesting that ICL emerges from large-scale next-token prediction over pattern-rich data rather than from language-specific properties.
ICL is the ability of a model to infer and apply patterns from examples provided within its input prompt, without any parameter updates. For example, given:
Input: 10100000 → Output: 01011111
Input: 11100011 → Output: 00011100
Input: 11000000 → Output: ?
A model with ICL can infer the pattern (bitwise NOT) and correctly predict 00111111.
We created 100 symbolic bitstring transformation tasks that can be encoded in both:
- Genomic sequences (A/T/C/G nucleotides) for Evo2 models
- Linguistic sequences (random digits) for Qwen3 language models
This enables direct "apples-to-apples" comparison of ICL across modalities.
- ICL exists in genomic models: Evo2 shows log-linear accuracy improvements with more demonstrations (1→128 shots)
- Comparable performance: Evo2-40B matches/exceeds Qwen3-14B on many tasks
- Different strengths:
- Qwen3 excels at: global operations (parity, majority), shift operations
- Evo2 excels at: full-bitstring transformations (identity, NOT, reverse)
- Modality-agnostic ICL: Supports hypothesis that ICL emerges from predictive compression, not language structure
- Python 3.8+ (Python 3.12 preferred, Python 3.13 may not work)
- For Qwen models: macOS (Apple Silicon) or Linux with CUDA
- For Evo2 models: NVIDIA H100 GPU with CUDA 12.1+
# Install requirements
pip install -r requirements.txtEvo2 requires special CUDA configuration. Follow the Evo2 installation guide.
# After setting up Evo2's environment
pip install evo2 # Follow Arc Institute's instructionsNote that the evaluation scripts expect all Qwen base models to be in the local directory of evaluation. You can change the model flag to Qwen/Qwen3-4B-Base to fetch from the Hugging Face Hub.
# Test Qwen3-4B with 8 in-context examples
python program_synth.py \
--programs curated_transformations.jsonl \
--model Qwen3-4B-Base \
--lmshuffle \
--per-trial-random \
--in-context-examples 8 \
--output qwen3_4b_8shot.json# Results are saved as JSON
cat qwen3_4b_8shot.json | jq '.overall'Output:
{
"mean_accuracy": 0.1775,
"stderr": 0.024821331246500152,
"mean_edit_distance": 2.48625,
"edit_distance_stderr": 0.103458120522365
}Each task requires the model to infer a transformation from few-shot examples:
Linguistic Encoding (Qwen):
1113 3111 6 # Example 1: input → output, separator "6"
1311 1131 6 # Example 2
3331 1333 6 # Example 3
1133 ? # Query: model must predict output
Genomic Encoding (Evo2):
AAAT TAAA G # Same examples, encoded as nucleotides
ATAA AATA G
TTTA ATTT G
AATT ?
Key Design Choices:
- Random encoding prevents memorization (different symbols per trial)
- 8-bit strings (256 possible inputs)
- Exact match evaluation
Programs are defined in curated_transformations.jsonl:
python program_synth.py \
--programs curated_transformations.jsonl \
--model Qwen3-1.7B-Base \
--lmshuffle \ # Randomize digit encodings
--per-trial-random \ # New encoding per trial
--in-context-examples 16 \ # Number of demonstrations
--trials-per-program 8 \ # Monte Carlo samples per task
--bit-length 8 \ # Bitstring length
--seed 42 \ # Reproducibility
--output qwen3_1.7b_16shot.json--lmshuffle: Randomizes which digits represent 0/1/separator (prevents memorization)--per-trial-random: Resample encoding for every trial (more rigorous)--in-context-examples: Number of demonstrations (k-shot)--trials-per-program: How many random prompts to test per program
# Standard encoding: 0="0", 1="1", sep="\n", arrow="->"
python program_synth.py \
--programs curated_transformations.jsonl \
--model Qwen3-0.6B-Base \
--in-context-examples 8 \
--output qwen3_0.6b_standard.jsonRequirements: H100 GPU, CUDA 12.1+, Evo2 properly installed
python program_synth_evo_smartbatch.py \
--programs curated_transformations.jsonl \
--model evo2_7b \ # or evo2_1b_base, evo2_40b
--in-context-examples 16 \
--trials-per-program 8 \
--output evo2_7b_16shot.jsonSmart Batching: The script automatically handles GPU memory:
- Starts with full batch size
- Halves batch size on OOM errors
- Caches working batch size for future runs
- Falls back to batch_size=1 if needed
Predicts the most common output from in-context examples:
python program_synth.py \
--programs curated_transformations.jsonl \
--backend naive \
--naive-baseline modal \
--in-context-examples 16 \
--trials-per-program 8 \
--output baseline_mode_16shot.jsonSimply returns the input unchanged:
python program_synth.py \
--programs curated_transformations.jsonl \
--backend naive \
--naive-baseline identity \
--in-context-examples 16 \
--output baseline_identity_16shot.jsonpython sweep_program_synth_totalshuffle.py --per-trial-randomThis evaluates:
- Models: Qwen3-0.6B, 1.7B, 4B, 8B, 14B
- Shot counts: 1, 2, 4, 8, 16, 32, 64, 128
- Saves results in:
sweep_results/
python sweep_program_synth_naive.py --per-trial-randompython sweep_program_synth_evo_smartbatch.pyEvaluates Evo2-1B, 7B, 40B across all shot counts.
Warning: Full sweeps take significant time and compute - typically 1-2 H100 hours.
Each evaluation produces a JSON file:
{
"config": {
"model": "Qwen3-4B-Base",
"in_context_examples": 16,
"trials_per_program": 8,
"backend": "mlx",
"lmshuffle": true,
"per_trial_random": true,
"bit_length": 8,
"seed": 42
},
"overall": {
"mean_accuracy": 0.1840,
"stderr": 0.0269,
"mean_edit_distance": 4.1234,
"edit_distance_stderr": 0.1456
},
"tasks": [
{
"index": 0,
"program": ["identity"],
"description": "Return the input unchanged.",
"total_trials": 8,
"correct": 1,
"accuracy": 0.125,
"average_edit_distance": 6.875,
"trials": [...]
},
...
]
}- mean_accuracy: Proportion of exactly correct predictions (0-1)
- stderr: Standard error across tasks (for significance testing)
- edit_distance: Levenshtein distance between prediction and expected output
- per-task accuracy: Success rate for each of the 100 programs
# Overall accuracy
cat sweep_results_final/Qwen3-4B-Base_ic16.json | jq '.overall.mean_accuracy'
# Top 10 easiest tasks
cat sweep_results_final/Qwen3-4B-Base_ic16.json | jq -r '.tasks | sort_by(.accuracy) | reverse | .[0:10] | .[] | "\(.accuracy)\t\(.description)"'
# Compare two models
python -c "
import json
with open('sweep_results_final/Qwen3-4B-Base_ic16.json') as f:
qwen = json.load(f)
with open('sweep_results_final/evo2_7b_ic16.json') as f:
evo = json.load(f)
print(f'Qwen3-4B: {qwen[\"overall\"][\"mean_accuracy\"]:.4f}')
print(f'Evo2-7B: {evo[\"overall\"][\"mean_accuracy\"]:.4f}')
"# Run full sweeps (this takes time!)
python sweep_program_synth_totalshuffle.py --per-trial-random
python sweep_program_synth_evo_smartbatch.py
python sweep_program_synth_naive.py --per-trial-random
# Generate figures
python analyze_tasks_final.py
python generate_graphs_final.py
# Figures saved to: graphs_final/# Compute bootstrap confidence intervals
python compute_bootstrap_final.py
# Run hypothesis tests
python hypothesis_test_final.py
# View results
cat hypothesis_results_final.txtAll paper results are included in sweep_results_final/:
# List all result files
ls sweep_results_final/
# Example: View Evo2-40B performance at 128 shots
cat sweep_results_final/evo2_40b_ic128.json | jq '.overall'- Exact replication: Due to stochastic generation, results may vary slightly from paper
- Mode baseline: GitHub version has fewer trials (file size limit) → slightly different p-values
- Random seeds: Set
--seed 42for consistency - Evo2 requirements: Requires H100 GPU; other GPUs may OOM or produce different results
- Issues: Open a GitHub issue with error logs
- Questions: Contact Nathan Breslow ([email protected])
- Evo2 setup: See Arc Institute's documentation
We thank the Arc Institute for releasing Evo2 and the Qwen team for open-sourcing their models. This work was supported by Johns Hopkins University.