Genomic Next-Token Predictors are In-Context Learners

Official code repository for the paper "Genomic Next-Token Predictors are In-Context Learners".

Authors: Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi

Overview

This repository demonstrates that in-context learning (ICL) is not unique to language models. We show that genomic foundation models trained on DNA sequences exhibit the same log-linear ICL scaling trends as language models, suggesting that ICL emerges from large-scale next-token prediction over pattern-rich data rather than from language-specific properties.

What's In-Context Learning (ICL)?

ICL is the ability of a model to infer and apply patterns from examples provided within its input prompt, without any parameter updates. For example, given:

Input: 10100000 → Output: 01011111
Input: 11100011 → Output: 00011100
Input: 11000000 → Output: ?

A model with ICL can infer the pattern (bitwise NOT) and correctly predict 00111111.

Our Contribution

We created 100 symbolic bitstring transformation tasks that can be encoded in both:

Genomic sequences (A/T/C/G nucleotides) for Evo2 models
Linguistic sequences (random digits) for Qwen3 language models

This enables direct "apples-to-apples" comparison of ICL across modalities.

Key Findings

ICL exists in genomic models: Evo2 shows log-linear accuracy improvements with more demonstrations (1→128 shots)
Comparable performance: Evo2-40B matches/exceeds Qwen3-14B on many tasks
Different strengths:
- Qwen3 excels at: global operations (parity, majority), shift operations
- Evo2 excels at: full-bitstring transformations (identity, NOT, reverse)
Modality-agnostic ICL: Supports hypothesis that ICL emerges from predictive compression, not language structure

Installation

Prerequisites

Python 3.8+ (Python 3.12 preferred, Python 3.13 may not work)
For Qwen models: macOS (Apple Silicon) or Linux with CUDA
For Evo2 models: NVIDIA H100 GPU with CUDA 12.1+

Basic Setup (Qwen/Linguistic Models)

# Install requirements
pip install -r requirements.txt

Evo2 Setup (Genomic Models)

Evo2 requires special CUDA configuration. Follow the Evo2 installation guide.

# After setting up Evo2's environment
pip install evo2  # Follow Arc Institute's instructions

Quick Start

Note that the evaluation scripts expect all Qwen base models to be in the local directory of evaluation. You can change the model flag to Qwen/Qwen3-4B-Base to fetch from the Hugging Face Hub.

1. Evaluate a Single Model

# Test Qwen3-4B with 8 in-context examples
python program_synth.py \
    --programs curated_transformations.jsonl \
    --model Qwen3-4B-Base \
    --lmshuffle \
    --per-trial-random \
    --in-context-examples 8 \
    --output qwen3_4b_8shot.json

2. View Results

# Results are saved as JSON
cat qwen3_4b_8shot.json | jq '.overall'

Output:

{
  "mean_accuracy": 0.1775,
  "stderr": 0.024821331246500152,
  "mean_edit_distance": 2.48625,
  "edit_distance_stderr": 0.103458120522365
}

Experimental Framework

Task Design: Bitstring Program Synthesis

Each task requires the model to infer a transformation from few-shot examples:

Linguistic Encoding (Qwen):
1113 3111 6        # Example 1: input → output, separator "6"
1311 1131 6        # Example 2
3331 1333 6        # Example 3
1133 ?             # Query: model must predict output

Genomic Encoding (Evo2):
AAAT TAAA G        # Same examples, encoded as nucleotides
ATAA AATA G
TTTA ATTT G
AATT ?

Key Design Choices:

Random encoding prevents memorization (different symbols per trial)
8-bit strings (256 possible inputs)
Exact match evaluation

Programs are defined in curated_transformations.jsonl:

Usage Guide

Evaluating Qwen Models (Linguistic)

Single Evaluation

python program_synth.py \
    --programs curated_transformations.jsonl \
    --model Qwen3-1.7B-Base \
    --lmshuffle \                    # Randomize digit encodings
    --per-trial-random \             # New encoding per trial
    --in-context-examples 16 \       # Number of demonstrations
    --trials-per-program 8 \         # Monte Carlo samples per task
    --bit-length 8 \                 # Bitstring length
    --seed 42 \                      # Reproducibility
    --output qwen3_1.7b_16shot.json

Important Flags

--lmshuffle: Randomizes which digits represent 0/1/separator (prevents memorization)
--per-trial-random: Resample encoding for every trial (more rigorous)
--in-context-examples: Number of demonstrations (k-shot)
--trials-per-program: How many random prompts to test per program

Without Encoding Randomization (Not Recommended)

# Standard encoding: 0="0", 1="1", sep="\n", arrow="->"
python program_synth.py \
    --programs curated_transformations.jsonl \
    --model Qwen3-0.6B-Base \
    --in-context-examples 8 \
    --output qwen3_0.6b_standard.json

Evaluating Evo2 Models (Genomic)

Requirements: H100 GPU, CUDA 12.1+, Evo2 properly installed

python program_synth_evo_smartbatch.py \
    --programs curated_transformations.jsonl \
    --model evo2_7b \                 # or evo2_1b_base, evo2_40b
    --in-context-examples 16 \
    --trials-per-program 8 \
    --output evo2_7b_16shot.json

Smart Batching: The script automatically handles GPU memory:

Starts with full batch size
Halves batch size on OOM errors
Caches working batch size for future runs
Falls back to batch_size=1 if needed

Running Baselines

Mode Baseline

Predicts the most common output from in-context examples:

python program_synth.py \
    --programs curated_transformations.jsonl \
    --backend naive \
    --naive-baseline modal \
    --in-context-examples 16 \
    --trials-per-program 8 \
    --output baseline_mode_16shot.json

Identity Baseline

Simply returns the input unchanged:

python program_synth.py \
    --programs curated_transformations.jsonl \
    --backend naive \
    --naive-baseline identity \
    --in-context-examples 16 \
    --output baseline_identity_16shot.json

Full Sweeps

Sweep All Qwen Models (1-128 shots)

python sweep_program_synth_totalshuffle.py --per-trial-random

This evaluates:

Models: Qwen3-0.6B, 1.7B, 4B, 8B, 14B
Shot counts: 1, 2, 4, 8, 16, 32, 64, 128
Saves results in: sweep_results/

Sweep Baselines

python sweep_program_synth_naive.py --per-trial-random

Sweep Evo2 Models

python sweep_program_synth_evo_smartbatch.py

Evaluates Evo2-1B, 7B, 40B across all shot counts.

Warning: Full sweeps take significant time and compute - typically 1-2 H100 hours.

Understanding Results

Result JSON Structure

Each evaluation produces a JSON file:

{
  "config": {
    "model": "Qwen3-4B-Base",
    "in_context_examples": 16,
    "trials_per_program": 8,
    "backend": "mlx",
    "lmshuffle": true,
    "per_trial_random": true,
    "bit_length": 8,
    "seed": 42
  },
  "overall": {
    "mean_accuracy": 0.1840,
    "stderr": 0.0269,
    "mean_edit_distance": 4.1234,
    "edit_distance_stderr": 0.1456
  },
  "tasks": [
    {
      "index": 0,
      "program": ["identity"],
      "description": "Return the input unchanged.",
      "total_trials": 8,
      "correct": 1,
      "accuracy": 0.125,
      "average_edit_distance": 6.875,
      "trials": [...]
    },
    ...
  ]
}

Key Metrics

mean_accuracy: Proportion of exactly correct predictions (0-1)
stderr: Standard error across tasks (for significance testing)
edit_distance: Levenshtein distance between prediction and expected output
per-task accuracy: Success rate for each of the 100 programs

Analyzing Results

# Overall accuracy
cat sweep_results_final/Qwen3-4B-Base_ic16.json | jq '.overall.mean_accuracy'

# Top 10 easiest tasks
cat sweep_results_final/Qwen3-4B-Base_ic16.json | jq -r '.tasks | sort_by(.accuracy) | reverse | .[0:10] | .[] | "\(.accuracy)\t\(.description)"'

# Compare two models
python -c "
import json
with open('sweep_results_final/Qwen3-4B-Base_ic16.json') as f:
    qwen = json.load(f)
with open('sweep_results_final/evo2_7b_ic16.json') as f:
    evo = json.load(f)
print(f'Qwen3-4B: {qwen[\"overall\"][\"mean_accuracy\"]:.4f}')
print(f'Evo2-7B:  {evo[\"overall\"][\"mean_accuracy\"]:.4f}')
"

Reproducing Paper Results

1. Reproduce Main Results (Figure 2)

# Run full sweeps (this takes time!)
python sweep_program_synth_totalshuffle.py --per-trial-random
python sweep_program_synth_evo_smartbatch.py
python sweep_program_synth_naive.py --per-trial-random

# Generate figures
python analyze_tasks_final.py
python generate_graphs_final.py

# Figures saved to: graphs_final/

2. Reproduce Statistical Tests

# Compute bootstrap confidence intervals
python compute_bootstrap_final.py

# Run hypothesis tests
python hypothesis_test_final.py

# View results
cat hypothesis_results_final.txt

3. View Pre-Computed Results

All paper results are included in sweep_results_final/:

# List all result files
ls sweep_results_final/

# Example: View Evo2-40B performance at 128 shots
cat sweep_results_final/evo2_40b_ic128.json | jq '.overall'

Notes on Reproducibility

Exact replication: Due to stochastic generation, results may vary slightly from paper
Mode baseline: GitHub version has fewer trials (file size limit) → slightly different p-values
Random seeds: Set --seed 42 for consistency
Evo2 requirements: Requires H100 GPU; other GPUs may OOM or produce different results

Issues: Open a GitHub issue with error logs
Questions: Contact Nathan Breslow ([email protected])
Evo2 setup: See Arc Institute's documentation

Acknowledgments

We thank the Arc Institute for releasing Evo2 and the Qwen team for open-sourcing their models. This work was supported by Johns Hopkins University.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analysis_final		analysis_final
graphs_final		graphs_final
sweep_results_final		sweep_results_final
README.md		README.md
analyze_bit_loads.py		analyze_bit_loads.py
analyze_tasks_final.py		analyze_tasks_final.py
bit_tracking_load.tsv		bit_tracking_load.tsv
compute_bootstrap_final.py		compute_bootstrap_final.py
curated_transformations.jsonl		curated_transformations.jsonl
dsl.py		dsl.py
generate_graphs_final.py		generate_graphs_final.py
hypothesis_results_final.txt		hypothesis_results_final.txt
hypothesis_test_final.py		hypothesis_test_final.py
hypothesis_tests.py		hypothesis_tests.py
program_synth.py		program_synth.py
program_synth_evo_smartbatch.py		program_synth_evo_smartbatch.py
requirements.txt		requirements.txt
sweep_program_synth_evo_smartbatch.py		sweep_program_synth_evo_smartbatch.py
sweep_program_synth_naive.py		sweep_program_synth_naive.py
sweep_program_synth_totalshuffle.py		sweep_program_synth_totalshuffle.py

JHU-CLSP/icl-in-genomics-models

Folders and files

Latest commit

History

Repository files navigation