Skip to content

JHU-CLSP/icl-in-genomics-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomic Next-Token Predictors are In-Context Learners

Official code repository for the paper "Genomic Next-Token Predictors are In-Context Learners".

Authors: Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi


Table of Contents


Overview

This repository demonstrates that in-context learning (ICL) is not unique to language models. We show that genomic foundation models trained on DNA sequences exhibit the same log-linear ICL scaling trends as language models, suggesting that ICL emerges from large-scale next-token prediction over pattern-rich data rather than from language-specific properties.

What's In-Context Learning (ICL)?

ICL is the ability of a model to infer and apply patterns from examples provided within its input prompt, without any parameter updates. For example, given:

Input: 10100000 → Output: 01011111
Input: 11100011 → Output: 00011100
Input: 11000000 → Output: ?

A model with ICL can infer the pattern (bitwise NOT) and correctly predict 00111111.

Our Contribution

We created 100 symbolic bitstring transformation tasks that can be encoded in both:

  • Genomic sequences (A/T/C/G nucleotides) for Evo2 models
  • Linguistic sequences (random digits) for Qwen3 language models

This enables direct "apples-to-apples" comparison of ICL across modalities.


Key Findings

  1. ICL exists in genomic models: Evo2 shows log-linear accuracy improvements with more demonstrations (1→128 shots)
  2. Comparable performance: Evo2-40B matches/exceeds Qwen3-14B on many tasks
  3. Different strengths:
    • Qwen3 excels at: global operations (parity, majority), shift operations
    • Evo2 excels at: full-bitstring transformations (identity, NOT, reverse)
  4. Modality-agnostic ICL: Supports hypothesis that ICL emerges from predictive compression, not language structure

Installation

Prerequisites

  • Python 3.8+ (Python 3.12 preferred, Python 3.13 may not work)
  • For Qwen models: macOS (Apple Silicon) or Linux with CUDA
  • For Evo2 models: NVIDIA H100 GPU with CUDA 12.1+

Basic Setup (Qwen/Linguistic Models)

# Install requirements
pip install -r requirements.txt

Evo2 Setup (Genomic Models)

Evo2 requires special CUDA configuration. Follow the Evo2 installation guide.

# After setting up Evo2's environment
pip install evo2  # Follow Arc Institute's instructions

Quick Start

Note that the evaluation scripts expect all Qwen base models to be in the local directory of evaluation. You can change the model flag to Qwen/Qwen3-4B-Base to fetch from the Hugging Face Hub.

1. Evaluate a Single Model

# Test Qwen3-4B with 8 in-context examples
python program_synth.py \
    --programs curated_transformations.jsonl \
    --model Qwen3-4B-Base \
    --lmshuffle \
    --per-trial-random \
    --in-context-examples 8 \
    --output qwen3_4b_8shot.json

2. View Results

# Results are saved as JSON
cat qwen3_4b_8shot.json | jq '.overall'

Output:

{
  "mean_accuracy": 0.1775,
  "stderr": 0.024821331246500152,
  "mean_edit_distance": 2.48625,
  "edit_distance_stderr": 0.103458120522365
}

Experimental Framework

Task Design: Bitstring Program Synthesis

Each task requires the model to infer a transformation from few-shot examples:

Linguistic Encoding (Qwen):
1113 3111 6        # Example 1: input → output, separator "6"
1311 1131 6        # Example 2
3331 1333 6        # Example 3
1133 ?             # Query: model must predict output

Genomic Encoding (Evo2):
AAAT TAAA G        # Same examples, encoded as nucleotides
ATAA AATA G
TTTA ATTT G
AATT ?

Key Design Choices:

  • Random encoding prevents memorization (different symbols per trial)
  • 8-bit strings (256 possible inputs)
  • Exact match evaluation

Programs are defined in curated_transformations.jsonl:

Usage Guide

Evaluating Qwen Models (Linguistic)

Single Evaluation

python program_synth.py \
    --programs curated_transformations.jsonl \
    --model Qwen3-1.7B-Base \
    --lmshuffle \                    # Randomize digit encodings
    --per-trial-random \             # New encoding per trial
    --in-context-examples 16 \       # Number of demonstrations
    --trials-per-program 8 \         # Monte Carlo samples per task
    --bit-length 8 \                 # Bitstring length
    --seed 42 \                      # Reproducibility
    --output qwen3_1.7b_16shot.json

Important Flags

  • --lmshuffle: Randomizes which digits represent 0/1/separator (prevents memorization)
  • --per-trial-random: Resample encoding for every trial (more rigorous)
  • --in-context-examples: Number of demonstrations (k-shot)
  • --trials-per-program: How many random prompts to test per program

Without Encoding Randomization (Not Recommended)

# Standard encoding: 0="0", 1="1", sep="\n", arrow="->"
python program_synth.py \
    --programs curated_transformations.jsonl \
    --model Qwen3-0.6B-Base \
    --in-context-examples 8 \
    --output qwen3_0.6b_standard.json

Evaluating Evo2 Models (Genomic)

Requirements: H100 GPU, CUDA 12.1+, Evo2 properly installed

python program_synth_evo_smartbatch.py \
    --programs curated_transformations.jsonl \
    --model evo2_7b \                 # or evo2_1b_base, evo2_40b
    --in-context-examples 16 \
    --trials-per-program 8 \
    --output evo2_7b_16shot.json

Smart Batching: The script automatically handles GPU memory:

  • Starts with full batch size
  • Halves batch size on OOM errors
  • Caches working batch size for future runs
  • Falls back to batch_size=1 if needed

Running Baselines

Mode Baseline

Predicts the most common output from in-context examples:

python program_synth.py \
    --programs curated_transformations.jsonl \
    --backend naive \
    --naive-baseline modal \
    --in-context-examples 16 \
    --trials-per-program 8 \
    --output baseline_mode_16shot.json

Identity Baseline

Simply returns the input unchanged:

python program_synth.py \
    --programs curated_transformations.jsonl \
    --backend naive \
    --naive-baseline identity \
    --in-context-examples 16 \
    --output baseline_identity_16shot.json

Full Sweeps

Sweep All Qwen Models (1-128 shots)

python sweep_program_synth_totalshuffle.py --per-trial-random

This evaluates:

  • Models: Qwen3-0.6B, 1.7B, 4B, 8B, 14B
  • Shot counts: 1, 2, 4, 8, 16, 32, 64, 128
  • Saves results in: sweep_results/

Sweep Baselines

python sweep_program_synth_naive.py --per-trial-random

Sweep Evo2 Models

python sweep_program_synth_evo_smartbatch.py

Evaluates Evo2-1B, 7B, 40B across all shot counts.

Warning: Full sweeps take significant time and compute - typically 1-2 H100 hours.


Understanding Results

Result JSON Structure

Each evaluation produces a JSON file:

{
  "config": {
    "model": "Qwen3-4B-Base",
    "in_context_examples": 16,
    "trials_per_program": 8,
    "backend": "mlx",
    "lmshuffle": true,
    "per_trial_random": true,
    "bit_length": 8,
    "seed": 42
  },
  "overall": {
    "mean_accuracy": 0.1840,
    "stderr": 0.0269,
    "mean_edit_distance": 4.1234,
    "edit_distance_stderr": 0.1456
  },
  "tasks": [
    {
      "index": 0,
      "program": ["identity"],
      "description": "Return the input unchanged.",
      "total_trials": 8,
      "correct": 1,
      "accuracy": 0.125,
      "average_edit_distance": 6.875,
      "trials": [...]
    },
    ...
  ]
}

Key Metrics

  • mean_accuracy: Proportion of exactly correct predictions (0-1)
  • stderr: Standard error across tasks (for significance testing)
  • edit_distance: Levenshtein distance between prediction and expected output
  • per-task accuracy: Success rate for each of the 100 programs

Analyzing Results

# Overall accuracy
cat sweep_results_final/Qwen3-4B-Base_ic16.json | jq '.overall.mean_accuracy'

# Top 10 easiest tasks
cat sweep_results_final/Qwen3-4B-Base_ic16.json | jq -r '.tasks | sort_by(.accuracy) | reverse | .[0:10] | .[] | "\(.accuracy)\t\(.description)"'

# Compare two models
python -c "
import json
with open('sweep_results_final/Qwen3-4B-Base_ic16.json') as f:
    qwen = json.load(f)
with open('sweep_results_final/evo2_7b_ic16.json') as f:
    evo = json.load(f)
print(f'Qwen3-4B: {qwen[\"overall\"][\"mean_accuracy\"]:.4f}')
print(f'Evo2-7B:  {evo[\"overall\"][\"mean_accuracy\"]:.4f}')
"

Reproducing Paper Results

1. Reproduce Main Results (Figure 2)

# Run full sweeps (this takes time!)
python sweep_program_synth_totalshuffle.py --per-trial-random
python sweep_program_synth_evo_smartbatch.py
python sweep_program_synth_naive.py --per-trial-random

# Generate figures
python analyze_tasks_final.py
python generate_graphs_final.py

# Figures saved to: graphs_final/

2. Reproduce Statistical Tests

# Compute bootstrap confidence intervals
python compute_bootstrap_final.py

# Run hypothesis tests
python hypothesis_test_final.py

# View results
cat hypothesis_results_final.txt

3. View Pre-Computed Results

All paper results are included in sweep_results_final/:

# List all result files
ls sweep_results_final/

# Example: View Evo2-40B performance at 128 shots
cat sweep_results_final/evo2_40b_ic128.json | jq '.overall'

Notes on Reproducibility

  1. Exact replication: Due to stochastic generation, results may vary slightly from paper
  2. Mode baseline: GitHub version has fewer trials (file size limit) → slightly different p-values
  3. Random seeds: Set --seed 42 for consistency
  4. Evo2 requirements: Requires H100 GPU; other GPUs may OOM or produce different results


Acknowledgments

We thank the Arc Institute for releasing Evo2 and the Qwen team for open-sourcing their models. This work was supported by Johns Hopkins University.

About

Essential code for the paper *Genomic Next-Token Predictors are In-Context Learners*.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages