Post-Generation Guardrails for DNA Foundation Models

TL;DR — DNA foundation models (Evo2, GENERator) can be steered to output pathogenic viral sequences via jailbreak attacks. We build a multi-stage post-generation guardrail that combines BLAST homology search, PathoLM classification, and activation-space linear probes to flag dangerous outputs. The probing results demonstrate that these models internally represent pathogenicity—even when they comply with the generation request.

Motivation

GeneBreaker (Zhang et al., 2025) demonstrated that Evo2 can be jailbroken to produce sequences with >90% nucleotide identity to SARS-CoV-2, HIV-1, and other select agents. The attack uses homologous non-pathogenic sequences as few-shot prompts combined with PathoLM-guided beam search. Fewer than 3% of ~370 biological AI models ship with any safety mechanism. This repository implements defenses to close that gap.

Defense Pipeline

The guardrail is a three-stage cascade applied to any generated DNA sequence. Each stage catches a different class of threat; a sequence must clear all three to be released.

flowchart LR
    A["Generated DNA"] --> B["BLAST"]
    B -- "≥80% identity" --> F["FLAGGED"]
    B -- "No hit" --> C["PathoLM"]
    C -- "Pathogen" --> F
    C -- "Non-pathogen" --> D["Activation Probe"]
    D -- "P unsafe > t" --> F
    D -- "P unsafe ≤ t" --> E["RELEASED"]

    style F fill:#d32f2f,color:#fff
    style E fill:#388e3c,color:#fff

Stage 1 — BLAST Homology Search

Queries the generated sequence against NCBI nt, restricted to known dangerous organism taxon IDs (e.g. Yersinia pestis, Ebola, B. anthracis). Any hit at ≥80% nucleotide identity → flagged.

Implementation: src/blast/checkBlast.py
Strength: Gold-standard for known pathogens.
Limitation: Fails on novel or sufficiently divergent sequences (<80% identity).

Stage 2 — PathoLM Classification

Runs the sequence through PathoLM, a DNABERT-based pathogen language model fine-tuned for binary pathogen/non-pathogen classification on 6-mer tokenised DNA.

Implementation: src/pipeline/patho.py
Strength: Captures statistical patterns beyond string identity.
Limitation: Black-box; adversarial inputs that fool PathoLM can bypass this stage.

Stage 3 — Activation-Space Probes

Extracts internal hidden-state representations from a frozen DNA foundation model and applies a trained linear or MLP classifier. This is the novel contribution — detailed in the Probing Analysis section below.

Implementation (GENERator probes): scripts/02_generator_probes.py
Implementation (Evo2 probes): evo2_probe/extract_embeddings.py, evo2_probe/train_probe.py

Full Pipeline Orchestration

# src/pipeline/pipeline.py — simplified
result = blast_check(sequence)        # Stage 1
if result["flagged"]: block(sequence)

result = patho_check(sequence)        # Stage 2
if result: block(sequence)

# Stage 3: probe_screen(sequence)     # Activation probe

See src/pipeline/pipeline.py for the full implementation.

Data

Pathogenic and benign CDS sequences are curated from JailbreakDNABench, a benchmark of viral coding sequences used to evaluate jailbreak attacks on DNA models. Data curation proceeds in two phases:

Phase	Script	Source	Output
0a	scripts/00_curate_data.py	CSV files (`patho/`, `nopatho/` per family)	84 sequences (POC)
0b	scripts/00b_curate_genbank.py	GenBank `.gb` files across all families	Extended dataset (~1000+ sequences)

All sequences are normalised to 640 nt, left-padded to a multiple of 6 (GENERator tokeniser requirement), filtered for ≥200 nt length and <5% ambiguous bases.

Baselines:

Method	Script	Technique
K-mer frequency	scripts/01_kmer_baseline.py	5-mer cosine similarity to per-family pathogen profiles
K-mer classifier	src/kmer/kmerClassifier.py	6-mer bag-of-words + logistic regression

Project Structure

├── src/
│   ├── pipeline/pipeline.py      # Orchestrates BLAST → PathoLM → Probe
│   ├── blast/checkBlast.py       # NCBI BLAST against dangerous taxa
│   └── kmer/kmerClassifier.py    # K-mer frequency classifier
├── scripts/
│   ├── 00_curate_data.py         # Phase 0a: CSV data curation
│   ├── 00b_curate_genbank.py     # Phase 0b: GenBank data curation
│   ├── 01_kmer_baseline.py       # K-mer cosine similarity baseline
│   └── 02_generator_probes.py    # GENERator activation probes
├── evo2_probe/
│   ├── build_dataset.py          # BRCA1 safe/unsafe dataset builder
│   ├── extract_embeddings.py     # Evo2 frozen embedding extraction
│   ├── pca_embeddings.py         # PCA dimensionality reduction
│   ├── train_probe.py            # Linear / MLP probe training
│   ├── probe.py                  # Single-sequence inference
│   ├── extract_layers.sh         # Sweep all layers (1B / 7B / 20B)
│   └── probe_layers.sh           # Train probes per layer
├── JailbreakDNABench/            # Pathogen CDS benchmark sequences
├── data/processed/               # Cached arrays, k-mer profiles
└── configs/                      # HF training & distributed configs

Probing Analysis

Core Hypothesis

DNA foundation models trained via next-token prediction on genomic data implicitly encode pathogenicity as a linearly separable direction in their activation space — despite never being supervised on pathogen labels.

If true, a simple logistic regression on a model's hidden states can detect pathogenic sequences, and the layer at which this signal peaks reveals where in the network pathogenicity is represented.

What Is Being Probed

We freeze the model weights and extract hidden-state vectors at multiple transformer layers. Each 640 nt input sequence produces one embedding vector per layer (mean-pooled across token positions). A linear classifier is then trained on these vectors to predict pathogen vs benign.

flowchart TB
    subgraph Input
        SEQ["DNA Sequence — 640 nt"]
    end

    subgraph Model["Frozen Evo2 7B — StripedHyena, 32 layers, hidden dim 4096"]
        direction TB
        TOK["Tokeniser"] --> EMB["Embedding"]
        EMB --> B0["blocks.0"]
        B0 --> D1["..."]
        D1 --> B7["blocks.7"]
        B7 --> D2["..."]
        D2 --> B15["blocks.15"]
        B15 --> D3["..."]
        D3 --> B23["blocks.23"]
        B23 --> D4["..."]
        D4 --> B31["blocks.31"]
        B31 --> HEAD["Prediction Head"]
    end

    subgraph Probing["Probe Classifier — per layer"]
        direction TB
        POOL["Mean-Pool"] --> SCALE["StandardScaler"]
        SCALE --> PCA["PCA 95%"]
        PCA --> LR["LogReg or MLP"]
        LR --> PRED["P pathogen"]
    end

    SEQ --> TOK
    B7 -. "4096-dim hidden state" .-> POOL
    B15 -. "4096-dim hidden state" .-> POOL
    B23 -. "4096-dim hidden state" .-> POOL
    B31 -. "4096-dim hidden state" .-> POOL

    style B7 fill:#1565c0,color:#fff
    style B15 fill:#1565c0,color:#fff
    style B23 fill:#1565c0,color:#fff
    style B31 fill:#1565c0,color:#fff
    style PRED fill:#f57f17,color:#000

Models Probed

Model	Architecture	Layers	Hidden Dim	Probe Script
GENERator-v2-eukaryote-1.2b	LLaMA decoder, 6-mer tokeniser	26 (probed at layers 6, 13, 19, 26)	2048	scripts/02_generator_probes.py
Evo2 7B	StripedHyena (Hyena + MHA)	32 (all layers swept)	4096	evo2_probe/extract_embeddings.py
Evo2 1B Base	StripedHyena	25 (all layers swept)	—	Same as above

Evaluation

Metric: AUROC, AUPRC, F1 via 5-fold stratified cross-validation
Baseline: 5-mer cosine similarity (see scripts/01_kmer_baseline.py)
Dashboard: Interactive layer-by-layer results at the Evo2 Probing Dashboard

Key Finding: The Model Knows — and Outputs Anyway

flowchart LR
    subgraph Internal["Internal State"]
        direction TB
        H["Hidden state at blocks.15"]
        H --> LP["Linear Probe"]
        LP --> SCORE["P pathogen = 0.92"]
    end

    subgraph External["External Behavior"]
        direction TB
        GEN["Autoregressive generation"]
        GEN --> SEQ["Outputs pathogenic sequence"]
    end

    INPUT["Jailbreak Prompt"] --> Internal
    INPUT --> External

    SCORE -. "Knows it is pathogenic" .-> GAP
    SEQ -. "Outputs it anyway" .-> GAP
    GAP["Representation vs Behavior Gap"]

    style SCORE fill:#d32f2f,color:#fff
    style SEQ fill:#d32f2f,color:#fff
    style GAP fill:#ff8f00,color:#000

The probing results reveal a fundamental asymmetry in DNA foundation models:

Internal representation: A linear probe on intermediate hidden states classifies pathogenic sequences with high AUROC — the model has learned a representation that separates pathogenic from benign sequences as a byproduct of next-token prediction.
External behavior: The same model, when given a jailbreak prompt, generates the pathogenic sequence anyway. The autoregressive generation objective does not condition on the pathogenicity information available in its own activations.
Implication: Pathogenicity is encoded as a direction in activation space that the generation head ignores. This is analogous to findings in LLM safety research where models represent "this is harmful" internally but still produce harmful outputs. The probe reads what the model already knows, turning implicit knowledge into an explicit guardrail.

This means activation probes are not just a classification trick — they are direct evidence that the model possesses safety-relevant information that its default behavior fails to use. A lightweight probe (single matrix multiply) can extract this signal in real time as a post-generation filter.

Quickstart

# 1. Environment setup
conda env create -f environment.yml   # or: pip install -r requirements.txt

# 2. Curate dataset from JailbreakDNABench
python scripts/00_curate_data.py
python scripts/00b_curate_genbank.py

# 3. Run k-mer baseline
python scripts/01_kmer_baseline.py

# 4. Train GENERator activation probes
python scripts/02_generator_probes.py

# 5. (Evo2) Extract embeddings across all layers and train probes
cd evo2_probe
./extract_layers.sh 7b
./probe_layers.sh 7b

# 6. Run full pipeline on a sequence
python src/pipeline/pipeline.py

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
JailbreakDNABench		JailbreakDNABench
configs		configs
data		data
evo2_probe		evo2_probe
figures		figures
plots		plots
scripts		scripts
sequence_recovery_results		sequence_recovery_results
src		src
zach-context		zach-context
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Post-Generation Guardrails for DNA Foundation Models

Motivation

Defense Pipeline

Stage 1 — BLAST Homology Search

Stage 2 — PathoLM Classification

Stage 3 — Activation-Space Probes

Full Pipeline Orchestration

Data

Project Structure

Probing Analysis

Core Hypothesis

What Is Being Probed

Models Probed

Evaluation

Key Finding: The Model Knows — and Outputs Anyway

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Post-Generation Guardrails for DNA Foundation Models

Motivation

Defense Pipeline

Stage 1 — BLAST Homology Search

Stage 2 — PathoLM Classification

Stage 3 — Activation-Space Probes

Full Pipeline Orchestration

Data

Project Structure

Probing Analysis

Core Hypothesis

What Is Being Probed

Models Probed

Evaluation

Key Finding: The Model Knows — and Outputs Anyway

Quickstart

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages