Skip to content

arnavbathla/cell-state-engineering

Repository files navigation

Cell State Compiler

A deterministic, auditable compiler platform for cell-state engineering. Scientists define cell-state transitions, configure biological constraints, and run the compiler to receive ranked candidate intervention designs — each scored by real Evo 2 genome foundation model inference.

Research platform only. All outputs are model-derived research signals. Not biological validation. Not for clinical use. Not for pathogen, toxin, or gain-of-function research.


The Problem

Engineering cell states is one of the most important and difficult problems in modern medicine. The ability to reprogram a cell — say, converting an exhausted T cell back into a functional memory-like state, or pushing a fibroblast toward a cardiomyocyte — would unlock treatments for cancer, autoimmune disease, aging, and tissue regeneration.

But today the process is largely artisanal:

  • Researchers manually search literature to identify candidate transcription factors or CRISPR targets
  • Interventions are designed by intuition and prior knowledge
  • Screening is expensive: a single experiment can take weeks and cost tens of thousands of dollars
  • There is no principled way to rank candidates before committing to wet lab work
  • Failed candidates leave no systematic trail — knowledge is lost between labs and experiments

The core bottleneck is the translation gap between a target cell state (what we want) and a ranked set of concrete molecular interventions (what to actually try). That gap is currently filled by expert intuition — brilliant, but slow, unscalable, and hard to audit.


The Solution

Cell State Compiler treats cell-state engineering as a compilation problem.

Just as a software compiler translates high-level source code into optimized machine instructions, this platform takes a high-level biological specification — starting state, target state, constraints — and compiles it into ranked molecular intervention candidates, each scored against the genome itself using a foundation model.

The workflow:

[Starting Cell State]  →  [Compiler]  →  [Ranked Candidates]
[Target Cell State]                       [Evo 2 Scores]
[Constraint Set]                          [Assay Plan]
                                          [Audit Trail]

Each compile job runs a deterministic nine-step pipeline:

  1. Text screening — safety gate blocks disallowed research domains before any computation
  2. State encoding — starting cell state encoded as a 384-dimensional vector (marker profile + pathway scores + state labels)
  3. Candidate generation — systematic enumeration across six intervention modalities (TF payload, CRISPRa, CRISPRi, RNA payload, regulatory context, small molecule context)
  4. Genome context build — each candidate is grounded to a real DNA context sequence
  5. Evo 2 scoring — genome foundation model scores sequence plausibility and produces a dense embedding for each candidate
  6. Trajectory prediction — deterministic model predicts the state transition path from start to target
  7. Risk assessment — flags hard constraint violations, biosecurity concerns, and high-uncertainty candidates
  8. Safety filtering — rejects any candidate that fails any gate; never silently degrades
  9. Ranking — candidates sorted by weighted composite score; assay plan and full report generated

The result is a ranked, explainable, auditable list of intervention candidates grounded in genome-level sequence plausibility — not just literature association.


Why Foundation Models Change Everything

Classical computational biology approaches to cell state analysis rely on curated gene regulatory networks, transcription factor binding databases, and pathway enrichment scores. These methods are powerful but limited: they can only reason about what has already been measured and annotated.

Genome foundation models — large neural networks trained on billions of base pairs of DNA — represent a qualitative shift. By learning the statistical structure of genomic sequences at scale, they develop internal representations that capture:

  • Sequence plausibility — how likely a given DNA sequence is under the distribution of real genomic sequences
  • Functional context — which sequence features associate with gene expression, chromatin accessibility, and regulatory activity
  • Variant sensitivity — how a single nucleotide change alters the model's assessment of a locus
  • Transferable embeddings — dense vector representations that can be used for downstream prediction tasks with relatively little labeled data

The analogy to large language models is direct. Just as GPT-scale models learn the statistical structure of language and generalize to new tasks, genome foundation models learn the statistical structure of DNA and generalize to regulatory genomics problems they were never explicitly trained on.

Evo 2

Evo 2 is a genome foundation model developed by the Arc Institute, trained on a large corpus of prokaryotic and eukaryotic sequences at single-nucleotide resolution. At 40 billion parameters it is the largest publicly available genome model to date.

Key capabilities used in this platform:

Operation What It Computes How It's Used
Sequence scoring Mean log-likelihood of a DNA sequence under the model Measures how "native" a candidate context sequence looks to the genome — high plausibility = the genome can produce this; low plausibility = unusual sequence that may not function as intended
Embedding Dense vector representation of a sequence from an intermediate layer Used to compute embedding feature scores and stored for future retrieval/comparison
Variant effect Delta log-likelihood between a reference and alternate sequence Quantifies the effect of a proposed edit relative to the reference context

These scores enter the ranking formula as explicit weighted terms — Evo 2 is a first-class input to ranking, not an annotation added afterward.

Ranking weights

target_state_similarity      0.26
identity_preservation        0.18
safety                       0.22
manufacturability            0.10
evo2_sequence_plausibility   0.12   ← Evo 2 score
evo2_context_confidence      0.08   ← derived from Evo 2 uncertainty
evo2_embedding_feature_score 0.04   ← from Evo 2 embedding
uncertainty_penalty         -0.15   ← Evo 2 uncertainty penalizes rank

The goal: candidates that look plausible to the genome itself rank higher than candidates that are merely mechanistically appealing on paper.

CPU fallback (development mode)

When no CUDA GPU is available, the genome model service automatically falls back to a CPU composition scorer — a real 4-mer background frequency model against the human genome composition. This is genuine bioinformatics computation (the same statistical model used by tools like FIMO and HOMER for sequence background scoring), clearly labeled provider: cpu_composition in all outputs. It is not a mock and it is not silent — every result tells you which scoring method was used.

On GPU hardware with Evo 2 installed, all scoring switches automatically to real neural network inference.


Technical Architecture

┌─────────────────────────────────────────────────────────────┐
│  Browser                                                    │
│  Next.js 14 App Router · TypeScript · Tailwind              │
│  TanStack Query · Recharts · Radix UI                       │
│  localhost:3000                                             │
└───────────────────────┬─────────────────────────────────────┘
                        │ REST / JSON
┌───────────────────────▼─────────────────────────────────────┐
│  FastAPI (Python 3.11)                      localhost:8000  │
│  JWT auth · SQLAlchemy 2 · Alembic                         │
│  Compiler pipeline · Audit logging                          │
└──────┬──────────────────────────┬────────────────────────────┘
       │ RQ job queue             │ httpx calls
       ▼                          ▼
┌──────────────┐    ┌─────────────────────────────────────────┐
│  RQ Worker   │    │  Genome Model Service       :8100       │
│  (Python)    │    │  FastAPI · Safety gateway               │
│              │    │  ┌─────────────────────────────────┐    │
│              │    │  │ Evo 2 (GPU)   OR  CPU scorer    │    │
│              │    │  │ arcinstitute/evo2_40b            │    │
│              │    │  │ arcinstitute/evo2_7b             │    │
│              │    │  │ arcinstitute/evo2_1b_base        │    │
│              │    │  └─────────────────────────────────┘    │
└──────────────┘    └─────────────────────────────────────────┘
       │
       ▼
┌──────────────┐    ┌──────────────┐
│  PostgreSQL  │    │  Redis       │
│  + pgvector  │    │  (RQ broker) │
│  :5432       │    │  :6379       │
└──────────────┘    └──────────────┘

Services

Service Technology Purpose
web Next.js 14, TypeScript, Tailwind Full UI: projects, states, compile, candidates, reports
api FastAPI, SQLAlchemy 2, pgvector REST API, auth, compiler pipeline orchestration
worker Python, RQ Async compile job execution
genome-model-service FastAPI Evo 2 inference: scoring, embedding, variant effect
postgres PostgreSQL 16 + pgvector All relational data + 384-dim vector columns
redis Redis 7 RQ job queue

Database schema (key tables)

users · organizations · organization_members
projects
  cell_states (vector(384))
  target_states (vector(384))
  constraint_sets
  compile_jobs
    genome_assets
    candidate_payloads
      state_trajectories
      risk_assessments
    evo2_model_runs        ← one record per Evo 2 API call
    assay_plans
    reports
  experiments
  audit_logs

Compiler pipeline (detail)

compile_cell_program(request, db):
    1. screen_text_fields()           # biosecurity gate
    2. check_evo2_health()            # hard fail if service down
    3. encode_cell_state() → vec384   # marker + pathway + label encoding
    4. generate_candidates()          # 5-6 modalities × target objectives
    5. for each candidate:
         build_genome_context()       # ground to real DNA sequence
         store GenomeAsset
         score_candidate_with_evo2()  # → ScoreSequenceResponse
         embed_candidate_with_evo2()  # → EmbedSequenceResponse
         store Evo2ModelRun records
         predict_trajectory()         # deterministic state path
         assess_risk()
         apply_safety_filter()
         compute_final_score()        # weighted formula
    6. rank_candidates()
    7. generate_assay_plan()
    8. generate_report()
    9. write_audit_logs()

Deployment

Prerequisites

  • Docker Desktop 4.x or Docker Engine 24+ with Compose V2
  • 16 GB RAM minimum (CPU-only mode)

GPU requirements for Evo 2:

Model VRAM Notes
evo2_1b_base ~4 GB CPU also works (slow, ~60s/sequence)
evo2_7b ~16 GB Single A100/H100 40 GB
evo2_40b ~80 GB Two H100 80 GB, use docker-compose.gpu.yml

1. Clone and configure

git clone <repo>
cd cell-state-compiler
cp .env.example .env

Key .env settings:

# For CPU-only local development:
EVO2_MODEL_NAME=evo2_1b_base
EVO2_DEVICE=cpu

# For GPU with 40B model:
EVO2_MODEL_NAME=evo2_40b
EVO2_DEVICE=cuda:0
HUGGINGFACE_TOKEN=hf_...   # required if repo is gated

2. Install NVIDIA Container Toolkit (GPU only)

# Ubuntu / Debian
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

3. Start all services

CPU (development, any machine):

docker compose up --build

GPU with Evo 2 40B:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

The docker-compose.gpu.yml override:

  • Uses Dockerfile.gpu (CUDA 12.4 + flash-attn + evo2)
  • Sets EVO2_MODEL_NAME=evo2_40b
  • Allocates all available GPUs to the genome model service
  • Evo 2 weights download automatically from HuggingFace on first start (~80 GB for 40B)

4. Seed demo data

# Wait for the api container to be healthy, then:
docker compose exec api python /scripts/seed_demo_data.py

This creates:

  • Admin user: demo@cellcompiler.local / password123
  • Demo organization, project, cell state, target state, constraint set

5. Verify

# All services healthy
docker compose ps

# API
curl http://localhost:8000/health

# Evo 2 status
curl http://localhost:8100/v1/evo2/health | python3 -m json.tool

# Frontend
open http://localhost:3000

Using the Platform

Endpoints

URL Description
http://localhost:3000 Web application
http://localhost:8000/docs FastAPI interactive docs
http://localhost:8100/docs Genome model service docs

End-to-end flow

  1. Log in at http://localhost:3000 with demo credentials or create an account
  2. Create a project — name, cell type, disease context
  3. Define the starting state — marker profile (CD8, PD1, TOX, TCF7, ...), pathway scores, state labels
  4. Define the target state — desired markers, functional objectives
  5. Configure constraints — allowed modalities, forbidden mechanisms, risk thresholds
  6. Run compile — click Compile Cell Program; job runs asynchronously in the worker
  7. Review candidates — ranked table with Evo 2 scores, plausibility, uncertainty, trajectory
  8. Inspect each candidate — Overview, Scores, Evo2 Analysis, Trajectory, Risk, Assay Plan, Audit tabs
  9. Export report — full markdown + JSON compile report

Checking Evo 2 health

curl http://localhost:8100/v1/evo2/health | python3 -m json.tool

GPU with Evo 2 loaded:

{
  "healthy": true,
  "provider": "local",
  "model_name": "evo2_40b",
  "device": "cuda:0",
  "cuda_available": true,
  "model_loaded": true,
  "smoke_test_passed": true
}

CPU-only (4-mer composition scoring):

{
  "healthy": true,
  "provider": "cpu_composition",
  "model_name": "4mer_human_background",
  "device": "cpu",
  "cuda_available": false,
  "model_loaded": true,
  "details": {
    "evo2_available": false,
    "scoring_mode": "cpu_composition"
  }
}

NVIDIA NIM (alternative to local Evo 2)

If you have access to NVIDIA's hosted Evo 2 NIM endpoint:

EVO2_PROVIDER=nvidia_nim
NVIDIA_NIM_API_KEY=your_key
NVIDIA_NIM_EVO2_URL=https://your-nim-endpoint

The NIM adapter calls the remote endpoint for all scoring and embedding operations. If an operation is unsupported by the endpoint it returns 501 — it never fabricates results.


Safety Design

Safety is structural, not advisory:

  • Biosecurity text gate — compile requests are rejected before computation if any text field contains pathogen, virus, toxin, virulence, immune evasion, gain of function, bioweapon, weapon, replication competent, or gain-of-function terms
  • Sequence safety gateway — the genome model service screens every DNA sequence before it reaches Evo 2; blocked sequences are rejected, never silently passed
  • Candidate safety filter — candidates with a blocked Evo 2 safety status, hard constraint violations, or a biosecurity-flagged risk class are removed from results; they are not scored lower, they are rejected
  • Generation disabled by default — the sequence generation endpoint requires ENABLE_EVO2_GENERATION=true AND a purpose-specific safety gate pass before any generation occurs
  • No mock results — CI has a grep check that fails if MockEvo2 or mock_evo2 appears anywhere in the codebase; there is no path through the system that returns fabricated scores
  • Full audit log — every action (login, project creation, compile job start/complete/fail, candidate view) is written to audit_logs with user ID, timestamp, and entity reference

Disclaimer shown on every Evo 2 result:

Evo 2 scores are model-derived research signals, not biological validation. Experimental confirmation required before any research decision.


Development

Project structure

cell-state-compiler/
  apps/
    api/              FastAPI backend (Python 3.11)
      app/
        compiler/     Nine-step compile pipeline
        models/       SQLAlchemy ORM models
        api/routes/   REST endpoints
        services/     Evo2 client, audit service
        jobs/         RQ task definitions
        migrations/   Alembic migrations
    worker/           RQ worker process
    genome-model-service/
      app/
        adapters/     LocalEvo2Adapter, CpuCompositionScorer, NvidiaEvo2NimAdapter
        services/     Routing and safety application
        safety.py     SequenceSafetyGateway
        model_health.py Health check with fallback logic
    web/              Next.js 14 frontend
      app/            App Router pages
      components/     Shared components
      hooks/          TanStack Query hooks
      lib/            API client, auth utilities
  scripts/
    seed_demo_data.py
    reset_db.py
    check_evo2_runtime.py
  data/demo_sequences/ Safe synthetic DNA for smoke tests
  docker-compose.yml
  docker-compose.gpu.yml

Running tests

# Backend
cd apps/api
pip install -e ".[test]"
pytest app/tests/ -v

# Genome model service
cd apps/genome-model-service
pip install -e ".[test]"
pytest app/tests/ -v

# Frontend type check
cd apps/web
npm run build

CI mock check

grep -r "MockEvo2\|mock_evo2" --include="*.py" --include="*.ts" . \
  && echo "FAIL: mocks found" || echo "PASS"

Local development without Docker

# Start infra only
docker compose up postgres redis genome-model-service

# API
cd apps/api
pip install -e .
alembic upgrade head
uvicorn app.main:app --reload --port 8000

# Worker (separate terminal)
cd apps/worker
python worker.py

# Frontend
cd apps/web
npm install
npm run dev

Environment Variables

Variable Default Description
EVO2_PROVIDER local local or nvidia_nim
EVO2_MODEL_NAME evo2_1b_base Evo 2 checkpoint: evo2_1b_base, evo2_7b, evo2_40b
EVO2_DEVICE cpu cpu or cuda:0
EVO2_HEALTH_RUN_SMOKE_TEST true Run inference smoke test on health check
EVO2_MAX_CONTEXT_LENGTH 8192 Max tokens per forward pass
HUGGINGFACE_TOKEN HuggingFace token for downloading gated models
ENABLE_EVO2_GENERATION false Enable sequence generation endpoint
SEQUENCE_SAFETY_MODE restricted Safety screening strictness
MAX_SEQUENCE_LENGTH 8192 Max DNA sequence length accepted
NVIDIA_NIM_API_KEY API key for NVIDIA NIM endpoint
NVIDIA_NIM_EVO2_URL NVIDIA NIM Evo 2 base URL
JWT_SECRET Secret for JWT signing (change in production)
DATABASE_URL PostgreSQL connection string
REDIS_URL Redis connection string
GENOME_MODEL_SERVICE_URL http://genome-model-service:8100 Internal URL for API → genome service calls

See .env.example for all variables with comments.


Troubleshooting

No module named 'flash_attn_2_cuda'

Evo 2 requires a CUDA GPU and flash-attn. Without GPU hardware, the genome model service automatically falls back to CPU composition scoring. To run neural network inference, use the GPU compose override:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

Model download is slow or fails

Evo 2 weights download from HuggingFace on first start. Weights are cached in the hf-model-cache Docker volume so subsequent starts are instant. Ensure you have enough disk space:

  • evo2_1b_base: ~2 GB
  • evo2_7b: ~14 GB
  • evo2_40b: ~80 GB

For gated repositories, set HUGGINGFACE_TOKEN in .env.

CUDA out of memory

Switch to a smaller model:

EVO2_MODEL_NAME=evo2_7b   # or evo2_1b_base

Or use device_map="auto" across multiple GPUs (enabled automatically when torch.cuda.device_count() > 1).

Compile job stays queued

The worker container must be running and connected to Redis. Check:

docker compose logs worker
docker compose exec redis redis-cli ping

[object Object] error in UI

Typically a 422 validation error from the API. Open browser DevTools → Network tab to see the raw error response.


Roadmap

  • Retrieval-augmented candidate generation using pgvector similarity search across historical experiments
  • Evo 2 variant effect scoring to rank specific nucleotide edits, not just genomic contexts
  • Prediction vs. observation comparison with quantitative error analysis once wet lab results are uploaded
  • Support for additional foundation models (Nucleotide Transformer, HyenaDNA, DNABERT-2) as alternative or ensemble backends
  • Multi-user organizations with role-based access control
  • Export candidates as structured protocols for lab automation (Opentrons, Hamilton)

About

Exploring Evo 2 DNA Foundation Model + more

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors