Skip to content

JarvisZhang24/GeneLM-Evo2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

20 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ GeneLM-Evo2

Genomic Intelligence Powered by Evo2

Next.js React Python Modal License

A full-stack genomic variant analysis platform leveraging the Evo2 DNA language model for zero-shot pathogenicity prediction

Features โ€ข Tech Stack โ€ข Quick Start โ€ข Architecture โ€ข API


โœจ Features

๐Ÿ”ฌ Evo2-Powered Analysis

  • 7B parameter DNA language model
  • Zero-shot variant pathogenicity prediction
  • ~95% AUROC on BRCA1 benchmark
  • Real-time inference on H100 GPUs

๐Ÿงญ Interactive Gene Browser

  • Browse by chromosome or search by gene
  • Interactive sequence viewer with nucleotide highlighting
  • Click any base to trigger variant analysis
  • Support for 24+ genome assemblies

๐Ÿฅ ClinVar Integration

  • Fetch clinically curated variants
  • Compare Evo2 predictions vs clinical labels
  • One-click analysis for SNVs
  • Confidence scoring with delta-likelihood

โšก Modern Tech Stack

  • Next.js 15 with Turbopack
  • React 19 + TailwindCSS 4
  • Modal serverless infrastructure
  • UCSC & NCBI API integration

๐Ÿ› ๏ธ Tech Stack

Frontend

Technology Version Purpose
Next.js 15 React framework with App Router
React 19 UI library
TailwindCSS 4 Utility-first CSS
shadcn/ui Latest Component library
Framer Motion 12 Animations
TypeScript 5.8 Type safety

Backend

Technology Version Purpose
Python 3.12 Runtime
Modal Latest Serverless GPU infrastructure
Evo2 7B DNA language model
PyTorch 2.8 Deep learning framework
Flash Attention 2.8.3 Efficient attention
CUDA 12.6 GPU acceleration

External APIs

  • UCSC Genome Browser API โ€” Reference sequence data
  • NCBI ClinVar API โ€” Clinical variant annotations
  • NCBI Gene API โ€” Gene information and coordinates

๐Ÿš€ Quick Start

Prerequisites

  • Node.js 20+ and npm
  • Python 3.12+
  • Modal account (sign up)
  • NVIDIA GPU with CUDA support (for local development, or use Modal's H100s)

Frontend Setup

# Navigate to frontend directory
cd genelm-frontend

# Install dependencies
npm install

# Create environment file
cp .env.example .env.local

# Configure your environment variables
# NEXT_PUBLIC_ANALYZE_SINGLE_VARIANT_BASE_URL=<your-modal-endpoint>

# Start development server
npm run dev

The frontend will be available at http://localhost:3000

Backend Setup

# Navigate to backend directory
cd genelm-backend

# Install Modal CLI
pip install modal

# Authenticate with Modal
modal setup

# Deploy the application
modal deploy main.py

# Or run locally for development
modal serve main.py

After deployment, Modal will provide an endpoint URL for the analyze_single_variant API.


๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         Frontend (Next.js)                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ Gene Browserโ”‚  โ”‚  Sequence   โ”‚  โ”‚   Variant Analysis      โ”‚  โ”‚
โ”‚  โ”‚  Component  โ”‚  โ”‚   Viewer    โ”‚  โ”‚      Dashboard          โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚         โ”‚                โ”‚                      โ”‚                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                โ”‚                      โ”‚
          โ–ผ                โ–ผ                      โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  NCBI Gene   โ”‚ โ”‚ UCSC Genome  โ”‚    โ”‚  Modal Backend โ”‚
   โ”‚     API      โ”‚ โ”‚     API      โ”‚    โ”‚    (H100)      โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                โ”‚
                                                โ–ผ
                                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                        โ”‚    Evo2      โ”‚
                                        โ”‚  (7B Model)  โ”‚
                                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Flow

  1. Gene Selection โ†’ User browses/searches genes via NCBI Gene API
  2. Sequence Loading โ†’ Genomic sequence fetched from UCSC API
  3. Variant Selection โ†’ User clicks nucleotide or selects ClinVar variant
  4. Evo2 Analysis โ†’ Request sent to Modal backend with H100 GPU
  5. Prediction โ†’ Model scores reference vs variant sequences
  6. Results โ†’ Delta-likelihood score + pathogenicity prediction returned

๐Ÿ“ก API Reference

Backend Endpoint

POST /analyze_single_variant

Analyze a single nucleotide variant using Evo2.

Request Body:

{
  "variant_pos": 43119628,
  "alt_allele": "G",
  "genome": "hg38",
  "chromosome": "chr17"
}

Response:

{
  "position": 43119628,
  "reference": "A",
  "variant": "G",
  "delta_score": -0.00234,
  "prediction": "Likely pathogenic",
  "confidence": 0.87
}
Field Type Description
position int Genomic position
reference str Reference allele
variant str Alternative allele
delta_score float Log-likelihood difference (ref - var)
prediction str "Likely pathogenic" or "Likely benign"
confidence float Confidence score (0-1)

๐Ÿ“Š Performance

BRCA1 Benchmark

Metric Value
AUROC ~95%
Model Evo2 7B
Context Window 8,192 bp
Variants Tested 500 SNVs
Classification LOF vs FUNC/INT

The model uses a threshold-based classification derived from Youden's J statistic optimization on the BRCA1 saturation mutagenesis dataset.

Inference Performance

Configuration Latency
Modal H100 (cold start) ~30s
Modal H100 (warm) ~2-5s
Batch scoring (100 variants) ~60s

๐Ÿ“ Project Structure

GeneLM-Evo2/
โ”œโ”€โ”€ genelm-frontend/          # Next.js frontend application
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ app/              # App router pages
โ”‚   โ”‚   โ”œโ”€โ”€ components/       # React components
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gene-sequence.tsx
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ known-variants.tsx
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ui/           # shadcn/ui components
โ”‚   โ”‚   โ””โ”€โ”€ utils/            # API utilities
โ”‚   โ”‚       โ”œโ”€โ”€ variants-api.ts
โ”‚   โ”‚       โ”œโ”€โ”€ genome-api.ts
โ”‚   โ”‚       โ””โ”€โ”€ genes-api.ts
โ”‚   โ”œโ”€โ”€ package.json
โ”‚   โ””โ”€โ”€ tailwind.config.ts
โ”‚
โ”œโ”€โ”€ genelm-backend/           # Modal serverless backend
โ”‚   โ”œโ”€โ”€ main.py               # Evo2 model & API endpoints
โ”‚   โ””โ”€โ”€ requirements.txt
โ”‚
โ””โ”€โ”€ README.md

๐Ÿ”ฎ Roadmap

  • Batch variant analysis
  • VCF file upload support
  • Additional gene benchmarks (TP53, BRCA2)
  • Variant effect visualization
  • Export results to PDF/CSV
  • Multi-model comparison (ESM, Nucleotide Transformer)

๐Ÿ™ Acknowledgments


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ๐Ÿงฌ by Jarvis Zhang

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors