Skip to content

Latest commit

 

History

History
563 lines (390 loc) · 18.5 KB

File metadata and controls

563 lines (390 loc) · 18.5 KB

Executive Summary: Scientific Foundation Models → LLM Inference

Date: December 8, 2025 Research Question: How do scientific foundation models (like ESM3 for genomics) connect to LLM-based inference capabilities? Application: Brain-genomics-LLM integration for developmental disorders research in AI-CoScientist


The Critical Discovery: Three Proven Pathways

1. BioReason Architecture (NeurIPS 2025) - RECOMMENDED FOR AI-COSCIENTIST

What it is: First successful integration of a DNA foundation model with a large language model

How it works:

  • DNA foundation model encodes genetic sequences → embeddings
  • Cross-modal connector layer bridges DNA embeddings to LLM input space
  • LLM (Qwen3) performs natural language reasoning over genetic information
  • Training: Supervised fine-tuning + targeted reinforcement learning

Performance:

  • Disease pathway prediction: 86% → 98% accuracy (12% improvement)
  • Variant effect prediction: 15% improvement over baselines
  • Generates interpretable step-by-step biological reasoning

Why it matters: This is the most direct path from "scientific data" to "natural language reasoning" - exactly what you need.

Implementation timeline: 2-3 months with pretrained models

Key institutions: University of Toronto, Vector Institute, Google DeepMind, Cohere

Source: NeurIPS 2025 Poster | arXiv


2. COMICAL Architecture (Oxford Academic 2025)

What it is: Contrastive learning framework for brain imaging + genomics integration

How it works:

  • Brain imaging encoder (transformer) processes structural MRI features (154 IDPs)
  • Genomic encoder (transformer) processes genetic variants (SNPs)
  • CLIP-style contrastive learning aligns both in unified embedding space
  • Enables cross-modal retrieval: genetics ↔ brain features

Performance:

  • Trained on 15.4 million IDP-SNP pairs from UK Biobank (40,426 subjects)
  • Discovered multiple novel genetic-brain associations
  • Predicts across diseases and unseen clinical outcomes

Why it matters: Directly addresses brain-genomics integration for developmental disorders

Implementation timeline: 4-6 months (requires large-scale paired data)

Key data source: UK Biobank (40K+ subjects with imaging + genetics)

Source: Oxford Academic | medRxiv


3. Med-Gemini Architecture (Google Research 2025)

What it is: Unified multimodal foundation model for all biomedical data including genomics

How it works:

  • Single transformer processes imaging, genomics, clinical notes, and text
  • Genomic data represented as "images" (polygenic risk scores projected to 2D)
  • Fine-tuned from Gemini 1.5 (1 million token context)
  • Natural language report generation across all modalities

Performance:

  • Med-Gemini-Polygenic outperforms standard polygenic risk score approaches
  • Generalizes to genetically correlated diseases never seen in training
  • MedQA benchmark: 4.6% improvement over Med-PaLM 2
  • 3D CT report generation: 53% clinically acceptable

Why it matters: Shows genomic information CAN be processed by general LLMs when properly represented

Implementation timeline: 6-12 months (requires massive compute)

Key innovation: Represents genomics as visual data for multimodal LLM

Source: Google Research Blog | arXiv


Technical Answer to Your Question

How ESM3 Enables Inference on Genetic Information

ESM3 Architecture (Evolutionary Scale, Science 2025):

  • Multimodal by design: Three separate tracks for sequence, structure, and FUNCTION
  • Function track accepts natural language describing what protein should do
  • Iterative sampling: Generative masked language model fills in missing information
  • Geometric attention: Combines sequence and 3D structure understanding

Key capability: You can prompt ESM3 with natural language function descriptions, and it will generate proteins with those properties. This is bidirectional reasoning:

  • Natural language → protein (generation)
  • Protein → natural language (interpretation)

Example workflow:

Input: "Design a protein that binds to amyloid-beta plaques in Alzheimer's"
ESM3 → Generates novel protein sequence with that function
Confidence: pTM > 0.8, pLDDT > 0.8

Achievement: Generated esmGFP, a novel fluorescent protein with only 58% similarity to known proteins (equivalent to 500 million years of evolution).

Source: Evolutionary Scale


The Missing Link: Cross-Modal Contrastive Learning

What Makes These Systems Work

All three approaches (BioReason, COMICAL, Med-Gemini) use the same fundamental technique:

1. Separate Encoders for Each Modality:

  • Scientific data encoder (DNA, protein, brain imaging) - often a pretrained foundation model
  • Text encoder (natural language) - usually BERT, GPT-style LLM, or custom

2. Projection to Shared Embedding Space:

  • Both encoders output vectors in the same dimensional space (e.g., 512D)
  • Learned projection layers ensure semantic alignment

3. Contrastive Learning Objective:

  • Positive pairs (same entity, different modalities) should have high similarity
  • Negative pairs (different entities) should have low similarity
  • Training minimizes this contrastive loss

4. Cross-Modal Reasoning:

  • Once aligned, can retrieve across modalities
  • Can generate natural language descriptions of scientific data
  • Can retrieve scientific data from natural language queries

Mathematical Form (CLIP-style):

L = -log(exp(sim(sci_i, text_i) / τ) / Σ_j exp(sim(sci_i, text_j) / τ))

Where:

  • sci_i = scientific data embedding (DNA, brain image, etc.)
  • text_i = corresponding text embedding
  • sim() = cosine similarity
  • τ = temperature parameter

2025 State of the Field: Key Findings

Foundation Models Are Everywhere in Science

Nature Machine Intelligence & Nature Methods (2025):

  1. LucaOne - Unified nucleic acid + protein foundation model (169,861 species)
  2. EpiAgent - Single-cell epigenomics (5M cells, 35B tokens)
  3. META-SiM - Single-molecule behavior discovery
  4. Nicheformer - Spatial single-cell analysis (SpatialCorpus-110M)
  5. GenomeOcean - DNA language model (4B params, #1 on Hugging Face)

Common Pattern: All use transformer architectures with domain-specific tokenization


Reasoning Is the New Frontier

NeurIPS 2025:

  • 766 papers with reasoning as core focus
  • Multiple workshops dedicated to reasoning in LLMs
  • Key question: "Can LLMs generate rigorously testable hypotheses across physics, chemistry, and biology?"

Breakthrough: BioReason shows that combining domain foundation models + LLMs + reinforcement learning achieves 98% accuracy on complex biological reasoning

Key insight: Pretraining alone is insufficient - need explicit reasoning training


Multimodal Is Essential for Science

ICLR 2025 Workshops:

  • MLGenX: AI for genomics and target identification
  • Foundation Models in the Wild: Multi-step scientific reasoning
  • SCI-FM: Open science for foundation models

Trend: Moving from unimodal → multimodal → unified architectures

Example: ProTrek integrates protein sequence + structure + natural language function descriptions in one model


Specific to Brain-Genomics Integration

Current State (2025)

epiBrainLLM (medRxiv 2024/2025):

  • Leverages genomic LLM to map genotypes → brain measures → clinical phenotypes
  • Focuses on Alzheimer's disease
  • Uses epigenomic data to understand causal pathways

Key findings from related research:

  • MIT analyzed 2 million cells from 400+ Alzheimer's brains
  • Identified epigenome erosion and cell identity loss
  • Found specific histone modifications (H3K27ac, H3K9ac) linked to AD
  • Microglial enhancers enriched for AD risk loci

GIANT Atlas:

  • Genetically Informed brAiN aTlas
  • Clusters brain voxels into genetically informed regions
  • Integrates voxel-wise heritability + spatial proximity

Data source: UK Biobank with 40K+ subjects having both brain imaging (154 IDPs) and genetics


Gap Analysis: What's Missing for Developmental Disorders

Current focus: Alzheimer's, psychiatric disorders, general neuroimaging Your opportunity: Developmental disorders are UNDER-EXPLORED

Advantages:

  1. Less crowded research space
  2. Clear clinical need
  3. Genetic components well-established
  4. Longitudinal data available (developmental trajectories)

Challenges:

  1. Need large-scale paired brain imaging + genomics data
  2. Fewer samples than adult disorders
  3. Developmental trajectories require time-series modeling

Recommended Strategy for AI-CoScientist

Phase 1: Quick Win (Month 1-2) - BioReason Pathway

Goal: DNA-LLM integration for genetic reasoning

Steps:

  1. Deploy GenomeOcean (4B params) or Nucleotide Transformer (500M)
  2. Build cross-modal connector to Qwen3-7B or LLaMA-3-8B
  3. Create reasoning dataset from ClinVar + literature (1K-10K examples)
  4. Fine-tune with supervised learning on reasoning chains
  5. Evaluate on pathway prediction tasks

Expected outcome:

  • Natural language reasoning over developmental disorder genetics
  • 90%+ accuracy on pathway prediction
  • Interpretable step-by-step explanations

Resources needed:

  • 8× A100/H100 GPUs for training (2-3 weeks)
  • ClinVar data (free)
  • 1-2 domain experts for annotation
  • ~$5K compute costs

Phase 2: Scale Up (Month 3-6) - COMICAL Pathway

Goal: Brain-genomics foundation model

Steps:

  1. Obtain UK Biobank access (or equivalent developmental disorder dataset)
  2. Extract brain imaging features (IDPs) from MRI
  3. Align with genetic data (SNPs, polygenic risk scores)
  4. Train COMICAL-style contrastive model
  5. Validate on known genetic-brain associations
  6. Discover novel associations

Expected outcome:

  • Cross-modal retrieval: genetics ↔ brain features
  • Novel developmental disorder associations
  • Predictive modeling of outcomes

Resources needed:

  • UK Biobank access ($$$)
  • 16-32× GPUs for training (1-2 months)
  • Brain imaging expertise
  • ~$20K compute costs

Phase 3: Full Integration (Month 6-12) - Med-Gemini Pathway

Goal: Unified multimodal foundation model

Steps:

  1. Integrate brain imaging, genomics, clinical notes, developmental trajectories
  2. Fine-tune large multimodal model (7B-70B params)
  3. Enable natural language report generation
  4. Add continual learning capabilities
  5. Deploy as autonomous discovery agent

Expected outcome:

  • State-of-the-art developmental disorder analysis
  • Automated hypothesis generation
  • Novel scientific discoveries
  • Published research

Resources needed:

  • 64-256× GPUs
  • Large-scale multimodal dataset
  • Multi-disciplinary team
  • ~$100K+ compute costs

Key Technical Requirements

Data Requirements

Minimum (Proof-of-Concept):

  • Genomics: 10K-100K gene sequences
  • Brain imaging: 1K-10K MRI scans
  • Paired data: 1K-10K subjects with both
  • Reasoning examples: 1K-10K annotated chains

Production Scale:

  • Genomics: 1M-10M sequences (GenomeOcean scale)
  • Brain imaging: 10K-100K scans (UK Biobank scale)
  • Paired data: 10K-100K subjects
  • Reasoning examples: 10K-100K high-quality annotations

AI-CoScientist Current Status:

  • ✅ RAG system with scientific literature
  • ✅ 100+ QA benchmark pairs
  • ⚠️ Need: Brain imaging + genomics paired data
  • ⚠️ Need: Reasoning chain annotations
  • ⚠️ Need: Expert validation pipeline

Computational Requirements

Development (Proof-of-Concept):

  • 4-8× A100 GPUs (80GB each)
  • 500GB-1TB storage
  • Fast interconnect for multi-GPU training

Production:

  • 32-64× H100 GPUs for training
  • 4-8× H100 GPUs for inference
  • 10TB+ storage for datasets
  • vLLM for 3× inference speedup

Current DGX Station Capabilities: Check GPU count and memory


Model Sizes and Trade-offs

Size Params Training Time Inference Speed Quality Best For
Small 100M-500M Days Very Fast Good Prototyping
Medium 1B-7B Weeks Fast Very Good Recommended
Large 7B-70B Months Moderate Excellent Research
Huge 70B+ Months+ Slow SOTA Benchmarks

Recommendation: Start with 1B-7B models (Nucleotide Transformer 500M + Qwen3-7B)


Success Metrics

Technical Metrics

Pathway Prediction (BioReason benchmark):

  • Target: >90% accuracy (BioReason achieved 98%)
  • Minimum: >80% to be useful

Cross-Modal Retrieval (COMICAL benchmark):

  • Target: Recall@10 >0.8
  • Minimum: >0.6 to be useful

Reasoning Quality:

  • Target: Faithfulness >0.8 (semantic similarity to expert reasoning)
  • Minimum: >0.6 for usability

Inference Speed:

  • Target: <2 seconds per query (95th percentile)
  • Maximum: <5 seconds

Scientific Metrics

Novel Discoveries:

  • Target: ≥1 novel genetic-brain association validated by experts
  • Stretch: Published in peer-reviewed journal

Hypothesis Quality:

  • Target: >50% of generated hypotheses rated as "scientifically plausible" by experts
  • Stretch: >70%

Expert Adoption:

  • Target: ≥5 domain experts actively using the system
  • Stretch: Regular citations in scientific literature

Competitive Landscape

Who's Doing This?

Academia:

  • MIT (2M cell Alzheimer's analysis)
  • University of Toronto (BioReason)
  • Google DeepMind (AlphaFold, Med-Gemini)
  • Evolutionary Scale (ESM3)

Industry:

  • Google Health (Med-PaLM, Med-Gemini)
  • NVIDIA (BioNeMo platform)
  • Anthropic (general scientific reasoning)

Your Edge:

  • Focus on developmental disorders (under-explored)
  • Integration with existing AI-CoScientist infrastructure
  • Multi-agent system for comprehensive analysis
  • Open-source approach with reproducible research

Risk Assessment

Technical Risks

Risk: Out of memory during training Mitigation: Gradient checkpointing, smaller models, quantization, gradient accumulation

Risk: Poor reasoning quality Mitigation: More supervised examples, expert annotations, reinforcement learning from human feedback

Risk: Slow inference Mitigation: vLLM (3× speedup), quantization (4-bit/8-bit), model distillation


Data Risks

Risk: Insufficient paired brain-genomics data Mitigation: Start with public datasets (UK Biobank application), synthetic data generation, transfer learning

Risk: Annotation quality issues Mitigation: Multiple expert annotators, inter-rater reliability checks, active learning to prioritize difficult cases

Risk: Data privacy concerns Mitigation: De-identification, federated learning, synthetic data generation


Scientific Risks

Risk: Model generates plausible but incorrect hypotheses Mitigation: Confidence scoring, uncertainty quantification, expert validation loops, retrieval-augmented generation

Risk: Overfitting to training data Mitigation: Large diverse datasets, regularization, cross-validation, out-of-distribution testing

Risk: Bias towards well-studied genes/pathways Mitigation: Balanced datasets, exploration bonuses in RL, diversity-promoting objectives


Next Steps (Week 1)

Monday-Tuesday: Environment Setup

  • Install foundation model libraries (transformers, esm, genslm)
  • Download pretrained checkpoints (Nucleotide Transformer, ESM-2, Qwen3)
  • Test GPU setup and memory capacity
  • Set up experiment tracking (Weights & Biases or MLflow)

Wednesday-Thursday: Data Preparation

  • Download ClinVar developmental disorder variants
  • Extract gene sequences from NCBI
  • Create initial reasoning dataset (100-1K examples)
  • Set up data processing pipeline

Friday: Proof-of-Concept

  • Load pretrained genomic foundation model
  • Load pretrained LLM
  • Build simple cross-modal connector
  • Test end-to-end inference on 1 example
  • Measure baseline performance

Weekend: Evaluation

  • Create evaluation dataset (50-100 examples)
  • Implement evaluation metrics
  • Run baseline evaluation
  • Document results and next steps

Conclusion: The Path Forward

The Bottom Line

You asked: "How do scientific foundation models connect to LLM-based inference capabilities?"

Answer: Through cross-modal contrastive learning that creates unified embedding spaces where scientific data and natural language can interact.

The breakthrough is BioReason (NeurIPS 2025), which shows:

  1. DNA foundation model can encode genetic sequences
  2. Cross-modal connector bridges DNA → LLM input space
  3. LLM performs multi-step reasoning over genetic information
  4. Achieves 98% accuracy on disease pathway prediction

This is directly applicable to your developmental disorders research.


Immediate Action Items

This week:

  1. Deploy GenomeOcean or Nucleotide Transformer
  2. Load Qwen3-7B or LLaMA-3-8B
  3. Build cross-modal connector
  4. Test on 1 developmental disorder gene

This month:

  1. Create reasoning dataset (1K examples)
  2. Fine-tune DNA-LLM model
  3. Evaluate on pathway prediction
  4. Achieve >80% accuracy

This quarter:

  1. Scale to 10K reasoning examples
  2. Achieve >90% accuracy
  3. Integrate with AI-CoScientist RAG
  4. Discover 1 novel genetic association

Expected Outcomes

3 months: Working DNA-LLM system with natural language reasoning over developmental disorder genetics

6 months: Brain-genomics foundation model discovering novel associations

12 months: Autonomous scientific discovery agent publishing novel findings


References

Full research report: /home/juke/git/AI-CoScientist/SCIENTIFIC_FOUNDATION_MODELS_LLM_INFERENCE_RESEARCH_2025.md (12,000 words, 38 sources)

Implementation guide: /home/juke/git/AI-CoScientist/FOUNDATION_MODEL_IMPLEMENTATION_GUIDE.md (Complete code examples and deployment guide)

Key papers:

  1. BioReason (NeurIPS 2025): https://arxiv.org/abs/2505.23579
  2. COMICAL (Oxford 2025): https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf196/8233690
  3. Med-Gemini (Google 2025): https://arxiv.org/abs/2405.03162
  4. ESM3 (Science 2025): https://www.evolutionaryscale.ai/blog/esm3-release

Document Version: 1.0 Last Updated: December 8, 2025 Next Review: After Week 1 proof-of-concept completion