Executive Summary: Scientific Foundation Models → LLM Inference

Date: December 8, 2025 Research Question: How do scientific foundation models (like ESM3 for genomics) connect to LLM-based inference capabilities? Application: Brain-genomics-LLM integration for developmental disorders research in AI-CoScientist

The Critical Discovery: Three Proven Pathways

1. BioReason Architecture (NeurIPS 2025) - RECOMMENDED FOR AI-COSCIENTIST

What it is: First successful integration of a DNA foundation model with a large language model

How it works:

DNA foundation model encodes genetic sequences → embeddings
Cross-modal connector layer bridges DNA embeddings to LLM input space
LLM (Qwen3) performs natural language reasoning over genetic information
Training: Supervised fine-tuning + targeted reinforcement learning

Performance:

Disease pathway prediction: 86% → 98% accuracy (12% improvement)
Variant effect prediction: 15% improvement over baselines
Generates interpretable step-by-step biological reasoning

Why it matters: This is the most direct path from "scientific data" to "natural language reasoning" - exactly what you need.

Implementation timeline: 2-3 months with pretrained models

Key institutions: University of Toronto, Vector Institute, Google DeepMind, Cohere

Source: NeurIPS 2025 Poster | arXiv

2. COMICAL Architecture (Oxford Academic 2025)

What it is: Contrastive learning framework for brain imaging + genomics integration

How it works:

Brain imaging encoder (transformer) processes structural MRI features (154 IDPs)
Genomic encoder (transformer) processes genetic variants (SNPs)
CLIP-style contrastive learning aligns both in unified embedding space
Enables cross-modal retrieval: genetics ↔ brain features

Performance:

Trained on 15.4 million IDP-SNP pairs from UK Biobank (40,426 subjects)
Discovered multiple novel genetic-brain associations
Predicts across diseases and unseen clinical outcomes

Why it matters: Directly addresses brain-genomics integration for developmental disorders

Implementation timeline: 4-6 months (requires large-scale paired data)

Key data source: UK Biobank (40K+ subjects with imaging + genetics)

Source: Oxford Academic | medRxiv

3. Med-Gemini Architecture (Google Research 2025)

What it is: Unified multimodal foundation model for all biomedical data including genomics

How it works:

Single transformer processes imaging, genomics, clinical notes, and text
Genomic data represented as "images" (polygenic risk scores projected to 2D)
Fine-tuned from Gemini 1.5 (1 million token context)
Natural language report generation across all modalities

Performance:

Med-Gemini-Polygenic outperforms standard polygenic risk score approaches
Generalizes to genetically correlated diseases never seen in training
MedQA benchmark: 4.6% improvement over Med-PaLM 2
3D CT report generation: 53% clinically acceptable

Why it matters: Shows genomic information CAN be processed by general LLMs when properly represented

Implementation timeline: 6-12 months (requires massive compute)

Key innovation: Represents genomics as visual data for multimodal LLM

Source: Google Research Blog | arXiv

Technical Answer to Your Question

How ESM3 Enables Inference on Genetic Information

ESM3 Architecture (Evolutionary Scale, Science 2025):

Multimodal by design: Three separate tracks for sequence, structure, and FUNCTION
Function track accepts natural language describing what protein should do
Iterative sampling: Generative masked language model fills in missing information
Geometric attention: Combines sequence and 3D structure understanding

Key capability: You can prompt ESM3 with natural language function descriptions, and it will generate proteins with those properties. This is bidirectional reasoning:

Natural language → protein (generation)
Protein → natural language (interpretation)

Example workflow:

Input: "Design a protein that binds to amyloid-beta plaques in Alzheimer's"
ESM3 → Generates novel protein sequence with that function
Confidence: pTM > 0.8, pLDDT > 0.8

Achievement: Generated esmGFP, a novel fluorescent protein with only 58% similarity to known proteins (equivalent to 500 million years of evolution).

Source: Evolutionary Scale

The Missing Link: Cross-Modal Contrastive Learning

What Makes These Systems Work

All three approaches (BioReason, COMICAL, Med-Gemini) use the same fundamental technique:

1. Separate Encoders for Each Modality:

Scientific data encoder (DNA, protein, brain imaging) - often a pretrained foundation model
Text encoder (natural language) - usually BERT, GPT-style LLM, or custom

2. Projection to Shared Embedding Space:

Both encoders output vectors in the same dimensional space (e.g., 512D)
Learned projection layers ensure semantic alignment

3. Contrastive Learning Objective:

Positive pairs (same entity, different modalities) should have high similarity
Negative pairs (different entities) should have low similarity
Training minimizes this contrastive loss

4. Cross-Modal Reasoning:

Once aligned, can retrieve across modalities
Can generate natural language descriptions of scientific data
Can retrieve scientific data from natural language queries

Mathematical Form (CLIP-style):

L = -log(exp(sim(sci_i, text_i) / τ) / Σ_j exp(sim(sci_i, text_j) / τ))

Where:

sci_i = scientific data embedding (DNA, brain image, etc.)
text_i = corresponding text embedding
sim() = cosine similarity
τ = temperature parameter

2025 State of the Field: Key Findings

Foundation Models Are Everywhere in Science

Nature Machine Intelligence & Nature Methods (2025):

LucaOne - Unified nucleic acid + protein foundation model (169,861 species)
EpiAgent - Single-cell epigenomics (5M cells, 35B tokens)
META-SiM - Single-molecule behavior discovery
Nicheformer - Spatial single-cell analysis (SpatialCorpus-110M)
GenomeOcean - DNA language model (4B params, #1 on Hugging Face)

Common Pattern: All use transformer architectures with domain-specific tokenization

Reasoning Is the New Frontier

NeurIPS 2025:

766 papers with reasoning as core focus
Multiple workshops dedicated to reasoning in LLMs
Key question: "Can LLMs generate rigorously testable hypotheses across physics, chemistry, and biology?"

Breakthrough: BioReason shows that combining domain foundation models + LLMs + reinforcement learning achieves 98% accuracy on complex biological reasoning

Key insight: Pretraining alone is insufficient - need explicit reasoning training

Multimodal Is Essential for Science

ICLR 2025 Workshops:

MLGenX: AI for genomics and target identification
Foundation Models in the Wild: Multi-step scientific reasoning
SCI-FM: Open science for foundation models

Trend: Moving from unimodal → multimodal → unified architectures

Example: ProTrek integrates protein sequence + structure + natural language function descriptions in one model

Specific to Brain-Genomics Integration

Current State (2025)

epiBrainLLM (medRxiv 2024/2025):

Leverages genomic LLM to map genotypes → brain measures → clinical phenotypes
Focuses on Alzheimer's disease
Uses epigenomic data to understand causal pathways

Key findings from related research:

MIT analyzed 2 million cells from 400+ Alzheimer's brains
Identified epigenome erosion and cell identity loss
Found specific histone modifications (H3K27ac, H3K9ac) linked to AD
Microglial enhancers enriched for AD risk loci

GIANT Atlas:

Genetically Informed brAiN aTlas
Clusters brain voxels into genetically informed regions
Integrates voxel-wise heritability + spatial proximity

Data source: UK Biobank with 40K+ subjects having both brain imaging (154 IDPs) and genetics

Gap Analysis: What's Missing for Developmental Disorders

Current focus: Alzheimer's, psychiatric disorders, general neuroimaging Your opportunity: Developmental disorders are UNDER-EXPLORED

Advantages:

Less crowded research space
Clear clinical need
Genetic components well-established
Longitudinal data available (developmental trajectories)

Challenges:

Need large-scale paired brain imaging + genomics data
Fewer samples than adult disorders
Developmental trajectories require time-series modeling

Recommended Strategy for AI-CoScientist

Phase 1: Quick Win (Month 1-2) - BioReason Pathway

Goal: DNA-LLM integration for genetic reasoning

Steps:

Deploy GenomeOcean (4B params) or Nucleotide Transformer (500M)
Build cross-modal connector to Qwen3-7B or LLaMA-3-8B
Create reasoning dataset from ClinVar + literature (1K-10K examples)
Fine-tune with supervised learning on reasoning chains
Evaluate on pathway prediction tasks

Expected outcome:

Natural language reasoning over developmental disorder genetics
90%+ accuracy on pathway prediction
Interpretable step-by-step explanations

Resources needed:

8× A100/H100 GPUs for training (2-3 weeks)
ClinVar data (free)
1-2 domain experts for annotation
~$5K compute costs

Phase 2: Scale Up (Month 3-6) - COMICAL Pathway

Goal: Brain-genomics foundation model

Steps:

Obtain UK Biobank access (or equivalent developmental disorder dataset)
Extract brain imaging features (IDPs) from MRI
Align with genetic data (SNPs, polygenic risk scores)
Train COMICAL-style contrastive model
Validate on known genetic-brain associations
Discover novel associations

Expected outcome:

Cross-modal retrieval: genetics ↔ brain features
Novel developmental disorder associations
Predictive modeling of outcomes

Resources needed:

UK Biobank access ($$$)
16-32× GPUs for training (1-2 months)
Brain imaging expertise
~$20K compute costs

Phase 3: Full Integration (Month 6-12) - Med-Gemini Pathway

Goal: Unified multimodal foundation model

Steps:

Integrate brain imaging, genomics, clinical notes, developmental trajectories
Fine-tune large multimodal model (7B-70B params)
Enable natural language report generation
Add continual learning capabilities
Deploy as autonomous discovery agent

Expected outcome:

State-of-the-art developmental disorder analysis
Automated hypothesis generation
Novel scientific discoveries
Published research

Resources needed:

64-256× GPUs
Large-scale multimodal dataset
Multi-disciplinary team
~$100K+ compute costs

Key Technical Requirements

Data Requirements

Minimum (Proof-of-Concept):

Genomics: 10K-100K gene sequences
Brain imaging: 1K-10K MRI scans
Paired data: 1K-10K subjects with both
Reasoning examples: 1K-10K annotated chains

Production Scale:

Genomics: 1M-10M sequences (GenomeOcean scale)
Brain imaging: 10K-100K scans (UK Biobank scale)
Paired data: 10K-100K subjects
Reasoning examples: 10K-100K high-quality annotations

AI-CoScientist Current Status:

✅ RAG system with scientific literature
✅ 100+ QA benchmark pairs
⚠️ Need: Brain imaging + genomics paired data
⚠️ Need: Reasoning chain annotations
⚠️ Need: Expert validation pipeline

Computational Requirements

Development (Proof-of-Concept):

4-8× A100 GPUs (80GB each)
500GB-1TB storage
Fast interconnect for multi-GPU training

Production:

32-64× H100 GPUs for training
4-8× H100 GPUs for inference
10TB+ storage for datasets
vLLM for 3× inference speedup

Current DGX Station Capabilities: Check GPU count and memory

Model Sizes and Trade-offs

Size	Params	Training Time	Inference Speed	Quality	Best For
Small	100M-500M	Days	Very Fast	Good	Prototyping
Medium	1B-7B	Weeks	Fast	Very Good	Recommended
Large	7B-70B	Months	Moderate	Excellent	Research
Huge	70B+	Months+	Slow	SOTA	Benchmarks

Recommendation: Start with 1B-7B models (Nucleotide Transformer 500M + Qwen3-7B)

Success Metrics

Technical Metrics

Pathway Prediction (BioReason benchmark):

Target: >90% accuracy (BioReason achieved 98%)
Minimum: >80% to be useful

Cross-Modal Retrieval (COMICAL benchmark):

Target: Recall@10 >0.8
Minimum: >0.6 to be useful

Reasoning Quality:

Target: Faithfulness >0.8 (semantic similarity to expert reasoning)
Minimum: >0.6 for usability

Inference Speed:

Target: <2 seconds per query (95th percentile)
Maximum: <5 seconds

Scientific Metrics

Novel Discoveries:

Target: ≥1 novel genetic-brain association validated by experts
Stretch: Published in peer-reviewed journal

Hypothesis Quality:

Target: >50% of generated hypotheses rated as "scientifically plausible" by experts
Stretch: >70%

Expert Adoption:

Target: ≥5 domain experts actively using the system
Stretch: Regular citations in scientific literature

Competitive Landscape

Who's Doing This?

Academia:

MIT (2M cell Alzheimer's analysis)
University of Toronto (BioReason)
Google DeepMind (AlphaFold, Med-Gemini)
Evolutionary Scale (ESM3)

Industry:

Google Health (Med-PaLM, Med-Gemini)
NVIDIA (BioNeMo platform)
Anthropic (general scientific reasoning)

Your Edge:

Focus on developmental disorders (under-explored)
Integration with existing AI-CoScientist infrastructure
Multi-agent system for comprehensive analysis
Open-source approach with reproducible research

Risk Assessment

Technical Risks

Risk: Out of memory during training Mitigation: Gradient checkpointing, smaller models, quantization, gradient accumulation

Risk: Poor reasoning quality Mitigation: More supervised examples, expert annotations, reinforcement learning from human feedback

Risk: Slow inference Mitigation: vLLM (3× speedup), quantization (4-bit/8-bit), model distillation

Data Risks

Risk: Insufficient paired brain-genomics data Mitigation: Start with public datasets (UK Biobank application), synthetic data generation, transfer learning

Risk: Annotation quality issues Mitigation: Multiple expert annotators, inter-rater reliability checks, active learning to prioritize difficult cases

Risk: Data privacy concerns Mitigation: De-identification, federated learning, synthetic data generation

Scientific Risks

Risk: Model generates plausible but incorrect hypotheses Mitigation: Confidence scoring, uncertainty quantification, expert validation loops, retrieval-augmented generation

Risk: Overfitting to training data Mitigation: Large diverse datasets, regularization, cross-validation, out-of-distribution testing

Risk: Bias towards well-studied genes/pathways Mitigation: Balanced datasets, exploration bonuses in RL, diversity-promoting objectives

Next Steps (Week 1)

Monday-Tuesday: Environment Setup

Install foundation model libraries (transformers, esm, genslm)
Download pretrained checkpoints (Nucleotide Transformer, ESM-2, Qwen3)
Test GPU setup and memory capacity
Set up experiment tracking (Weights & Biases or MLflow)

Wednesday-Thursday: Data Preparation

Download ClinVar developmental disorder variants
Extract gene sequences from NCBI
Create initial reasoning dataset (100-1K examples)
Set up data processing pipeline

Friday: Proof-of-Concept

Load pretrained genomic foundation model
Load pretrained LLM
Build simple cross-modal connector
Test end-to-end inference on 1 example
Measure baseline performance

Weekend: Evaluation

Create evaluation dataset (50-100 examples)
Implement evaluation metrics
Run baseline evaluation
Document results and next steps

Conclusion: The Path Forward

The Bottom Line

You asked: "How do scientific foundation models connect to LLM-based inference capabilities?"

Answer: Through cross-modal contrastive learning that creates unified embedding spaces where scientific data and natural language can interact.

The breakthrough is BioReason (NeurIPS 2025), which shows:

DNA foundation model can encode genetic sequences
Cross-modal connector bridges DNA → LLM input space
LLM performs multi-step reasoning over genetic information
Achieves 98% accuracy on disease pathway prediction

This is directly applicable to your developmental disorders research.

Immediate Action Items

This week:

Deploy GenomeOcean or Nucleotide Transformer
Load Qwen3-7B or LLaMA-3-8B
Build cross-modal connector
Test on 1 developmental disorder gene

This month:

Create reasoning dataset (1K examples)
Fine-tune DNA-LLM model
Evaluate on pathway prediction
Achieve >80% accuracy

This quarter:

Scale to 10K reasoning examples
Achieve >90% accuracy
Integrate with AI-CoScientist RAG
Discover 1 novel genetic association

Expected Outcomes

3 months: Working DNA-LLM system with natural language reasoning over developmental disorder genetics

6 months: Brain-genomics foundation model discovering novel associations

12 months: Autonomous scientific discovery agent publishing novel findings

References

Full research report: /home/juke/git/AI-CoScientist/SCIENTIFIC_FOUNDATION_MODELS_LLM_INFERENCE_RESEARCH_2025.md (12,000 words, 38 sources)

Implementation guide: /home/juke/git/AI-CoScientist/FOUNDATION_MODEL_IMPLEMENTATION_GUIDE.md (Complete code examples and deployment guide)

Key papers:

BioReason (NeurIPS 2025): https://arxiv.org/abs/2505.23579
COMICAL (Oxford 2025): https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf196/8233690
Med-Gemini (Google 2025): https://arxiv.org/abs/2405.03162
ESM3 (Science 2025): https://www.evolutionaryscale.ai/blog/esm3-release

Document Version: 1.0 Last Updated: December 8, 2025 Next Review: After Week 1 proof-of-concept completion

FilesExpand file tree

EXECUTIVE_SUMMARY_FOUNDATION_MODELS_2025.md

Latest commit

History

EXECUTIVE_SUMMARY_FOUNDATION_MODELS_2025.md

File metadata and controls

Executive Summary: Scientific Foundation Models → LLM Inference

The Critical Discovery: Three Proven Pathways

1. BioReason Architecture (NeurIPS 2025) - RECOMMENDED FOR AI-COSCIENTIST

2. COMICAL Architecture (Oxford Academic 2025)

3. Med-Gemini Architecture (Google Research 2025)

Technical Answer to Your Question

How ESM3 Enables Inference on Genetic Information

The Missing Link: Cross-Modal Contrastive Learning

What Makes These Systems Work

2025 State of the Field: Key Findings

Foundation Models Are Everywhere in Science

Reasoning Is the New Frontier

Multimodal Is Essential for Science

Specific to Brain-Genomics Integration

Current State (2025)

Gap Analysis: What's Missing for Developmental Disorders

Recommended Strategy for AI-CoScientist

Phase 1: Quick Win (Month 1-2) - BioReason Pathway

Phase 2: Scale Up (Month 3-6) - COMICAL Pathway

Phase 3: Full Integration (Month 6-12) - Med-Gemini Pathway

Key Technical Requirements

Data Requirements

Computational Requirements

Model Sizes and Trade-offs

Success Metrics

Technical Metrics

Scientific Metrics

Competitive Landscape

Who's Doing This?

Risk Assessment

Technical Risks

Data Risks

Scientific Risks

Next Steps (Week 1)

Monday-Tuesday: Environment Setup

Wednesday-Thursday: Data Preparation

Friday: Proof-of-Concept

Weekend: Evaluation

Conclusion: The Path Forward

The Bottom Line

Immediate Action Items

Expected Outcomes

References