Date: December 8, 2025 Research Question: How do scientific foundation models (like ESM3 for genomics) connect to LLM-based inference capabilities? Application: Brain-genomics-LLM integration for developmental disorders research in AI-CoScientist
What it is: First successful integration of a DNA foundation model with a large language model
How it works:
- DNA foundation model encodes genetic sequences → embeddings
- Cross-modal connector layer bridges DNA embeddings to LLM input space
- LLM (Qwen3) performs natural language reasoning over genetic information
- Training: Supervised fine-tuning + targeted reinforcement learning
Performance:
- Disease pathway prediction: 86% → 98% accuracy (12% improvement)
- Variant effect prediction: 15% improvement over baselines
- Generates interpretable step-by-step biological reasoning
Why it matters: This is the most direct path from "scientific data" to "natural language reasoning" - exactly what you need.
Implementation timeline: 2-3 months with pretrained models
Key institutions: University of Toronto, Vector Institute, Google DeepMind, Cohere
Source: NeurIPS 2025 Poster | arXiv
What it is: Contrastive learning framework for brain imaging + genomics integration
How it works:
- Brain imaging encoder (transformer) processes structural MRI features (154 IDPs)
- Genomic encoder (transformer) processes genetic variants (SNPs)
- CLIP-style contrastive learning aligns both in unified embedding space
- Enables cross-modal retrieval: genetics ↔ brain features
Performance:
- Trained on 15.4 million IDP-SNP pairs from UK Biobank (40,426 subjects)
- Discovered multiple novel genetic-brain associations
- Predicts across diseases and unseen clinical outcomes
Why it matters: Directly addresses brain-genomics integration for developmental disorders
Implementation timeline: 4-6 months (requires large-scale paired data)
Key data source: UK Biobank (40K+ subjects with imaging + genetics)
Source: Oxford Academic | medRxiv
What it is: Unified multimodal foundation model for all biomedical data including genomics
How it works:
- Single transformer processes imaging, genomics, clinical notes, and text
- Genomic data represented as "images" (polygenic risk scores projected to 2D)
- Fine-tuned from Gemini 1.5 (1 million token context)
- Natural language report generation across all modalities
Performance:
- Med-Gemini-Polygenic outperforms standard polygenic risk score approaches
- Generalizes to genetically correlated diseases never seen in training
- MedQA benchmark: 4.6% improvement over Med-PaLM 2
- 3D CT report generation: 53% clinically acceptable
Why it matters: Shows genomic information CAN be processed by general LLMs when properly represented
Implementation timeline: 6-12 months (requires massive compute)
Key innovation: Represents genomics as visual data for multimodal LLM
Source: Google Research Blog | arXiv
ESM3 Architecture (Evolutionary Scale, Science 2025):
- Multimodal by design: Three separate tracks for sequence, structure, and FUNCTION
- Function track accepts natural language describing what protein should do
- Iterative sampling: Generative masked language model fills in missing information
- Geometric attention: Combines sequence and 3D structure understanding
Key capability: You can prompt ESM3 with natural language function descriptions, and it will generate proteins with those properties. This is bidirectional reasoning:
- Natural language → protein (generation)
- Protein → natural language (interpretation)
Example workflow:
Input: "Design a protein that binds to amyloid-beta plaques in Alzheimer's"
ESM3 → Generates novel protein sequence with that function
Confidence: pTM > 0.8, pLDDT > 0.8
Achievement: Generated esmGFP, a novel fluorescent protein with only 58% similarity to known proteins (equivalent to 500 million years of evolution).
Source: Evolutionary Scale
All three approaches (BioReason, COMICAL, Med-Gemini) use the same fundamental technique:
1. Separate Encoders for Each Modality:
- Scientific data encoder (DNA, protein, brain imaging) - often a pretrained foundation model
- Text encoder (natural language) - usually BERT, GPT-style LLM, or custom
2. Projection to Shared Embedding Space:
- Both encoders output vectors in the same dimensional space (e.g., 512D)
- Learned projection layers ensure semantic alignment
3. Contrastive Learning Objective:
- Positive pairs (same entity, different modalities) should have high similarity
- Negative pairs (different entities) should have low similarity
- Training minimizes this contrastive loss
4. Cross-Modal Reasoning:
- Once aligned, can retrieve across modalities
- Can generate natural language descriptions of scientific data
- Can retrieve scientific data from natural language queries
Mathematical Form (CLIP-style):
L = -log(exp(sim(sci_i, text_i) / τ) / Σ_j exp(sim(sci_i, text_j) / τ))
Where:
sci_i= scientific data embedding (DNA, brain image, etc.)text_i= corresponding text embeddingsim()= cosine similarityτ= temperature parameter
Nature Machine Intelligence & Nature Methods (2025):
- LucaOne - Unified nucleic acid + protein foundation model (169,861 species)
- EpiAgent - Single-cell epigenomics (5M cells, 35B tokens)
- META-SiM - Single-molecule behavior discovery
- Nicheformer - Spatial single-cell analysis (SpatialCorpus-110M)
- GenomeOcean - DNA language model (4B params, #1 on Hugging Face)
Common Pattern: All use transformer architectures with domain-specific tokenization
NeurIPS 2025:
- 766 papers with reasoning as core focus
- Multiple workshops dedicated to reasoning in LLMs
- Key question: "Can LLMs generate rigorously testable hypotheses across physics, chemistry, and biology?"
Breakthrough: BioReason shows that combining domain foundation models + LLMs + reinforcement learning achieves 98% accuracy on complex biological reasoning
Key insight: Pretraining alone is insufficient - need explicit reasoning training
ICLR 2025 Workshops:
- MLGenX: AI for genomics and target identification
- Foundation Models in the Wild: Multi-step scientific reasoning
- SCI-FM: Open science for foundation models
Trend: Moving from unimodal → multimodal → unified architectures
Example: ProTrek integrates protein sequence + structure + natural language function descriptions in one model
epiBrainLLM (medRxiv 2024/2025):
- Leverages genomic LLM to map genotypes → brain measures → clinical phenotypes
- Focuses on Alzheimer's disease
- Uses epigenomic data to understand causal pathways
Key findings from related research:
- MIT analyzed 2 million cells from 400+ Alzheimer's brains
- Identified epigenome erosion and cell identity loss
- Found specific histone modifications (H3K27ac, H3K9ac) linked to AD
- Microglial enhancers enriched for AD risk loci
GIANT Atlas:
- Genetically Informed brAiN aTlas
- Clusters brain voxels into genetically informed regions
- Integrates voxel-wise heritability + spatial proximity
Data source: UK Biobank with 40K+ subjects having both brain imaging (154 IDPs) and genetics
Current focus: Alzheimer's, psychiatric disorders, general neuroimaging Your opportunity: Developmental disorders are UNDER-EXPLORED
Advantages:
- Less crowded research space
- Clear clinical need
- Genetic components well-established
- Longitudinal data available (developmental trajectories)
Challenges:
- Need large-scale paired brain imaging + genomics data
- Fewer samples than adult disorders
- Developmental trajectories require time-series modeling
Goal: DNA-LLM integration for genetic reasoning
Steps:
- Deploy GenomeOcean (4B params) or Nucleotide Transformer (500M)
- Build cross-modal connector to Qwen3-7B or LLaMA-3-8B
- Create reasoning dataset from ClinVar + literature (1K-10K examples)
- Fine-tune with supervised learning on reasoning chains
- Evaluate on pathway prediction tasks
Expected outcome:
- Natural language reasoning over developmental disorder genetics
- 90%+ accuracy on pathway prediction
- Interpretable step-by-step explanations
Resources needed:
- 8× A100/H100 GPUs for training (2-3 weeks)
- ClinVar data (free)
- 1-2 domain experts for annotation
- ~$5K compute costs
Goal: Brain-genomics foundation model
Steps:
- Obtain UK Biobank access (or equivalent developmental disorder dataset)
- Extract brain imaging features (IDPs) from MRI
- Align with genetic data (SNPs, polygenic risk scores)
- Train COMICAL-style contrastive model
- Validate on known genetic-brain associations
- Discover novel associations
Expected outcome:
- Cross-modal retrieval: genetics ↔ brain features
- Novel developmental disorder associations
- Predictive modeling of outcomes
Resources needed:
- UK Biobank access ($$$)
- 16-32× GPUs for training (1-2 months)
- Brain imaging expertise
- ~$20K compute costs
Goal: Unified multimodal foundation model
Steps:
- Integrate brain imaging, genomics, clinical notes, developmental trajectories
- Fine-tune large multimodal model (7B-70B params)
- Enable natural language report generation
- Add continual learning capabilities
- Deploy as autonomous discovery agent
Expected outcome:
- State-of-the-art developmental disorder analysis
- Automated hypothesis generation
- Novel scientific discoveries
- Published research
Resources needed:
- 64-256× GPUs
- Large-scale multimodal dataset
- Multi-disciplinary team
- ~$100K+ compute costs
Minimum (Proof-of-Concept):
- Genomics: 10K-100K gene sequences
- Brain imaging: 1K-10K MRI scans
- Paired data: 1K-10K subjects with both
- Reasoning examples: 1K-10K annotated chains
Production Scale:
- Genomics: 1M-10M sequences (GenomeOcean scale)
- Brain imaging: 10K-100K scans (UK Biobank scale)
- Paired data: 10K-100K subjects
- Reasoning examples: 10K-100K high-quality annotations
AI-CoScientist Current Status:
- ✅ RAG system with scientific literature
- ✅ 100+ QA benchmark pairs
⚠️ Need: Brain imaging + genomics paired data⚠️ Need: Reasoning chain annotations⚠️ Need: Expert validation pipeline
Development (Proof-of-Concept):
- 4-8× A100 GPUs (80GB each)
- 500GB-1TB storage
- Fast interconnect for multi-GPU training
Production:
- 32-64× H100 GPUs for training
- 4-8× H100 GPUs for inference
- 10TB+ storage for datasets
- vLLM for 3× inference speedup
Current DGX Station Capabilities: Check GPU count and memory
| Size | Params | Training Time | Inference Speed | Quality | Best For |
|---|---|---|---|---|---|
| Small | 100M-500M | Days | Very Fast | Good | Prototyping |
| Medium | 1B-7B | Weeks | Fast | Very Good | Recommended |
| Large | 7B-70B | Months | Moderate | Excellent | Research |
| Huge | 70B+ | Months+ | Slow | SOTA | Benchmarks |
Recommendation: Start with 1B-7B models (Nucleotide Transformer 500M + Qwen3-7B)
Pathway Prediction (BioReason benchmark):
- Target: >90% accuracy (BioReason achieved 98%)
- Minimum: >80% to be useful
Cross-Modal Retrieval (COMICAL benchmark):
- Target: Recall@10 >0.8
- Minimum: >0.6 to be useful
Reasoning Quality:
- Target: Faithfulness >0.8 (semantic similarity to expert reasoning)
- Minimum: >0.6 for usability
Inference Speed:
- Target: <2 seconds per query (95th percentile)
- Maximum: <5 seconds
Novel Discoveries:
- Target: ≥1 novel genetic-brain association validated by experts
- Stretch: Published in peer-reviewed journal
Hypothesis Quality:
- Target: >50% of generated hypotheses rated as "scientifically plausible" by experts
- Stretch: >70%
Expert Adoption:
- Target: ≥5 domain experts actively using the system
- Stretch: Regular citations in scientific literature
Academia:
- MIT (2M cell Alzheimer's analysis)
- University of Toronto (BioReason)
- Google DeepMind (AlphaFold, Med-Gemini)
- Evolutionary Scale (ESM3)
Industry:
- Google Health (Med-PaLM, Med-Gemini)
- NVIDIA (BioNeMo platform)
- Anthropic (general scientific reasoning)
Your Edge:
- Focus on developmental disorders (under-explored)
- Integration with existing AI-CoScientist infrastructure
- Multi-agent system for comprehensive analysis
- Open-source approach with reproducible research
Risk: Out of memory during training Mitigation: Gradient checkpointing, smaller models, quantization, gradient accumulation
Risk: Poor reasoning quality Mitigation: More supervised examples, expert annotations, reinforcement learning from human feedback
Risk: Slow inference Mitigation: vLLM (3× speedup), quantization (4-bit/8-bit), model distillation
Risk: Insufficient paired brain-genomics data Mitigation: Start with public datasets (UK Biobank application), synthetic data generation, transfer learning
Risk: Annotation quality issues Mitigation: Multiple expert annotators, inter-rater reliability checks, active learning to prioritize difficult cases
Risk: Data privacy concerns Mitigation: De-identification, federated learning, synthetic data generation
Risk: Model generates plausible but incorrect hypotheses Mitigation: Confidence scoring, uncertainty quantification, expert validation loops, retrieval-augmented generation
Risk: Overfitting to training data Mitigation: Large diverse datasets, regularization, cross-validation, out-of-distribution testing
Risk: Bias towards well-studied genes/pathways Mitigation: Balanced datasets, exploration bonuses in RL, diversity-promoting objectives
- Install foundation model libraries (transformers, esm, genslm)
- Download pretrained checkpoints (Nucleotide Transformer, ESM-2, Qwen3)
- Test GPU setup and memory capacity
- Set up experiment tracking (Weights & Biases or MLflow)
- Download ClinVar developmental disorder variants
- Extract gene sequences from NCBI
- Create initial reasoning dataset (100-1K examples)
- Set up data processing pipeline
- Load pretrained genomic foundation model
- Load pretrained LLM
- Build simple cross-modal connector
- Test end-to-end inference on 1 example
- Measure baseline performance
- Create evaluation dataset (50-100 examples)
- Implement evaluation metrics
- Run baseline evaluation
- Document results and next steps
You asked: "How do scientific foundation models connect to LLM-based inference capabilities?"
Answer: Through cross-modal contrastive learning that creates unified embedding spaces where scientific data and natural language can interact.
The breakthrough is BioReason (NeurIPS 2025), which shows:
- DNA foundation model can encode genetic sequences
- Cross-modal connector bridges DNA → LLM input space
- LLM performs multi-step reasoning over genetic information
- Achieves 98% accuracy on disease pathway prediction
This is directly applicable to your developmental disorders research.
This week:
- Deploy GenomeOcean or Nucleotide Transformer
- Load Qwen3-7B or LLaMA-3-8B
- Build cross-modal connector
- Test on 1 developmental disorder gene
This month:
- Create reasoning dataset (1K examples)
- Fine-tune DNA-LLM model
- Evaluate on pathway prediction
- Achieve >80% accuracy
This quarter:
- Scale to 10K reasoning examples
- Achieve >90% accuracy
- Integrate with AI-CoScientist RAG
- Discover 1 novel genetic association
3 months: Working DNA-LLM system with natural language reasoning over developmental disorder genetics
6 months: Brain-genomics foundation model discovering novel associations
12 months: Autonomous scientific discovery agent publishing novel findings
Full research report: /home/juke/git/AI-CoScientist/SCIENTIFIC_FOUNDATION_MODELS_LLM_INFERENCE_RESEARCH_2025.md (12,000 words, 38 sources)
Implementation guide: /home/juke/git/AI-CoScientist/FOUNDATION_MODEL_IMPLEMENTATION_GUIDE.md (Complete code examples and deployment guide)
Key papers:
- BioReason (NeurIPS 2025): https://arxiv.org/abs/2505.23579
- COMICAL (Oxford 2025): https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf196/8233690
- Med-Gemini (Google 2025): https://arxiv.org/abs/2405.03162
- ESM3 (Science 2025): https://www.evolutionaryscale.ai/blog/esm3-release
Document Version: 1.0 Last Updated: December 8, 2025 Next Review: After Week 1 proof-of-concept completion