Skip to content

lamalab-org/GPMs-book

Repository files navigation

General Purpose Models for the Chemical Sciences ✨

Status Contributions Welcome Platform arXiv

📘 Read the Review

You can read the review directly on or access it as a collaborative online book at:

👉 arXiv

👉 Online Book

We also welcome community contributions — this is a living resource of references we have used in the review, and we aim to keep current with the evolving ecosystem. If you don't find your work or your favourite work here, please add it.

✨ Join the Community

Help us grow and improve this resource by sharing your feedback or contributing directly via the online platform.

Together, we can keep this reference useful, relevant, and up to date for everyone! This document provides an overview of the research sections and their associated references.

Introduction

The Shape and Structure of Chemical Data

Shape of Scientific Data

Scale of Chemical Data

Dataset Creation

Filtering
Synthetic Data

Building Principles of GPMs

Taxonomy of Foundation Models

Representations

Common Representations of Molecules and Materials
Tokenization
Embeddings

General Training Workflow

Pre-training: Learning the Shape of Data

Self-Supervision
Families of Self-Supervised Learning
Generative Methods
Contrastive Learning

The Holy Grail of Building Good Internal Representation

Fine-Tuning: Learning the Coloring of Data

Post-Supervised Adaptation: Learning to Align and Shape Behavior

Example Architectures

Multimodality

Multimodal Integration in Chemistry

Optimizations

Mixture-of-Experts
Quantization and Mixed Precision
Parameter-Efficient Tuning
Distillation

Model Level Adaptation

System-level Integration: Agents

Core Components of an Agentic System
Approaches for building Agentic System

Evaluations

The Evolution of Model Evaluation

Design of Evaluations

Evaluation Methodologies

Future Directions

Applications

Existing GPMs for Chemical Science

Knowledge Gathering

Structured Data Extraction
Question Answering

Experiment Planning

Conventional Planning
LLMs to Decompose Problems into Plans
Pruning of Search Spaces
Evaluation

Experiment Execution

Compiled Automation
Interpreted Automation
Hybrid Approaches
Comparison and Outlook

Data Analysis

Prompting
Agentic Systems
Current Limitations

Reporting

From Data to Explanation
Writing Assistance
Vision

Accelerating Applications

Property Prediction

Prompting
Fine-Tuning
Agents
Core Limitations

Molecular and Material Generation

Generation
Validation

Retrosynthesis

Accelerating Applications

Automating the Scientific Workflow

Coding and ML Applications of AI Scientists
Chemistry and Related Fields
Are these Systems Capable of Real Autonomous Research?

Implications of GPMs: Education, Safety, and Ethics

Education

Vision
Current Status
Outlook and Limitations

Safety

Evaluating Risk Amplification in the Chemical Discovery Cycle
Existing Approaches to Safety
Solutions

Ethics

Environmental Impact and Climate Ethics
Copyright Infringement and Plagiarism Concerns
Bias and Discrimination
Democratization of Power

Outlook and Conclusions

optimizers

LLMs as Optimizers

LLMs as Surrogate Models
LLMs as Next Candidate Generators
LLMs as Prior Knowledge Sources
How to Face Optimization Problems?

hypothesis

Hypothesis Generation

Initial Sparks
Chemistry-Focused Hypotheses
Are LLMs Actually Capable of Novel Hypothesis Generation?

Releases

No releases published

Packages

No packages published

Contributors 5