Skip to content

luigibonati/DEAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

96 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DEAL

Data Efficient Active Learning for Machine Learning Potentials

DEAL selects non-redundant structures from atomistic trajectories via Sparse Gaussian Processes (SGP), to be used to train machine-learning interatomic potentials.

It consists of two steps:

  1. preselection using the MLP uncertainty (e.g. max uncertainty obtained with query-by-committee)
  2. select a dataset of non-redundant configurations using the local predictive variance of a Gaussian Process

The method is described in:

Perego S. & Bonati L. Data efficient machine learning potentials for modeling catalytic reactivity via active learning and enhanced sampling, npj Computational Materials 10, 291 (2024) doi: 10.1038/s41524-024-01481-6

In addition, step 2. can be also used to perform subsampling of a trajectory even without the uncertainty pre-selection (see examples)

Highlights

  • Select structures based on SGP predictive variance.

  • Analyze selected structures (e.g. along the trajectory or as a function of a CV)

    drawing
  • Interactive visualization using chemiscope

    drawing


Table of contents


πŸ“š Contents

  • deal/ – The core Python package.
  • examples/ – Two realistic workflows demonstrating how to use DEAL in practice.
  • npj_supporting_data/ – Jupyter notebooks reproducing the full workflow described in the publication, including the use of Gaussian Process Regression for reaction-pathway exploration.
  • tests/ – A test suite to verify that the installation is correct and all components work as expected.

πŸ”§ Dependencies

DEAL requires:


πŸš€ Installation

Below is a complete installation sequence.

Create environment

conda create -n deal python=3.12
conda activate deal

Install FLARE 1.3.3b

(See: https://mir-group.github.io/flare/installation/install.html)

conda install -y gcc gxx cmake openmp liblapacke openblas -c conda-forge
git clone https://github.com/mir-group/flare.git -b 1.3.3b
cd flare
pip install .
cd ..
  1. Install DEAL
git clone https://github.com/luigibonati/DEAL.git
cd DEAL
pip install .

πŸ§ͺ Usage

DEAL can be run either with a command-line tool (deal) or using the python class (DEAL).


🟒 Minimal example

deal --file traj.xyz --threshold 0.1

DEAL will automatically:

  • detect atomic species from the first frame
  • use default GP/kernel/descriptor parameters
  • use default output names

πŸ“„ With a YAML config file

For more customization you can create an input.yaml file (below are the default):

deal -c input.yaml
data:
  files: ["traj.xyz"]     # can be a single file or a list of files
  format: "extxyz"        # file format (e.g. extxyz, xyz, ...)
  index: ":"              # frame selection (e.g. ":", "0:100", [0,10,20]) [see ASE notation]
  colvar: none            # collective variables file associated with the trajectory (optional, used for monitoring CVs in chemiscope)
  shuffle: false          # whether to shuffle the frames before processing (suggested true for MD data)
  seed: 42

deal:
  threshold: 0.1         #can be a single value or a list of values
  update_threshold: 0.08 # if not set it is chosen as 0.8 * threshold 
  max_atoms_added: 0.2    # limit the number of selected environments added per configuration (can be int (number of atoms) float (0,1) (fraction of total atoms), or -1 (no limit)
  initial_atoms: none     # specify which atoms to use for GP initialization (list, fraction or number. Default: none, 1 atom per species)
  output_prefix: deal    # prefix for output files (threshold will be appended as suffix)
  force_only: true
  train_hyps: false      # whether to re-train hyperparameters at each iteration (slower) 
  verbose: true          # allowed values: true/false/"debug" (default: true)
  save_gp: false

flare_calc:
  gp: SGP_Wrapper        # (see flare's documentation)
  kernels:
    - name: NormalizedDotProduct
      sigma: 2 
      power: 2
  descriptors:
    - name: B2
      nmax: 8
      lmax: 3
      cutoff_function: cosine
      radial_basis: chebyshev
  cutoff: 4.5
  

One can also use a base config file and override via CLI the options:

deal -c input.yaml -t 0.15

🐍 Python Usage

# Import 
from deal import DataConfig, DEALConfig, FlareConfig, DEAL

# Define Config (uses defaults where not provided)
data_cfg = DataConfig(files="traj.xyz")
deal_cfg = DEALConfig(
    threshold=0.1,
    output_prefix="deal",    
)
flare_cfg = FlareConfig()

# Instantiate DEAL class
deal = DEAL(data_cfg, deal_cfg, flare_cfg)

# Run 
deal.run()

Output files

In both cases the following files (with the default output_prefix=deal):

  1. deal_selected.xyz – selected frames

Contains the atomic configurations where the GP uncertainty exceeded the threshold. Includes atoms.info["frame"] indicating the original trajectory index.

  1. deal_chemiscope.json.gz – chemiscope visualization file

Can be viewed online at https://chemiscope.org/ or inside Python:

import chemiscope
chemiscope.show_input('deal_chemiscope.json.gz')

Multiple thresholds

If the CLI receives a list of thresholds, DEAL will run once per threshold.

deal:
  threshold:
    - 0.10
    - 0.15
    - 0.20

Equivalent behaviour in Python:

for thr in [0.10, 0.15, 0.20]:
    deal_cfg.threshold = thr
    deal_cfg.output_prefix = f"run_thr{thr}"
    DEAL(data_cfg, deal_cfg, flare_cfg).run()

πŸŽ›οΈ Choice of the parameters

Descriptors

Local environments are characterized via the Atomic Cluster Expansion formalism as implemented in flare. Key hyperparameters: body order (B1/B2), radial degree nmax, angular degree lmax, and cutoff (in Γ…).

  descriptors:
    - name: B2
      nmax: 8
      lmax: 3
      cutoff_function: cosine
      radial_basis: chebyshev
  cutoff: 4.5

Threshold

  threshold: 0.1
  update_threshold: 0.08  # if not set it is chosen as 0.8 * threshold      
  max_atoms_added: -1 # no limit on the number of selected environment of a given configuration to the GP.
  initial_atoms: 0.15 # use up to 15% of the atoms (of each species) for GP initialization

Thethreshold parameter in the DEAL configuration controls when a local environment is flagged by the SGP’s predictive variance (normalized by the noise hyperparameter). If any environment exceeds the threshold, the GP is updated and that environment (plus any others above update_threshold * threshold, up to max_atoms_added) is added.

Some tips:

  • A good starting point is around 0.1. As a rule of thumb, homogeneous, condensed and/or crystalline systems tend to have fewer different local environments and require smaller thresholds (<<0.1), whereas heterogeneous systems may require larger ones (>0.1).
  • Try a few values and compare how many structures are selected; distributions often are very similar across thresholds, what changes is the number of structures.
  • For active learning of ML potentials, a possible practical strategy is to pick an initial threshold (large for quickly covering all configurations), run the single-point calculations on the selected structure, and update the potential. Then, re-evaluate the already screened configuration (all) and, if it is not adequately described, restart DEAL selection with a tighter threshold. Given the generally low cost of ML simulations, it is still advisable to perform a greater number of AL-cycles rather than re-evaluating existing structures, as the newly generated structures are expected to be more relevant as generated by an increasingly accurate interatomic potential.

About

Data efficient active learning for machine learning potentials

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •