With scientific machine learning becoming an ever larger mainstay at machine learning conferences, and ever more venues and research centres at the intersection of machine learning and the natural sciences / engineering appearing there exist ever more impressive examples of algorithms which connect the very best of machine learning with deep scientific insight into the respective underlying problem to advance the field.
Below are a few prime examples of recent flagship algorithms in scientific machine learning, of which every single one of them personifies the very best algorithmic approaches we have available to us today.
AlphaFold - predicts 3D protein structure given its sequence:
---
width: 500px
align: center
name: alphafold
---
AlphaFold model. (Source: [Jumper et al., 2021](https://www.nature.com/articles/s41586-021-03819-2))
GNS - capable of simulating the motion of water particles:
---
width: 500px
align: center
name: gns
---
GNS model. (Source: [Sanchez-Gonzalez et al., 2020](https://proceedings.mlr.press/v119/sanchez-gonzalez20a.html))
Codex - translating natural language to code:
---
width: 500px
align: center
name: codex
---
Codex demo (Source: [openai.com](https://openai.com/blog/openai-codex))
Geometric deep learning aims to generalize neural network models to non-Euclidean domains such as graphs and manifolds. Good examples of this line of research include:
SFCNN - steerable rotation equivariant CNN, e.g. for image segmentation
---
width: 500px
align: center
name: sfcnn
---
SFCNN model. (Source: [Weiler et al., 2018](https://arxiv.org/abs/1711.07289))
SEGNN - molecular property prediction model
---
width: 500px
align: center
name: segnn
---
SEGNN model. (Source: [Brandstetter et al., 2022](https://arxiv.org/abs/2110.02905))
Stable Diffusion - generating images from natural text description
---
width: 500px
align: center
name: stablediffusion
---
Stable Diffusion art. (Source: [stability.ai](https://stability.ai/blog/stable-diffusion-public-release))
---
width: 700px
align: center
name: stablediffusion_brain
---
Stable Diffusion brain signal reconstruction. (Source: [Takagi & Nishimito, 2023](https://sites.google.com/view/stablediffusion-with-brain/?s=09))
ImageBind - Holistic AI learning across six modalities
---
width: 600px
align: center
name: imagebind
---
ImageBind modalities. (Source: [ai.meta.com](https://ai.meta.com/blog/imagebind-six-modalities-binding-ai/))
Machine learning at the intersection of engineering, physics, chemistry, computational biology etc. and core machine learning to improve existing scientific workflows, derive new scientific insight, or bridge the gap between scientific data and our current state of knowledge.
Important here to recall is the difference in approaches between engineering & physics, and machine learning on the other side:
Engineering & Physics
Models are derived from conservation laws, observations, and established physical principles.
Machine Learning
Models are derived from data with imprinted priors on the model space either through the data itself, or through the design of the machine learning algorithm.
There exist 3 main types of modern day machine learning:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
In supervised learning we have a mapping
$$\mathcal{D}{N} = \left{ \left( x{n}, y_{n} \right) \right}_{n=1:N}$$ (supervised_data)
with
Some also call it "glorified curve-fitting"
In regression, the target
---
width: 600px
align: center
name: 2d_regression_ex
---
2D regression example. (Source: {cite}`murphy2022`, Introduction)
Example of a response surface being fitted to a number of data points in 3 dimensions, where in this instance the x- and y-axes are a two-dimensional space, and the z-axis is the temperature in the two-dimensional space.
In classification the labels
---
width: 350px
align: center
name: iris_classification
---
Classification example. (Source: {cite}`murphy2022`, Introduction)
Example of flower classification, where we aim to find the decision boundaries which will sort each individual node into the respective class.
In unsupervised learning, we only receive a dataset of inputs
$$\mathcal{D}{N} = \left{ x{n} \right}_{n=1:N}$$ (unsupervised_data)
without the respective outputs
The implicit goal here is to describe the system, and identify features in the high-dimensional inputs.
Two famous examples of unsupervised learning are clustering (e.g. k-means) and especially dimensionality reduction (e.g. principal component analysis) which is commonly used in engineering and scientific applications.
---
width: 400px
align: center
name: pca_clustering
---
Clustering based on principal components. (Source: {cite}`brunton2019`, Section 1.5)
Combining clustering with principal component analysis to show the samples which have cancer in the first three principal component coordinates.
The difference can furthermore be expressed in probabilistic terms, i.e., in supervised learning we are fitting a model over the outputs conditioned on the inputs
In reinforcement learning, an agent sequentially interact with an unknown environment to obtain an interaction trajectory
---
width: 500px
align: center
name: rl
---
Reinforcement learning overview. (Source: [lilianweng](https://lilianweng.github.io/posts/2018-02-19-rl-overview/))
Let's presume we have a simple regression problem, e.g.
---
width: 400px
align: center
name: lin_reg_1d
---
Linear regression example. (Source: {cite}`murphy2022`, Introduction)
Then we have a number of scalar observations
Then a crucial choice is the degree of the polynomial function.
This class of models is called Linear Models because we want to learn only the linear scaling coefficients
, given any choice of basis for the variable like the polinomial basis shown here.
We can then construct an error function with the sum of squares approach in which we are computing the distance of every target data point to our polynomial
in which we are then optimizing for the value of
---
width: 400px
align: center
name: lin_reg_1d_distances
---
Linear regression error computation. (Source: {cite}`murphy2022`, Introduction)
To minimize this we then have to take the derivative with respect to the coefficients
which we are optimizing for and by setting to 0, we can then find the minimum
This can be solved by the trusty old Gaussian elimination. A general problem with this approach is that the degree of the polynomial is a decisive factor which often leads to over-fitting and hence makes this a less desirable approach. Gaussian elimination, or a matrix inversion approach when implemented on a computer can also be a highly expensive computational operation for large datasets.
This is a special case of the Maximum Likelihood method.
Recap: Bayes Theorem
If we now seek to reformulate the curve-fitting in probabilistic terms, then we have to begin by expressing our uncertainty over the target
where
$$p(y|x, \mathbf{w}, \beta)=\prod^{N}{n=1}\mathcal{N}(y{n}|y(x_{n},\mathbf{w}), \beta^{-1}).$$ (bayesian_lin_reg_joint_likelihood)
---
width: 400px
align: center
name: bayesian_reg_1d
---
Bayesian regression example. (Source: {cite}`bishop2006`, Section 1.2)
Taking the log likelihood we are then able to find the definitions of the optimal parameters
$$\text{ln } p(y|x, \mathbf{w}, \beta) = - \frac{\beta}{2} \sum^{N}{2} { y(x{n}, \mathbf{w}) - y_{n} }^{2} + \frac{N}{2} \text{ln } \beta - \frac{N}{2} \text{ln }(2 \pi)$$ (bayesian_lin_reg_joint_likelihood_log)
Which we can then optimize for the
If we consider the special case of
, and instead of maximizing, minimizing the negative log-likelihood, then this is equivalent to the sum-of-squares error function.
The herein obtained optimal maximum likelihood parameters $\mathbf{w}{ML}$ and $\beta{ML}$ can then be resubstituted to obtain the predictive distribution for the targets
$$p(y|x, \mathbf{w}{ML}, \beta{ML})=\mathcal{N}(y|y(x, \mathbf{w}{ML}),\beta{ML}^{-1})$$ (bayesian_lin_reg_ml_sol)
To arrive at the full Bayesian curve-fitting approach we now have to apply the sum and product rules of probability
Recap: Sum Rules of (disjoint) Probability
Recap: Product Rules of Probability - for independent events
The Bayesian curve fitting formula is hence
with the dependence on
with mean and variance
and
and
-
represents the prediction uncertainty - The second term of the variance is caused by the uncertainty in the parameters
.
Formalizing the principle of maximum likelihood estimation, we seek to act in a similar fashion to function extrema calculation in high school in that we seek to differentiate our likelihood function, and then find the maximum value of it by setting the derivative equal to zero.
where
- Supervised Learning: Mapping from inputs
to outputs - Unsupervised Learning: Only receives the datapoints
with no access to the true labels - Maximum Likelihood Principle
- Polynomial Curve Fitting a special case
- Wasteful of training data, and tends to overfit
- Bayesian approach less prone to overfitting
- Using AI to Accelerate Scientific Discovery - inspirational video by Demis Hassabis