ML_in_QM_JC_02-10-18

WMD Journal Club
2nd October 2018
Suzanne Wallace

Machine learning (ML) overview
 Subfield of artificial intelligent (AI)
 Involves algorithms whose performance improves with data
 Identify/ exploit non-randomness in data and use for prediction or analysis
 Types
 Supervised
 Mapping function is learned to map inputs x to labels y with a training data set,
mapping function is then used to predict labels for new data
 c.f. company on click episode with employees labelling pixels of images for image recognition models
 e.g. regression
 Unsupervised
 e.g. dimensionality reduction

ML for quantum chemistry?
Examples of applications in QM
 Ab initio molecular dynamics to learn potential energy surface (PES)
 Orbital-free DFT to learn mapping from electron densities to their kinetic energy
 Molecular property prediction to map molecules to property values
 c.f. Dan’s work using ML to predict band gaps?

Principles of ML for quantum chemistry
 ‘Similarity principle’ – exploit redundancy
 In QM, could avoid having to repeat calculations for similar systems
 Interpolate between calculations to obtain approximate solutions
for the remaining systems
 Decisive factor is control of the interpolation error!
 How far could we push this...?
 ...train models with other models?
 E.g. use ML to map out PES of a molecule… but use various PES’s to
predict PES of other molecules based on similarity of species and
coordination environments...?

Main technical topics of the tutorial
 Kernel-based ML methods + assessing model performance
 (more details to follow…)
 Numerical representation of system (descriptor), e.g.
 Where domain knowledge for specific system is important?
 ‘The problem of learning a function from a finite sample of its values has no
unique solution (there are infinitely many functions that are compatible with
the training data) […]Essentially, one chooses the simplest model that is
compatible with the data (Occam’s razor)’
 c.f. Berkeley blog for interesting discussion of underfitting and overfitting (wrt
Fukushima)  https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/
 Overfitting may represent training data well, but perform poorly for unseen data
 ‘too much predictive power to quirks in our training data’ [cite Berkeley blog]
 Underfitting will just give nonsense for training and unseen data
Importance of
choice method +
testing model
once built
But now onto
nuts and bolts of
this learning
algorithm...

Kernel-based ML methods
(alternatives include artificial neural networks)
 Central idea  derive non-linear versions of ML algorithms by mapping
inputs into a higher dimensional space and applying the linear algorithm there
 ‘Kernel trick’  re-write linear ML algorithms to use only inner products
between inputs (norms, angles, distances between inputs)
 Functions called kernels operate on input space vectors, but gives same
results as evaluating inner products in feature space
 Essentially, are able to avoid explicit calculations in a high-dimensional
feature space

Dusting off the mathematical cobwebs…
 = for all
 = is an element of
 = real numbers
 = dot product of vectors (inner product), way of multiplying
two vectors together to obtain a scalar (I was initially massively confused by
use of the cross here…)
 = non-negative norm of a vector (a scalar value)
 = Euclidean norm (see later)
 = 1-norm (see later)

Kernel functions
 Section outlines various general conditions for an inner product of vectors in a
given vector space
 A kernel is a function that corresponds to an inner product in a feature space
 A function is only a kernel if there exists a map between the vector space and feature
space (but do not need to know form, existence is sufficient)
 A vector space with an inner product = an ‘inner product space’
 Kernel functions allow replacing computations in high-dimensional feature space
by computations in input space

Specific kernels: linear
Simple, linear kernel 
 Has identical input and feature space 
 Equivalent of using original linear algorithm (use as initial test for new system?)
 Gives a linear regression model
www.matlabsolutions.com/blog/Tensorflow-Linear-
regression-understanding-the-concept.php
Regression co-effs Training inputs
Inputs to predict

Specific kernels: Gaussian
(or squared exponential kernel or radial basis function kernel)
 Non-linear kernel 
 (non-linear  change of output not proportional to change of input)
 Maps into an infinite-dimensional feature space
 𝝈>0  hyperparameter determining the length scale on which the kernel
operates
 Something to tune for optimal model performance
 Limiting cases for 𝜎  0 or ∞ relate to overfitting and underfitting respectively
 For intermediate values of 𝜎, kernel value depends on
 Kernel approaches 1 as above  0
 “ “ “ 0 as above  ∞
 Samples close in input space are correlated in feature space
 Samples faraway however are mapped to orthogonal subspaces
Gaussian kernel is
local approximator
where scale
depends on 𝝈

Gaussian kernel is local approximator
where scale depends on 𝜎

Specific kernels: Laplacian
 Similar to Gaussian
 Uses exp, but using 1-norm instead of
euclidean norm (see next)
 Demonstrated to perform better for
molecular properties in refs 45 and 59-61

Aside: 1-norm vs. Euclidean norm
(source wikipedia)
The one we all know and love:
Summat to do with taxi drivers and the American road system:
(kind of like a constrained norm?)

Regression methods
Multiple linear regression
 Find co-effs to minimize generalization error (av. error on new inputs)
 However, in practice, due to finite size of training set can only minimize empirical
error  care must be taken to avoid over or underfitting
Ridge regression
 Added regularization to avoid overfitting
 Increases bias but reduces variance
 c.f. bias-variance tradeoff, e.g. https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/)
 Adds a penalty term where strength of regularization is determined by
hyperparameter, 𝜆 (larger values give simpler and smoother models)
Kernel ridge regression
 Applying ‘kernel trick’ to linear ridge regression  nonlinear version
(linear model in d dimensions each weighted by regression co-eff)
(term to allow modelling functions
that do not pass through origin)+

Implementation
Importance of model selection
 How to choose between different ML models
 Choice of kernel, k
 How to choose hyperparameters?
 e.g. 𝜆 for regularisation
 and 𝜎 if using Gaussian or Laplacian kernels
 Regression coefficients 𝛼 and 𝛽 for set hyperparameters determined by kernel?
 Therefore choice is dependent upon quality of training set? (methods such as bootstrapping and
cross-validation allow for re-use of data if set is small)
 Occam’s razor as general guiding principle  use simplest model that fits data
Estimating model performance
 ‘Risk of model’, f 
 R has to be estimated from a finite set of training data as the empirical risk
 Again, use regularization to avoid over-fitting to training set
loss function measuring the error of a prediction

Kernel ridge regression fits to 5 data points to
represent cosine function with different values of
hyperparameter, 𝜎

E.g. for predicting atomization energies
 Use 1k reference DFT calculations for atomization energy of organic
molecules to estimate for remaining molecules in full set of 7k molecules
(dataset and notebook for e.g. included in SI)

Considerations for…
Preparation of training dataset
 How large and homogeneous?
 In this e.g. inhomogeneous wrt no. of non-H atoms so had to include all with
four or fewer)  requires insights for relevant inhomogeneities?
 Split the training set and ‘hold out’ set  requires insights for size of sets?
 Can use methods cross-validation to reuse data if dataset is fairly small
Representation of data

Considerations for…
The model
 Choose kernel
 Choose hyperparameters 𝜆 and 𝜎, for chosen params
 Compute kernel matrices
 Algorithm computes regression coefficients
 Compute prediction performance statistics (using ‘hold out dataset’)
 Perform grid search to determine values of 𝜆 and 𝜎 for best
performance
 … Although is this not also influenced by regression coefficients which are
determined by initial choice of hyperparameters?
 Try different kernel (and repeat above steps)
 Compare performance with different kernels

Grid search to determine values of 𝜆 and 𝜎
for best performance

Comparing predictions using different
kernels

Key themes/ central ideas to method
 Exploit non-randomness in data to avoid having to perform additional QM calculations
 Minimising interpolation error when using some QM calculations to predict results of
others
 Kernel-based ML methods systematically derive nonlinear versions of linear ML
algorithms
 Avoid costly evaluations in a high-dimensional feature space through use of inner
products
 Avoiding underfitting or overfitting to training data (since we want a general model but
due to finite training data set, this will always be an empirical fit)
 Principle of Occam’s razor
 Choice of kernel
 Choice of kernel hyperparameters to minimise underfitting or overfitting
 Regularization to penalize for overfitting
 Build, optimize, tweak, repeat, optimize, tweak, repeat!

…verdict!
How useful as a starting point?

…verdict!
How useful as a starting point?
 A little hard to follow at the start of the more technical bits/ gauge what the
point in all the definitions were + some notation hard to follow, especially use of
cross when discussing dot products
 Had to google a fair bit!
 Later sections kind of easier to follow + nice example at the end
 Possibly easier to follow by not reading in order? Or re-reading the start after!
 …but nice plots to explain use of different kernels and influence of
hyperparameters on fits!

ML_in_QM_JC_02-10-18

More Related Content

What's hot (20)

Similar to ML_in_QM_JC_02-10-18 (20)

More from Suzanne Wallace (15)

Recently uploaded (20)

ML_in_QM_JC_02-10-18