WMD Journal Club
2nd October 2018
Suzanne Wallace
ML_in_QM_JC_02-10-18
Motivation?
Machine learning (ML) overview
 Subfield of artificial intelligent (AI)
 Involves algorithms whose performance improves with data
 Identify/ exploit non-randomness in data and use for prediction or analysis
 Types
 Supervised
 Mapping function is learned to map inputs x to labels y with a training data set,
mapping function is then used to predict labels for new data
 c.f. company on click episode with employees labelling pixels of images for image recognition models
 e.g. regression
 Unsupervised
 e.g. dimensionality reduction
ML for quantum chemistry?
Examples of applications in QM
 Ab initio molecular dynamics to learn potential energy surface (PES)
 Orbital-free DFT to learn mapping from electron densities to their kinetic energy
 Molecular property prediction to map molecules to property values
 c.f. Dan’s work using ML to predict band gaps?
Principles of ML for quantum chemistry
 ‘Similarity principle’ – exploit redundancy
 In QM, could avoid having to repeat calculations for similar systems
 Interpolate between calculations to obtain approximate solutions
for the remaining systems
 Decisive factor is control of the interpolation error!
 How far could we push this...?
 ...train models with other models?
 E.g. use ML to map out PES of a molecule… but use various PES’s to
predict PES of other molecules based on similarity of species and
coordination environments...?
Main technical topics of the tutorial
 Kernel-based ML methods + assessing model performance
 (more details to follow…)
 Numerical representation of system (descriptor), e.g.
 Where domain knowledge for specific system is important?
 ‘The problem of learning a function from a finite sample of its values has no
unique solution (there are infinitely many functions that are compatible with
the training data) […]Essentially, one chooses the simplest model that is
compatible with the data (Occam’s razor)’
 c.f. Berkeley blog for interesting discussion of underfitting and overfitting (wrt
Fukushima)  https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/
 Overfitting may represent training data well, but perform poorly for unseen data
 ‘too much predictive power to quirks in our training data’ [cite Berkeley blog]
 Underfitting will just give nonsense for training and unseen data
Importance of
choice method +
testing model
once built
But now onto
nuts and bolts of
this learning
algorithm...
Kernel-based ML methods
(alternatives include artificial neural networks)
 Central idea  derive non-linear versions of ML algorithms by mapping
inputs into a higher dimensional space and applying the linear algorithm there
 ‘Kernel trick’  re-write linear ML algorithms to use only inner products
between inputs (norms, angles, distances between inputs)
 Functions called kernels operate on input space vectors, but gives same
results as evaluating inner products in feature space
 Essentially, are able to avoid explicit calculations in a high-dimensional
feature space
Dusting off the mathematical cobwebs…
 = for all
 = is an element of
 = real numbers
 = dot product of vectors (inner product), way of multiplying
two vectors together to obtain a scalar (I was initially massively confused by
use of the cross here…)
 = non-negative norm of a vector (a scalar value)
 = Euclidean norm (see later)
 = 1-norm (see later)
Kernel functions
 Section outlines various general conditions for an inner product of vectors in a
given vector space
 A kernel is a function that corresponds to an inner product in a feature space
 A function is only a kernel if there exists a map between the vector space and feature
space (but do not need to know form, existence is sufficient)
 A vector space with an inner product = an ‘inner product space’
 Kernel functions allow replacing computations in high-dimensional feature space
by computations in input space
Specific kernels: linear
Simple, linear kernel 
 Has identical input and feature space 
 Equivalent of using original linear algorithm (use as initial test for new system?)
 Gives a linear regression model
www.matlabsolutions.com/blog/Tensorflow-Linear-
regression-understanding-the-concept.php
Regression co-effs Training inputs
Inputs to predict
Specific kernels: Gaussian
(or squared exponential kernel or radial basis function kernel)
 Non-linear kernel 
 (non-linear  change of output not proportional to change of input)
 Maps into an infinite-dimensional feature space
 𝝈>0  hyperparameter determining the length scale on which the kernel
operates
 Something to tune for optimal model performance
 Limiting cases for 𝜎  0 or ∞ relate to overfitting and underfitting respectively
 For intermediate values of 𝜎, kernel value depends on
 Kernel approaches 1 as above  0
 “ “ “ 0 as above  ∞
 Samples close in input space are correlated in feature space
 Samples faraway however are mapped to orthogonal subspaces
Gaussian kernel is
local approximator
where scale
depends on 𝝈
Gaussian kernel is local approximator
where scale depends on 𝜎
Specific kernels: Laplacian
 Similar to Gaussian
 Uses exp, but using 1-norm instead of
euclidean norm (see next)
 Demonstrated to perform better for
molecular properties in refs 45 and 59-61
Aside: 1-norm vs. Euclidean norm
(source wikipedia)
The one we all know and love:
Summat to do with taxi drivers and the American road system:
(kind of like a constrained norm?)
ML_in_QM_JC_02-10-18
Regression methods
Multiple linear regression
 Find co-effs to minimize generalization error (av. error on new inputs)
 However, in practice, due to finite size of training set can only minimize empirical
error  care must be taken to avoid over or underfitting
Ridge regression
 Added regularization to avoid overfitting
 Increases bias but reduces variance
 c.f. bias-variance tradeoff, e.g. https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/)
 Adds a penalty term where strength of regularization is determined by
hyperparameter, 𝜆 (larger values give simpler and smoother models)
Kernel ridge regression
 Applying ‘kernel trick’ to linear ridge regression  nonlinear version
(linear model in d dimensions each weighted by regression co-eff)
(term to allow modelling functions
that do not pass through origin)+
Implementation
Importance of model selection
 How to choose between different ML models
 Choice of kernel, k
 How to choose hyperparameters?
 e.g. 𝜆 for regularisation
 and 𝜎 if using Gaussian or Laplacian kernels
 Regression coefficients 𝛼 and 𝛽 for set hyperparameters determined by kernel?
 Therefore choice is dependent upon quality of training set? (methods such as bootstrapping and
cross-validation allow for re-use of data if set is small)
 Occam’s razor as general guiding principle  use simplest model that fits data
Estimating model performance
 ‘Risk of model’, f 
 R has to be estimated from a finite set of training data as the empirical risk
 Again, use regularization to avoid over-fitting to training set
loss function measuring the error of a prediction
Kernel ridge regression fits to 5 data points to
represent cosine function with different values of
hyperparameter, 𝜎
E.g. for predicting atomization energies
 Use 1k reference DFT calculations for atomization energy of organic
molecules to estimate for remaining molecules in full set of 7k molecules
(dataset and notebook for e.g. included in SI)
Considerations for…
Preparation of training dataset
 How large and homogeneous?
 In this e.g. inhomogeneous wrt no. of non-H atoms so had to include all with
four or fewer)  requires insights for relevant inhomogeneities?
 Split the training set and ‘hold out’ set  requires insights for size of sets?
 Can use methods cross-validation to reuse data if dataset is fairly small
Representation of data
Considerations for…
The model
 Choose kernel
 Choose hyperparameters 𝜆 and 𝜎, for chosen params
 Compute kernel matrices
 Algorithm computes regression coefficients
 Compute prediction performance statistics (using ‘hold out dataset’)
 Perform grid search to determine values of 𝜆 and 𝜎 for best
performance
 … Although is this not also influenced by regression coefficients which are
determined by initial choice of hyperparameters?
 Try different kernel (and repeat above steps)
 Compare performance with different kernels
Grid search to determine values of 𝜆 and 𝜎
for best performance
Comparing predictions using different
kernels
Key themes/ central ideas to method
 Exploit non-randomness in data to avoid having to perform additional QM calculations
 Minimising interpolation error when using some QM calculations to predict results of
others
 Kernel-based ML methods systematically derive nonlinear versions of linear ML
algorithms
 Avoid costly evaluations in a high-dimensional feature space through use of inner
products
 Avoiding underfitting or overfitting to training data (since we want a general model but
due to finite training data set, this will always be an empirical fit)
 Principle of Occam’s razor
 Choice of kernel
 Choice of kernel hyperparameters to minimise underfitting or overfitting
 Regularization to penalize for overfitting
 Build, optimize, tweak, repeat, optimize, tweak, repeat!
…verdict!
How useful as a starting point?
…verdict!
How useful as a starting point?
 A little hard to follow at the start of the more technical bits/ gauge what the
point in all the definitions were + some notation hard to follow, especially use of
cross when discussing dot products
 Had to google a fair bit!
 Later sections kind of easier to follow + nice example at the end
 Possibly easier to follow by not reading in order? Or re-reading the start after!
 …but nice plots to explain use of different kernels and influence of
hyperparameters on fits!

More Related Content

PPTX
Automated Machine Learning (Auto ML)
PDF
201907 AutoML and Neural Architecture Search
PDF
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
PPTX
Machine learning with scikitlearn
PPTX
Reading group nfm - 20170312
PPTX
Optimization problems and algorithms
PPTX
08 neural networks
PPTX
Learning to compare: relation network for few shot learning
Automated Machine Learning (Auto ML)
201907 AutoML and Neural Architecture Search
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
Machine learning with scikitlearn
Reading group nfm - 20170312
Optimization problems and algorithms
08 neural networks
Learning to compare: relation network for few shot learning

What's hot (20)

PDF
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
PPT
PDF
A tour of the top 10 algorithms for machine learning newbies
PPTX
Survey on contrastive self supervised l earning
PPTX
A hybrid sine cosine optimization algorithm for solving global optimization p...
PDF
LNCS 5050 - Bilevel Optimization and Machine Learning
PDF
Machine learning Mind Map
PDF
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
PPTX
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
PPTX
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
PPTX
Ppt shuai
PPTX
PDF
Generative Adversarial Networks : Basic architecture and variants
PDF
SVD and the Netflix Dataset
PDF
Visual diagnostics for more effective machine learning
PDF
Mattar_PhD_Thesis
PDF
Evalu8VPrasadTechnicalPaperV5
PPT
AR model
PPTX
One shot learning
PDF
Two methods for optimising cognitive model parameters
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
A tour of the top 10 algorithms for machine learning newbies
Survey on contrastive self supervised l earning
A hybrid sine cosine optimization algorithm for solving global optimization p...
LNCS 5050 - Bilevel Optimization and Machine Learning
Machine learning Mind Map
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
Deep Learning Fast MRI Using Channel Attention in Magnitude Domain
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Ppt shuai
Generative Adversarial Networks : Basic architecture and variants
SVD and the Netflix Dataset
Visual diagnostics for more effective machine learning
Mattar_PhD_Thesis
Evalu8VPrasadTechnicalPaperV5
AR model
One shot learning
Two methods for optimising cognitive model parameters
Ad

Similar to ML_in_QM_JC_02-10-18 (20)

PPT
November, 2006 CCKM'06 1
PPTX
Analyzing the Butterfly Algorithm: Accuracy, Efficiency, and Scalability in L...
PPTX
Analyzing the Butterfly Algorithm: Accuracy, Efficiency, and Scalability in L...
PPT
Summary.ppt
PPTX
17- Kernels and Clustering.pptx
PPT
presentation.ppt
PDF
Stock Market Prediction Using ANN
PDF
Data clustering using kernel based
PDF
deep CNN vs conventional ML
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Large Scale Kernel Learning using Block Coordinate Descent
DOC
Observations
PPT
Presentation
DOCX
Types of Machine Learnig Algorithms(CART, ID3)
PDF
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
PDF
Optimal feature selection from v mware esxi 5.1 feature set
PDF
A Survey of Machine Learning Methods Applied to Computer ...
PDF
Machine Learning.pdf
PPTX
Neural Network and deep learning Concept
PDF
A general frame for building optimal multiple SVM kernels
November, 2006 CCKM'06 1
Analyzing the Butterfly Algorithm: Accuracy, Efficiency, and Scalability in L...
Analyzing the Butterfly Algorithm: Accuracy, Efficiency, and Scalability in L...
Summary.ppt
17- Kernels and Clustering.pptx
presentation.ppt
Stock Market Prediction Using ANN
Data clustering using kernel based
deep CNN vs conventional ML
Introduction to Machine Learning with SciKit-Learn
Large Scale Kernel Learning using Block Coordinate Descent
Observations
Presentation
Types of Machine Learnig Algorithms(CART, ID3)
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal feature selection from v mware esxi 5.1 feature set
A Survey of Machine Learning Methods Applied to Computer ...
Machine Learning.pdf
Neural Network and deep learning Concept
A general frame for building optimal multiple SVM kernels
Ad

More from Suzanne Wallace (15)

PPTX
defect_supercell_finite_size_schemes_10-09-18
PPTX
DeepLearning_JC_talk
PPTX
MRS Fall Meeting 2017
PPTX
NREL PV seminar
PPTX
NREL_rapid_development_intro
PPTX
NREL_defect_tolerance
PPTX
UG masters project
PPTX
Hoffmann band structures JC talk
PPTX
FE-PV JC talk
PPTX
Defect tolerance
PPTX
CZTS PL data JC talk
PPTX
CIS GBs JC talk
PPTX
Band alignment JC talk
PPTX
APS March Meeting 2016
PPTX
EMRS Meeting 2017
defect_supercell_finite_size_schemes_10-09-18
DeepLearning_JC_talk
MRS Fall Meeting 2017
NREL PV seminar
NREL_rapid_development_intro
NREL_defect_tolerance
UG masters project
Hoffmann band structures JC talk
FE-PV JC talk
Defect tolerance
CZTS PL data JC talk
CIS GBs JC talk
Band alignment JC talk
APS March Meeting 2016
EMRS Meeting 2017

Recently uploaded (20)

PDF
cell_morphology_organelles_Physiology_ 07_02_2019.pdf
PPT
Chapter 6 Introductory course Biology Camp
PPTX
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPTX
complications of tooth extraction.pptx FIRM B.pptx
PDF
CHEM - GOC general organic chemistry.ppt
PPTX
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
PDF
Thyroid Hormone by Iqra Nasir detail.pdf
PDF
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
PDF
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
PPTX
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
PDF
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
PPTX
Cutaneous tuberculosis Dermatology
PPTX
The Female Reproductive System - Grade 10 ppt
PPT
ZooLec Chapter 13 (Digestive System).ppt
PDF
SOCIAL PSYCHOLOGY_ CHAPTER 2.pdf- the self in a social world
PDF
No dilute core produced in simulations of giant impacts on to Jupiter
PDF
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
PPT
Chapter 52 introductory biology course Camp
PPTX
Introduction of Plant Ecology and Diversity Conservation
cell_morphology_organelles_Physiology_ 07_02_2019.pdf
Chapter 6 Introductory course Biology Camp
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
complications of tooth extraction.pptx FIRM B.pptx
CHEM - GOC general organic chemistry.ppt
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
Thyroid Hormone by Iqra Nasir detail.pdf
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
Cutaneous tuberculosis Dermatology
The Female Reproductive System - Grade 10 ppt
ZooLec Chapter 13 (Digestive System).ppt
SOCIAL PSYCHOLOGY_ CHAPTER 2.pdf- the self in a social world
No dilute core produced in simulations of giant impacts on to Jupiter
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
Chapter 52 introductory biology course Camp
Introduction of Plant Ecology and Diversity Conservation

ML_in_QM_JC_02-10-18

  • 1. WMD Journal Club 2nd October 2018 Suzanne Wallace
  • 4. Machine learning (ML) overview  Subfield of artificial intelligent (AI)  Involves algorithms whose performance improves with data  Identify/ exploit non-randomness in data and use for prediction or analysis  Types  Supervised  Mapping function is learned to map inputs x to labels y with a training data set, mapping function is then used to predict labels for new data  c.f. company on click episode with employees labelling pixels of images for image recognition models  e.g. regression  Unsupervised  e.g. dimensionality reduction
  • 5. ML for quantum chemistry? Examples of applications in QM  Ab initio molecular dynamics to learn potential energy surface (PES)  Orbital-free DFT to learn mapping from electron densities to their kinetic energy  Molecular property prediction to map molecules to property values  c.f. Dan’s work using ML to predict band gaps?
  • 6. Principles of ML for quantum chemistry  ‘Similarity principle’ – exploit redundancy  In QM, could avoid having to repeat calculations for similar systems  Interpolate between calculations to obtain approximate solutions for the remaining systems  Decisive factor is control of the interpolation error!  How far could we push this...?  ...train models with other models?  E.g. use ML to map out PES of a molecule… but use various PES’s to predict PES of other molecules based on similarity of species and coordination environments...?
  • 7. Main technical topics of the tutorial  Kernel-based ML methods + assessing model performance  (more details to follow…)  Numerical representation of system (descriptor), e.g.  Where domain knowledge for specific system is important?  ‘The problem of learning a function from a finite sample of its values has no unique solution (there are infinitely many functions that are compatible with the training data) […]Essentially, one chooses the simplest model that is compatible with the data (Occam’s razor)’  c.f. Berkeley blog for interesting discussion of underfitting and overfitting (wrt Fukushima)  https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/  Overfitting may represent training data well, but perform poorly for unseen data  ‘too much predictive power to quirks in our training data’ [cite Berkeley blog]  Underfitting will just give nonsense for training and unseen data Importance of choice method + testing model once built But now onto nuts and bolts of this learning algorithm...
  • 8. Kernel-based ML methods (alternatives include artificial neural networks)  Central idea  derive non-linear versions of ML algorithms by mapping inputs into a higher dimensional space and applying the linear algorithm there  ‘Kernel trick’  re-write linear ML algorithms to use only inner products between inputs (norms, angles, distances between inputs)  Functions called kernels operate on input space vectors, but gives same results as evaluating inner products in feature space  Essentially, are able to avoid explicit calculations in a high-dimensional feature space
  • 9. Dusting off the mathematical cobwebs…  = for all  = is an element of  = real numbers  = dot product of vectors (inner product), way of multiplying two vectors together to obtain a scalar (I was initially massively confused by use of the cross here…)  = non-negative norm of a vector (a scalar value)  = Euclidean norm (see later)  = 1-norm (see later)
  • 10. Kernel functions  Section outlines various general conditions for an inner product of vectors in a given vector space  A kernel is a function that corresponds to an inner product in a feature space  A function is only a kernel if there exists a map between the vector space and feature space (but do not need to know form, existence is sufficient)  A vector space with an inner product = an ‘inner product space’  Kernel functions allow replacing computations in high-dimensional feature space by computations in input space
  • 11. Specific kernels: linear Simple, linear kernel   Has identical input and feature space   Equivalent of using original linear algorithm (use as initial test for new system?)  Gives a linear regression model www.matlabsolutions.com/blog/Tensorflow-Linear- regression-understanding-the-concept.php Regression co-effs Training inputs Inputs to predict
  • 12. Specific kernels: Gaussian (or squared exponential kernel or radial basis function kernel)  Non-linear kernel   (non-linear  change of output not proportional to change of input)  Maps into an infinite-dimensional feature space  𝝈>0  hyperparameter determining the length scale on which the kernel operates  Something to tune for optimal model performance  Limiting cases for 𝜎  0 or ∞ relate to overfitting and underfitting respectively  For intermediate values of 𝜎, kernel value depends on  Kernel approaches 1 as above  0  “ “ “ 0 as above  ∞  Samples close in input space are correlated in feature space  Samples faraway however are mapped to orthogonal subspaces Gaussian kernel is local approximator where scale depends on 𝝈
  • 13. Gaussian kernel is local approximator where scale depends on 𝜎
  • 14. Specific kernels: Laplacian  Similar to Gaussian  Uses exp, but using 1-norm instead of euclidean norm (see next)  Demonstrated to perform better for molecular properties in refs 45 and 59-61
  • 15. Aside: 1-norm vs. Euclidean norm (source wikipedia) The one we all know and love: Summat to do with taxi drivers and the American road system: (kind of like a constrained norm?)
  • 17. Regression methods Multiple linear regression  Find co-effs to minimize generalization error (av. error on new inputs)  However, in practice, due to finite size of training set can only minimize empirical error  care must be taken to avoid over or underfitting Ridge regression  Added regularization to avoid overfitting  Increases bias but reduces variance  c.f. bias-variance tradeoff, e.g. https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/)  Adds a penalty term where strength of regularization is determined by hyperparameter, 𝜆 (larger values give simpler and smoother models) Kernel ridge regression  Applying ‘kernel trick’ to linear ridge regression  nonlinear version (linear model in d dimensions each weighted by regression co-eff) (term to allow modelling functions that do not pass through origin)+
  • 18. Implementation Importance of model selection  How to choose between different ML models  Choice of kernel, k  How to choose hyperparameters?  e.g. 𝜆 for regularisation  and 𝜎 if using Gaussian or Laplacian kernels  Regression coefficients 𝛼 and 𝛽 for set hyperparameters determined by kernel?  Therefore choice is dependent upon quality of training set? (methods such as bootstrapping and cross-validation allow for re-use of data if set is small)  Occam’s razor as general guiding principle  use simplest model that fits data Estimating model performance  ‘Risk of model’, f   R has to be estimated from a finite set of training data as the empirical risk  Again, use regularization to avoid over-fitting to training set loss function measuring the error of a prediction
  • 19. Kernel ridge regression fits to 5 data points to represent cosine function with different values of hyperparameter, 𝜎
  • 20. E.g. for predicting atomization energies  Use 1k reference DFT calculations for atomization energy of organic molecules to estimate for remaining molecules in full set of 7k molecules (dataset and notebook for e.g. included in SI)
  • 21. Considerations for… Preparation of training dataset  How large and homogeneous?  In this e.g. inhomogeneous wrt no. of non-H atoms so had to include all with four or fewer)  requires insights for relevant inhomogeneities?  Split the training set and ‘hold out’ set  requires insights for size of sets?  Can use methods cross-validation to reuse data if dataset is fairly small Representation of data
  • 22. Considerations for… The model  Choose kernel  Choose hyperparameters 𝜆 and 𝜎, for chosen params  Compute kernel matrices  Algorithm computes regression coefficients  Compute prediction performance statistics (using ‘hold out dataset’)  Perform grid search to determine values of 𝜆 and 𝜎 for best performance  … Although is this not also influenced by regression coefficients which are determined by initial choice of hyperparameters?  Try different kernel (and repeat above steps)  Compare performance with different kernels
  • 23. Grid search to determine values of 𝜆 and 𝜎 for best performance
  • 24. Comparing predictions using different kernels
  • 25. Key themes/ central ideas to method  Exploit non-randomness in data to avoid having to perform additional QM calculations  Minimising interpolation error when using some QM calculations to predict results of others  Kernel-based ML methods systematically derive nonlinear versions of linear ML algorithms  Avoid costly evaluations in a high-dimensional feature space through use of inner products  Avoiding underfitting or overfitting to training data (since we want a general model but due to finite training data set, this will always be an empirical fit)  Principle of Occam’s razor  Choice of kernel  Choice of kernel hyperparameters to minimise underfitting or overfitting  Regularization to penalize for overfitting  Build, optimize, tweak, repeat, optimize, tweak, repeat!
  • 26. …verdict! How useful as a starting point?
  • 27. …verdict! How useful as a starting point?  A little hard to follow at the start of the more technical bits/ gauge what the point in all the definitions were + some notation hard to follow, especially use of cross when discussing dot products  Had to google a fair bit!  Later sections kind of easier to follow + nice example at the end  Possibly easier to follow by not reading in order? Or re-reading the start after!  …but nice plots to explain use of different kernels and influence of hyperparameters on fits!