Skip to content
PaperThe following article is Open access

Unified representation of molecules and crystals for machine learning

and

Published 21 November 2022 © 2022 The Author(s). Published by IOP Publishing Ltd
, , Citation Haoyan Huo and Matthias Rupp 2022 Mach. Learn.: Sci. Technol. 3 045017DOI 10.1088/2632-2153/aca005

2632-2153/3/4/045017

Abstract

Accurate simulations of atomistic systems from first principles are limited by computational cost. In high-throughput settings, machine learning can reduce these costs significantly by accurately interpolating between reference calculations. For this, kernel learning approaches crucially require a representation that accommodates arbitrary atomistic systems. We introduce a many-body tensor representation that is invariant to translations, rotations, and nuclear permutations of same elements, unique, differentiable, can represent molecules and crystals, and is fast to compute. Empirical evidence for competitive energy and force prediction errors is presented for changes in molecular structure, crystal chemistry, and molecular dynamics using kernel regression and symmetric gradient-domain machine learning as models. Applicability is demonstrated for phase diagrams of Pt-group/transition-metal binary systems.

Export citation and abstractBibTeXRIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The computational study of atomistic systems such as molecules and crystals requires accurate treatment of interactions at the atomic and electronic scale. Accurate first-principles methods, however, are limited by their high computational cost. In settings that require many calculations, such as dynamics simulations, phase diagrams, or high-throughput searches, machine learning (ML) [1, 2] can reduce overall costs by orders of magnitude via accurate interpolation between reference calculations [35]. For this, the problem of repeatedly solving a complex equation such as Schrödinger’s equation for many related inputs is mapped onto a non-linear regression problem: instead of numerically solving new systems, they are statistically estimated based on a reference set of known solutions [6, 7]. This ansatz enables, among other applications, screening larger databases of molecules and materials [5, 8], running longer dynamics simulations [9], investigating larger systems [10], and increasing the accuracy of calculations [5, 11].

Kernel-based ML models [1215] for data-efficient accurate prediction of ab initio properties require a single space in which regression is carried out. Representations [16] are functions that map atomistic systems to elements in such spaces, either directly or via a kernel [17]. Representations should be (a) invariant against transformations preserving the predicted property, in particular translations, rotations, and nuclear permutations of same elements, as learning these invariances from data would require many reference calculations; non-scalar properties can require equivariance instead of invariance; (b) unique, that is, variant against transformations changing the property, as systems with identical representation that differ in property would introduce errors [18]; (c) continuous, and ideally differentiable, as discontinuities work against the smoothness assumption of the ML model and model gradients are often useful; (d) general in the sense of being able to encode any atomistic system, including finite and periodic systems; (e) fast to compute, as the goal is to reduce computational cost; (f) data-efficient in the sense of requiring few reference calculations to reach a given target error. Constant size is an advantage, [19] as is the ability to encode the whole system as well as local atomic environments. Requirements (e) and (f) are in practice determined empirically. See [16, 2022] for details on these and further requirements.

Some representations fulfill these requirements only partially, such as the Coulomb matrix (CM) [6] and bag of bonds (BoB) [23] discussed below. State-of-the-art representations often fulfill these requirements in some limit, such as infinite expansion order. See [16] for a comprehensive and detailed discussion. The descriptors used in cheminformatics, [24] and sometimes in materials informatics, often violate (b) and (c), in particular if they do not include atomic coordinate information or rely on cutoff-based definitions of chemical bonds. Such descriptors serve the different purpose of directly predicting derived properties that are not functions of a single conformation, such as solubility or binding affinity to a macromolecule.

We introduce a many-body tensor representation (MBTR) derived from CM/BoB and concepts of many-body expansions. It is related [16] to Behler–Parrinello symmetry functions [25] and histograms of distances, angles, and dihedral angles [26]. MBTR fulfills the above requirements in the limit, is interpretable, allows visualization (figure 1), and describes finite and periodic systems. State-of-the-art empirical performance is demonstrated by us for organic molecules and inorganic crystals, as well as applicability to phase diagrams of Pt-group / transition metal binary systems, and by others for predicting and optimizing various molecular [2733] and crystalline properties [3437]. Implementations of MBTR are publicly available (see Code Availability section at the end).

Figure 1. Refer to the following caption and surrounding text.

Figure 1.  Visualization of many-body tensor representation. Top panels show distributions of inverse distances (k = 2, quadratic weighting) for aspirin (C9O4H8, left) and distributions of angles (k = 3, exponential weighting) for fcc salt (NaCl, right). Bottom panels show derivatives of the representation, obtained by differentiating with respect to the Cartesian coordinates of C atom r6 connecting the ester group (left) and the Na atom r1 at lattice point (right).

Standard image High-resolution image

2. Method

We start from the CM [6, 10, 38], which represents a molecule $\mathcal{M}$ as a symmetric atom-by-atom matrix

Equation (1)

where Zi are atomic numbers and $d_{i,j} = ||\boldsymbol{R_i}-\boldsymbol{R_j}||$ is Euclidean distance between atoms i and j. To avoid dependence on atom ordering (in the input), which would violate (a), M is either diagonalized, loosing information which violates (b) [18], or sorted, causing discontinuities that violate (c). Another shortcoming is the use of Z, which is not well suited for interpolation [39] as it overly decorrelates chemical elements from the same column of the periodic table.

The related BoB [23] representation uses the same terms, but arranges them differently. For each pair of chemical elements, corresponding CM terms are stored in sorted order, which can be viewed as an $N_\mathrm{e} \times N_\mathrm{e} \times d$ tensor, or an $N_\mathrm{e} \times (N_\mathrm{e} + 1)/2 \times d$ tensor if symmetry is taken into account, where $N_\mathrm{e}$ is number of elements and d is sufficiently large. Unlike the CM, it can not distinguish homometric molecules [21], which might distort its feature space [40]. While the BoB tensor itself does not suffer from discontinuities, its derivative does.

To derive MBTR, we retain stratification by elements, but avoid the sorting by arranging distances on a real-space axis:

Equation (2)

where x is a real number, $z_1,z_2$ are atomic numbers, $N_\mathrm{a}$ is number of atoms, $\delta(\cdot)$ is Dirac’s delta, and $\delta(\cdot,\cdot)$ is Kronecker’s delta. fBoB has mixed continuous-discrete domain and encodes all (inverse) distances between atoms with elements z1 and z2. For a smoother measure, we replace Dirac’s δ with another probability distribution $\mathcal{D}$, ‘broadening’ or ‘smearing’ it [20, 41]. In this work, we use the normal distribution. Other distributions can be used, in particular symmetric and short-tailed ones, for example, the Laplace distribution or the uniform distribution. We did not observe significant differences in performance, however. Adding a weighting function w2 and replacing the Kronecker δ functions by an element correlation matrix $C \in \boldsymbol{R}^{N_\mathrm{e} \times N_\mathrm{e}}$ yields

Equation (3)

of which equation (2) is a special case. In general, g2 describes a relation between atoms i and j, such as their inverse distance, $\mathcal{D}$ broadens the result of g2, and w2 allows to weight down contributions, for example, from far-away atoms. Building on the idea of many-body expansions, [42, 43] we generalize from f2 in equation (3), which encodes two-body terms, to the MBTR equation

Equation (4)

where $\boldsymbol{z} \in \boldsymbol{N}^k$ are atomic numbers, $\boldsymbol{i} = (i_1,\ldots,i_k) \in \{1,\ldots,N_\mathrm{a}\}^k$ are index tuples, and wk , gk assign a scalar to k atoms in $\mathcal{M}$ [44]. Canonical choices of gk for $k \! = \! 1,2,3,4$ are atom counts, (inverse) distances, angles, and dihedral angles. The element correlation matrices C allow exploitation of similarities between chemical element species (‘alchemical learning’), for example, within the same column of the periodic table [4547].

We measure the similarity of two atomistic systems $\mathcal{M}$ and $\mathcal{M}^{^{\prime}}$ as the Euclidean distance between their representations,

Equation (5)

In practice, we adjust equation (4) for symmetries. Discretizing the continuous axis as $\left(x_{\min},x_{\min}+\right.$ $\left.\Delta x,\ldots,x_{\max}\right)$ results in a rank $k\!+\!1$ tensor of dimensions $N_\mathrm{e} \times \cdots \times N_\mathrm{e} \times N_x$ with $N_x = (x_{\max}-x_{\min})/\Delta x$, where $x_{\min}$ and $x_{\max}$ are the smallest and largest values for which $f_k(x,\boldsymbol{z}) \neq 0$ for all z and $\mathcal{M}$. Linearizing element ranks yields $N_\mathrm{e}^k \times N_x$ matrices, allowing for visualization (figure 1) and efficient numerical implementation via linear algebra routines. For systems with many element species, discretization can lead to large matrices, requiring substantial amounts of memory. In such settings, memory-efficient implementation via sparse matrix formats or on-the-fly calculation of distances and inner products (see, e.g. [45]) of MBTR matrices might be preferable.

Periodic systems, used to model bulk crystals and surfaces, can be viewed as unit cells surrounded by infinitely many translated images of themselves. For such systems, $N_\mathrm{a} = \infty$ and the sum in equation (4) diverges. We prevent this by requiring one index of i to be in the (same) primitive unit cell [48]. This accounts for translational symmetry and prevents double-counting. Use of weighting functions wk such as exponentially decaying weights [49] then ensures convergence of the sum. Figure 1 (right) presents the resulting distributions of angles for face-centered cubic (FCC) NaCl as an example. Note that the k-body terms gk do not depend on choice of unit cell geometry (lattice vectors). This ensures unique representation of Bravais lattices where the choice of basis vectors is not unique, for example 2D hexagonal lattices where the angle between lattice vectors can be $\frac{1}{3}\pi$ or $\frac{2}{3}\pi$.

Many applications, including dynamics simulations and structural relaxation, require forces, the negative gradient of the energy with respect to atomic coordinates. The gradient of equation (4) is given by

Equation (6)

The gradient $\nabla f_k(x,\boldsymbol{z}$) can be derived analytically if this is possible for $\nabla w_k$, $\nabla g_k$, and $\nabla \mathcal{D}$. Alternatively, automatic differentiation [50, 51] can be used, removing the need for manual derivation. Figure 1 visualizes MBTR gradients.

3. Results

To validate MBTR, we demonstrate accurate predictions for properties of molecules and crystals. Focusing on the representation, we employ plain kernel ridge regression (KRR) models [7] unless stated otherwise.

3.1. Changes in molecular structure

To demonstrate interpolation across changes in the chemical structure of molecules we utilize a benchmark dataset [38] of 7211 small organic molecules composed of up to seven C, N, O, S and Cl atoms, saturated with H. Molecules were relaxed to their ground state using the Perdew–Burke–Ernzerhof (PBE) [52] approximation to Kohn–Sham density functional theory (DFT). Restriction to relaxed structures projects out spatial variability and allows focusing on changes in chemical structure. Table 1 presents prediction errors for atomization energies and isotropic polarizabilities obtained from single point calculations with the hybrid PBE0 [53, 54] functional. For 5k training samples, prediction errors are below 1 kcal mol−1 (‘chemical accuracy’), with the MBTR model’s mean absolute error (MAE) of 0.6 kcal mol−1 corresponding to thermal fluctuations at room temperature. Note that MBTR achieves similar performance with a linear regression model, allowing constant-time predictions.

Table 1.  Prediction errors for small organic molecules. Machine-learning models of atomization energies E and isotropic polarizabilities α, obtained at hybrid density functional level of theory, were trained on 5k molecules and evaluated on 2k others using different representations. RMSE = root mean square error, MAE = mean absolute error, CM = Coulomb matrix, BoB = bag of bonds, BAML = bonding angular machine learning, SOAP = smooth overlap of atomic positions, FCHL19 = Faber–Christensen–Huang–Lilienfeld representation, MBTR = many-body tensor representation.

   E (kcal mol−1) α (Å)3
RepresentationKernelMAERMSEMAERMSE
CM [6]Laplacian3.474.760.130.17
BoB [23]Laplacian1.782.860.090.12
BAML [42]Laplacian1.152.540.070.12
SOAP [55]REMatch0.921.610.050.07
FCHL19 [45, 47]Gaussian0.44
MBTRLinear0.741.140.070.10
MBTRGaussian0.600.970.040.06

3.2. Changes in crystal chemistry

Interpolation across changes in the chemistry of crystalline solids is demonstrated for a dataset of 11k elpasolite structures (ABC2D6, AlNaK2F6 prototype) [56, 57] composed of 12 different elements, with geometries and energies computed at DFT/PBE level of theory. Predicting formation energies with MBTR yields a root-mean-squared-error (RMSE) of 8.1 meV/atom and MAE of 4.7 meV/atom (figure 2) for a training set of 9k crystals.

Figure 2. Refer to the following caption and surrounding text.

Figure 2.  Formation energy predictions for ABC2D6 elpasolite structures containing 12 different elements. Shown are reference energies (DFT $\Delta E$) and predicted energies (ML $\Delta E$), as well as distribution of errors (inset) for 2272 crystals, from an MBTR ML model trained on 9086 other ones.

Standard image High-resolution image

Adding chemical elements should increase the intrinsic dimensionality of the learning problem, and thus prediction errors. To verify this, we created a dataset of 4611 ABC2 ternary alloys containing 22 non-radioactive elements from groups 1, 2, 13–15, spanning five rows and columns of the periodic table. Structures were taken from the Open Quantum Materials Database (OQMD) [58, 59], with geometries and properties also computed via DFT/PBE. As expected, energy predictions exhibit larger errors (RMSE 31 meV/atom, MAE 23 meV/atom) compared to an elpasolite model of same training set size (RMSE 23 meV/atom, MAE 15 meV/atom).

3.3. Changes in molecular geometry

For interpolation of changes in molecular geometry, we employ a benchmark dataset [13, 62] of ab initio molecular dynamics trajectories of eight organic molecules. Each molecule was simulated at a temperature of 500 K for between 150k and 1M time steps of 0.5 fs, with energies and forces computed at the DFT/PBE level of theory and the Tkatchenko–Scheffler model [63] for van der Waals interactions. Table 2 presents results for models trained and evaluated only on energies (right-hand side) and only on forces (left-hand side).

Table 2.  Energy and force prediction errors for changes in geometry of organic molecules. Shown are prediction errors for total energies (kcal mol−1) and atomic forces (kcal mol−1 Å−1). MAE = mean absolute error, RMSE = root mean squared error, PaiNN = polarizable atom interaction neural network [60], FCHL19 = Faber–Christensen–Huang–Lilienfeld representation [45, 47], sGDML = symmetric gradient domain machine learning [13], sMatérn = Matérn kernel augmented with symmetric permutations for sGDML [61], CMmd = Coulomb matrix variant, MBTR = many-body tensor representation.

 Trained only on forces, 1k referencesTrained only on energies, 10k references
 PaiNNFCHL19sGDML/CMmd sGDML/MBTRCMmd MBTRMBTR
KernelGaussiansMatérnsMatérnGaussianlinearGaussian
 EnergyForceEnergyForceEnergyForceEnergyForceEnergyEnergyEnergy
MoleculeMAEMAEMAEMAEMAEMAEMAEMAEMAEMAEMAE
Benzene0.07 a 0.06 a 0.070.150.030.030.03
Uracil0.110.130.100.100.110.240.110.170.050.100.03
Naphthalene0.120.080.120.150.120.110.110.090.120.100.07
Aspirin0.170.340.170.500.190.680.170.480.360.210.25
Salicylic acid0.120.200.120.220.120.280.110.180.110.130.07
Malonaldehyde0.100.340.080.250.100.410.090.360.180.210.10
Ethanol0.060.220.050.140.070.330.060.260.170.170.06
Toluene0.100.090.100.200.100.140.090.130.160.110.10

a We observed higher noise in predictions of benzene, whose reported prediction errors are also inconsistent in different publications. To make results comparable, we retrained the sGDML/CMmd model (originally reported values are 0.10 and 0.06).

Energy-only models were trained on 10k configurations and validated on 2k other ones, employing MBTR (parametrized for dynamics data, see supplement) and a similarly modified CM (CMmd, see supplement). Non-linear MBTR regression performs best overall, with the linear kernel again being competitive.

On the one hand, differentiating energy-based ML potentials can introduce errors, for example, from small oscillations between training samples due to insufficient regularization, and from insufficient model constraints in directions not covered by the training data [64]. On the other hand, electronic structure calculations often provide reference forces at not additional cost. It is therefore beneficial to include these in model training. This often reduces the required number of reference calculations by an order of magnitude [9, 13, 61, 65].

Force-only models require an adaptation of plain KRR. To accommodate forces into training and to demonstrate use of MBTR with other regression approaches, we employ MBTR in the framework of symmetrized gradient-domain machine-learning (sGDML) [61]. This approach uses the Matérn kernel, augmentation by symmetric molecular permutations, and reference forces for training, while providing energy and force predictions.

Table 2 (left-hand side) compares performance of the original sGDML approach (based on the CMmd representation [13, 61]) and of sGDML based on a two-body MBTR representation. Both models were trained on 1000 configurations, leading to kernel matrices with dimensionalities between 27k and 63k. For reference, we also present results for the Faber–Christensen–Huang–Lilienfeld (FCHL) representation [45, 47] and the polarizable atom interaction neural network (PaiNN) [60]. sGDML/MBTR performs as good or better than sGDML/CMmd for energy and force predictions. Compared to PaiNN, sGDML/MBTR performs better for energy predictions, but worse for forces. For a more fine-grained comparison between the sGDML models, figure 3 presents learning curves of relative MAE ratios of sGDML/MBTR over sGDML/CMmd, together with standard deviations over five runs starting from different random seeds. sGDML/MBTR consistently outperforms sGDML/CMmd with error reductions up to 50%–60%, especially when less than 100 training samples are used. For more symmetric molecules such as benzene, malonaldehyde, and ethanol, use of MBTR is less beneficial but still an improvement.

Figure 3. Refer to the following caption and surrounding text.

Figure 3.  Relative improvement in predictive accuracy on dynamics data of eight different organic molecules. Shown are force and energy prediction MAE ratios of sGDML/MBTR over sGDML/CMmd as a function of training set size. Error bars show the standard deviation of the ratios over five runs with different random seeds.

Standard image High-resolution image

3.4. Phase diagrams

We demonstrate applicability by identifying the convex hull of the phase diagram for Pt-group/transition metal binary alloys, relevant for industrial applications [66]. For a given dataset of candidate structures, we predict the energy of each structure and identify those with the lowest energy, which form the convex hull in a phase diagram. Compositions that lie on or slightly above the convex hull correspond to stable and meta-stable alloys, respectively.

To demonstrate this, we use a dataset [66] of 153 alloys computed at the DFT/PBE level of theory. This dataset contains at most a few hundred structures for each alloy. Due to this small amount of data direct application of ML models results in errors in predicted energies that are large enough to lead to wrong convex hulls. We address this by employing a simple active learning [67] scheme.

Starting with a few randomly selected structures, we iteratively train ML models on these and predict energies of candidate structures. In each iteration, we calculate (look up) DFT energies only for structures predicted to be low in energy and include these in the training dataset of the next iteration. This procedure prevents computationally expensive DFT calculations for high-energy structures that lie above the convex hull, saving up to 48% of all DFT calculations while still identifying the correct convex hull.

Figure 4 presents results for AgPt. The active learning model selected 357 DFT calculations for training and predicted energies of 331 (48%) other structures, with a MAE of 39 meV/atom. The trade-off between the number of saved calculations and the probability of failing to identify the correct convex hull can be explicitly controlled by adjusting the energy threshold below which DFT calculations are requested. In this simple demonstration, structures are given and not derived from composition by relaxation. While structural relaxation is possible with ML, it brings its own challenges [6876].

Figure 4. Refer to the following caption and surrounding text.

Figure 4.  Phase diagrams of Pt-group/transition metal binary alloys. Shown are Agx Pt$_{1-x}$ structures (points) and their convex hull (dashed line) as given by DFT and identified by ML. All structures are shown at their DFT energy. Symbols indicate whether a structure was selected for training (blue circles) or predicted as high-energy (orange triangles). See main text for details.

Standard image High-resolution image

3.5. Other uses

MBTR has been used to study structure and properties of molecules, clusters, crystals and other atomistic systems. Studies related to properties include predictions of

  • gas-particle partition coefficients, such as saturation vapor pressure and equilibrium partitioning coefficients, of atmospheric molecules via KRR [27]
  • Heisenberg exchange spin coupling constants for dicopper complexes via Gaussian process regression [28]
  • total and orbital energies of diverse larger organic molecules from the OE62 dataset [77] via a graph neural network [78]
  • extrapolation of size-extensive properties [79] at the example of atomization energies of organic molecules in the QM9 [8] and OE62 [77] datasets
  • formation energies of Al–Ni and Cd–Te binary compounds via support vector regression [34]
  • band gaps and formation energies of perovskite-like materials [35]
  • energetics of compositionally disordered compounds via KRR [80].

Studies related to structure include

  • visualization of the conformational space of tannic acid molecules via principal component analysis [29]
  • global optimization of atomic clusters, including electronic spin multiplicities, via active learning and Gaussian process regression [30, 31]
  • derivative-free structural relaxation of water and small unbranched alkanes via KRR and simulated annealing [32]
  • identification of low-energy point defects in solids via evolutionary algorithms, clustering, and Gaussian process regression [36]
  • Monte Carlo simulations of the thermodynamics of thiolate-protected gold nanoclusters via minimal learning machine regression [37]
  • developing data-efficient ML potentials at the example of Cs+ in water via active learning and Gaussian process regression [33].

4. Discussion and outlook

MBTR is a general representation (numerical description, feature set) of atomistic systems for fast accurate interpolation between quantum-mechanical calculations via ML. It is based on distributions of k-atom terms stratified by chemical elements. Despite, or because of, this simple principle, it is connected to many other representations, including CM [6], BoB [23], histograms of distances, angles and dihedral angles [26], atom-centered symmetry functions [25], partial radial distribution functions [81], FCHL representation [45, 47], as well as cluster expansion [82]. See [16] for further details on these and other relationships.

MBTR represents whole molecules and crystals. With increasing number of atoms, and thus degrees of freedom, this approach is likely to degrade, and exploitation of locality via prediction of additive atomic energy contributions becomes appealing [9, 83]. This requires representing local chemical environments [20], for which MBTR can be modified [34, 84, 85].

We note in passing that problems in the training of ML models, such as outliers, can often be traced back to problems in the underlying reference calculations, such as unconverged fast Fourier transform grids or inconsistent settings (violating the assumption that a single function is being fitted), a phenomenon also observed by others [86]. This suggests that automated identification of errors in big datasets of electronic structure calculations via parametrization of ML models might be a general approach for validation of such datasets. We rationalize this hypothesis by ML models identifying regularity (correlations) in data, and faulty calculations deviating in some way from correct ones.

Continuing advances in electronic structure codes and increasing availability of large-scale computing resources have led to large collections of ab initio calculations, such as Materials Project [87], AFLOWlib [88], OQMD [59], and Novel Materials Discovery Laboratory [89]. Representations such as MBTR are key to combine quantum mechanics with machine learning (QM/ML) for fast, accurate and precise interpolation in these settings.

Acknowledgments

M R and H H thank Matthias Scheffler for helpful discussions and support. M R thanks the Institute for Pure and Applied Mathematics (IPAM) for hospitality and support, participants of its program on Understanding Many-Particle Systems with Machine Learning for feedback, and Albert Bartók-Pártay, Gábor Csányi, Alexander Shapeev, Alexandre Tkatchenko and the late Alessandro de Vita for extended discussions. M R acknowledges funding from EU Horizon 2020 program Grant 676580, The Novel Materials Discovery (NOMAD) Laboratory, a European Center of Excellence.

This article is an extended and updated version of the original arXiv preprint 1704.06439, whose formal publication was delayed for non-technical reasons.

Data availability statement

All datasets used in this study are publicly available. Implementations of MBTR are available as part of the DScribe [85] and qmmlpack [90] libraries. Code to reproduce results of reported experiments is available at: https://github.com/hhaoyan/mbtr.

The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.6084/m9.figshare.19567324.v1.

Please wait… references are loading.