Hierarchical transfer learning: an agile and equitable strategy for machine-learning interatomic models

Introduction

Machine learning (ML) has emerged as a powerful tool for developing interatomic models (IAM) capable of bridging the computational efficiency of classical molecular mechanics approaches and the predictive power of first-principles-based methods. In essence, this is achieved by directly learning the target potential energy surface (PES) topography onto a flexible set of basis functions, rather than trying to reconstruct the PES through analytical expressions as is typically done in classical “force field’ strategies. This model generation approach is particularly useful for simulating complex systems (e.g., condensed phase reacting systems and materials under extremely high temperature and pressure conditions) for which suitable molecular mechanics descriptions …

Introduction

ML-IAM development is typically accomplished through five high-level steps: (1) an initial training data is generated, comprising a series of system configurations with corresponding forces, energies, and/or stresses assigned via, e.g., Kohn–Sham Density Functional Theory1 (DFT); (2) system configurations are then recast as a series of descriptors that encode the local chemical environments; (3) a model architecture is selected, e.g., a smooth set of basis functions that ingest these descriptors and can be used to predict corresponding energies and by consequence, forces, and stresses; (4) model parameters are determined via optimization; (5) the model is iteratively refined until desired accuracy is achieved through strategies such as active learning.

Many open source tools for each of these steps are now available. For example, for the ML portion (steps 2–5), a multitude of descriptor approaches2,3,4,5,6, ML model architectures7,8,9,10,11,12,13,14,15,16 and even active learning tools6,12,13,14,17,18,19,20, have been published, each of which excels in different problem spaces (e.g., data-rich vs data-poor fitting, large-scale vs relatively small-scale simulation, materials vs chemistry applications). Ultimately, this has enabled what were once viewed as challenging fitting problems to serve as basic model and method benchmarks21, as well as generation of highly transferable general-purpose ML-IAM parameter sets22,23,24,25,26. However, there remain a number of application spaces necessitating the balance of accuracy and efficiency afforded by ML-IAMs for which model development remains far from trivial. These problems tend to exist at the confluence of high chemical and configurational complexity due to the need for large models, large training sets, and exhaustive validation. In particular, efforts to develop ML-IAMs for systems with highly complex potential energy surfaces – such as those involving both bonded and non-bonded interactions that span widely separated energy scales, as seen in condensed-phase molecular reactions and covalent phase transformations – have largely been restricted to systems with fewer than three element types, due to the roughly exponential growth in the number of required parameters with increasing chemical complexity. For example, modeling evolution in carbon-containing systems under reactive conditions is particularly difficult due to the disparate energy scales relevant for conformational change, bond formation/breaking, and non-bonded interactions and the fact that these materials tend to contain three or more element types (e.g., C, N, O, and H)6,26,27,28.

Active learning, in which the fitting framework autonomously attempts to identify maximally informative candidate (i.e., unlabeled) training data for labeling and subsequent addition to the training set can make these fitting problems more tractable by reducing training data volume requirements6,13,18,19. However, it does not address the remaining practical challenges of large models, large training sets, and the need for exhaustive validation. Therefore, in this work, we describe a new transfer learning strategy to (1) reduce fit complexity, (2) enable the physicochemical space over which models are suited to be better defined, and (3) enable generation of models that are more transferable and efficiently generated than those fit using traditional “direct-learned” methods. Our transfer learning approach is distinct from that used for neural network-type ML-IAMs29, and is designed for use with parametrically linear ML-IAMs that employ a descriptor that provides unique and fully separable representations of (n)-body interaction clusters based on both order and chemical composition7,26. Our strategy draws inspiration from classical transferable force fields30,31,32, allowing the fitting problem to be decomposed into small reusable parameter blocks and thereby enables far greater chemically extensibly than currently available strategies. Our hierarchical strategy also enables immediate application to targeted subsets of chemical space while concurrently refining and expanding model chemical scope. This allows for real-time problem-solving within the initial domain as fitting efforts progressively scale to accommodate increasingly complex chemistry.

In the following sections, we provide an overview of our approach within the context the Chebyshev Interaction Model for Efficient Simulation (ChIMES) ML-IAM, followed by discussion of our target application and initial training data generation strategy. We apply our approach to a testbed system comprising mixtures of carbon (C) an nitrogen (N) under conditions ranging from nominally ambient temperature (T) and pressure (P) up to 10,000 K and 200 GPa. We note that this testbed was chosen due to the relatively low atomic complexity (i.e., only two species are present) but high configurational complexity (e.g., including multiple phases, compositions, and chemistry). Performance of transfer- and direct-learned models are compared with DFT, and results are discussed within the context of the following guiding questions:

Given models for pure-C and -N systems, can we build a high-quality C/N model without having to refit any parameters?

How does performance of hierarchally-transfer-learned models compare with a model fit via the standard approach?

Beyond agility and parameter reuse, do any other advantages emerge from this strategy?

Results

ML-IAMs are characterized by two main features: model architecture and descriptor. For the majority of ML-IAMs, these features are structured such that parameters for interactions between atoms of various types are inseparable; introduction of an additional atom type requires generating a new model with all parameters updated2,3,5,11,13,15,22. As a consequence, ML-IAMs are generally fit a-la-carte for a given target system, atomic composition, and set of conditions. Specifically, use of an atom-centered descriptor or use of complex architecture (e.g., neural network, graph based, etc) generally precludes generating models that are explicitly compositionally-extensible. In the section below, we will show that the cluster-based descriptor and parametrically linear form used by the ChIMES ML-IAM overcomes this limitation, enabling a unique opportunity for developing chemically extensible models through a hierarchical transfer learning strategy.

Model overview

In this section, we provide a brief overview of the ChIMES ML-IAM, emphasizing features salient to the presently described transfer learning strategy. For a more detailed discussion of the model and its underlying form, we direct the reader to refs. 6 and 33.

ChIMES describes system energy through an explicit many-body cluster expansion, i.e:

$$E=\mathop{\sum }\limits_{i}^{{{n}_{{\rm{a}}}}{E}_{i}+\mathop{\sum\sum }\limits_{i > j}}{{{n}_{{\rm{a}}}}}{E}_{ij}+\mathop{\sum\sum\sum }\limits_{i > j > k}^{{{n}_{{\rm{a}}}}}{E}_{ijk}+\cdots ,,$$

(1)

where E is the total ChIMES energy for a system of na atoms, E**i is the energy for a single atom, i, Eij, and Eij**k are the energy for a cluster of two or three atoms (i.e., i**j or ijk), respectively. This expansion can extend to arbitrary bodiedness, though all models produced to date contain a maximum of 4-body terms6,7,26,27,33,34,35,36,37,38. Interactions between pairs of atoms are described through Chebyshev polynomial series that take as input interatomic pair distances, i.e., for a pair of two atoms i**j:

$${E}_{ij}\propto \mathop{\sum }\limits_{\alpha }^{{{{\mathcal{O}}}_{2{\rm{B}}}}{c}_{\alpha }}{{e}_{i}{e}_{j}}{T}_{\alpha }\left({s}_{ij}^{{e}_{i}{e}_{j}}\right),$$

(2)

where T**α is a Chebyshev polynomial of order α, ({s}_{ij}^{{{e}_{i}{e}_{j}}) is a transformed pair distance between atoms i**j of element type eie**j, ({{\mathcal{O}}}_{2{\rm{B}}}) is the user-defined two-body order for the polynomial series, and ({c}_{\alpha }}{{e}_{i}{e}_{j}}) are the Chebyshev polynomial coefficients that comprise the fitting parameters of the model. Note that we use the “ ∝ ” symbol to indicate that smoothing and penalty functions have been excluded from these equations for simplicity. Many-body interactions are treated as the product of interactions for constituent atom pairs. For example, a three-body interaction is given by:

$${E}_{ijk}\propto \sum _{\beta }\mathop{\sum }\limits_{\gamma }^{{{{\mathcal{O}}}_{3{\rm{B}}}}\mathop{\sum }\limits_{\delta }!{\scriptstyle{\prime}\atop }{c}_{\beta \gamma \delta }}{{e}_{i}{e}_{j},{e}_{i}{e}_{k},{e}_{j}{e}_{k}}{T}_{\beta }\left({s}_{ij}^{{{e}_{i}{e}_{j}}\right){T}_{\gamma }\left({s}_{ik}}{{e}_{i}{e}_{k}}\right){T}_{\delta }\left({s}_{jk}^{{e}_{j}{e}_{k}}\right),$$

(3)

Where the “’” indicates that the sum only considers terms for which at least two of β, γ, and δ are non-zero, ensuring a true three-body interaction. Previous work has shown that this functional form is well suited for describing variety of systems, including inorganic materials, covalent materials, condensed phase reacting systems, and molecular systems and materials23,26,33,34,36,38.

Models are fit by force-, energy-, and/or stress matching to gas- and/or condensed-phase atomic configurations labeled by a ground-truth method such as DFT; this is achieved by minimizing an objective function of the form:

$${F}_{{\rm{obj}}}\propto \sqrt{\mathop{\sum }\limits_{i=1}^{{{n}_{{\rm{f}}}}\left({w}_{{{\rm{E}}}_{i}}}{2}\Delta {E}_{i}^{{2}+\mathop{\sum }\limits_{j=1}}{{n}_{{\rm{a}}}}\mathop{\sum }\limits_{k=1}^{{3}{w}_{{{\rm{F}}}_{ijk}}}{2}\Delta {F}_{ijk}^{{2}+\mathop{\sum }\limits_{j=1}}{9}{w}_{{\sigma }_{ij}}^{{2}\Delta {\sigma }_{ij}}{2}\right)},$$

(4)

where Δ**X = XDFT − XChIMES{c} and X is a force, energy or stress predicted by the superscripted method. Fobj and {c} are the weighted root-mean-squared error and model coefficients, respectively. The number of frames and atoms are given by nf and na, respectively. Fij**k indicates the kth Cartesian component of the force on atom j in configuration i while σij and E**i indicates the j component of the stress tensor and the energy for configuration i, respectively. Weights for each force, energy, and stress are given by w.

Since ChIMES is entirely linear in its fitted parameters {c}, the model optimization problem can be recast as the following over-determined matrix equation:

$${\boldsymbol{w}}{\boldsymbol{M}}{\boldsymbol{c}}={\boldsymbol{w}}{{\boldsymbol{X}}}_{{\rm{DFT}}},$$

(5)

where XDFT is the vector of ({F}_{ijk}^{{{\rm{DFT}}}), ({\sigma }_{ij}}{{\rm{DFT}}}), and EDFT values, w is a diagonal matrix of weights to be applied to the elements of XDFT and rows of M, and the elements of design matrix M are given by:

$${M}_{ab}=\frac{\partial {X}_{a,{\rm{ChIMES}}{c}}}{\partial {c}_{b}}.$$

(6)

In the above, a represents a combined index over force and energy components, and b is the index over permutationally invariant model parameters. This allows model parameters to be rapidly generated by application of advanced linear solvers38,39,40,41 and makes the model well suited for iterative and/or active-learning training strategies. For additional details, the reader is directed to refs. 6 and 26.

Parameter hierarchy and transfer learning overview

ChIMES models view system configurations as a collection of fully connected graphs between atoms in n-body clusters, where atoms form nodes and the transformed distances between those atoms form the edges. We refer to these cluster graphs as the ChIMES descriptor. The parametric linearity characteristic to ChIMES models coupled with use of a cluster-centered many-body descriptor gives rise to an inherently hierarchical parameter structure that can be leveraged for transfer learning. Specifically, the atom cluster energy terms given in Eq. (1) can be further decomposed based on constituent atom types. For example, the two-body energy contributions for a system comprised entirely of C and N can be written as:

$$\mathop{\sum\sum }\limits_{i > j}^{{{{n}_{{\rm{a}}}}}{E}_{ij}=\mathop{\sum\sum }\limits_{i > j}}{{{n}_{{\rm{C}}}}}{E}_{ij}^{{{\rm{CC}}}+\mathop{\sum\sum }\limits_{i > j}}{{{n}_{{\rm{N}}}}}{E}_{ij}^{{{\rm{NN}}}+\mathop{\sum }\limits_{i}}{{n}_{{\rm{C}}}}\mathop{\sum }\limits_{j}^{{{n}_{{\rm{N}}}}{E}_{ij}}{{\rm{CN}}},$$

(7)

where nC and nN are the number of C and N atoms in the system, respectively i.e., nC + nN = na and ({E}_{ij}^{{e}_{i}{e}_{j}}) is the two-body energy for a set of atoms i**j of element types eie**j. Similarly, for a three-body interaction:

$$\begin{array}{l}\mathop{\sum\sum\sum}\limits_{i > j > k}^{{{{n}_{{\rm{a}}}}}{E}_{ijk}=\mathop{\sum\sum\sum}\limits_{i > j > k}}{{{n}_{{\rm{C}}}}}{E}_{ijk}^{{{\rm{CCC}}}\ +\mathop{\sum\sum\sum }\limits_{i > j > k}}{{{n}_{{\rm{N}}}}}{E}_{ijk}^{{{\rm{NNN}}}+\mathop{\sum\sum}\limits_{i > j}}{{{n}_{{\rm{C}}}}}\mathop{\sum }\nolimits_{k}^{{{n}_{{\rm{N}}}}{E}_{ijk}}{{\rm{CCN}}}\ +\mathop{\sum }\limits_{i}^{{{n}_{{\rm{C}}}}\mathop{\sum\sum}\limits_{j > k}}{{{n}_{{\rm{N}}}}}{E}_{ijk}^{{\rm{CNN}}}.\end{array}$$

(8)

This logic can be extended for construction of higher-body interactions. Notably, pure component terms (and thus, parameters) for C and N interactions are confined to the first two terms of Eqs. (7) and (8), and are non-interacting. Two-atom-type cross-interactions are contained in the remaining terms. This point is illustrated in Fig. 1. For a C, H, O, and N ChIMES model, all terms for interactions between only C atoms and between only N atoms are contained in the orange C and N blocks, respectively, while all cross-interaction terms are contained within the yellow CN block. Critically, this means parameters for a given block in the same column can be fit in parallel, completely independently of one another. For example, a ChIMES model containing up to four body interactions for a C/N system is comprised of two “building blocks” containing {C, CC, CCC, CCCC} and {N, NN, NNN, NNNN} parameters, and a CN cross-term building block containing {CN, CCN, CNN, CCCN, CCNN, CNNN}. We note that, throughout this text, ‘C/N’ refers to systems or datasets containing both carbon and nitrogen atoms (e.g., C/N systems), as well as the corresponding models that fully describe them, including pure-C, pure-N, and C–N cross interactions. In contrast, ‘CN’ is used specifically to denote parameters associated with ‘C–N` cross-interactions.

Fig. 1: Schematic of a ChIMES parameter hierarchy for a model describing C, H, O, and N containing systems.

Parameters in a given column can be fit completely independent of one another. Parameters blocks with two or more atoms represent cross-interactions between the indicated atom types.

Following the precedent set by previous ChIMES model development endeavors, a ChIMES-CN model would traditionally be generated by fitting all of these parameters at once. However, the unique model structure also allows a “hierarchical transfer learning” approach wherein C- and N- building blocks are fit independently to pure-C and pure-N training data, respectively. CN-block parameters can then be fit to two-element system training data by replacing the definition of Δ**X used in Eq. (4) with: (\Delta X={X}^{{{{\rm{DFT}}}}{{\prime} }}-{X}^{{{\rm{ChIMES}}{c}}), where ({X}}{{{\rm{DFT}}}^{{{\prime} }}={X}}{{\rm{DFT}}}-{X}^{{{\rm{ChIMES}}-{\rm{C}}}-{X}}{{\rm{ChIMES}}-N}) and ChIMES–C and ChIMES–N indicate X computed using the C and N block parameters, respectively. This same logic can be extended to trinary and quaternary systems e.g., the resulting CN parameter block along with the previously fit C and N parameter blocks could be used in development of, e.g. CHN, CON, and CHON models.

Prototypical system overview and training strategy

Like other ML-IAMs, these building blocks have historically been fit all at once, yielding ChIMES models for which applications are confined to specific set of atom types and limited to certain compositional ranges27,28,36,38,43. Here, we explore efficacy of the hierarchical transfer learning strategy described above to develop a C/N model valid from near-ambient conditions to extreme conditions of ~10,000 K and 200 GPa that is suitable for describing any range of compositions from 0 to 100% N, by building upon previously generated ChIMES models for C (i.e., the 2024-Large model26) and N33. We note that this testbed C/N system is also interesting within the context of synthesis of N-doped graphictic materials for applications including catalysis, energy storage, and sensing44,45,46,47,48,49. For example, shock-compression of C/N-rich precursor materials has been shown capable of producing nitrogen-containing graphitic nanoonions on sub μs timescales, which holds incredible promise as a high-throughput strategy for tailored synthesis of exotic and technologically relevant carbon nanomaterials50,51. However, governing phenomena and associated kinetics remain poorly understood due to the extreme associated T and P and far-from equilibrium events that occur. Hence the present pure C/N systems can serve as a reductionist model for understanding this process.

Training data were generated through a combination of Kohn–Sham DFT molecular dynamics (MD) simulations and single point calculations, details of which can be found in Section “Methods”. The C/N binary phase diagram is unknown under our conditions of interst; thus, to generate training data, simulations were launched DFT-MD for three different C/N compositions, at a variety of temperatures and densities spanning 300 K/1 g cm3 to 9000 K/4 g cm3 as shown in the plot in Fig. 2. Systems at densities below 2.9 g cm3 were initialized with a graphitic structure, while higher density initial configurations were initialized with a diamond-like structure; N-atoms were then introduced by random substitution. Simulations were run for at least 5 ps; 20 training configurations were taken from each of the 10 simulations to build the initial 298 configuration training set. As is shown in Fig. 2, resulting configurations span graphitic, compressed gas, and high-density liquid, containing both small molecules and larger, polymeric structures. This training set was supplemented with 98 configurations for 3 solid C/N materials, mp-1985, mp-571653, and mp-563, found in the materials project database52, comprising cell optimization trajectories under 0, 5, 10, 20 and 40 GPa. Note that the former two structures have been observed experimentally53,54,55.

Fig. 2: Overview of the training data used for model development in this work.

All training data for mixed C/N systems with nitrogen fraction, temperature, and pressure as given in the plot inset. Simulations used to generate training data points were initialized as N-doped graphite or diamond, as indicated below the plot. Representative snapshots of training configurations at each composition state points are provided below the plot, with N atoms in blue and C atoms in cyan and the corresponding composition, temperature, and label ("case’’) given in the figure. Connections are drawn between atoms within 1.8 Å of one another.

Models were generated using version the 2.0.0 of the ChIMES-LSQ package8. Fitting was automated via version 2.0.0 of the ChIMES Active Learning Driver, ALDriver6,17 an open source python workflow tool for automated ChIMES model generation via iterative fitting. ChIMES simulations were conducted using ChIMES_MD, which is a part of the ChIMES-LSQ package.The Standard iterative learning strategy coupled with the newly implemented hierarchical learning capability was used for the present work. Briefly, in the iterative learning strategy, an initial (It-1) model is trained to the DFT-MD-generated training set. The model is then deployed in parallel simulations at a selection of the training compositions, temperatures, and densities. Generally, early models are not adequately informed by the available training data and can give rise to unstable simulations that frequently sample poorly informed regions of the model (e.g., near the inner cutoff) and do not conserve the appropriate quantity. Our active learning strategy33 attempts to select up to 20 such configurations from each simulation, as well as 20 configurations from otherwise stable portions of the simulation. These ChIMES-generated configurations are then assigned labels (forces, energies, and stresses) via single-point DFT calculation, and then added to the training set, from which the next iteration model is generated; this is repeated for a user-specified number of iterative learning cycles. A total of 10 cycles were used in the present work. A weighting factor of w = nI/I is applied to each training point, where nI is the total number of requested iterative learning cycles and I is the current cycle, counting from 1. This has the effect giving DFT-MD generated configurations highest priority weights, which prevents the unphysical configurations generated by early ChIMES models (e.g., that may have extremely small interatomic distances) from driving the fit away from relevant physicochemical space. We note that the need for this weighting strategy arises from our desire to generate maximally efficient models, which means that at our target level of model complexity we may not be able to simultaneously describe near and extremely far from optimal structures equally well. Instead, by applying this decaying weighting scheme, we maintain importance of “ground truth” DFT configurations and ensures models converge with subsequent iterations while still adding the benefit of “rare event sampling” and longer-time scales accessible to ChIMES-based MD.

As in previous work33, initial weights were set to wF = 1.0 kcal mol−1 Å, wE = 0.3 kcal mol−1, and wσ = 100.0 kcal mol−1 Å−3. These weights account for differences in the relative abundance and magnitude of the force, energy, and stress tensor training data. Models contained 1- through 4-body interactions, with corresponding polynomial orders of ({{\mathcal{O}}}_{{\rm{2b}}}=25), ({{\mathcal{O}}}_{{\rm{3b}}}=10), and ({{\mathcal{O}}}_{{\rm{4b}}}=4). Remaining hyperparameters for the fits were selected using previously described ChIMES heuristic approaches33,36 and are given in Table 1. All ChIMES simulations were run using either the ChIMES_MD code available in the CHIMES_LSQ repository8, or via LAMMPS56 through version 2.0.0 of the ChIMES_Calculator Library57. Simulations used a 0.2 fs time step and were run for up to 100 ps, however all analysis was performed on only the first 5 ps to enable consistent comparison with DFT. Additional details on use of these tools is provided in Supplementary Section I.

Table 1 Model hyperparameters including the inner cutoff (rcut,in), 2-, 3-, and 4-body outer cutoffs (rcut,out,2b, rcut,out,3b, and rcut,out,4b, repectively), and Morse variable for distance transformation (λ)

To assess efficacy of the proposed hierarchical transfer learning capability, three models were generated, henceforth referred to as “Standard”, “Hierarch,” and “Partial” for models fit using the standard a-la-carte approach, the new hierarchical transfer learning strategy, or a partial hierarchical strategy where two parameter blocks are learned simultaneously, respectively. We begin by comparing model performance relative to DFT, for C/N systems, and then extend our study to pure C and pure N.

Performance for C/N systems

Though one might intuitively expect the Standard and Partial models to outperform the Hierarchical model for the C/N system, we find that in general, all models perform equally well for this system. Thus, in this subsection, we will only present data from the Standard model when it shows significant deviations from the Hierarchical model.

We begin with discussion of numerical metrics. The complete Hierarch C/N model contains a total of 3026 parameters. Of those parameters, the 442 for C and 462 for N were fixed, taken from earlier work, and only the remaining 2122 corresponding to CN cross-interactions were fit. Figure 3 provides force, energy, and stress parity plots for fitting iterations 1, 2, 4, 8, and 10 of 10. Note that each plot only shows new training data introduced at that iteration. Overall, we find excellent agreement with DFT. Distributions of stress and energy remain relatively unchanged between iterations but there is a clear increase in the spread of forces at It-2, arising from configurations generated by the poorly-constrained It-1 ChIMES model. By It-4, the DFT generated range and distribution of forces (i.e., in It-1) are recovered by the ChIMES model; by It-10, models yield simulations for which the relevant quantity is conserved (see Supplementary Information Fig. 1).

Fig. 3: Force, stress, and energy parity plots for successive active learning iterations of the Hierarch model.

Data in each plot represent only new training data added at each cycle and are given in terms of point density as indicated in the color bar.

Moving on to physical property metrics, we find that pressure predictions at each training state point are within error of the DFT-predicted value (see Supplementary Information Fig. 2), while C/N crystal cold compression curves exhibits absolute percent differences ranging from 0.1 to 1.7% (see Supplementary Information Table II). Diffusion coefficients are also in good agreement with DFT (See Supplementary Fig. 3), though ChIMES models underpredict these values for the two graphitic structures comprising cases 1 and 2. This disagreement is due to use of rcut,out ≤ 5, which precludes recovery of the low-lying dispersion forces that modulate inter-sheet interactions, but greatly reduces the model’s computational expense. Ongoing work is exploring overcoming this limitation by explicitly including D2 corrections in the ChIMES model[26](https://www.nature.com/artic

Introduction

Introduction

Given models for pure-C and -N systems, can we build a high-quality C/N model without having to refit any parameters?

How does performance of hierarchally-transfer-learned models compare with a model fit via the standard approach?

Results

Model overview

Parameter hierarchy and transfer learning overview

Prototypical system overview and training strategy

Performance for C/N systems

Similar Posts