Machine Learning Tools Accelerate Materials Discovery

Literature searches, simulations, and practical experiments have been part of the materials science toolkit for decades, but the last few years have seen an explosion of machine learning-driven software tools that promise to accelerate all three.

Many of the challenges facing the semiconductor manufacturing industry are fundamentally materials science problems. What metal has the lowest resistance at nanowire dimensions, and what precursors can be used to deposit it? What photoresist offers the best combination of etch resistance and sensitivity to EUV photons? What oxide semiconductors with good carrier mobility are the most compatible with CMOS BEOL processes? What happens — chemically, electrically, and thermally — at the interfaces between all those layers?

As the manufacturing pro…

As the manufacturing process has grown more complex, the sheer number of materials involved has skyrocketed. Once-exotic elements like ruthenium are now key components of leading-edge processes. Adhesion promoters, deposition precursors, and many more auxiliary compounds support each material that remains on the wafer.

To identify and evaluate candidate materials, process engineers must analyze an enormous amount of data. Bulk properties like resistivity or thermal conductivity are a starting point, but these properties often change as feature sizes shrink. Device integration raises new questions, too, from surface interactions to long-term stability.

Materials discovery is often described as a funnel, with a large number of initial candidates gradually narrowing down to a handful of potential solutions. At each step, engineers need more detailed information about the candidate’s behavior. Some of this data already exists, either in the technical literature or in an organization’s own institutional knowledge. Some data can be calculated via simulations. And some data is really only obtainable through experimental studies.

Atomistic models: Filling the funnel The first step, the foundation on which other tools rest, is atomistic, physics-based simulations. The most rigorous of these, density functional theory (DFT) models, seek to solve Schrödinger’s equation for the system of interest. Increases in computing power have made larger systems more approachable, but DFT calculations are still most appropriate for ideal crystals. Disordered systems like glasses, systems with symmetry-breaking defects, and thermal fluctuations — all of which are important aspects of real materials — are very computationally challenging for DFT methods.

Michele Ceriotti, professor at École Polytechnique Fédérale de Lausanne, pointed to the development of machine learning interpolated potentials (MLIPs) in the last decade as a significant breakthrough.[⁠1] Given DFT simulations of key reference systems, a machine learning tool can interpolate the potential field between them. For instance, such a model might start with an exact simulation of a perfect crystal and an exact simulation of the immediate neighborhood of a defect, and use an MLIP to examine defect-driven distortions of the electric field potential. As the number of defects increases and the space between them decreases, how does the electric field change?

Imec principal researcher Geoffrey Pourtois said in a recent interview that large-scale efforts like the Materials Project have used atomistic simulations to build data libraries characterizing the fundamental properties of an enormous range of compounds. These libraries, in turn, support the fat end of the materials discovery funnel — initial screening.

Pourtois emphasized that a clearly formulated description of the problem to be solved is essential at this point. For example, if a team is trying to identify a “better” interlayer dielectric, what does that mean? A lower dielectric constant? Greater mechanical stability? Less interaction with a new metal being introduced to the process? A tool trained on something like the Materials Project dataset might offer hundreds, or even thousands of candidates with good dielectric properties, but probably only a few of them will meet the other requirements.

Anders Blom, principal solutions engineer at Synopsys, noted that while atomistic simulations are limited in their ability to predict real-world behavior, their results can also provide parameters for higher-level tools. A list of candidate materials might include well-characterized materials that already are used in the application, alongside others for which there is little data beyond their basic properties. By combining atomistic simulations with whatever experimental data exists, a machine learning model can predict where candidates might fall relative to “known” materials. As experimental work proceeds, it can be used to further refine the model, and therefore the list of candidates.

Large language models: analyzing the literature In addition to machine-readable databases like the Materials Project, a lot of materials information is stored in media designed to be read by humans. Technical journals and materials datasheets contain decades of experimental and theoretical results. Large-scale analysis of these materials is the domain of large language models (LLMs), said Surésh Rajaraman, executive vice president and head of the Thin Film business unit at EMD.

Unfortunately, using existing literature to support automated materials discovery is really two separate tasks. First, the model used to analyze the material must be able to “understand” language generally. Recognizing that words in close proximity, such as “thermal conductivity of silicon,” automatically convey a complete concept for a human, but pose a monumental challenge for a machine.

Commercial LLMs develop their model of language by analyzing enormous datasets and incorporating billions or even trillions of parameters. Still, Xue Jiang and colleagues at the Institute for Advanced Materials and Technology, University of Science and Technology, Beijing, observed that these general-purpose models cannot provide the specific, quantitative answers needed for materials discovery tasks.[⁠2]

The next step, then, involves retraining a general-purpose model on a more focused, topic-specific database. Once that’s done, well-worded queries can find correlations that might not have been directly analyzed before. For example, Jiang and colleagues identified materials that were found in context with words like “cathodes” and “electrochemical” that also were associated with thermoelectricity, but which had not yet been analyzed for thermoelectric properties. (See figure 1, below.)

Figure 1: Using context analysis to identify candidate materials [Ref. 2].

Generative models: Designing new materials All of the tools discussed so far deal with materials that already exist in the literature. Someone has either synthesized and analyzed them, or modeled them in software.

For precursors, complex oxides, and similar compounds, though, the library of possible materials extends well beyond such pre-existing datasets. Here, generative tools come to the fore. Given a set of desirable properties and a training set of materials that have those properties, a generative neural network attempts to find new materials that “belong” in the training set.

An evaluation tool, such as a simulator, tests the proposed materials and provides feedback to the generation tool, which in turn refines its model to produce better candidates. Such generated candidate materials can then be evaluated with the same simulation tools as “real” materials to decide which are worth actually synthesizing.[⁠3]

It’s difficult to overstate the degree to which machine learning tools and GPU hardware are changing how materials discovery is done. Without these tools, few researchers would even consider evaluating hundreds, much less thousands, of candidate compounds. With them, the initial evaluation of a candidate material might only take a few milliseconds. Simulations that took weeks or months can be completed during a lunch break or overnight, even with relatively modest computing hardware.

Nonetheless, as a material progresses from initial screening to process integration, experimental studies become more and more important. For new materials, Pourtois explained, there simply isn’t enough data for a process model or a DTCO model. Process integration schemes inherently require more variables, increasing the number of details that a full simulation requires. AI tools cannot replace experimentalists, but they do allow more focused examination of a narrower range of candidates.

Ceriotti, M. “Beyond potentials: Integrated machine learning models for materials,” MRS Bulletin 47, 1045–1053 (2022). https://doi.org/10.1557/s43577-022-00440-0
Jiang, X., Wang, W., Tian, S. et al., “Applications of natural language processing and large language models in materials discovery,” npj Comput Mater 11, 79 (2025). https://doi.org/10.1038/s41524-025-01554-0
Pyzer-Knapp, E.O., Pitera, J.W., Staar, P.W.J. et al., “Accelerating materials discovery using artificial intelligence, high performance computing and robotics,” npj Comput Mater 8, 84 (2022). https://doi.org/10.1038/s41524-022-00765-z

Related Reading Powering Efficiency: AI Transforms IC Manufacturing As ICs Fuel AI Smarter manufacturing of smarter chips dominated the agenda at this year’s SEMICON West.

Similar Posts