Accurate and cost effective models are the bottleneck to digitizing materials development. At the atomic scale, scientists are currently limited to choosing between density functional theory, accurate and generalizable but expensive, and classical interatomic potentials, which are inexpensive but not generalizable. While DFT can be used for materials across the periodic table, classical interatomic potentials are more specialized and are developed for specific materials. This same paradigm exists at larger length-scales, creating a set of trade-offs for computational methods that must balance accuracy, generalizability, and cost.
Recently, machine learning models have been shown to offer a third option to classical interatomic potentials and DFT. At the atomic scale, machine learning interatomic potentials trained on DFT data are significantly more accurate than classical interatomic potentials while orders of magnitude less expensive than DFT. MLIPs, just like their classical counterparts, need to be trained on datasets that cover the material properties to be predicted. Initial examples of MLIPs trained on open-source datasets have shown their inability to accurately capture common materials properties, highlighting their sparsity, which only covers crystalline, bulk materials at low temperatures. Simulating surfaces, interfaces, or finite temperatures is a risky endeavor with the machine learning models hallucinating to yield incorrect property predictions.
At Physics Inverted Materials, we believe that to benefit from the increased accuracy and decreased cost, we will have to create bespoke datasets that sufficiently cover the desired materials phase space. This is conventionally not thought of as feasible due to the high dimensional and effectively infinitely large chemical space. As such, we needed to develop an automated and efficient way of generating datasets using an efficient active learning algorithm to build out the dataset based on requests. To make this feasible, the active learning algorithm must be fast and data efficient to overcome the typical greater than 6 month MLIP development time. However, the benefits are clear, using active learning ensures that every material property we predict is truly accurate because the dataset contains bespoke data relevant to the material property.
Over the past year, we have improved the capability of our active learning algorithm to make it more data efficient and significantly faster so that we can scale to the diverse properties of our customers. In the below graphs, we show the improvements on a benchmark learning task. The learning task starts with a dataset seed of 10 consecutive frames from an ab-initio molecular dynamics simulation of diamond silicon from which the algorithm expands the dataset to ensure that a calculation at 3000K is accurate. Although on a simple material, this is a challenging learning task to ensure the accuracy of simulating silicon through a phase transformation from solid to liquid.
Over the past year, we have reduced the data requirements from over 3,000 to just 32 frames and the time from more than two weeks to under half a day representing 99% and 97% reductions in dataset cost and generation time.
If you are interested in learning more about our active learning technology and how our accurate MLIPs can be used to digitally test your materials properties, reach out to info@phinmaterials.com.