Eliminating hallucinations in machine learning models

Digitizing materials development requires materials models that can predict materials behavior accurately, quickly and cheaply. Digital models have not been able to meet all three criteria at the same time, leaving experimental development as the status quo. Physics Inverted Materials’ novel approach to digital modeling embeds efficient machine learning models into physics-based simulations, breaking existing trade-offs and simultaneously increasing the accuracy and decreasing the time and cost of predicting materials behavior.

A common scenario where current digital tools are inadequate is in prediction of a new material property. Imagine you have just measured the elastic constants of silicon and now want to know the melting temperature. Or, you have done both of these and now you want to measure the surface energy. Quantum mechanics simulations can simulate all these properties accurately but take months and cost thousands of dollars. Classical interatomic potentials are significantly faster and lower cost but will likely lead to inaccurate property estimates.

Machine learning interatomic potentials (MLIPs) provide an alternative to quantum mechanics and classical interatomic potentials. MLIPs can be trained from quantum mechanics data, promising comparable accuracy but at 10,000 times lower cost. However, as with all machine learning models, their accuracy can decrease significantly when evaluating data dissimilar from their training dataset leading to hallucinations and inaccurate results, limiting their use in the exact scenario described above.

Physics Inverted Materials has developed uncertainty quantification tools that eliminate hallucinations from machine learning interatomic potentials and ensure the accurate prediction of material properties. Here we show how we are able to accurately estimate material properties and compare them to pretrained foundation models. We show that while pretrained foundation models hallucinate, PHIN is able to accurately predict material behavior quickly and at low cost.

From elastic constants and melting temperature to vacancies and surfaces

In our previous blog post, we showed how pretrained foundation models were less accurate than our models at predicting the elastic constants – a mechanical property – and the melting temperature – a thermodynamic property. This suggests that the open-source datasets they were trained on did not cover the appropriate properties while our datasets contained many more relevant structures. However, moving to new properties can introduce similar effects, requiring additional data to eliminate hallucinations. Because PHIN-atomic predicts uncertainty along with energies, forces, and stresses, it is able to identify when it is extrapolating and when it is interpolating, adding data as needed. This ensures the accuracy of the machine learning model predictions and eliminates the worry of having to validate the performance. See our previous blog post for further description of the active learning algorithm.

To show the difference of PHIN-atomic in predicting material performance, we predict two more properties, vacancy energy and surface energy and compare our performance to DFT, classical interatomic potentials, and a pretrained foundation model. Furthermore, we also compare it to the machine learning model from the first blog post used to predict the melting temperature, which is labeled as “Pretrained Bulk”.

Vacancy Energy

Vacancies are present in all crystalline materials and impact properties from band gaps to diffusion coefficients. In silicon, vacancies decrease the semiconducting performance and are painstakingly minimized during manufacturing. The natural prevalence of vacancies can be determined by the energy required to create a vacancy. Calculating the zero Kelvin vacancy energy is a common practice using DFT because it only requires minimizing the energy of the atomic configuration. At higher temperatures, however, an interatomic potential is needed to perform molecular dynamics simulations.

The figure below shows the energy of multiple structures with different vacancy concentrations with a best fit linear model for the total energy versus the number of atoms. We show results from DFT, classical interatomic potentials, PHIN’s model, a pretrained foundation model, and the pretrained bulk model. The vacancy energy calculated from PHIN-atomic almost perfectly agrees with DFT, whereas the pretrained foundation model underestimates the vacancy energy and the Classical model overpredicts the DFT vacancy energy. The pretrained Bulk model is in better agreement to DFT than the pretrained foundation model, which can most likely be attributed to the diversity of atomic environments the model was trained on for melted silicon, which is not present in the foundation model dataset.

The near perfect agreement of PHIN to DFT highlights the power of PHIN’s software to reproduce DFT quantities, and the improvement of PHIN compared to its previous checkpoint (Pretrained Bulk) highlights the efficacy of PHIN’s active learning and uncertainty quantification.

PHIN 0K vacancy energy is indistinguishable from DFT

By showing the accuracy of PHIN’s models to zero Kelvin data, we establish confidence in the model’s ability to predict properties at the same accuracy as DFT. When we transition to simulating the vacancy energy at room temperature, we can therefore trust the values that are output by PHIN-atomic even though we cannot simulate it with DFT due to its high cost.

When we move to 300K in the figure below, the discrepancies in predictions demonstrate exactly why there is often hesitation in using digital predictions for evaluating material behavior. Without assurances of model accuracy, it is difficult to determine which value to trust. With PHIN, however, we assure the accuracy of the model with uncertainty quantification such that the model predictions are always at quantum mechanical accuracy, eliminating the issue of choosing an accurate model.

300K Vacancy Energy predicted with different models

Surface Energy

Up to now, all the simulations have assumed the silicon is infinitely repeating in space. This is a good approximation most of the time since materials are periodic, repeating the same crystalline structures. In real-life, however, materials do end and terminate with particular surfaces. The crystalline nature of many materials means that there are a few preferential surface terminations that have significantly lower energies. The lowest energy surface termination normally dictates the structure observed experimentally as well as how the material is faceted. For silicon, the diamond cubic crystal structure has two common surface terminations, defined by crystal lattice planes (1, 0, 0) and (1, 1, 1), the latter of which has lower energy and is therefore experimentally observed.

We simulate the (1, 1, 1) surface at 0K and 300K to show the accuracy of PHIN, classical, and pretrained foundation models to DFT. In the figure below, the PHIN and DFT data are nearly indistinguishable and all the other models significantly underestimate the surface energy. This quality of agreement is only possible because PHIN ensures the accuracy of not just the final structure, but of all the structures throughout the relaxations, which is necessary to arrive at the correct final structure.

PHIN 0K (1, 1, 1) surface energy is indistinguishable from DFT

Moving from 0K to 300K introduces kinetic energy to the silicon atoms. The kinetic energy allows the atoms to traverse the potential energy surface instead of remaining in the ground state configuration.

The introduction of entropy into the system can have a large effect on the energetics of the system. The energies are now computed as free energies by taking an ensemble average from an NVT simulation. The ensemble average enables the thermodynamic weighting of the structures for capturing the diverse structures present in the simulation. The increase in temperature decreases all the surface energies due to the configurational entropy of the surface reconstructions. Interestingly, the ordering of the different methods changes as well, with PHIN now being between the classical and bulk models.

300K (1, 1, 1) Surface Energy predicted with different models

The significant changes in the surface energy at different temperatures highlights the importance of using accurate and trustable interatomic potential. Since DFT simply cannot be used to compute non-zero temperature properties due to the excessive cost of the models, PHIN’s use of uncertainty quantification ensures we can trust MLIP predictions and the accuracy to DFT. PHIN’s models are the only ones that enable us to accurately model surfaces across temperatures at DFT accuracy.

Conclusion

Digitizing materials development requires accurate, timely, and inexpensive prediction of materials behavior. While DFT is accurate, properties such as room temperature vacancy energies or surface energies can take months and thousands of dollars to evaluate. Classical interatomic potentials can compute these properties in minutes, but have significantly lower accuracy due to their strict functional forms. PHIN Materials’ use of uncertainty quantification eliminates hallucinations of machine learning interatomic potentials by ensuring that the machine learning models are always interpolating on their training datasets. We can therefore ensure the machine learning models always produce DFT accurate data but at 10,000x lower cost.

PHIN-atomic builds a foundation model by expanding the dataset the machine learning model is trained on as more material properties are calculated. The impact of active learning is highlighted by comparing the final model from our previous blog post – the Bulk checkpoint – to the dynamic model used within PHIN-atomic. The material behavior prediction is significantly more accurate due to fine-tuning with structures identified during the vacancy and surface energy evaluations. Just like other foundation models, fine-tuning is (relatively) inexpensive, in our case taking less than a day to refit. PHIN’s models therefore satisfy all requirements to digitize materials development. PHIN accurately predicts materials behavior, returns predictions in a timely manner, and is significantly less expensive than DFT.