Data-driven Machine Learning (ML) has been widely used in materials performance optimization and novel materials design due to its ability to quickly fit potential data patterns and achieve accurate prediction. However, the results of data-driven ML are often inconsistent with the materials basic theory or principle, which results mainly from the lack of the guidance of materials domain knowledge, e.g., the correlation among descriptors and the driving mechanism associated with the properties. Herein, by analyzing the characteristics of materials data and the modeling principle of data-driven ML methods, we clarify the three main contradictions occurring to the application of ML in materials science, i.e., the contradictions between high dimension and small sample, complexity and accuracy of models, learning results and domain knowledge. Following this, we propose the ML method embedded with materials domain knowledge to reconcile these three contradictions. Further, surrounding the whole ML process including target definition, data collection and preprocessing, feature engineering, model construction and application, we explore some key techniques to realize domain knowledge embedding by summarizing the related basic and exploratory efforts. Finally, opportunities and challenges facing the ML method embedded with domain knowledge are also discussed.
PARK H, JUNG K, NEZAFATI M, et al. Sodium ion diffusion in Nasicon (Na3Zr2Si2PO12) solid electrolytes: effects of excess sodium[J]. ACS Appl Mater Inter, 2016, 8(41): 27814–27824.
LIU Y, ZHAO T L, JU W W, et al. Materials discovery and design using machine learning[J]. J Materiomics, 2017, 3(3): 159–177.
SCHMIDT J, MARQUES M R G, BOTTI S, et al. Recent advances and applications of machine learning in solid-state materials science[J]. NPJ Comput Mater, 2019, 5(83): 1–36.
CHEN C, ZUO Y, YE W, et al. A critical review of machine learning of energy materials[J]. Adv Energy Mater, 2020, 10(8): 1903242.
CERIOTTI M. Unsupervised machine learning in atomistic simulations, between predictions and understanding[J]. J Chem Phys, 2019, 150(15): 150901.
JING L L, TIAN Y L. Self-supervised visual feature learning with deep neural networks: a survey[J]. IEEE T Pattern Anal, 2021, 43(11): 4037–4058.
TU Enmei, YANG Jie. J Shanghai Jiao Tong University (in Chinese), 2018, 52(10): 1208–1291.
YUAN Ruihao, LIAO Weijie, TANG Bin, et al. Aeron Manuf Technol (in Chinese), 2021, 64(18): 22–30.
RAJAK P, WANG B B, NOMURA K, et al. Autonomous reinforcement learning agent for stretchable kirigami design of 2D materials[J]. NPJ Comput Mater, 2021, 7(1): 102.
DEL VECCHIO C, FENU G, PELLEGRINO F A, et al. Support vector representation machine for superalloy investment casting optimization[J]. Appl Math Model, 2019, 72: 324–336.
LIU Y, WU J M, WANG Z C, et al. Predicting creep rupture life of Ni-based single crystal superalloys using divide-and-conquer approach based machine learning[J]. Acta Mater, 2020, 195: 454–467.
TAYLOR P L, CONDUIT G. Machine learning predictions of superalloy microstructure[J]. Comp Mater Sci, 2022, 201: 110916.
WEN C, ZHANG Y, WANG C X, et al. Machine learning assisted design of high entropy alloys with desired property[J]. Acta Mater, 2019, 170: 109–117.
BATCHELOR T A A, PEDERSEN J K, WINTHER S H, et al. High-entropy alloys as a discovery platform for electrocatalysis[J]. Joule, 2019, 3(3): 834–845.
ZHANG Y, WEN C, WANG C X, et al. Phase prediction in high entropy alloys with a rational selection of materials descriptors and machine learning models[J]. Acta Mater, 2020, 185: 528–539.
LIU Y, WU J M, YANG G, et al. Predicting the onset temperature (Tg) of GexSe1–x glass transition: a feature selection based two-stage support vector regression method[J]. Sci Bull, 2019, 64(16): 1195–1203.
SENDEK A D, YANG Q, CUBUK E D, et al. Holistic computational structure screening of more than 12000 candidates for solid lithium-ion conductor materials[J]. Energ Environ Sci, 2017, 10(2): 306–320.
SENDEK A D, CUBUK E D, ANTONIUK E R, et al. Machine learning-assisted discovery of solid Li-ion conducting materials[J]. Chem Mater, 2019, 31, 2: 342–352.
CUBUK E D, SENDEK A D, REED E J. Screening billions of candidates for solid lithium-ion conductors: A transfer learning approach for small data[J]. J Chem Phys, 2019, 150(21): 214701.
ZHANG Y, HE X F, CHEN Z Q, et al. Unsupervised discovery of solid-state lithium ion conductors[J]. Nat Commun, 2019, 10: 5260.
IWASAKI Y, SAWADA R, STANEV V, et al. Identification of advanced spin-driven thermoelectric materials via interpretable machine learning[J]. NPJ Comput Mater, 2019, 5: 103.
MIN K, CHO E. Accelerated discovery of potential ferroelectric perovskite via active learning[J]. J Mater Chem C, 2020, 8: 7866–7872.
MA W, LIU Y M. A data-efficient self-supervised deep learning model for design and characterization of nanophotonic structures[J]. Sci China Phys Mech, 2020, 63(8): 284212.
HO C T, WANG D W. Robust identification of topological phase transition by self-supervised machine learning approach[J]. New J Phys, 2021, 23(8): 083021.
CHEN D, ZHENG J X, WEI G W, et al. Extracting predictive representations from hundreds of millions of molecules[J]. J Phys Chem Lett, 2021, 12(44): 10793–10801.
MA W, CHENG F, XU Y, et al. Probabilistic representation and inverse design of metamaterials based on a deep generative model with semi-supervised learning strategy[J]. Adv Mater, 2019, 31(35): 1901111.
SAHOO P, ROY I, WANG Z, et al. MultiCon: a semi-supervised approach for predicting drug function from chemical structure analysis[J]. J Chem Inf Model, 2020, 60(12): 5995–6006.
KUNSELMAN C, ATTARI V, MCCLENNY L, et al. Semi-supervised learning approaches to class assignment in ambiguous microstructures[J]. Acta Mater, 2020, 188: 49–62.
CHEN D, SUN D, FU J, et al. Semi-supervised learning framework for aluminum alloy metallographic image segmentation[J]. IEEE Access, 2021, 9: 30858–30867.
XIE Jianxin, SU Yanjing, XUE Dezhen, et al. Acta Metall SIN (in Chinese), 2021, 57(11): 1343–1361.
XUE D Z, BALACHANDRAN P V, HOGDEN J, et al. Accelerated search for materials with targeted properties by adaptive design[J]. Nat Commun, 2016, 7: 11241.
DOAN H A, AGARWAL G, QIAN H, et al. Quantum chemistryinformed active learning to accelerate the design and discovery of sustainable energy storage materials[J]. Chem Mater, 2020, 32: 6338–6346.
SAIEDIAN I, BADLOE T, LEE H, et al. Deep Q-network to produce polarization-independent perfect solar absorbers: a statistical report[J]. Nano Converg, 2020, 7(1): 26.
BUTLER K T, DAVIES D W, CARTWRIGHT H, et al. Machine learning for molecular and materials science[J]. Nature, 2018, 559(7715): 547–555.
DI Shaoceng, FENG Yuntian, QU Tongming, et al. Chin J Theor App Mech-pol (in Chinese), 2021, 53(10): 2712–2723.
WOHLRAB L, FURNKRANZ J. A review and comparison of strategies for handling missing values in separate-and-conquer rule learning[J]. J Intell Inf Syst, 2011, 36(1): 73–98.
XU X D, LIU H W, YAO M H. Recent progress of anomaly detection[J]. Complexity, 2019: 2686378.
GUO H X, LI Y J, SHANG J, et al. Learning from class-imbalanced data: Review of methods and applications[J]. Expert Syst Appl, 2017, 73: 220–239.
WANG Y D, PAN Z B, PAN Y W, et al. A training data set cleaning method by classification ability ranking for the k-nearest neighbor classifier[J]. IEEE T Neur Net Lear, 2020, 31(5): 1544–1556.
GHIRINGHELLI L M, VYBIRAL J, LEVCHENKO S V, et al. Big data of materials science: critical role of the descriptor[J]. Phys Rev Lett, 2015, 114(10): 105503.
SHANDIZ M A, GAUYIN R. Application of machine learning methods for the prediction of crystal system of cathode materials in lithium-ion batteries[J]. Comp Mater Sci, 2016, 117: 270–278.
Li Y, ZOU C F, BERECIBAR M, et al. Random forest regression for online capacity estimation of lithium-ion batteries[J]. Appl Energ, 2018, 232: 197–210.
CHELGANI S C, MATIN S S, HOWER J C. Explaining relationships between coke quality index and coal properties by random forest method[J]. Fuel, 2016, 182: 754–760.
IM J, LEE S, KO T W, et al. Identifying Pb-free perovskites for solar cells by machine learning[J]. NPJ Comput Mater, 2019, 5: 37.
WANG X M, XU Y L, YANG J, et al. ThermoEPred-EL: Robust bandgap predictions of chalcogenides with diamond-like structure via feature cross-based stacked ensemble learning[J]. Comp Mater Sci, 2019, 169: 109117.
WEN C, WANG C X, ZHANG Y, et al. Modeling solid solution strengthening in high entropy alloys using machine learning[J]. Acta Mater, 2021, 212: 116917.
LIU Y, GUO B R, ZOU X X, et al. Machine learning assisted materials design and discovery for rechargeable batteries[J]. Energy Storage Mater, 2020, 31: 434–450.
DE JONG M, CHEN M, NOTESTINE R, et al. A statistical learning framework for materials science: application to elastic moduli of k-nary inorganic polycrystalline compounds[J]. Sci Rep, 2016, 6: 34256.
ZHANG Y, LING C. A strategy to apply machine learning to small datasets in materials science[J]. NPJ Comput Mater, 2019, 3(5): 71–78.
FABER F A, LINDMAA A, VON Lilienfeld O A, et al. Machine learning energies of 2million Elpasolite (ABC2D6) crystals[J]. Phys Rev Lett, 2016, 117(13): 135502.
SCHMIDT J, SHI J M, BORLIDO P, et al. Predicting the thermodynamic stability of solids combining density functional theory and machine learning[J]. Chem Mater, 2017, 29(12): 5090–5103.
LIU Y, WU J M, AVDEEV M, et al. Multi-layer feature selection incorporating weighted score-based expert knowledge toward modeling materials with targeted properties[J]. Adv Theor Simul, 2020, 3(2): 1900215.
WANG X L, XIAO R J, LI H, et al. Quantitative structure-property relationship study of cathode volume changes in lithium ion batteries using ab-initio and partial least squares analysis[J]. J Materiomics, 2017, 3(3): 178–183.
ZHAO Y L, ZHANG K, ZHANG Y, et al. Prediction of air voids of asphalt layers by intelligent algorithm[J]. Constr Build Mater, 2022, 317: 125908.
JIANG D W, WANG Z Y, ZHANG J L, et al. Predictive modelling for contact angle of liquid metals and oxide ceramics by comparing Gaussian process regression with other machine learning methods[J]. Ceram Int, 2022, 48(1): 665–673.
YE W K, CHEN C, WANG Z B, et al. Deep neural networks for accurate predictions of crystal stability[J]. Nat Commun, 2018, 9: 3800.
HU C, JAIN G, ZHANG P Q, et al. Data-driven method based on particle swarm optimization and k-nearest neighbor regression for estimating capacity of lithium-ion battery[J]. Appl Energ, 2014, 129: 49–55.
HALEVY A, NORVIG P, PEREIRA F. The unreasonable effectiveness of data[J]. IEEE Intell Syst, 2009, 24(2): 8–12.
AGRAWAL A, CHOUDHARY A. Perspective: materials informatics and big data: realization of the “fourth paradigm” of science in materials science[J]. Apl Mater, 2016, 4(5): 053208.
MA B Y, WEI X Y, LIU C N, et al. Data augmentation in microscopic images for material data mining[J]. NPJ Comput Mater, 2020, 6(1): 125.
RESHEF D N, RESHEF Y A, FINUCANE H K, et al. Detecting novel associations in large data sets[J]. Science, 2011, 334(6062): 1518–1524.
LOOKMAN T, BALACHANDRAN P V, XUE D Z, et al. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design[J]. NPJ Comput Mater, 2019, 5: 21.
ZHUANG F Z, QI Z Y, DUAN K Y, et al. A comprehensive survey on transfer learning[J]. P IEEE, 2021, 109(1): 43–76.
XU Q C, LI Z Z, LIU M, et al. Rationalizing perovskite data for machine learning and materials design[J]. J Phys Chem Lett, 2018, 9(24): 6948–6954.
HAFIZ H, KHAIR A I, CHOI H, et al. A high-throughput data analysis and materials discovery tool for strongly correlated materials[J]. NPJ Comput Mater, 2018, 4: 63.
LI W, JACOBS R, MORGAN D. Predicting the thermodynamic stability of perovskite oxides using machine learning models[J]. Comput Mater Sci, 2018, 150: 454–463.
WANG A Y T, MURDOCK R J, KAUWE S K, et al. Machine learning for materials scientists: an introductory guide toward best practices[J]. Chem Mater, 2020, 32(12): 4954–4965.
WILKINSON M D, DUMONTIER M, AALBERSBERG I J, et al. Comment: the FAIR guiding principles for scientific data management and stewardship[J]. Sci Data, 2016, 3: 160018.
SONG Jia, WEN Liangming, LI Yang. Inf Doc Serv(in Chinese), 2021, 42(1): 57–68.
DRAXL C, SCHEFFLER M. NOMAD: The FAIR concept for big data-driven materials science[J]. MRS Bull, 2018, 43(9): 676–682.
ALLEN F H. The Cambridge Structural Database: a quarter of a million crystal structures and rising [J]. Acta Crystallogr B, 2002, 58: 380–388.
BERGERHOFF G, HUNDT R, SIEVERS R, et al. The inorganic crystal structure data base[J]. J Chem Info Comput Sci, 1983, 23(2): 66–69.
SAAL J E, KIRKLIN S, AYKOL M, et al. Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD)[J]. JOM, 2013, 65: 1501–1509.
VILLARS P, BERNDT M, BRANDENBURG K, et al. The Pauling File, binaries edition[J]. J Alloy Compd, 2004, 367: 293–297.
JAIN A, ONG S P, HAUTIER G, et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation [J]. APL Mater, 2013, 1: 011002.
CURTAROLO S, SETYAWAN W, WANG S, et al. AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations[J]. Comp Mater Sci, 2012, 58: 227–235.
SANCHEZ-LENGELING B, ASPURU-GUZIK A. Inverse molecular design using machine learning: Generative models for matter engineering[J]. Science, 2018, 361(6400): 360–365.
DAN Y B, ZHAO Y, LI X, et al. Generative adversarial networks (GAN) based efficient sampling of chemical composition space for inverse design of inorganic materials[J]. NPJ Comput Mater, 2020, 6(1): 84.
XU X F, LEI Y G, LI Z D. An incorrect data detection method for big data cleaning of machinery condition monitoring[J]. IEEE T Ind Electron, 2020, 67(3): 2326–2336.
STORKEY A J, HAMBLY N C, WILLIAMS C K I, et al. Cleaning sky survey data bases using Hough transform and renewal string approaches[J]. Mon Not R Astron Soc, 2004, 347(1): 36–51.
WARD L, AGRAWAL A, CHOUDHARY A, WOLVERTON C. A general-purpose machine learning framework for predicting properties of inorganic materials[J]. NPJ Comput Mater, 2016, 2: 16028.
LI Y H, XIAO B, TANG Y C, et al. Center-Environment feature model for machine learning study of spinel oxides based on first-principles computations[J]. J Phys Chem C, 2020, 124(52): 28458–28468.
OUYANG R H, CURTAROLO S, AHMETCIK E, et al. SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates[J]. Phys Rev Mater, 2018, 2(8): 083802.
WANG Y, WAGNER N, RONDINELLI J M. Symbolic regression in materials science[J]. MRS Commun, 2019, 9(3): 793–805.
WENG B C, SONG Z L, ZHU R L, et al. Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts[J]. Nat Commun, 2020, 11(1): 3513.
WEN C, WANG C X, ZHANG Y, et al. Modeling solid solution strengthening in high entropy alloys using machine learning[J]. Acta Mater, 2021, 212: 116917.
IM J, LEE S, KO T W, et al. Identifying Pb-free perovskites for solar cells by machine learning[J]. NPJ Comput Mater, 2019, 5: 37.
TONG Z N, WANG L Y, ZHU G M, et al. Predicting twin nucleation in a polycrystalline Mg alloy using machine learning methods[J]. Metall Mater Trans A, 2019, 50(12): 5543–5560.
KHARKOV Y A, SOTSKOV V E, KARAZEEV A A, et al. Revealing quantum chaos with machine learning[J]. Phys Rev B, 2020, 101(6): 064406.
WANG A P, ZOU Z Y, WANG D, et al. Identifying chemical factors affecting reaction kinetics in Li-air battery via ab initio calculations and machine learning[J]. Energy Storage Mater, 2021, 35: 595–601.
AHMAD A, AHMAD W, ASLAM F, et al. Compressive strength prediction of fly ash-based geopolymer concrete via advanced machine learning techniques[J]. CASE Stud Constr Mat, 2022, 16: e00840.
SARKER S, TANG-KONG R, SCHOEPPNER R, et al. Discovering exceptionally hard and wear-resistant metallic glasses by combining machine-learning with high throughput experimentation[J]. Appl Phys Rev, 2022, 9(1): 011403.
ATTIA P M, GROVER A, JIN N, et al. Closed-loop optimization of fast-charging protocols for batteries with machine learning[J]. Nature, 2020, 578(7795): 397–402.
LAMBARD G, SASAKI T T, SODEYAMA K, et al. Optimization of direct extrusion process for Nd–Fe–B magnets using active learning assisted by machine learning and Bayesian optimization[J]. Scripta Mater, 2022, 209: 114341.
PRUKSAWAN S, LAMBARD G, SAMITSU S, et al. Prediction and optimization of epoxy adhesive strength from a small dataset through active learning[J]. Sci Technol Adv Mat, 2020, 20(1): 1010–1021.
SHEN S, SADOUGHI M, LI M, et al. Deep convolutional neural networks with ensemble learning and transfer learning for capacity estimation of lithium-ion batteries[J]. Appl Energ, 2020, 260: 114296.
TOGO R, SAITO N, OGAWA T, et al. Estimating regions of deterioration in electron microscope images of rubber materials via a transfer learning-based anomaly detection model[J]. IEEE Access, 2019, 7: 162395–162404.
ISAYEV O, OSES C, TOHER C, et al. Universal fragment descriptors for predicting properties of inorganic crystals[J]. Nat Commun, 2017, 8: 15679.
XIE T, GROSSMAN J C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties[J]. Phys Rev Lett, 2018, 120(14): 145301.
AHAMD Z, XIE T, MAHESHWARI C, et al. Machine learning enabled computational screening of inorganic solid electrolytes for suppression of dendrite formation in lithium metal anodes[J]. ACS Central Sci, 2018, 4(8): 996–1006.
ZHOU L M, YAO A M Z, WU Y J, et al. Machine learning assisted prediction of cathode materials for Zn-ion batteries[J]. Adv Theor Simul, 2021, 4(9): 2100196.
MEREDIG B, AGRAWAL A, KIRKLIN S, et al. Combinatorial screening for new materials in unconstrained composition space with machine learning[J]. Phys Rev B, 2014, 89(9): 094104.
KONONOVA O, HE T J, HUO H Y, et al. Opportunities and challenges of text mining in materials research[J]. ISCIENCE, 2021, 24(3): 102155.
OLIVETTI E A, COLE J M, KIM E, et al. Data-driven materials research enabled by natural language processing and information extraction[J]. Appl Phys Rev, 2021, 7(4): 041317.
TSHITOYAN V, DAGDELEN J, WESTON L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature[J]. Nature, 2019, 571(7763): 95–98.
CHEN Y T, ZHANG D X. Physics constrained deep learning of geomechanical logs[J]. IEEE T Geosci Remote, 2020, 58(8), 5932–5943.
CHEN Y T, ZHANG D X. Theory guided deep-learning for load forecasting (TgDLF) via ensemble long short-term memory[J]. Adv Appl Energ, 2020, 1: 1–15.
CHEN Y T, HUANG D, ZHANG D X. Theory-guided hard constraint projection (HCP): a knowledge-based data-driven scientific machine learning method[J]. J Comput Phys, 2021, 445: 110624.