PDF (1 MB)
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
References
Show full outline
Hide outline
Review | Publishing Language: Chinese

A Data Quality and Quantity Governance for Machine Learning in Materials Science

Yue LIU1,2Shuchang MA1Zhengwei YANG1Xinxin ZOU1Siqi SHI3,4()
School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
Shanghai Engineering Research Center of Intelligent Computing System, Shanghai 200444, China
School of Materials Science and Engineering, Shanghai University, Shanghai 200444, China
Materials Genome Institute, Shanghai University, Shanghai 200444, China
Show Author Information

Abstract

Data-driven machine learning is widely used in materials property prediction and structure-activity relationship research due to its accurate and efficient predictive ability. Data determines the upper limit of machine learning. However, materials data often have various quality and quantity problems (i.e., multiple sources, large noise, small samples, and high dimensionality), affecting the application of machine learning in the materials field. In this paper, by analyzing the data quality and quantity problems and their related governance work, we find that data quality and data quantity jointly determine this problem. Following this, a data quality and quantity governance framework embedded by materials domain knowledge in the whole process of materials machine learning is proposed. We define twelve dimensions to analyze the connotation of materials data quality and quantity. A life cycle model of data quality and quantity governance is constructed to ensure that data quality and quantity governance activities are carried out in an orderly manner. To manage data quality and quantity accurately and comprehensively, a series of corresponding governance processing models are established from domain knowledge and data-driven aspects, which provides technical support for the specific implementation of the life cycle model. This framework realizes the overall evaluation and improvement of materials data quality and quantity, providing theoretical guidance and candidate solutions for high-quality and appropriate-quantity data acquisition and accelerating the in-depth application of machine learning in materials research and development.

CLC number: TP181; TB3 Document code: A Article ID: 0454-5648(2023)02-0427-11

References

[1]

ROBERT C. Machine learning, a probabilistic perspective[J]. Chance, 2014, 27(2): 62‒63.

[2]

LIU Y, ZHAO T L, JU W W, et al. Materials discovery and design using machine learning[J]. J Materiomics, 2017, 3(3): 159‒177.

[3]

SCHMIDT J, MARQUES M R G, BOTTI S, et al. Recent advances and applications of machine learning in solid-state materials science[J]. NPJ Comput Mater, 2019, 5(1): 1‒36.

[4]

CHEN C, ZUO Y X, YE W K, et al. A critical review of machine learning of energy materials[J]. Adv Energy Mater, 2020, 10(8): 1903242.

[5]

CHEN H H, CHEN J P, DING J H. Data evaluation and enhancement for quality improvement of machine learning[J]. IEEE Trans Reliab, 2021, 70(2): 831‒847.

[6]

MEHRABI N, MORSTATTER F, SAXENA N, et al. A survey on bias and fairness in machine learning[J]. Acm Comput Surveys, 2021, 54(6): 1‒32.

[7]

OAKI Y, IGARASHI Y. Materials informatics for 2d materials combined with sparse modeling and chemical perspective: Toward small-data-driven chemistry and materials science[J]. Bull Chem Soc Jpn, 2021, 94(10): 2410‒2422.

[8]

LIU Y, GUO B R, ZOU X X, et al. Machine learning assisted materials design and discovery for rechargeable batteries[J]. Energy Storage Mater, 2020, 31: 434‒450.

[9]

BEAL M S, HAYDEN B E, LE GALL T, et al. High throughput methodology for synthesis, screening, and optimization of solid state lithium ion electrolytes[J]. ACS Comb Sci, 2011, 13(4): 375‒381.

[10]

RAJAN A C, MISHRA A, SATSANGI S, et al. Machinelearning-assisted accurate band gap predictions of functionalized mxene[J]. Chem Mater, 2018, 30(12): 4031‒4038.

[11]

LU P, ZHUO Z, ZHANG W H, et al. A hybrid feature selection combining wavelet transform for quantitative analysis of heat value of coal using laser-induced breakdown spectroscopy[J]. APPL Phys B-Lasers O, 2021, 127(19): 1‒11.

[12]

YUAN J, WANG Q, LI Z, et al. Domain-knowledge-oriented data pre-processing and machine learning of corrosion-resistant γ-u alloys with a small database[J]. Comput Mater Sci, 2021, 194: 110472.

[13]

LIU Yue, ZOU Xinxin, YANG Zhengwei, et al. J Chin Ceram Soc, 2022, 50(3): 863‒876.

[14]

GHARAGHEIZI F, SATTARI M, ILANI-KASHKOULI P, et al. A “non-linear” quantitative structure–property relationship for the prediction of electrical conductivity of ionic liquids[J]. Chem Eng Sci, 2013, 101: 478‒485.

[15]

HEMMATI-SARAPARDEH A, TASHAKKORI M, HOSSEINZADEH M, et al. On the evaluation of density of ionic liquid binary mixtures: Modeling and data assessment[J]. J Mol Liq, 2016, 222: 745‒751.

[16]

LI W, JACOBS R, MORGAN D. Predicting the thermodynamic stability of perovskite oxides using machine learning models[J]. Comput Mater Sci, 2018, 150: 454‒463.

[17]

XU Q, LI Z, LIU M, et al. Rationalizing perovskite data for machine learning and materials design[J]. J Phys Chem Lett, 2018, 9(24): 6948‒6954.

[18]
WUEST T, MAK-DADANSKI J, THOBEN K-D. Data quality in materials science: A quality management manual approach[C]//IFIP International conference on advances in production management systems, Springer, 2014: 42‒49.
[19]

WENZLICK M, MAMUN O, DEVANATHAN R, et al. Assessment of outliers in alloy datasets using unsupervised techniques[J]. J Materiomics, 2022, 74(7): 2846‒2859.

[20]

WILKINSON M D, DUMONTIER M, AALBERSBERG I J, et al. The fair guiding principles for scientific data management and stewardship[J]. Sci Data, 2016, 3: 160018.

[21]

SONG Jia, WEN Liangming, LI Yang. Inform Document Services (in Chinese), 2021, 42(1): 57‒68.

[22]

IWASAKI Y, SAWADA R, STANEV V, et al. Identification of advanced spin-driven thermoelectric materials via interpretable machine learning[J]. NPJ Comput Mater, 2019, 5(103): 1‒6.

[23]

AGRAWAL A, DESHPANDE P D, CECEN A, et al. Exploration of data science techniques to predict fatigue strength of steel from composition and processing parameters[J]. Integr Mater Manuf I, 2014, 3: 90‒108.

[24]

SHIN D, YAMAMOTO Y, BRADY M P, et al. Modern data analytics approach to predict creep of high-temperature alloys[J]. Acta Mater, 2019, 168: 321‒330.

[25]

IM J, LEE S, KO T W, et al. Identifying Pb-free perovskites for solar cells by machine learning[J]. NPJ Comput Mater, 2019, 5(37): 1‒8.

[26]

DENG Q, LIN B. Exploring structure-composition relationships of cubic perovskite oxides via extreme feature engineering and automated machine learning[J]. Mater Today Commun, 2021, 28: 102590.

[27]

MANGAL A, HOLM E A. A comparative study of feature selection methods for stress hotspot classification in materials[J]. Integr Mater Manuf I, 2018, 7(3): 87‒95.

[28]

QI Z C, ZHANG N X, YONG L, et al. Prediction of mechanical properties of carbon fiber based on cross-scale fem and machine learning[J]. Compos Struct, 2019, 212: 199‒206.

[29]

WANG X L, XIAO R J, LI H, et al. Quantitative structure-property relationship study of cathode volume changes in lithium ion batteries using ab-initio and partial least squares analysis[J]. J Materiomics, 2017, 3(3): 178‒183.

[30]

ZENG Y Z, LI Q X, BAI K W. Prediction of interstitial diffusion activation energies of nitrogen, oxygen, boron and carbon in bcc, fcc, and hcp metals using machine learning[J]. Comput Mater Sci, 2018, 144: 232‒247.

[31]

ATTARIAN SHANDIZ M, GAUVIN R. Application of machine learning methods for the prediction of crystal system of cathode materials in lithium-ion batteries[J]. Comput Mater Sci, 2016, 117: 270‒278.

[32]

STANEV V, OSES C, KUSNE A G, et al. Machine learning modeling of superconducting critical temperature[J]. NPJ Comput Mater, 2018, 4(29): 1‒14.

[33]

FURMANCHUK A, SAAL J E, DOAK J W, et al. Prediction of Seebeck coefficient for compounds without restriction to fixed stoichiometry: A machine learning approach[J]. J Comput Chem, 2018, 39(4): 191‒202.

[34]

FENG H Q, WU B H, LIU Y Y, et al. The application of particle swarm optimization algorithm on absorbent materials[J]. Appl Mech Mater, 2014, 446‒447: 1541‒1545.

[35]

LIU Y, ZOU X X, MA S C, et al. Feature selection method reducing correlations among features by embedding domain knowledge[J]. Acta Mater, 2022, 238: 118195.

[36]

YAN C, LIANG J, ZHAO M, et al. A novel hybrid feature selection strategy in quantitative analysis of laser-induced breakdown spectroscopy[J]. Anal Chim Acta, 2019, 1080: 35‒42.

[37]
STURLAUGSON L E, SHEPPARD J W. Principal component analysis preprocessing with bayesian networks for battery capacity estimation[C]//2013 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Minneapolis, MN, USA, 2013: 98‒101.
[38]

CURTAROLO S, MORGAN D, PERSSON K, et al. Predicting crystal structures with data mining of quantum calculations[J]. Phys Rev Lett, 2003, 91(13): 135503.

[39]

OUYANG R H, CURTAROLO S, AHMETCIK E, et al. Sisso: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates[J]. Phys Rev Mater, 2018, 2: 083802.

[40]

ANDERSEN M, LEVCHENKO S V, SCHEFFLER M, et al. Beyond scaling relations for the description of catalytic materials[J]. Acs Catalysis, 2019, 9(4): 2752‒2759.

[41]

BARTEL C J, MILLICAN S L, DEML A M, et al. Physical descriptor for the Gibbs energy of inorganic crystalline solids and temperaturedependent materials chemistry[J]. Nat Commun, 2018, 9: 4168‒4177.

[42]

WENG B, SONG Z, ZHU R, et al. Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts[J]. Nat Commun, 2020, 11: 3513‒3520.

[43]

HE M, ZHANG L. Machine learning and symbolic regression investigation on stability of mxene materials[J]. Comput Mater Sci, 2021, 196: 110578.

[44]

TRAN B, XUE B, ZHANG M, et al. A new representation in pso for discretization-based feature selection[J]. IEEE Trans Cybern, 2018, 48(6): 1733‒1746.

[45]

HANCHUAN P, FUHUI L, DING C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Trans Pattern Anal Mach Intell, 2005, 27(8): 1226‒1238.

[46]

GENUER R, POGGI J M, TULEAU-MALOT C. Variable selection using random forests[J]. Pattern Recognit Lett, 2010, 31(14): 2225‒2236.

[47]

BALAKRISHNAN K, DHANALAKSHMI R. Feature selection techniques for microarray datasets: A comprehensive review, taxonomy, and future directions[J]. Front Inf Technol Electron Eng, 2022, 23(10): 1451‒1478.

[48]

JAIN A, ONG S P, HAUTIER G, et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation[J]. Apl Materials, 2013, 1(1): 011002.

[49]

KIRKLIN S, SAAL J E, MEREDIG B, et al. The open quantum materials database (oqmd): Assessing the accuracy of dft formation energies[J]. NPJ Comput Mater, 2015, 1(1): 15010.

[50]

HE B, CHI S, YE A, et al. High-throughput screening platform for solid electrolytes combining hierarchical ion-transport prediction algorithms[J]. Sci Data, 2020, 7(1): 151.

[51]

WU Y J, FANG L, XU Y B. Predicting interfacial thermal resistance by machine learning[J]. NPJ Comput Mater, 2019, 5(1): 56.

[52]

WANG Y Q, YAO Q M, KWOK J T, et al. Generalizing from a few examples: A survey on few-shot learning[J]. Acm Comput Surv, 2020, 53(3): 1‒34.

[53]

SONG Y, SIRIWARDANE E M D, ZHAO Y, et al. Computational discovery of new 2D materials using deep learning generative models[J]. ACS Appl Mater Interfaces, 2021, 13(45): 53303‒53313.

[54]

DAN Y, ZHAO Y, LI X, et al. Generative adversarial networks (gan) based efficient sampling of chemical composition space for inverse design of inorganic materials[J]. NPJ Comput Mater, 2020, 6(1): 84.

[55]

NOH J, KIM J, STEIN H S, et al. Inverse design of solid-state materials via a continuous representation[J]. Matter, 2019, 1(5): 1370‒1384.

[56]

HOFFMANN J, MAESTRATI L, SAWADA Y, et al. Data-driven approach to encoding and decoding 3-d crystal structures[J]. Arxiv, 2019. Doi: 10.48550/arXiv.1909.00949.

[57]

LOOKMAN T, BALACHANDRAN P V, XUE D Z, et al. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design[J]. NPJ Comput Mater, 2019, 5(1): 21.

[58]

MIN K, CHO E. Accelerated discovery of potential ferroelectric perovskite via active learning[J]. J Mater Chem C, 2020, 8(23): 7866‒7872.

[59]

PRUKSAWAN S, LAMBARD G, SAMITSU S, et al. Prediction and optimization of epoxy adhesive strength from a small dataset through active learning[J]. Sci Technol Adv Mater, 2019, 20(1): 1010‒1021.

[60]

JEONG M H, SULLIVAN C J, GAO Y Z, et al. Robust abnormality detection methods for spatial search of radioactive materials[J]. Trans GIS, 2019, 23(4): 860‒877.

[61]

SHI Siqi, SUN Shiyu, MA Shuchang, et al. J Inorg Mater (in Chinese),2022, 37(12): 1311‒1320.

[62]

LIU Y, WU J, AVDEEV M, et al. Multi-layer feature selection incorporating weighted score-based expert knowledge toward modeling materials with targeted properties[J]. Adv Theory Simul, 2020, 3: 1900215.

[63]

LIU Y, GE X Y, YANG Z W, et al. An automatic descriptors recognizer customized for materials science literature[J]. J Power Sources, 2022, 545: 231946.

[64]

TSHITOYAN V, DAGDELEN J, WESTON L, et al. Unsupervised word embeddings capture latent knowledge from materials science literature[J]. Nature, 2019, 571(7763): 95‒98.

Journal of the Chinese Ceramic Society
Pages 427-437
Cite this article:
LIU Y, MA S, YANG Z, et al. A Data Quality and Quantity Governance for Machine Learning in Materials Science. Journal of the Chinese Ceramic Society, 2023, 51(2): 427-437. https://doi.org/10.14062/j.issn.0454-5648.20220991
Metrics & Citations  
Article History
Copyright
Return