AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Review Article | Open Access

Pretreating and normalizing metabolomics data for statistical analysis

Division of Gastroenterology and Hepatology, Department of Medicine, Department of Microbiology/Immunology, UIC Cancer Center, University of Illinois Chicago, Jesse Brown VA Medical Center Chicago (537), Chicago, IL 60612, USA
Division of Gastroenterology and Hepatology, Department of Medicine, University of Illinois Chicago, Chicago, IL 60612, USA

Peer review under responsibility of Chongqing Medical University.

Show Author Information

Abstract

Metabolomics as a research field and a set of techniques is to study the entire small molecules in biological samples. Metabolomics is emerging as a powerful tool generally for precision medicine. Particularly, integration of microbiome and metabolome has revealed the mechanism and functionality of microbiome in human health and disease. However, metabolomics data are very complicated. Preprocessing/pretreating and normalizing procedures on metabolomics data are usually required before statistical analysis. In this review article, we comprehensively review various methods that are used to preprocess and pretreat metabolomics data, including MS-based data and NMR -based data preprocessing, dealing with zero and/or missing values and detecting outliers, data normalization, data centering and scaling, data transformation. We discuss the advantages and limitations of each method. The choice for a suitable preprocessing method is determined by the biological hypothesis, the characteristics of the data set, and the selected statistical data analysis method. We then provide the perspective of their applications in the microbiome and metabolome research.

References

1
Xia Y, Sun J. An Integrated Analysis of Microbiomes and Metabolomics. American Chemical Society; 2022.
2

Liland KH. Multivariate methods in metabolomics – from pre-processing to dimension reduction and statistical analysis. TrAC, Trends Anal Chem. 2011;30(6):827–841.

3

Martin M, Legat B, Leenders J, et al. PepsNMR for 1H NMR metabolomic data pre-processing. Anal Chim Acta. 2018;1019:1–13.

4
Xia Y, Sun J. Statistical Data Analysis of Microbiomes and Metabolomics. American Chemical Society; 2022.
5

Bijlsma S, Bobeldijk I, Verheij ER, et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal Chem. 2006;78(2):567–574.

6
Karaman I. Preprocessing and pretreatment of metabolomics data for statistical analysis. In: Sussulini A, ed. Metabolomics: From Fundamentals to Clinical Applications. Cham: Springer International Publishing; 2017: 145–161.
7

Yang J, Zhao X, Lu X, Lin X, Xu G. A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis. Front Mol Biosci. 2015;2:4, 4.

8
Defernez M, Le Gall G. Chapter eleven - strategies for data handling and statistical analysis in metabolomics studies. In: Rolin D, ed. Advances in Botanical Research. vol. 67. Academic Press; 2013: 493–555.
9

Smolinska A, Hauschild A-C, Fijten R, Dallinga J, Baumbach J, Van Schooten F. Current breathomics—a review on data pre-processing techniques and machine learning in metabolomics breath analysis. J Breath Res. 2014;8(2):027105.

10
Trygg J, Gabrielsson J, Lundstedt T. Data preprocessing: Background estimation, Denoising, and Preprocessing. In: Brown SD, Tauler R, Walczak B, eds. Comprehensive Chemometrics. Elsevier; 2009: 1–8.
11

Eilers PH. A perfect smoother. Anal Chem. 2003;75(14):3631–3636.

12

Eilers PH, Marx BD. Flexible smoothing with B-splines and penalties. Stat Sci. 1996;11(2):89–121.

13

Xu Z, Sun X, Harrington PdB. Baseline correction method using an orthogonal basis for gas chromatography/mass spectrometry data. Anal Chem. 2011;83(19):7464–7471.

14

Burton L, Ivosev G, Tate S, Impey G, Wingate J, Bonner R. Instrumental and experimental effects in LC–MS-based metabolomics. J Chromatogr B. 2008;871(2):227–235.

15

Alonso A, Marsal S, Julià A. Analytical methods in untargeted metabolomics: state of the art in 2015. Front Bioeng Biotechnol. 2015;3:23.

16
Jellema RH, Folch-Fortuny A, Hendriks MM. Variable Shift and Alignment. 2020.
17

Ruckstuhl AF, Jacobson MP, Field RW, Dodd JA. Baseline subtraction using robust local regression estimation. J Quant Spectrosc Radiat Transf. 2001;68(2):179–193.

18

Lieber CA, Mahadevan-Jansen A. Automated method for subtraction of fluorescence from biological Raman spectra. Appl Spectrosc. 2003;57(11):1363–1367.

19

Eilers PH, Boelens HF. Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report. 2005;1(1):5.

20

Eilers PH. Parametric time warping. Anal Chem. 2004;76(2):404–411.

21

Nielsen N-PV, Carstensen JM, Smedsgaard J. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. J Chromatogr A. 1998;805(1–2):17–35.

22

Wong JW, Durante C, Cartwright HM. Application of fast Fourier transform cross-correlation for the alignment of large chromatographic and spectral datasets. Anal Chem. 2005;77(17):5655–5661.

23

Savorani F, Tomasi G, Engelsen SB. icoshift: a versatile tool for the rapid alignment of 1D NMR spectra. J Magn Reson. 2010;202(2):190–202.

24

Veselkov KA, Lindon JC, Ebbels TM, et al. Recursive segment-wise peak alignment of biological 1H NMR spectra for improved metabolic biomarker recovery. Anal Chem. 2009;81(1):56–66.

25

Hrydziuszko O, Viant MR. Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline. Metabolomics. 2012;8(1):161–174.

26

Gaude E, Chignola F, Spiliotopoulos D, et al. muma, an R package for metabolomics univariate and multivariate statistical analysis. Current Metabolomics. 2013;1(2):180–189.

27

Martín-Fernández JA, Palarea-Albaladejo J, Olea RA. Dealing with zeros. Compositional data analysis: Theory and applications. 2011:43–58.

28

Smilde AK, van der Werf MJ, Bijlsma S, van der Werff-van der Vat BJ, Jellema RH. Fusion of mass spectrometry-based metabolomics data. Anal Chem. 2005;77(20):6729–6736.

29

Steuer R. Review: on the analysis and interpretation of correlations in metabolomic data. Briefings Bioinf. 2006;7(2):151–158.

30

Xia J, Psychogios N, Young N, Wishart DS. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res. 2009;37(Web Server issue):W652–W660.

31

Xia J, Sinelnikov IV, Han B, Wishart DS. MetaboAnalyst 3.0-making metabolomics more meaningful. Nucleic Acids Res. 2015;43(W1):W251–W257.

32
Steuer R, Morgenthal K, Weckwerth W, Selbig J. A gentle guide to the analysis of metabolomic data. In: Metabolomics. Springer; 2007: 105–126.
33

Troyanskaya O, Cantor M, Sherlock G, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520–525.

34

Gromski PS, Xu Y, Kotze HL, et al. Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites. 2014;4(2):433–452.

35

Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118.

36
Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D. Imputing Missing Data for Gene Expression Arrays. 1999.
37

Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcaMethods—a bioconductor package providing PCA methods for incomplete data. Bioinformatics. 2007;23(9):1164–1167.

38

Wei R, Wang J, Su M, et al. Missing value imputation approach for mass spectrometry-based metabolomics data. Sci Rep. 2018;8(1):663, 663.

39
Lazar C. imputeLCMD: A Collection of Methods for Left-Censored Missing Data Imputation. 2015;vol. 2, R package, version.
40

Oba S, Sato MA, Takemasa I, Monden M, Matsubara KI, Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003;19(16):2088–2096.

41

Steinfath M, Groth D, Lisec J, Selbig J. Metabolite profile analysis: from raw data to regression and classification. Physiol Plantarum. 2008;132(2):150–161.

42

Buuren Sv, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Software. 2010:1–68.

43

Lin TH. A comparison of multiple imputation with EM algorithm and MCMC method for quality of life missing data. Qual Quantity. 2010;44(2):277–287.

44

Costea PI, Zeller G, Sunagawa S, Bork P. A fair comparison. Nat Methods. 2014;11(4):359, 359.

45
Little RJ, Rubin DB. Statistical Analysis with Missing Data. vol. 793. John Wiley & Sons; 2019;
46

Karpievitch YV, Dabney AR, Smith RD. Normalization and missing value imputation for label-free LC-MS analysis. BMC Bioinf. 2012;13(16):S5.

47

Lazar C, Gatto L, Ferro M, Bruley C, Burger T. Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J Proteome Res. 2016;15(4):1116–1125.

48

Playdon MC, Joshi AD, Tabung FK, et al. Metabolomics analytics workflow for epidemiological research: perspectives from the consortium of metabolomics studies (COMETS). Metabolites. 2019;9(7):145.

49

Walach J, Filzmoser P, Kouřil Š, Friedecký D, Adam T. Cellwise outlier detection and biomarker identification in metabolomics based on pairwise log ratios. J Chemometr. 2020;34(1):e3182, e3182.

50

Kumar N, Hoque MA, Sugimoto M. Kernel weighted least square approach for imputing missing values of metabolomics data. Sci Rep. 2021;11(1):11108.

51

Zhang S, Zheng C, Lanza IR, Nair KS, Raftery D, Vitek O. Interdependence of signal processing and analysis of urine 1H NMR spectra for metabolic profiling. Anal Chem. 2009;81(15):6080–6088.

52

Xia J, Mandal R, Sinelnikov IV, Broadhurst D, Wishart DS. MetaboAnalyst 2.0—a comprehensive server for metabolomic data analysis. Nucleic Acids Res. 2012;40(W1):W127–W133.

53

Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem. 2006;78(13):4281–4290.

54
Zacharias H, Altenbuchinger M, Gronwald W. Data Normalization in NMR-Based Metabolomics. 2018.
55

Craig A, Cloarec O, Holmes E, Nicholson JK, Lindon JC. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Anal Chem. 2006;78(7):2262–2267.

56

Spraul M, Neidig P, Klauck U, et al. Automatic reduction of NMR spectroscopic data for statistical and pattern recognition classification of samples. J Pharmaceut Biomed Anal. 1994;12(10):1215–1225.

57

Warrack BM, Hnatyshyn S, Ott K-H, et al. Normalization strategies for metabonomic analysis of urine samples. J Chromatogr B. 2009;877(5):547–552.

58

Li B, Tang J, Yang Q, et al. Performance evaluation and online realization of data-driven normalization methods used in LC/MS based untargeted metabolomics analysis. Sci Rep. 2016;6(1):38881.

59

Dong J, Cheng K-K, Xu J, Chen Z, Griffin JL. Group aggregating normalization method for the preprocessing of NMR-based metabolomic data. Chemometr Intell Lab Syst. 2011;108(2):123–132.

60

Xia J, Wishart DS. Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst. Nat Protoc. 2011;6(6):743–760.

61

Chong J, Wishart DS, Xia J. Using MetaboAnalyst 4.0 for comprehensive and integrative metabolomics data analysis. Current Protocols in Bioinformatics. 2019;68(1):e86.

62

De Filippis F, Pellegrini N, Vannini L, et al. High-level adherence to a Mediterranean diet beneficially impacts the gut microbiota and associated metabolome. Gut. 2016;65(11):1812–1821.

63

Rocha CM, Barros AS, Goodfellow BJ, et al. NMR metabolomics of human lung tumours reveals distinct metabolic signatures for adenocarcinoma and squamous cell carcinoma. Carcinogenesis. 2014;36(1):68–75.

64

O'Keefe SJD, Li JV, Lahti L, et al. Fat, fibre and cancer risk in African Americans and rural Africans. Nat Commun. 2015;6(1):6342.

65

Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–193.

66

Callister SJ, Barry RC, Adkins JN, et al. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J Proteome Res. 2006;5(2):277–286.

67

Kohl SM, Klein MS, Hochrein J, Oefner PJ, Spang R, Gronwald W. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics : Official journal of the Metabolomic Society. 2012;8(Suppl 1):146–160.

68
Wen J, Xiao X, Dong J, Chen Z, Dai X. Data normalization for diabetes Ⅱ metabonomics analysis. In: Paper Presented at: 2007 1st International Conference on Bioinformatics and Biomedical Engineering. 2007.
69

Lee J, Park J, Lim MS, et al. Quantile normalization approach for liquid chromatography–mass spectrometry-based metabolomic data from healthy human volunteers. Anal Sci. 2012;28(8):801–805.

70

Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J. Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics. 2004;20(15):2447–2454.

71

Jatlow P, McKee S, O'Malley SS. Correction of urine cotinine concentrations for creatinine excretion: is it useful? Clin Chem. 2003;49(11):1932–1934.

72

Holmes E, Foxall P, Nicholson J, et al. Automatic data reduction and pattern recognition methods for analysis of 1H NMR spectra of human urine from normal and pathological states. Anal Biochem. 1994;220:284–296.

73

Wagner BD, Accurso FJ, Laguna TA. The applicability of urinary creatinine as a method of specimen normalization in the cystic fibrosis population. J Cyst Fibros : official journal of the European Cystic Fibrosis Society. 2010;9(3):212–216.

74

Heavner DL, Morgan WT, Sears SB, Richardson JD, Byrd GD, Ogden MW. Effect of creatinine and specific gravity normalization techniques on xenobiotic biomarkers in smokers' spot and 24-h urines. J Pharmaceut Biomed Anal. 2006;40(4):928–942.

75

Suwazono Y, Åkesson A, Alfvén T, Järup L, Vahter M. Creatinine versus specific gravity-adjusted urinary cadmium concentrations. Biomarkers. 2005;10(2–3):117–126.

76

Fauler G, Leis H, Huber E, et al. Determination of homovanillic acid and vanillylmandelic acid in neuroblastoma screening by stable isotope dilution GC-MS. J Mass Spectrom. 1997;32(5):507–514.

77

Saccenti E. Correlation patterns in experimental data are affected by normalization procedures: consequences for data analysis and network inference. J Proteome Res. 2017;16(2):619–634.

78

Shockcor JP, Holmes E. Metabonomic applications in toxicity screening and disease diagnosis. Curr Top Med Chem. 2002;2(1):35–51.

79

Beckwith-Hall B, Nicholson J, Nicholls A, et al. Nuclear magnetic resonance spectroscopic and principal components analysis investigations into biochemical effects of three model hepatotoxins. Chem Res Toxicol. 1998;11(4):260–272.

80

Kohler I, Verhoeven A, Derks RJ, Giera M. Analytical pitfalls and challenges in clinical metabolomics. Bioanalysis. 2016;8(14):1509–1532.

81

Chen Y, Shen G, Zhang R, et al. Combination of injection volume calibration by creatinine and MS signals' normalization to overcome urine variability in LC-MS-based metabolomics studies. Anal Chem. 2013;85(16):7659–7665.

82

Sysi-Aho M, Katajamaa M, Yetukuri L, Orešič M. Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC Bioinf. 2007;8(1):93.

83

Torgrip RJO, Åberg KM, Alm E, Schuppe-Koistinen I, Lindberg J. A note on normalization of biofluid 1D 1H-NMR data. Metabolomics. 2008;4(2):114–121.

84

Romano R, Lamanna R, Santini MT, Indovina PL. A new algorithm for NMR spectral normalization. J Magn Reson. 1999;138(1):115–122.

85

Romano R, Santini MT, Indovina PL. A time-domain algorithm for NMR spectral normalization. J Magn Reson. 2000;146(1):89–99.

86

Lemmerling P, Vanhamme L, Romano R, Van Huffel S. A subspace time-domain algorithm for automated NMR spectral normalization. J Magn Reson. 2002;157(2):190–199.

87

Workman C, Jensen LJ, Jarmer H, et al. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 2002;3(9):0041. research0048.

88

Cleveland WS, Devlin SJ. Locally weighted regression: an approach to regression analysis by local fitting. J Am Stat Assoc. 1988;83(403):596–610.

89

Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin. 2002:111–139.

90

Li C, Hung Wong W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2001;2(8):0031. research0032.

91

Åstrand M. Contrast normalization of oligonucleotide arrays. J Comput Biol. 2003;10(1):95–102.

92

Park T, Yi S-G, Kang S-H, Lee S, Lee Y-S, Simon R. Evaluation of normalization methods for microarray data. BMC Bioinf. 2003;4(1):33.

93
Martens H, Naes T. Multivariate Calibration. Chichester, UK: Wiley; 1989.
94

Bro R, Smilde AK. Centering and scaling in component analysis. J Chemometr. 2003;17(1):16–33.

95

van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genom. 2006;7(1):142.

96
Jackson J, Edward A. User’s Guide to Principal Components. New York: John Willey Sons. Inc; 1991:40.
97
Erikson L, Johansson E, Kettaneh-Wold N, Wold S. Introduction to Multi-And Megavariate Data Analysis Using Projection Methods (PCA & PLS) Umea. Sweden: Umetrics AB; 1999.
98
Wold S, Johansson E, Cocchi M. 3D QSAR in Drug Design: Theory, Methods and Applications. Leiden, Holland: ESCOM; 1993:523–550.
99

Keun HC, Ebbels TMD, Antti H, et al. Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Anal Chim Acta. 2003;490(1):265–276.

100

Goodacre R, Broadhurst D, Smilde AK, et al. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics. 2007;3(3):231–241.

101

Yan Z, Yan R. Tailored sensitivity reduction improves pattern recognition and information recovery with a higher tolerance to varied sample concentration for targeted urinary metabolomics. J Chromatogr A. 2016;1443:101–110.

102

Giskeødegård GF, Grinde MT, Sitter B, et al. Multivariate modeling and prediction of breast cancer prognostic factors using MR metabolomics. J Proteome Res. 2010;9(2):972–979.

103

Wold S, Antti H, Lindgren F, Öhman J. Orthogonal signal correction of near-infrared spectra. Chemometr Intell Lab Syst. 1998;44(1):175–185.

104

Arioli A, Dagliati A, Geary B, et al. OptiMissP: a dashboard to assess missingness in proteomic data-independent acquisition mass spectrometry. PLoS One. 2021;16(4):e0249771.

105

Struck W, Siluk D, Yumba-Mpanga A, Markuszewski M, Kaliszan R, Markuszewski MJ. Liquid chromatography tandem mass spectrometry study of urinary nucleosides as potential cancer markers. J Chromatogr A. 2013;1283:122–131.

106

Kvalheim OM, Brakstad F, Liang Y. Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Anal Chem. 1994;66(1):43–51.

107

Huber W, Von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18(suppl_1):S96–S104.

108

Parsons HM, Ludwig C, Günther UL, Viant MR. Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation. BMC Bioinf. 2007;8(1):234.

109

Feng C, Wang H, Lu N, et al. Log-transformation and its implications for data analysis. Shanghai archives of psychiatry. 2014;26(2):105–109.

110

Feng C, Wang H, Lu N, Tu XM. Log transformation: application and interpretation in biomedical research. Stat Med. 2013;32(2):230–239.

111

De Livera AM, Dias DA, De Souza D, et al. Normalizing and integrating metabolomics data. Anal Chem. 2012;84(24):10768–10776.

112

Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002;18(suppl_1):S105–S110.

113

Bland JM, Altman DG. Transformations, means, and confidence intervals. BMJ Br Med J (Clin Res Ed). 1996;312(7038):1079.

114

Tukey JW. On the comparative anatomy of transformations. Ann Math Stat. 1957:602–632.

115

Sakia RM. The Box-Cox transformation technique: a review. J Roy Stat Soc: Series D (The Statistician). 1992;41(2):169–178.

116

Box GE, Cox DR. An analysis of transformations. J Roy Stat Soc B. 1964;26(2):211–243.

117

Box GE, Hill WJ. Correcting inhomogeneity of variance with power transformation weighting. Technometrics. 1974;16(3):385–389.

118

Waaijenborg S, Korobko O, Willems van Dijk K, et al. Fusing metabolomics data sets with heterogeneous measurement errors. PLoS One. 2018;13(4):e0195939.

119

Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.

120

Anderle M, Roy S, Lin H, Becker C, Joho K. Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum. Bioinformatics. 2004;20(18):3575–3582.

121

Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings Bioinf. 2016;19(1):1–11.

Genes & Diseases
Article number: 100979
Cite this article:
Sun J, Xia Y. Pretreating and normalizing metabolomics data for statistical analysis. Genes & Diseases, 2024, 11(3): 100979. https://doi.org/10.1016/j.gendis.2023.04.018

195

Views

4

Downloads

9

Crossref

9

Web of Science

10

Scopus

0

CSCD

Altmetrics

Received: 12 October 2022
Accepted: 09 April 2023
Published: 07 July 2023
© 2023 The Authors.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Return