SlideShare a Scribd company logo
Implementation of Metabolomic Data Normalization Strategies
Dmitry Grapov, PhD
Summary
Five normalization methods were compared, of which the combination of
qc-LOESS and cubic splines showed the best performance based on within-batch and
between-batch variable relative standard deviations for QCs. This approach was used to
normalize sample measurements the results of which were analyzed using principal
components analysis. Based on this analysis an unknown source of variance was
identified among the samples (batches 1-~7 and 8-25) which was absent from QC
samples and concluded to stem from the biological variability due to the experimental
design.
Results
The complete data set, acquired over a one year period (3/6/2013 to
2/20/2014), consisted of 1262 measurements of 319 variables. Analytical variance over
the duration of the data acquisition was estimated based on 105 equally interspersed
quality control (QCs) samples (1:10 QC/samples). To aid the overview of temporal trends
the full data acquisition time was segmented into 1-3 day increments or 25 batches
(median samples per batch 53; range, 13 to 84).
QC samples were used to evaluate five common data normalization procedures:
quantile, cubic splines, cyclic LOESS [1], batch ratio and (qc-)LOESS [2]. Normalization
performance was assessed based on within-batch (Figure 1A) and between-batch
(Figure 1 B&C) variable relative standard deviations (RSD) of QC samples. The qc-LOESS
approach, which is a modification of the LOESS procedure (Figure 2), displayed the best
performance for QC samples (median batch RSD, 30%, range: 20-42%; raw data, 35%,
19-51%), with 78% of normalized variables showing RSD<40% compared 65% for raw
data. However 113 variables (35%) displayed inconsistent trends between qc-LOESS
model training and tests sets and were identified as inappropriate for the qc-LOESS
normalization. The remaining variables were normalized using the cubic splines method,
which does not require a similar consistency criterion, and showed the second best
performance for QC samples (median batch RSD, 31%, 18-44%; and 77% of variables
with RSD<40%). The combination of qc-LOESS and cubic splines normalizations were
shown to improve data quality by reducing within-batch and between-batch analytical
variance (Figure 3).
Principal components analysis was used to evaluate raw and normalized QC and
sample measurements for batch effects (Figure 4). Raw QCs data displayed slight
differences between batches 1-7 and 7-25 (Figure 4A, red points), which was removed
after normalization (Figure 4B). However both raw and normalized samples displayed a
large mode of variance between samples among batches ~1-7 and all other batches
(Figure 4 C &D). After confirming that this trend was not due to the biological design of
the study, based on evaluation of same samples measured by an orthogonal
metabolomic platform (LC-Q-TOF), a semi-supervised approach of model based
clustering was used to define the members of the unique modes of variance. A linear
model was used to adjust the normalized data based on the model-based clustering
defined clusters (Figure 5).
Methods
Principal components analysis (PCA) on autoscaled data was used to overview
raw and normalized data and QC sample variance based on acquisition batch, and used
to identify 1 outlier QC sample (Bio Rec 94) which was removed from all further analyses
(Figure 6). Quantile, cubic splines and cyclic LOESS normalizations were implemented
without cross-validation [1]. Within- and between-batch RSDs were calculated based on
batch and aggregated medians. Batch ratio (BR) and qc-LOESS were implemented using
cross-validation where 2/3 of QC samples were used to train the model, which was then
applied to the remaining 1/3 data, and for consistency with the other normalization
methods performance is reported for the combined training and test sets.
BR normalization is an implementation a batch specific correction factor for each
variable, and was calculated as the ratio of the within-batch to the study wide variable
medians. The qc-LOESS normalization is an adaptation of the LOESS normalization which
uses qc samples, but also includes a step to determine if the LOESS based normalization
is applicable to the data by testing the correlation between LOESS models for the
training and test sets (cubic splines interpolated). LOESS model span was selected using
leave-one-out cross-validation on the training data. Variables inappropriate for the qc-
LOESS normalization were instead normalized by the cubic splines method. Cubic splines
normalization displayed the best performance of all algorithms for variables with
intensities < 1,000, but displayed slightly higher RSD compared to no normalization for
variables > 1000 intensity (Figure 6). The combination of qc-LOESS and cubic splines
were used to fully normalize the dataset, but variables with intensities >1000 and
showing poor cubic splines performance could instead be presented as raw or non-
normalized data.
Model based clustering was carried out using Bayesian information criterion
(BIC) optimized and EM initialized hierarchical clustering of finite mixtures of Gaussian
mixture models [3]. The best two cluster model was selected based on BIC. Analyte
specific linear models were used to adjust sample means based on the model-based
cluster memberships.
All analyses were implemented in R v3.0.2 [4] using the Devium package
(https://github.com/dgrapov/devium).
Figure 1. Overview of common data normalization approaches applied to the QC samples.
A)
B C
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
Figure 2. Modified workflow for qc-LOESS normalization.
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
Figure 3. Comparison of raw and normalized sample relative standard deviations.
A)
B C
A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is
preferable B) number of variables displaying RSDs in the specified intervals for each normalization method
and C) number of batches displaying median variable RSDs in the specified intervals.
Figure 4. PCA scores of raw and normalized samples and QCs, annotated by batch and acquisition order.
A) raw QC B normalized QC
C) raw samples D) normalized samples
PCA sample scores for the first 2 components for a) raw QCs B) normalized QCs C) raw samples and D)
normalized samples.
Figure 5. PCA scores of normalized samples before and after non-supervised model based clustering defined
covariate adjustment
A) defined clusters B cluster-membership adjusted data
PCA sample scores for the first 2 components for a) model-based clustering defined clusters B) cluster-
membership adjusted data.
Figure 6. Principal components analysis of QCs, with annotation of acquisition order (sample label).
A) PCA scores from the first two components displaying QC sample label IDs (duplicated labels are expressed
as X.1). Sample 94, circled in red, and was identified as an outlier (no other QC scores with similar dates in
its proximity).
Figure 6. Performance of the cubic splines normalization on QC samples.
References
1. Kohl, S.M., et al., State-of-the art data normalization methods improve NMR-
based metabolomic analysis. Metabolomics, 2012. 8(Suppl 1): p. 146-160.
2. Dunn, W.B., et al., Procedures for large-scale metabolic profiling of serum and
plasma using gas chromatography and liquid chromatography coupled to mass
spectrometry. Nat Protoc, 2011. 6(7): p. 1060-83.
3. Fraley, C. and A. Raftery, E.,, Model-based Clustering, Discriminant Analysis
and Density Estimation. Journal of the American Statistical Association,
2002(97): p. 611-631.
4. R Development Core Team, R: A language and environment for statistical
computing. R Foundation for Statistical Computing, 2011. ISBN 3-900051-
900007-900050, URL http://www.R-project.org/.

More Related Content

PPTX
Data Normalization Approaches for Large-scale Biological Studies
Dmitry Grapov
 
PPTX
Normalization of Large-Scale Metabolomic Studies 2014
Dmitry Grapov
 
PPT
Strategies for Metabolomics Data Analysis
Dmitry Grapov
 
PPTX
3 data normalization (2014 lab tutorial)
Dmitry Grapov
 
PPTX
0 introduction
Dmitry Grapov
 
PPTX
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Dmitry Grapov
 
PPTX
Data analysis workflows part 2 2015
Dmitry Grapov
 
PPT
Advanced strategies for Metabolomics Data Analysis
Dmitry Grapov
 
Data Normalization Approaches for Large-scale Biological Studies
Dmitry Grapov
 
Normalization of Large-Scale Metabolomic Studies 2014
Dmitry Grapov
 
Strategies for Metabolomics Data Analysis
Dmitry Grapov
 
3 data normalization (2014 lab tutorial)
Dmitry Grapov
 
0 introduction
Dmitry Grapov
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Dmitry Grapov
 
Data analysis workflows part 2 2015
Dmitry Grapov
 
Advanced strategies for Metabolomics Data Analysis
Dmitry Grapov
 

What's hot (20)

PPT
Multivarite and network tools for biological data analysis
Dmitry Grapov
 
PPTX
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
PPTX
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov
 
PPT
Metabolomic Data Analysis Case Studies
Dmitry Grapov
 
PPTX
4 partial least squares modeling
Dmitry Grapov
 
PPTX
Mapping to the Metabolomic Manifold
Dmitry Grapov
 
PPTX
3 principal components analysis
Dmitry Grapov
 
PPT
Multivariate data analysis and visualization tools for biological data
Dmitry Grapov
 
PPTX
1 statistical analysis
Dmitry Grapov
 
PPTX
Metabolomic data analysis and visualization tools
Dmitry Grapov
 
PPTX
Automation of (Biological) Data Analysis and Report Generation
Dmitry Grapov
 
PPTX
7 network mapping i
Dmitry Grapov
 
PPTX
Data analysis workflows part 1 2015
Dmitry Grapov
 
PPT
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov
 
PPTX
Some statistical concepts relevant to proteomics data analysis
UC Davis
 
PPTX
Article of analytical chemistry
Amber Shaheen Abbasi
 
PPT
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
jatwood3
 
PPTX
Omic Data Integration Strategies
Dmitry Grapov
 
PPT
Paper presentation @IPAW'08
Paolo Missier
 
PPTX
2 cluster analysis
Dmitry Grapov
 
Multivarite and network tools for biological data analysis
Dmitry Grapov
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Dmitry Grapov
 
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov
 
Metabolomic Data Analysis Case Studies
Dmitry Grapov
 
4 partial least squares modeling
Dmitry Grapov
 
Mapping to the Metabolomic Manifold
Dmitry Grapov
 
3 principal components analysis
Dmitry Grapov
 
Multivariate data analysis and visualization tools for biological data
Dmitry Grapov
 
1 statistical analysis
Dmitry Grapov
 
Metabolomic data analysis and visualization tools
Dmitry Grapov
 
Automation of (Biological) Data Analysis and Report Generation
Dmitry Grapov
 
7 network mapping i
Dmitry Grapov
 
Data analysis workflows part 1 2015
Dmitry Grapov
 
Prote-OMIC Data Analysis and Visualization
Dmitry Grapov
 
Some statistical concepts relevant to proteomics data analysis
UC Davis
 
Article of analytical chemistry
Amber Shaheen Abbasi
 
Harnessing The Proteome With Proteo Iq Quantitative Proteomics Software
jatwood3
 
Omic Data Integration Strategies
Dmitry Grapov
 
Paper presentation @IPAW'08
Paolo Missier
 
2 cluster analysis
Dmitry Grapov
 
Ad

Viewers also liked (9)

PDF
Metabolomic data: combining wavelet representation with learning approaches
tuxette
 
PDF
Metabolomics: data acquisition, pre-processing and quality control
COST action BM1006
 
PPT
Gene Ontology Enrichment Network Analysis -Tutorial
Dmitry Grapov
 
PPTX
6 metabolite enrichment analysis
Dmitry Grapov
 
PPTX
Introduction to Network Mapping
Dmitry Grapov
 
PPTX
Metabolomics: The Next Generation of Biochemistry
Metabolon, Inc.
 
PPTX
5 data analysis case study
Dmitry Grapov
 
PPTX
Metabolomics
Shreya Ahuja
 
PDF
Metabolomics Data Analysis
COST action BM1006
 
Metabolomic data: combining wavelet representation with learning approaches
tuxette
 
Metabolomics: data acquisition, pre-processing and quality control
COST action BM1006
 
Gene Ontology Enrichment Network Analysis -Tutorial
Dmitry Grapov
 
6 metabolite enrichment analysis
Dmitry Grapov
 
Introduction to Network Mapping
Dmitry Grapov
 
Metabolomics: The Next Generation of Biochemistry
Metabolon, Inc.
 
5 data analysis case study
Dmitry Grapov
 
Metabolomics
Shreya Ahuja
 
Metabolomics Data Analysis
COST action BM1006
 
Ad

Similar to Case Study: Overview of Metabolomic Data Normalization Strategies (20)

PDF
report
Arthur He
 
PDF
Cardiology_Metabolomics_workshop_2016_v2
Sophia Banton
 
PDF
Monitoring nonlinear profiles with {R}: an application to quality control
Emilio L. Cano
 
PPTX
Complex Systems Biology Informed Data Analysis and Machine Learning
Dmitry Grapov
 
PPTX
Chemometric analysis in IR spectroscopy/ infrared spectroscopy
Kailashpati Tripathi
 
PPT
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
PDF
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
James Nelson
 
PPT
Boston regulated bioanalysis
Bhaswat Chakraborty
 
PDF
Wine.Final.Project.MJv3
Melissa A. Johnson
 
PDF
Hyperspectral Data Issues
Alex Henderson
 
PPT
Representative sampling
Malla Reddy College of Pharmacy
 
PDF
BrazMedChem2014
Peter Kenny
 
PDF
SysmexReport CONTROL HEMATOLAOGICO INTERNO
willmed
 
PPT
Intermediate Strategies for Metabolomic Data Analysis
Dmitry Grapov
 
PDF
R_DAY_CLABSI
Reuben Hilliard
 
PDF
Basic QC Statistics - Improving Laboratory Performance Through Quality Contro...
Randox
 
PDF
SysmexReport_8291 CONTROL HEMATOLOGICO INTERNO
willmed
 
PDF
Statistical Treatment of Analytical Data (Zeev Alfassi) (z-lib.org).pdf
ArloWinstonDeGuzman
 
PPT
process monitoring for quality engineering
shamithraacademy
 
PPTX
Bertrand de Meulder-El impacto de las ciencias ómicas en la medicina, la nutr...
Fundación Ramón Areces
 
report
Arthur He
 
Cardiology_Metabolomics_workshop_2016_v2
Sophia Banton
 
Monitoring nonlinear profiles with {R}: an application to quality control
Emilio L. Cano
 
Complex Systems Biology Informed Data Analysis and Machine Learning
Dmitry Grapov
 
Chemometric analysis in IR spectroscopy/ infrared spectroscopy
Kailashpati Tripathi
 
Multivariate Analysis and Visualization of Proteomic Data
UC Davis
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
James Nelson
 
Boston regulated bioanalysis
Bhaswat Chakraborty
 
Wine.Final.Project.MJv3
Melissa A. Johnson
 
Hyperspectral Data Issues
Alex Henderson
 
Representative sampling
Malla Reddy College of Pharmacy
 
BrazMedChem2014
Peter Kenny
 
SysmexReport CONTROL HEMATOLAOGICO INTERNO
willmed
 
Intermediate Strategies for Metabolomic Data Analysis
Dmitry Grapov
 
R_DAY_CLABSI
Reuben Hilliard
 
Basic QC Statistics - Improving Laboratory Performance Through Quality Contro...
Randox
 
SysmexReport_8291 CONTROL HEMATOLOGICO INTERNO
willmed
 
Statistical Treatment of Analytical Data (Zeev Alfassi) (z-lib.org).pdf
ArloWinstonDeGuzman
 
process monitoring for quality engineering
shamithraacademy
 
Bertrand de Meulder-El impacto de las ciencias ómicas en la medicina, la nutr...
Fundación Ramón Areces
 

More from Dmitry Grapov (7)

PDF
R programming for Data Science - A Beginner’s Guide
Dmitry Grapov
 
PDF
Network mapping 101 course
Dmitry Grapov
 
PDF
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Dmitry Grapov
 
PDF
Dmitry Grapov Resume and CV
Dmitry Grapov
 
PPTX
Machine Learning Powered Metabolomic Network Analysis
Dmitry Grapov
 
PPTX
Modeling poster
Dmitry Grapov
 
PPTX
American Society of Mass Spectrommetry Conference 2014
Dmitry Grapov
 
R programming for Data Science - A Beginner’s Guide
Dmitry Grapov
 
Network mapping 101 course
Dmitry Grapov
 
Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integratio...
Dmitry Grapov
 
Dmitry Grapov Resume and CV
Dmitry Grapov
 
Machine Learning Powered Metabolomic Network Analysis
Dmitry Grapov
 
Modeling poster
Dmitry Grapov
 
American Society of Mass Spectrommetry Conference 2014
Dmitry Grapov
 

Recently uploaded (20)

PDF
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PPTX
Nature of Science and the kinds of models used in science
JocelynEvascoRomanti
 
PPT
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
PDF
JADESreveals a large population of low mass black holes at high redshift
Sérgio Sacani
 
PPTX
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
PPTX
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
PPTX
Cell Structure and Organelles Slides PPT
JesusNeyra8
 
PPTX
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
PDF
Control and coordination Class 10 Chapter 6
LataHolkar
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PDF
Drones in Disaster Response: Real-Time Data Collection and Analysis (www.kiu...
publication11
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PDF
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPTX
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
PDF
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
PPTX
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PDF
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Nature of Science and the kinds of models used in science
JocelynEvascoRomanti
 
Grade_9_Science_Atomic_S_t_r_u_cture.ppt
QuintReynoldDoble
 
JADESreveals a large population of low mass black holes at high redshift
Sérgio Sacani
 
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
Cell Structure and Organelles Slides PPT
JesusNeyra8
 
Home Garden as a Component of Agroforestry system : A survey-based Study
AkhangshaRoy
 
Control and coordination Class 10 Chapter 6
LataHolkar
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Drones in Disaster Response: Real-Time Data Collection and Analysis (www.kiu...
publication11
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
Systems Biology: Integrating Engineering with Biological Research (www.kiu.a...
publication11
 
The Obesity Paradox. Friend or Foe ?pptx
drdgd1972
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 

Case Study: Overview of Metabolomic Data Normalization Strategies

  • 1. Implementation of Metabolomic Data Normalization Strategies Dmitry Grapov, PhD Summary Five normalization methods were compared, of which the combination of qc-LOESS and cubic splines showed the best performance based on within-batch and between-batch variable relative standard deviations for QCs. This approach was used to normalize sample measurements the results of which were analyzed using principal components analysis. Based on this analysis an unknown source of variance was identified among the samples (batches 1-~7 and 8-25) which was absent from QC samples and concluded to stem from the biological variability due to the experimental design. Results The complete data set, acquired over a one year period (3/6/2013 to 2/20/2014), consisted of 1262 measurements of 319 variables. Analytical variance over the duration of the data acquisition was estimated based on 105 equally interspersed quality control (QCs) samples (1:10 QC/samples). To aid the overview of temporal trends the full data acquisition time was segmented into 1-3 day increments or 25 batches (median samples per batch 53; range, 13 to 84). QC samples were used to evaluate five common data normalization procedures: quantile, cubic splines, cyclic LOESS [1], batch ratio and (qc-)LOESS [2]. Normalization performance was assessed based on within-batch (Figure 1A) and between-batch (Figure 1 B&C) variable relative standard deviations (RSD) of QC samples. The qc-LOESS approach, which is a modification of the LOESS procedure (Figure 2), displayed the best performance for QC samples (median batch RSD, 30%, range: 20-42%; raw data, 35%,
  • 2. 19-51%), with 78% of normalized variables showing RSD<40% compared 65% for raw data. However 113 variables (35%) displayed inconsistent trends between qc-LOESS model training and tests sets and were identified as inappropriate for the qc-LOESS normalization. The remaining variables were normalized using the cubic splines method, which does not require a similar consistency criterion, and showed the second best performance for QC samples (median batch RSD, 31%, 18-44%; and 77% of variables with RSD<40%). The combination of qc-LOESS and cubic splines normalizations were shown to improve data quality by reducing within-batch and between-batch analytical variance (Figure 3). Principal components analysis was used to evaluate raw and normalized QC and sample measurements for batch effects (Figure 4). Raw QCs data displayed slight differences between batches 1-7 and 7-25 (Figure 4A, red points), which was removed after normalization (Figure 4B). However both raw and normalized samples displayed a large mode of variance between samples among batches ~1-7 and all other batches (Figure 4 C &D). After confirming that this trend was not due to the biological design of the study, based on evaluation of same samples measured by an orthogonal metabolomic platform (LC-Q-TOF), a semi-supervised approach of model based clustering was used to define the members of the unique modes of variance. A linear model was used to adjust the normalized data based on the model-based clustering defined clusters (Figure 5). Methods
  • 3. Principal components analysis (PCA) on autoscaled data was used to overview raw and normalized data and QC sample variance based on acquisition batch, and used to identify 1 outlier QC sample (Bio Rec 94) which was removed from all further analyses (Figure 6). Quantile, cubic splines and cyclic LOESS normalizations were implemented without cross-validation [1]. Within- and between-batch RSDs were calculated based on batch and aggregated medians. Batch ratio (BR) and qc-LOESS were implemented using cross-validation where 2/3 of QC samples were used to train the model, which was then applied to the remaining 1/3 data, and for consistency with the other normalization methods performance is reported for the combined training and test sets. BR normalization is an implementation a batch specific correction factor for each variable, and was calculated as the ratio of the within-batch to the study wide variable medians. The qc-LOESS normalization is an adaptation of the LOESS normalization which uses qc samples, but also includes a step to determine if the LOESS based normalization is applicable to the data by testing the correlation between LOESS models for the training and test sets (cubic splines interpolated). LOESS model span was selected using leave-one-out cross-validation on the training data. Variables inappropriate for the qc- LOESS normalization were instead normalized by the cubic splines method. Cubic splines normalization displayed the best performance of all algorithms for variables with intensities < 1,000, but displayed slightly higher RSD compared to no normalization for variables > 1000 intensity (Figure 6). The combination of qc-LOESS and cubic splines were used to fully normalize the dataset, but variables with intensities >1000 and showing poor cubic splines performance could instead be presented as raw or non- normalized data.
  • 4. Model based clustering was carried out using Bayesian information criterion (BIC) optimized and EM initialized hierarchical clustering of finite mixtures of Gaussian mixture models [3]. The best two cluster model was selected based on BIC. Analyte specific linear models were used to adjust sample means based on the model-based cluster memberships. All analyses were implemented in R v3.0.2 [4] using the Devium package (https://github.com/dgrapov/devium).
  • 5. Figure 1. Overview of common data normalization approaches applied to the QC samples. A) B C A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the
  • 6. logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is preferable B) number of variables displaying RSDs in the specified intervals for each normalization method and C) number of batches displaying median variable RSDs in the specified intervals.
  • 7. Figure 2. Modified workflow for qc-LOESS normalization. A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is preferable B) number of variables displaying RSDs in the specified intervals for each normalization method and C) number of batches displaying median variable RSDs in the specified intervals.
  • 8. Figure 3. Comparison of raw and normalized sample relative standard deviations. A) B C
  • 9. A) Smoothed trend line (LOESS) for the relationship between the relative standard deviation (RSD) and the logarithm (base 10) of the variable mean for each of the normalization methods, where lower RSD is preferable B) number of variables displaying RSDs in the specified intervals for each normalization method and C) number of batches displaying median variable RSDs in the specified intervals.
  • 10. Figure 4. PCA scores of raw and normalized samples and QCs, annotated by batch and acquisition order. A) raw QC B normalized QC C) raw samples D) normalized samples PCA sample scores for the first 2 components for a) raw QCs B) normalized QCs C) raw samples and D) normalized samples.
  • 11. Figure 5. PCA scores of normalized samples before and after non-supervised model based clustering defined covariate adjustment A) defined clusters B cluster-membership adjusted data PCA sample scores for the first 2 components for a) model-based clustering defined clusters B) cluster- membership adjusted data.
  • 12. Figure 6. Principal components analysis of QCs, with annotation of acquisition order (sample label). A) PCA scores from the first two components displaying QC sample label IDs (duplicated labels are expressed as X.1). Sample 94, circled in red, and was identified as an outlier (no other QC scores with similar dates in its proximity).
  • 13. Figure 6. Performance of the cubic splines normalization on QC samples.
  • 14. References 1. Kohl, S.M., et al., State-of-the art data normalization methods improve NMR- based metabolomic analysis. Metabolomics, 2012. 8(Suppl 1): p. 146-160. 2. Dunn, W.B., et al., Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat Protoc, 2011. 6(7): p. 1060-83. 3. Fraley, C. and A. Raftery, E.,, Model-based Clustering, Discriminant Analysis and Density Estimation. Journal of the American Statistical Association, 2002(97): p. 611-631. 4. R Development Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2011. ISBN 3-900051- 900007-900050, URL http://www.R-project.org/.