Browse code

Added introduction vignette.

Jaehyun Joo authored on 04/01/2022 01:43:14
Showing 2 changed files

1 1
new file mode 100644
... ...
@@ -0,0 +1,2 @@
1
+*.html
2
+*.R
0 3
new file mode 100644
... ...
@@ -0,0 +1,293 @@
1
+---
2
+title: "Introduction to the poplin package"
3
+author: 
4
+  - name: Jaehyun Joo
5
+    affiliation: University of Pennsylvania
6
+    email: [email protected]
7
+package: poplin
8
+output: 
9
+  BiocStyle::html_document:
10
+      toc_float: true
11
+vignette: >
12
+  %\VignetteIndexEntry{Introduction to the poplin package}
13
+  %\VignetteEngine{knitr::rmarkdown}
14
+  %\VignetteEncoding{UTF-8}
15
+---
16
+
17
+```{r, include = FALSE}
18
+knitr::opts_chunk$set(
19
+  collapse = TRUE,
20
+  comment = "#>",
21
+  warning = FALSE,
22
+  error = FALSE,
23
+  message = FALSE
24
+)
25
+```
26
+
27
+# Overview {-}
28
+
29
+The `poplin` package aims to provide data processing utilities (imputation,
30
+normalization, and dimension reduction) for LC/MS metabolomics data and a S4
31
+class container for storing and retrieving the resulting outputs, motivated by
32
+the `r Biocpkg("SingleCellExperiment")`.
33
+
34
+# The poplin class
35
+
36
+The `poplin` class, an extension of the `SummarizedExperiment` class, provides
37
+additional containers for data processing results and dimension-reduced
38
+presentations of metabolomics data. `poplin` objects can be created via the
39
+constructor of the same name. Posit that we have a feature intensity matrix
40
+generated from spectral data processing packages, such as `r Biocpkg("xcms")`.
41
+Note that rows represent features and columns represent samples like
42
+`SummarizedExperiment` objects.
43
+
44
+```{r constructor}
45
+library(poplin)
46
+nsample <- 20
47
+nfeature <- 100
48
+features <- rlnorm(nsample * nfeature, 10, 1)
49
+fmat <- matrix(features, ncol = nsample)
50
+cn <- paste0("S", seq_len(nsample))
51
+rn <- paste0("F", seq_len(nfeature))
52
+colnames(fmat) <- cn
53
+rownames(fmat) <- rn
54
+pd <- poplin(
55
+    assays = list(raw = fmat),
56
+    colData = DataFrame(sample_id = cn),
57
+    rowData = DataFrame(feature_id = rn)
58
+    )
59
+pd
60
+
61
+```
62
+Alternatively, a `poplin` object can be constructed by coercing an existing
63
+`SummarizedExperiment` object.
64
+
65
+```{r coercion}
66
+se <- SummarizedExperiment(
67
+    assays = list(raw = fmat),
68
+    colData = DataFrame(sample_id = cn),
69
+    rowData = DataFrame(feature_id = rn)
70
+    )
71
+pd <- as(se, "poplin")
72
+
73
+# any opration applied to "se" also works on "pd"
74
+assays(pd)
75
+dim(pd)
76
+head(rowData(pd), 3)
77
+head(colData(pd), 3)
78
+
79
+```
80
+
81
+To illustrate the methods of the class, we will use the `faahko_poplin` data
82
+included in the `poplin` package. This data set was generated from the `faahko3`
83
+object in the `r Biocpkg("faahKO")` package, which consists of 12 samples (6
84
+wild-type and 6 FAAH knockout mice) and 206 LC/MS peaks.
85
+
86
+```{r faahko}
87
+data(faahko_poplin)
88
+faahko_poplin
89
+```
90
+
91
+The `poplin` class have three data containers: `poplinRaw`, `poplinData`,
92
+`poplinReduced`.
93
+
94
+`poplinRaw` corresponds to `assays` in the `SummarizedExperiment` class and is
95
+intended to store raw intensity data. To retrieve the data in this container,
96
+one can use `poplin_raw_list()` accessor. This is an alias of `assays()`
97
+methods from the `SummarizedExperiment` class.
98
+  
99
+```{r poplinRaw}
100
+## Get a list of raw intensity data sets.
101
+poplin_raw_list(faahko_poplin) # alias of assays()
102
+
103
+## Get the names of data sets
104
+poplin_raw_names(faahko_poplin) # alias of assayNames()
105
+
106
+## Get indvidual entries
107
+head(poplin_raw(faahko_poplin, 1), 3) # alias of assay()
108
+```
109
+`poplinData` is intended to store processed data that are typically returned
110
+ by utility functions in the `poplin` package. To retrieve the data in this
111
+ container, one can use `poplin_data_list()` accessor. Note that each entry
112
+ must have the same dimension as returned by `dim()`.
113
+    
114
+```{r poplinData}
115
+poplin_data_list(faahko_poplin)
116
+poplin_data_names(faahko_poplin)
117
+head(poplin_data(faahko_poplin, "knn"), 3)
118
+```
119
+
120
+`poplinReduced` is intended to store dimension-reduced data. To retrieve the
121
+data in this container, one can use `poplin_reduced_list()` accessor. Note that
122
+each entry must have the same number of rows as returned by `ncol()`.
123
+ 
124
+```{r poplinReduced}
125
+poplin_reduced_list(faahko_poplin)
126
+poplin_reduced_names(faahko_poplin)
127
+head(poplin_reduced(faahko_poplin, "pca"), 3)
128
+```
129
+
130
+In the `poplin` class, each of these accessors has setter methods so that users
131
+can assign data to individual containers.
132
+
133
+```{r assignment}
134
+## Operations also work on poplinRaw and poplinReduced containers.
135
+knn <- poplin_data(faahko_poplin, "knn")
136
+empty <- faahko_poplin
137
+poplin_data_list(empty) <- list() # replace with empty data
138
+poplin_data_list(empty) <- list(knn1 = knn, knn2 = knn) # add data list
139
+poplin_data(empty, "knn3") <- knn # add data
140
+poplin_data_names(empty)
141
+poplin_data_names(empty) <- c("imp1", "imp2", "imp3") # change names
142
+poplin_data_names(empty)
143
+```
144
+# Missing value imputation
145
+
146
+In the poplin package, commonly used missing value imputation algorithms are
147
+available via the `poplin_impute()` function, which included k-nearest neighbor
148
+(using Gower's distance), random forest, PCA-based methods (e.g., NIPALS PCA,
149
+Bayesian PCA, Probabilistic PCA), and univariate replacement (e.g.,
150
+half-minimum, median, mean). `poplin_impute()` can be applied either to a
151
+`poplin` object or `matrix`. Please refer to the [Visualization] section to
152
+visualize the missingness of the data.
153
+
154
+```{r imputation}
155
+## missing % of raw intensity matrix
156
+m <- poplin_raw(faahko_poplin, "raw")
157
+100 * sum(is.na(m)) / prod(dim(faahko_poplin))
158
+
159
+## apply half-mininum imputation to a poplin object
160
+res <- poplin_impute(x = faahko_poplin, xin = "raw", xout = "halfmin",
161
+                     method = "univariate", type = "halfmin")
162
+                     
163
+## apply random forest imputation to a matrix
164
+poplin_data(res, "rf") <- poplin_impute(x = m, method = "randomforest")
165
+
166
+poplin_data_list(res)
167
+
168
+```
169
+
170
+# Normalization
171
+
172
+In metabolomics analysis, the data may need to be normalized to reduce unwanted
173
+sample-to-sample variability. The `poplin_normalize()` function provides several
174
+data-driven normalization approaches, such as probabilistic quotient
175
+normalization (PQN), cyclic LOESS normalization, variance stabilizing
176
+normalization (generalized log transformation), sum normalization, median
177
+normalization, feature-based scaling (e.g., auto scaling, pareto scaling, level
178
+scaling, and etc.).
179
+
180
+```{r normalization}
181
+## Apply sum normalization to a poplin object
182
+res <- poplin_normalize(x = faahko_poplin, method = "sum", 
183
+                        xin = "knn", xout = "knn_pqn")
184
+
185
+## Apply VSN normalization to a matrix
186
+m <- poplin_data(faahko_poplin, "knn")
187
+poplin_data(res, "knn_vsn") <- poplin_normalize(x = m, method = "vsn")
188
+
189
+poplin_data_list(res)
190
+```
191
+
192
+# Dimension reduction
193
+
194
+In metabolomics, dimension reduction methods are often used for modeling and
195
+visualization. Currently, the poplin package supports three dimension-reduction
196
+methods: principal component analysis (PCA), t-distributed stochastic neighbor
197
+embedding (t-SNE), and partial least squares-discriminant analysis (PLS-DA). The
198
+`poplin_reduce` function perform dimension reduction of the data and store
199
+the result to the `poplinReduced` container.
200
+
201
+```{r dimension reduction}
202
+empty <- faahko_poplin
203
+poplin_reduced_list(empty) <- list()
204
+poplin_reduced_names(empty)
205
+
206
+## Apply PCA to a poplin object
207
+res <- poplin_reduce(x = empty, xin = "knn_cyclic", xout = "pca", 
208
+                     method = "pca", ncomp = 3)
209
+
210
+## Apply t-SNE to a matrix
211
+poplin_reduced(res, "tsne") <- poplin_reduce(m, method = "tsne", 
212
+                                             ncomp = 3, perplexity = 3)
213
+
214
+## Apply PLS-DA to a poplin object
215
+y <- factor(colData(res)$sample_group, levels = c("WT", "KO"))
216
+res <- poplin_reduce(x = res, xin = "knn_cyclic", xout = "plsda", 
217
+                     method = "plsda", y = y, ncomp = 3)
218
+
219
+```
220
+
221
+The `poplin_reduce()` function returns the result containing custom attributes
222
+that are used for summary and visualization. Please refer to [Visualization] for
223
+details.
224
+
225
+```{r dimension reduction summary}
226
+summary(poplin_reduced(res, "pca"))
227
+summary(poplin_reduced(res, "tsne"))
228
+summary(poplin_reduced(res, "plsda"))
229
+```
230
+
231
+# Visualization
232
+
233
+The poplin package provide common visualization for metabolomics data based on
234
+`r CRANpkg("ggplot2")`. The plot functions in poplin package can be applied
235
+either to a `poplin` object or a `matrix`.
236
+
237
+## poplin_naplot
238
+
239
+The `poplin_naplot()` helps to visually inspect missingness of the data.
240
+
241
+```{r naplot, fig.width = 8, fig.height = 8}
242
+poplin_naplot(x = faahko_poplin, xin = "raw")
243
+```
244
+
245
+## poplin_corplot
246
+
247
+The `poplin_corplot()` visualizes correlations between samples (or features) to
248
+quickly check the grouping structure in the data.
249
+
250
+```{r corplot, fig.width = 8, fig.height = 8}
251
+poplin_corplot(x = faahko_poplin, xin = "knn_cyclic")
252
+```
253
+
254
+## poplin_boxplot
255
+
256
+The `poplin_boxplot()` produces a box-and-whisker plot of the feature intensity
257
+values across the samples.
258
+
259
+```{r poplin_boxplot, fig.wide = TRUE}
260
+group <- colData(faahko_poplin)$sample_group
261
+
262
+## distribution of intensities before normalization
263
+poplin_boxplot(faahko_poplin, xin = "knn", group = group, 
264
+               pre_log2 = TRUE)
265
+
266
+## distribution of intensities after cyclic LOESS normalization
267
+poplin_boxplot(faahko_poplin, xin = "knn_cyclic", group = group)
268
+
269
+```
270
+
271
+## poplin_scoreplot
272
+
273
+The `poplin_scoreplot()` function visualizes the data onto a lower-dimensional
274
+space using the `poplin_reduce()` output.
275
+
276
+```{r poplin_scoreplot}
277
+group <- colData(faahko_poplin)$sample_group
278
+
279
+## PCA output
280
+poplin_scoreplot(faahko_poplin, xin = "pca", group = group, 
281
+                 ellipse = TRUE)
282
+
283
+## PLS-DA output 
284
+poplin_scoreplot(faahko_poplin, xin = "plsda", ellipse = TRUE,
285
+                 label = TRUE)
286
+
287
+```
288
+
289
+# Session info {-}
290
+
291
+```{r session info}
292
+sessionInfo()
293
+```