0 | 3 |
new file mode 100644 |
... | ... |
@@ -0,0 +1,293 @@ |
1 |
+--- |
|
2 |
+title: "Introduction to the poplin package" |
|
3 |
+author: |
|
4 |
+ - name: Jaehyun Joo |
|
5 |
+ affiliation: University of Pennsylvania |
|
6 |
+ email: [email protected] |
|
7 |
+package: poplin |
|
8 |
+output: |
|
9 |
+ BiocStyle::html_document: |
|
10 |
+ toc_float: true |
|
11 |
+vignette: > |
|
12 |
+ %\VignetteIndexEntry{Introduction to the poplin package} |
|
13 |
+ %\VignetteEngine{knitr::rmarkdown} |
|
14 |
+ %\VignetteEncoding{UTF-8} |
|
15 |
+--- |
|
16 |
+ |
|
17 |
+```{r, include = FALSE} |
|
18 |
+knitr::opts_chunk$set( |
|
19 |
+ collapse = TRUE, |
|
20 |
+ comment = "#>", |
|
21 |
+ warning = FALSE, |
|
22 |
+ error = FALSE, |
|
23 |
+ message = FALSE |
|
24 |
+) |
|
25 |
+``` |
|
26 |
+ |
|
27 |
+# Overview {-} |
|
28 |
+ |
|
29 |
+The `poplin` package aims to provide data processing utilities (imputation, |
|
30 |
+normalization, and dimension reduction) for LC/MS metabolomics data and a S4 |
|
31 |
+class container for storing and retrieving the resulting outputs, motivated by |
|
32 |
+the `r Biocpkg("SingleCellExperiment")`. |
|
33 |
+ |
|
34 |
+# The poplin class |
|
35 |
+ |
|
36 |
+The `poplin` class, an extension of the `SummarizedExperiment` class, provides |
|
37 |
+additional containers for data processing results and dimension-reduced |
|
38 |
+presentations of metabolomics data. `poplin` objects can be created via the |
|
39 |
+constructor of the same name. Posit that we have a feature intensity matrix |
|
40 |
+generated from spectral data processing packages, such as `r Biocpkg("xcms")`. |
|
41 |
+Note that rows represent features and columns represent samples like |
|
42 |
+`SummarizedExperiment` objects. |
|
43 |
+ |
|
44 |
+```{r constructor} |
|
45 |
+library(poplin) |
|
46 |
+nsample <- 20 |
|
47 |
+nfeature <- 100 |
|
48 |
+features <- rlnorm(nsample * nfeature, 10, 1) |
|
49 |
+fmat <- matrix(features, ncol = nsample) |
|
50 |
+cn <- paste0("S", seq_len(nsample)) |
|
51 |
+rn <- paste0("F", seq_len(nfeature)) |
|
52 |
+colnames(fmat) <- cn |
|
53 |
+rownames(fmat) <- rn |
|
54 |
+pd <- poplin( |
|
55 |
+ assays = list(raw = fmat), |
|
56 |
+ colData = DataFrame(sample_id = cn), |
|
57 |
+ rowData = DataFrame(feature_id = rn) |
|
58 |
+ ) |
|
59 |
+pd |
|
60 |
+ |
|
61 |
+``` |
|
62 |
+Alternatively, a `poplin` object can be constructed by coercing an existing |
|
63 |
+`SummarizedExperiment` object. |
|
64 |
+ |
|
65 |
+```{r coercion} |
|
66 |
+se <- SummarizedExperiment( |
|
67 |
+ assays = list(raw = fmat), |
|
68 |
+ colData = DataFrame(sample_id = cn), |
|
69 |
+ rowData = DataFrame(feature_id = rn) |
|
70 |
+ ) |
|
71 |
+pd <- as(se, "poplin") |
|
72 |
+ |
|
73 |
+# any opration applied to "se" also works on "pd" |
|
74 |
+assays(pd) |
|
75 |
+dim(pd) |
|
76 |
+head(rowData(pd), 3) |
|
77 |
+head(colData(pd), 3) |
|
78 |
+ |
|
79 |
+``` |
|
80 |
+ |
|
81 |
+To illustrate the methods of the class, we will use the `faahko_poplin` data |
|
82 |
+included in the `poplin` package. This data set was generated from the `faahko3` |
|
83 |
+object in the `r Biocpkg("faahKO")` package, which consists of 12 samples (6 |
|
84 |
+wild-type and 6 FAAH knockout mice) and 206 LC/MS peaks. |
|
85 |
+ |
|
86 |
+```{r faahko} |
|
87 |
+data(faahko_poplin) |
|
88 |
+faahko_poplin |
|
89 |
+``` |
|
90 |
+ |
|
91 |
+The `poplin` class have three data containers: `poplinRaw`, `poplinData`, |
|
92 |
+`poplinReduced`. |
|
93 |
+ |
|
94 |
+`poplinRaw` corresponds to `assays` in the `SummarizedExperiment` class and is |
|
95 |
+intended to store raw intensity data. To retrieve the data in this container, |
|
96 |
+one can use `poplin_raw_list()` accessor. This is an alias of `assays()` |
|
97 |
+methods from the `SummarizedExperiment` class. |
|
98 |
+ |
|
99 |
+```{r poplinRaw} |
|
100 |
+## Get a list of raw intensity data sets. |
|
101 |
+poplin_raw_list(faahko_poplin) # alias of assays() |
|
102 |
+ |
|
103 |
+## Get the names of data sets |
|
104 |
+poplin_raw_names(faahko_poplin) # alias of assayNames() |
|
105 |
+ |
|
106 |
+## Get indvidual entries |
|
107 |
+head(poplin_raw(faahko_poplin, 1), 3) # alias of assay() |
|
108 |
+``` |
|
109 |
+`poplinData` is intended to store processed data that are typically returned |
|
110 |
+ by utility functions in the `poplin` package. To retrieve the data in this |
|
111 |
+ container, one can use `poplin_data_list()` accessor. Note that each entry |
|
112 |
+ must have the same dimension as returned by `dim()`. |
|
113 |
+ |
|
114 |
+```{r poplinData} |
|
115 |
+poplin_data_list(faahko_poplin) |
|
116 |
+poplin_data_names(faahko_poplin) |
|
117 |
+head(poplin_data(faahko_poplin, "knn"), 3) |
|
118 |
+``` |
|
119 |
+ |
|
120 |
+`poplinReduced` is intended to store dimension-reduced data. To retrieve the |
|
121 |
+data in this container, one can use `poplin_reduced_list()` accessor. Note that |
|
122 |
+each entry must have the same number of rows as returned by `ncol()`. |
|
123 |
+ |
|
124 |
+```{r poplinReduced} |
|
125 |
+poplin_reduced_list(faahko_poplin) |
|
126 |
+poplin_reduced_names(faahko_poplin) |
|
127 |
+head(poplin_reduced(faahko_poplin, "pca"), 3) |
|
128 |
+``` |
|
129 |
+ |
|
130 |
+In the `poplin` class, each of these accessors has setter methods so that users |
|
131 |
+can assign data to individual containers. |
|
132 |
+ |
|
133 |
+```{r assignment} |
|
134 |
+## Operations also work on poplinRaw and poplinReduced containers. |
|
135 |
+knn <- poplin_data(faahko_poplin, "knn") |
|
136 |
+empty <- faahko_poplin |
|
137 |
+poplin_data_list(empty) <- list() # replace with empty data |
|
138 |
+poplin_data_list(empty) <- list(knn1 = knn, knn2 = knn) # add data list |
|
139 |
+poplin_data(empty, "knn3") <- knn # add data |
|
140 |
+poplin_data_names(empty) |
|
141 |
+poplin_data_names(empty) <- c("imp1", "imp2", "imp3") # change names |
|
142 |
+poplin_data_names(empty) |
|
143 |
+``` |
|
144 |
+# Missing value imputation |
|
145 |
+ |
|
146 |
+In the poplin package, commonly used missing value imputation algorithms are |
|
147 |
+available via the `poplin_impute()` function, which included k-nearest neighbor |
|
148 |
+(using Gower's distance), random forest, PCA-based methods (e.g., NIPALS PCA, |
|
149 |
+Bayesian PCA, Probabilistic PCA), and univariate replacement (e.g., |
|
150 |
+half-minimum, median, mean). `poplin_impute()` can be applied either to a |
|
151 |
+`poplin` object or `matrix`. Please refer to the [Visualization] section to |
|
152 |
+visualize the missingness of the data. |
|
153 |
+ |
|
154 |
+```{r imputation} |
|
155 |
+## missing % of raw intensity matrix |
|
156 |
+m <- poplin_raw(faahko_poplin, "raw") |
|
157 |
+100 * sum(is.na(m)) / prod(dim(faahko_poplin)) |
|
158 |
+ |
|
159 |
+## apply half-mininum imputation to a poplin object |
|
160 |
+res <- poplin_impute(x = faahko_poplin, xin = "raw", xout = "halfmin", |
|
161 |
+ method = "univariate", type = "halfmin") |
|
162 |
+ |
|
163 |
+## apply random forest imputation to a matrix |
|
164 |
+poplin_data(res, "rf") <- poplin_impute(x = m, method = "randomforest") |
|
165 |
+ |
|
166 |
+poplin_data_list(res) |
|
167 |
+ |
|
168 |
+``` |
|
169 |
+ |
|
170 |
+# Normalization |
|
171 |
+ |
|
172 |
+In metabolomics analysis, the data may need to be normalized to reduce unwanted |
|
173 |
+sample-to-sample variability. The `poplin_normalize()` function provides several |
|
174 |
+data-driven normalization approaches, such as probabilistic quotient |
|
175 |
+normalization (PQN), cyclic LOESS normalization, variance stabilizing |
|
176 |
+normalization (generalized log transformation), sum normalization, median |
|
177 |
+normalization, feature-based scaling (e.g., auto scaling, pareto scaling, level |
|
178 |
+scaling, and etc.). |
|
179 |
+ |
|
180 |
+```{r normalization} |
|
181 |
+## Apply sum normalization to a poplin object |
|
182 |
+res <- poplin_normalize(x = faahko_poplin, method = "sum", |
|
183 |
+ xin = "knn", xout = "knn_pqn") |
|
184 |
+ |
|
185 |
+## Apply VSN normalization to a matrix |
|
186 |
+m <- poplin_data(faahko_poplin, "knn") |
|
187 |
+poplin_data(res, "knn_vsn") <- poplin_normalize(x = m, method = "vsn") |
|
188 |
+ |
|
189 |
+poplin_data_list(res) |
|
190 |
+``` |
|
191 |
+ |
|
192 |
+# Dimension reduction |
|
193 |
+ |
|
194 |
+In metabolomics, dimension reduction methods are often used for modeling and |
|
195 |
+visualization. Currently, the poplin package supports three dimension-reduction |
|
196 |
+methods: principal component analysis (PCA), t-distributed stochastic neighbor |
|
197 |
+embedding (t-SNE), and partial least squares-discriminant analysis (PLS-DA). The |
|
198 |
+`poplin_reduce` function perform dimension reduction of the data and store |
|
199 |
+the result to the `poplinReduced` container. |
|
200 |
+ |
|
201 |
+```{r dimension reduction} |
|
202 |
+empty <- faahko_poplin |
|
203 |
+poplin_reduced_list(empty) <- list() |
|
204 |
+poplin_reduced_names(empty) |
|
205 |
+ |
|
206 |
+## Apply PCA to a poplin object |
|
207 |
+res <- poplin_reduce(x = empty, xin = "knn_cyclic", xout = "pca", |
|
208 |
+ method = "pca", ncomp = 3) |
|
209 |
+ |
|
210 |
+## Apply t-SNE to a matrix |
|
211 |
+poplin_reduced(res, "tsne") <- poplin_reduce(m, method = "tsne", |
|
212 |
+ ncomp = 3, perplexity = 3) |
|
213 |
+ |
|
214 |
+## Apply PLS-DA to a poplin object |
|
215 |
+y <- factor(colData(res)$sample_group, levels = c("WT", "KO")) |
|
216 |
+res <- poplin_reduce(x = res, xin = "knn_cyclic", xout = "plsda", |
|
217 |
+ method = "plsda", y = y, ncomp = 3) |
|
218 |
+ |
|
219 |
+``` |
|
220 |
+ |
|
221 |
+The `poplin_reduce()` function returns the result containing custom attributes |
|
222 |
+that are used for summary and visualization. Please refer to [Visualization] for |
|
223 |
+details. |
|
224 |
+ |
|
225 |
+```{r dimension reduction summary} |
|
226 |
+summary(poplin_reduced(res, "pca")) |
|
227 |
+summary(poplin_reduced(res, "tsne")) |
|
228 |
+summary(poplin_reduced(res, "plsda")) |
|
229 |
+``` |
|
230 |
+ |
|
231 |
+# Visualization |
|
232 |
+ |
|
233 |
+The poplin package provide common visualization for metabolomics data based on |
|
234 |
+`r CRANpkg("ggplot2")`. The plot functions in poplin package can be applied |
|
235 |
+either to a `poplin` object or a `matrix`. |
|
236 |
+ |
|
237 |
+## poplin_naplot |
|
238 |
+ |
|
239 |
+The `poplin_naplot()` helps to visually inspect missingness of the data. |
|
240 |
+ |
|
241 |
+```{r naplot, fig.width = 8, fig.height = 8} |
|
242 |
+poplin_naplot(x = faahko_poplin, xin = "raw") |
|
243 |
+``` |
|
244 |
+ |
|
245 |
+## poplin_corplot |
|
246 |
+ |
|
247 |
+The `poplin_corplot()` visualizes correlations between samples (or features) to |
|
248 |
+quickly check the grouping structure in the data. |
|
249 |
+ |
|
250 |
+```{r corplot, fig.width = 8, fig.height = 8} |
|
251 |
+poplin_corplot(x = faahko_poplin, xin = "knn_cyclic") |
|
252 |
+``` |
|
253 |
+ |
|
254 |
+## poplin_boxplot |
|
255 |
+ |
|
256 |
+The `poplin_boxplot()` produces a box-and-whisker plot of the feature intensity |
|
257 |
+values across the samples. |
|
258 |
+ |
|
259 |
+```{r poplin_boxplot, fig.wide = TRUE} |
|
260 |
+group <- colData(faahko_poplin)$sample_group |
|
261 |
+ |
|
262 |
+## distribution of intensities before normalization |
|
263 |
+poplin_boxplot(faahko_poplin, xin = "knn", group = group, |
|
264 |
+ pre_log2 = TRUE) |
|
265 |
+ |
|
266 |
+## distribution of intensities after cyclic LOESS normalization |
|
267 |
+poplin_boxplot(faahko_poplin, xin = "knn_cyclic", group = group) |
|
268 |
+ |
|
269 |
+``` |
|
270 |
+ |
|
271 |
+## poplin_scoreplot |
|
272 |
+ |
|
273 |
+The `poplin_scoreplot()` function visualizes the data onto a lower-dimensional |
|
274 |
+space using the `poplin_reduce()` output. |
|
275 |
+ |
|
276 |
+```{r poplin_scoreplot} |
|
277 |
+group <- colData(faahko_poplin)$sample_group |
|
278 |
+ |
|
279 |
+## PCA output |
|
280 |
+poplin_scoreplot(faahko_poplin, xin = "pca", group = group, |
|
281 |
+ ellipse = TRUE) |
|
282 |
+ |
|
283 |
+## PLS-DA output |
|
284 |
+poplin_scoreplot(faahko_poplin, xin = "plsda", ellipse = TRUE, |
|
285 |
+ label = TRUE) |
|
286 |
+ |
|
287 |
+``` |
|
288 |
+ |
|
289 |
+# Session info {-} |
|
290 |
+ |
|
291 |
+```{r session info} |
|
292 |
+sessionInfo() |
|
293 |
+``` |