Bioconductor Code: qmtools

History View file @ 31501c1

                     new file mode 100644
@@ -0,0 +1,293 @@
                     +---
                     +title: "Introduction to the poplin package"
                     +author:
                     +  - name: Jaehyun Joo
                     +    affiliation: University of Pennsylvania
                     +    email: [email protected]
                     +package: poplin
                     +output:
                     +  BiocStyle::html_document:
                     +      toc_float: true
                     +vignette: >
                     +  %\VignetteIndexEntry{Introduction to the poplin package}
                     +  %\VignetteEngine{knitr::rmarkdown}
                     +  %\VignetteEncoding{UTF-8}
                     +---
+                    +
                     +```{r, include = FALSE}
                     +knitr::opts_chunk$set(
                     +  collapse = TRUE,
                     +  comment = "#>",
                     +  warning = FALSE,
                     +  error = FALSE,
                     +  message = FALSE
                     +)
                     +```
+                    +
                     +# Overview {-}
+                    +
                     +The `poplin` package aims to provide data processing utilities (imputation,
                     +normalization, and dimension reduction) for LC/MS metabolomics data and a S4
                     +class container for storing and retrieving the resulting outputs, motivated by
                     +the `r Biocpkg("SingleCellExperiment")`.
+                    +
                     +# The poplin class
+                    +
                     +The `poplin` class, an extension of the `SummarizedExperiment` class, provides
                     +additional containers for data processing results and dimension-reduced
                     +presentations of metabolomics data. `poplin` objects can be created via the
                     +constructor of the same name. Posit that we have a feature intensity matrix
                     +generated from spectral data processing packages, such as `r Biocpkg("xcms")`.
                     +Note that rows represent features and columns represent samples like
                     +`SummarizedExperiment` objects.
+                    +
                     +```{r constructor}
                     +library(poplin)
                     +nsample <- 20
                     +nfeature <- 100
                     +features <- rlnorm(nsample * nfeature, 10, 1)
                     +fmat <- matrix(features, ncol = nsample)
                     +cn <- paste0("S", seq_len(nsample))
                     +rn <- paste0("F", seq_len(nfeature))
                     +colnames(fmat) <- cn
                     +rownames(fmat) <- rn
                     +pd <- poplin(
                     +    assays = list(raw = fmat),
                     +    colData = DataFrame(sample_id = cn),
                     +    rowData = DataFrame(feature_id = rn)
                     +    )
                     +pd
+                    +
                     +```
                     +Alternatively, a `poplin` object can be constructed by coercing an existing
                     +`SummarizedExperiment` object.
+                    +
                     +```{r coercion}
                     +se <- SummarizedExperiment(
                     +    assays = list(raw = fmat),
                     +    colData = DataFrame(sample_id = cn),
                     +    rowData = DataFrame(feature_id = rn)
                     +    )
                     +pd <- as(se, "poplin")
+                    +
                     +# any opration applied to "se" also works on "pd"
                     +assays(pd)
                     +dim(pd)
                     +head(rowData(pd), 3)
                     +head(colData(pd), 3)
+                    +
                     +```
+                    +
                     +To illustrate the methods of the class, we will use the `faahko_poplin` data
                     +included in the `poplin` package. This data set was generated from the `faahko3`
                     +object in the `r Biocpkg("faahKO")` package, which consists of 12 samples (6
                     +wild-type and 6 FAAH knockout mice) and 206 LC/MS peaks.
+                    +
                     +```{r faahko}
                     +data(faahko_poplin)
                     +faahko_poplin
                     +```
+                    +
                     +The `poplin` class have three data containers: `poplinRaw`, `poplinData`,
                     +`poplinReduced`.
+                    +
                     +`poplinRaw` corresponds to `assays` in the `SummarizedExperiment` class and is
                     +intended to store raw intensity data. To retrieve the data in this container,
                     +one can use `poplin_raw_list()` accessor. This is an alias of `assays()`
                     +methods from the `SummarizedExperiment` class.
+                    +
                     +```{r poplinRaw}
                     +## Get a list of raw intensity data sets.
                     +poplin_raw_list(faahko_poplin) # alias of assays()
+                    +
                     +## Get the names of data sets
                     +poplin_raw_names(faahko_poplin) # alias of assayNames()
+                    +
                     +## Get indvidual entries
                     +head(poplin_raw(faahko_poplin, 1), 3) # alias of assay()
                     +```
                     +`poplinData` is intended to store processed data that are typically returned
                     + by utility functions in the `poplin` package. To retrieve the data in this
                     + container, one can use `poplin_data_list()` accessor. Note that each entry
                     + must have the same dimension as returned by `dim()`.
+                    +
                     +```{r poplinData}
                     +poplin_data_list(faahko_poplin)
                     +poplin_data_names(faahko_poplin)
                     +head(poplin_data(faahko_poplin, "knn"), 3)
                     +```
+                    +
                     +`poplinReduced` is intended to store dimension-reduced data. To retrieve the
                     +data in this container, one can use `poplin_reduced_list()` accessor. Note that
                     +each entry must have the same number of rows as returned by `ncol()`.
+                    +
                     +```{r poplinReduced}
                     +poplin_reduced_list(faahko_poplin)
                     +poplin_reduced_names(faahko_poplin)
                     +head(poplin_reduced(faahko_poplin, "pca"), 3)
                     +```
+                    +
                     +In the `poplin` class, each of these accessors has setter methods so that users
                     +can assign data to individual containers.
+                    +
                     +```{r assignment}
                     +## Operations also work on poplinRaw and poplinReduced containers.
                     +knn <- poplin_data(faahko_poplin, "knn")
                     +empty <- faahko_poplin
                     +poplin_data_list(empty) <- list() # replace with empty data
                     +poplin_data_list(empty) <- list(knn1 = knn, knn2 = knn) # add data list
                     +poplin_data(empty, "knn3") <- knn # add data
                     +poplin_data_names(empty)
                     +poplin_data_names(empty) <- c("imp1", "imp2", "imp3") # change names
                     +poplin_data_names(empty)
                     +```
                     +# Missing value imputation
+                    +
                     +In the poplin package, commonly used missing value imputation algorithms are
                     +available via the `poplin_impute()` function, which included k-nearest neighbor
                     +(using Gower's distance), random forest, PCA-based methods (e.g., NIPALS PCA,
                     +Bayesian PCA, Probabilistic PCA), and univariate replacement (e.g.,
                     +half-minimum, median, mean). `poplin_impute()` can be applied either to a
                     +`poplin` object or `matrix`. Please refer to the [Visualization] section to
                     +visualize the missingness of the data.
+                    +
                     +```{r imputation}
                     +## missing % of raw intensity matrix
                     +m <- poplin_raw(faahko_poplin, "raw")
                     +100 * sum(is.na(m)) / prod(dim(faahko_poplin))
+                    +
                     +## apply half-mininum imputation to a poplin object
                     +res <- poplin_impute(x = faahko_poplin, xin = "raw", xout = "halfmin",
                     +                     method = "univariate", type = "halfmin")
+                    +
                     +## apply random forest imputation to a matrix
                     +poplin_data(res, "rf") <- poplin_impute(x = m, method = "randomforest")
+                    +
                     +poplin_data_list(res)
+                    +
                     +```
+                    +
                     +# Normalization
+                    +
                     +In metabolomics analysis, the data may need to be normalized to reduce unwanted
                     +sample-to-sample variability. The `poplin_normalize()` function provides several
                     +data-driven normalization approaches, such as probabilistic quotient
                     +normalization (PQN), cyclic LOESS normalization, variance stabilizing
                     +normalization (generalized log transformation), sum normalization, median
                     +normalization, feature-based scaling (e.g., auto scaling, pareto scaling, level
                     +scaling, and etc.).
+                    +
                     +```{r normalization}
                     +## Apply sum normalization to a poplin object
                     +res <- poplin_normalize(x = faahko_poplin, method = "sum",
                     +                        xin = "knn", xout = "knn_pqn")
+                    +
                     +## Apply VSN normalization to a matrix
                     +m <- poplin_data(faahko_poplin, "knn")
                     +poplin_data(res, "knn_vsn") <- poplin_normalize(x = m, method = "vsn")
+                    +
                     +poplin_data_list(res)
                     +```
+                    +
                     +# Dimension reduction
+                    +
                     +In metabolomics, dimension reduction methods are often used for modeling and
                     +visualization. Currently, the poplin package supports three dimension-reduction
                     +methods: principal component analysis (PCA), t-distributed stochastic neighbor
                     +embedding (t-SNE), and partial least squares-discriminant analysis (PLS-DA). The
                     +`poplin_reduce` function perform dimension reduction of the data and store
                     +the result to the `poplinReduced` container.
+                    +
                     +```{r dimension reduction}
                     +empty <- faahko_poplin
                     +poplin_reduced_list(empty) <- list()
                     +poplin_reduced_names(empty)
+                    +
                     +## Apply PCA to a poplin object
                     +res <- poplin_reduce(x = empty, xin = "knn_cyclic", xout = "pca",
                     +                     method = "pca", ncomp = 3)
+                    +
                     +## Apply t-SNE to a matrix
                     +poplin_reduced(res, "tsne") <- poplin_reduce(m, method = "tsne",
                     +                                             ncomp = 3, perplexity = 3)
+                    +
                     +## Apply PLS-DA to a poplin object
                     +y <- factor(colData(res)$sample_group, levels = c("WT", "KO"))
                     +res <- poplin_reduce(x = res, xin = "knn_cyclic", xout = "plsda",
                     +                     method = "plsda", y = y, ncomp = 3)
+                    +
                     +```
+                    +
                     +The `poplin_reduce()` function returns the result containing custom attributes
                     +that are used for summary and visualization. Please refer to [Visualization] for
                     +details.
+                    +
                     +```{r dimension reduction summary}
                     +summary(poplin_reduced(res, "pca"))
                     +summary(poplin_reduced(res, "tsne"))
                     +summary(poplin_reduced(res, "plsda"))
                     +```
+                    +
                     +# Visualization
+                    +
                     +The poplin package provide common visualization for metabolomics data based on
                     +`r CRANpkg("ggplot2")`. The plot functions in poplin package can be applied
                     +either to a `poplin` object or a `matrix`.
+                    +
                     +## poplin_naplot
+                    +
                     +The `poplin_naplot()` helps to visually inspect missingness of the data.
+                    +
                     +```{r naplot, fig.width = 8, fig.height = 8}
                     +poplin_naplot(x = faahko_poplin, xin = "raw")
                     +```
+                    +
                     +## poplin_corplot
+                    +
                     +The `poplin_corplot()` visualizes correlations between samples (or features) to
                     +quickly check the grouping structure in the data.
+                    +
                     +```{r corplot, fig.width = 8, fig.height = 8}
                     +poplin_corplot(x = faahko_poplin, xin = "knn_cyclic")
                     +```
+                    +
                     +## poplin_boxplot
+                    +
                     +The `poplin_boxplot()` produces a box-and-whisker plot of the feature intensity
                     +values across the samples.
+                    +
                     +```{r poplin_boxplot, fig.wide = TRUE}
                     +group <- colData(faahko_poplin)$sample_group
+                    +
                     +## distribution of intensities before normalization
                     +poplin_boxplot(faahko_poplin, xin = "knn", group = group,
                     +               pre_log2 = TRUE)
+                    +
                     +## distribution of intensities after cyclic LOESS normalization
                     +poplin_boxplot(faahko_poplin, xin = "knn_cyclic", group = group)
+                    +
                     +```
+                    +
                     +## poplin_scoreplot
+                    +
                     +The `poplin_scoreplot()` function visualizes the data onto a lower-dimensional
                     +space using the `poplin_reduce()` output.
+                    +
                     +```{r poplin_scoreplot}
                     +group <- colData(faahko_poplin)$sample_group
+                    +
                     +## PCA output
                     +poplin_scoreplot(faahko_poplin, xin = "pca", group = group,
                     +                 ellipse = TRUE)
+                    +
                     +## PLS-DA output
                     +poplin_scoreplot(faahko_poplin, xin = "plsda", ellipse = TRUE,
                     +                 label = TRUE)
+                    +
                     +```
+                    +
                     +# Session info {-}
+                    +
                     +```{r session info}
                     +sessionInfo()
                     +```

Added introduction vignette.