Browse code

updated vignette

Elijah Willie authored on 16/09/2022 03:55:47
Showing 1 changed files

1 1
new file mode 100644
... ...
@@ -0,0 +1,209 @@
1
+---
2
+title: "FuseSOM package manual"
3
+author: "Elijah Willie"
4
+date: "`r Sys.Date()`"
5
+output:
6
+    BiocStyle::html_document:
7
+        toc: true
8
+vignette: >
9
+  %\VignetteIndexEntry{FuseSOM package manual}
10
+  %\VignetteEngine{knitr::rmarkdown}
11
+  \usepackage[utf8]{inputenc}
12
+---
13
+
14
+```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE}
15
+library(knitr)
16
+opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
17
+```
18
+
19
+# Installation
20
+```{r, eval = FALSE}
21
+if (!require("BiocManager"))
22
+    install.packages("BiocManager")
23
+BiocManager::install("FuseSOM")
24
+```
25
+
26
+
27
+# Introduction
28
+
29
+A correlation based multiview self organizing map for the characterization of cell types (`FuseSOM`) is a tool for unsupervised clustering. `FuseSOM` is robust and achieves high accuracy by combining a `Self Organizing Map` architecture and a `Multiview` integration of correlation based metrics to cluster highly multiplexed in situ imaging cytometry assays. The `FuseSOM` pipeline has been streamlined and accepts currently used data structures including `SingleCellExperiment` and `SpatialExperiment` objects as well as `DataFrames`.
30
+
31
+# Disclaimer
32
+
33
+This is purely a tool generated for clustering and as such it does not provide any means for QC and feature selection. It is advisable that the user first use other tools for quality control and feature selection before running `FuseSOM`.
34
+
35
+# Getting Started
36
+
37
+## `FuseSOM` Matrix Input
38
+
39
+If you have a matrix containing expression data that was QCed and normalised by some other tool, the next step is to run the `FuseSOM` algorithm.This can be done by calling the `runFuseSOM()` function which takes in the matrix of interest where the columns are markers and the rows are observations, the makers of interest (if this is not provided, it is assumed that all columns are markers), and the number of clusters.
40
+
41
+```{r, message=FALSE, warning=FALSE}
42
+# load FuseSOM
43
+library(FuseSOM)
44
+
45
+```
46
+
47
+Next we will load in the [`Risom et al`](https://www.sciencedirect.com/science/article/pii/S0092867421014860?via%3Dihub) dataset and run it through the FuseSOM pipeline. This dataset profiles the spatial landscape of ductal carcinoma in situ (DCIS), which is a pre-invasive lesion that is thought to be a precursor to invasive breast cancer (IBC). The key conclusion of this manuscript (amongst others) is that spatial information about cells can be used to predict disease progression in patients.We will also be using the markers used in the original study. 
48
+
49
+```{r}
50
+# load in the data
51
+data("risom_dat")
52
+
53
+# define the markers of interest
54
+risomMarkers <- c('CD45','SMA','CK7','CK5','VIM','CD31','PanKRT','ECAD',
55
+                   'Tryptase','MPO','CD20','CD3','CD8','CD4','CD14','CD68','FAP',
56
+                   'CD36','CD11c','HLADRDPDQ','P63','CD44')
57
+
58
+# we will be using the manual_gating_phenotype as the true cell type to gauge 
59
+# performance
60
+names(risom_dat)[names(risom_dat) == 'manual_gating_phenotype'] <- 'CellType'
61
+
62
+```
63
+
64
+Now that we have loaded the data and define the markers of interest. We can run the `FuseSOM` algorithm. We have provided a function `runFuseSOM` that runs the pipeline from top to bottom and returns the cluster labels as well as the `Self Organizing Map` model.
65
+```{r}
66
+risomRes <- runFuseSOM(data = risom_dat, markers = risomMarkers, 
67
+                        numClusters = 23)
68
+```
69
+
70
+
71
+Lets look at the distribution of the clusters.
72
+```{r}
73
+# get the distribution of the clusters
74
+table(risomRes$clusters)/sum(table(risomRes$clusters))
75
+
76
+```
77
+
78
+Looks like `cluster_1` has about $32\%$ of the cells which is interesting.
79
+Next, lets generate a heatmap of the marker expression for each cluster.
80
+
81
+```{r}
82
+risomHeat <- FuseSOM::markerHeatmap(data = risom_dat, markers = risomMarkers,
83
+                            clusters = risomRes$clusters, clusterMarkers = TRUE)
84
+```
85
+
86
+## Using `FuseSOM` to estimate the number of clusters
87
+`FuseSOM` also provides functionality for estimating the number of clusters in a dataset using three classes of methods including:
88
+
89
+1.  Discriminant based method.
90
+    + A method developed in house based on discriminant based maximum clusterability projection pursuit
91
+2.  Distance based methods which includes:
92
+    + The Gap Statistic
93
+    + The Jump Statistic
94
+    + The Slope Statistic
95
+    + The Within Cluster Dissimilarity Statistic
96
+    + The Silhouette Statistic
97
+
98
+We can estimate the number of clusters using the `estimateNumCluster`. Run `help(estimateNumCluster)` to see it's complete functionality.
99
+
100
+```{r}
101
+# lets estimate the number of clusters using all the methods
102
+# original clustering has 23 clusters so we will set kseq from 2:25
103
+# we pass it the som model generated in the previous step
104
+risomKest <- estimateNumCluster(data = risomRes$model, kSeq = 2:25, 
105
+                                  method = c("Discriminant", "Distance"))
106
+
107
+```
108
+We can then use this result to determine the best number of clusters for this dataset based on the different metrics. The `FuseSOM` package provides a plotting function (`optiPlot`) which generates an elbow plot with the optimal value for the number of clusters for the distance based methods. See below
109
+
110
+```{r}
111
+# what is the best number of clusters determined by the discriminant method?
112
+# optimal number of clusters according to the discriminant method is 7
113
+risomKest$Discriminant 
114
+
115
+# we can plot the results using the optiplot function
116
+pSlope <- optiPlot(risomKest, method = 'slope')
117
+pSlope
118
+pJump <- optiPlot(risomKest, method = 'jump')
119
+pJump
120
+pWcd <- optiPlot(risomKest, method = 'wcd')
121
+pWcd
122
+pGap <- optiPlot(risomKest, method = 'gap')
123
+pGap
124
+pSil <- optiPlot(risomKest, method = 'silhouette')
125
+pSil
126
+
127
+```
128
+From the plots, we see that the `Jump` statistics almost perfectly capture the number of clusters. The `Gap` method is a close second with $15$ clusters. All the other methods significantly underestimates the number of clusters.
129
+
130
+## `FuseSOM` Sinlge Cell Epxeriment object as input.
131
+
132
+The `FuseSOM` algorithm is also equipped to take in a `SingleCellExperiment` object as input. The results of the pipeline will be written to either the metada or the colData fields. See below.
133
+
134
+First we create a `SingleCellExperiment` object
135
+```{r, message=FALSE, warning=FALSE}
136
+library(SingleCellExperiment)
137
+
138
+# create a singelcellexperiment object
139
+colDat <- risom_dat[, setdiff(colnames(risom_dat), risomMarkers)]
140
+sce <- SingleCellExperiment(assays = list(counts = t(risom_dat)),
141
+                                 colData = colDat)
142
+
143
+sce
144
+```
145
+
146
+Next we pass it to the `runFuseSOM()` function. Here, we can provide the assay in which the data is stored and what name to store the clusters under in the colData section. Note that the `Self Organizing Map` that is generated will be stored in the metadata field.
147
+
148
+```{r}
149
+risomRessce <- runFuseSOM(sce, markers = risomMarkers, assay = 'counts', 
150
+                      numClusters = 23, verbose = FALSE)
151
+
152
+colnames(colData(risomRessce))
153
+names(metadata(risomRessce))
154
+```
155
+Notice how the there is now a clusters column in the colData and SOM field in the metadata. You can run this function again with a new set of cluster number. If you provide a new name for the clusters, it will be stored under that new column, else, it will overwrite the current clusters column. Running it again on the same object will overwrite the SOM field in the metadata.
156
+
157
+Just like before, lets plot the heatmap of the resulting clusters across all markers.
158
+```{r}
159
+data <- risom_dat[, risomMarkers] # get the original data used
160
+clusters <- colData(risomRessce)$clusters # extract the clusters from the sce object
161
+# generate the heatmap
162
+risomHeatsce <- markerHeatmap(data = risom_dat, markers = risomMarkers,
163
+                            clusters = clusters, clusterMarkers = TRUE)
164
+```
165
+
166
+## Using `FuseSOM` to estimate the number of clusters for single cell experiment objects
167
+Just like before, we can estimate the number of clusters 
168
+```{r}
169
+# lets estimate the number of clusters using all the methods
170
+# original clustering has 23 clusters so we will set kseq from 2:25
171
+# now we pass it a singlecellexperiment object instead of the som model as before
172
+# this will return a singelcellexperiment object where the metatdata contains the
173
+# cluster estimation information
174
+risomRessce <- estimateNumCluster(data = risomRessce, kSeq = 2:25, 
175
+                                  method = c("Discriminant", "Distance"))
176
+
177
+names(metadata(risomRessce))
178
+```
179
+Notice how the metadata now contains a `clusterEstimation` field which holds the results from the `estimateNumCluster()` function
180
+
181
+We can assess the results in a similar fashion as before
182
+```{r}
183
+# what is the best number of clusters determined by the discriminant method?
184
+# optimal number of clusters according to the discriminant method is 8
185
+metadata(risomRessce)$clusterEstimation$Discriminant 
186
+
187
+# we can plot the results using the optiplot function
188
+pSlope <- optiPlot(risomRessce, method = 'slope')
189
+pSlope
190
+pJump <- optiPlot(risomRessce, method = 'jump')
191
+pJump
192
+pWcd <- optiPlot(risomRessce, method = 'wcd')
193
+pWcd
194
+pGap <- optiPlot(risomRessce, method = 'gap')
195
+pGap
196
+pSil <- optiPlot(risomRessce, method = 'silhouette')
197
+pSil
198
+
199
+```
200
+Again, we see that the `Jump` statistics almost perfectly capture the number of clusters. The `Gap` method is a close second with $15$ clusters. All the other methods significantly underestimates the number of clusters.
201
+
202
+## `FuseSOM` Spatial Epxeriment object as input.
203
+The methodology for `Spatial Epxeriment` is exactly the same as that of `Single Cell Epxeriment`
204
+
205
+# sessionInfo()
206
+
207
+```{r}
208
+sessionInfo()
209
+```
0 210
\ No newline at end of file