... | ... |
@@ -69,6 +69,10 @@ First, we load the dataset. To reduce the computational complexity of this vigne |
69 | 69 |
```{r} |
70 | 70 |
zilionis <- ZilionisLungData() |
71 | 71 |
zilionis <- zilionis[, 1:5000] |
72 |
+ |
|
73 |
+# now we add simple colnames (= cell ids) to the dataset |
|
74 |
+# Note: This is normally not necessary |
|
75 |
+colnames(zilionis) <- paste0("Cell", 1:ncol(zilionis)) |
|
72 | 76 |
``` |
73 | 77 |
|
74 | 78 |
We split this dataset into two parts, one for the training and the other for the testing. |
... | ... |
@@ -24,7 +24,7 @@ easy tools to train their own model classifying new cell types from labeled |
24 | 24 |
scRNA-seq data. |
25 | 25 |
|
26 | 26 |
This vignette shows how to train a basic |
27 |
-classification model for an independant cell type, which is not a child of |
|
27 |
+classification model for an independent cell type, which is not a child of |
|
28 | 28 |
any other cell type. |
29 | 29 |
|
30 | 30 |
## Preparing train object and test object |
... | ... |
@@ -148,8 +148,8 @@ times for one model, users can use `set.seed`. |
148 | 148 |
```{r} |
149 | 149 |
set.seed(123) |
150 | 150 |
classifier_B <- train_classifier(train_obj = train_set, cell_type = "B cells", |
151 |
- marker_genes = selected_marker_genes_B, |
|
152 |
- sce_assay = 'counts', sce_tag_slot = 'B_cell') |
|
151 |
+ marker_genes = selected_marker_genes_B, |
|
152 |
+ assay = 'counts', tag_slot = 'B_cell') |
|
153 | 153 |
``` |
154 | 154 |
```{r} |
155 | 155 |
classifier_B |
... | ... |
@@ -169,8 +169,8 @@ The `test_classifier` model automatically tests a classifier's performance |
169 | 169 |
against another dataset. Here, we used the `test_set` created before: |
170 | 170 |
|
171 | 171 |
```{r} |
172 |
-classifier_B_test <- test_classifier(test_obj = test_set, classifier = classifier_B, |
|
173 |
- sce_assay = 'counts', sce_tag_slot = 'B_cell') |
|
172 |
+classifier_B_test <- test_classifier(classifier = classifier_B, test_obj = test_set, |
|
173 |
+ assay = 'counts', tag_slot = 'B_cell') |
|
174 | 174 |
``` |
175 | 175 |
|
176 | 176 |
### Interpreting test model result |
Former-commit-id: 4a69d2314d0fe5849d5002a1dc1cdd474f9afaff
... | ... |
@@ -233,7 +233,7 @@ New classification models can be stored using the `save_new_model` function: |
233 | 233 |
|
234 | 234 |
```{r} |
235 | 235 |
# no copy of pretrained models is performed |
236 |
-save_new_model(new_model = classifier_B, path.to.models = tempdir(), |
|
236 |
+save_new_model(new_model = classifier_B, path_to_models = tempdir(), |
|
237 | 237 |
include.default = FALSE) |
238 | 238 |
``` |
239 | 239 |
|
... | ... |
@@ -241,7 +241,7 @@ Parameters: |
241 | 241 |
|
242 | 242 |
* **new_model**: The new model that should be added to the database in the |
243 | 243 |
specified directory. |
244 |
- * **path.to.models**: The directory where the new models should be stored. |
|
244 |
+ * **path_to_models**: The directory where the new models should be stored. |
|
245 | 245 |
* **include.default**: If set, the default models shipped with the package |
246 | 246 |
are added to the database. |
247 | 247 |
|
... | ... |
@@ -253,7 +253,7 @@ Models can be deleted from the model database using the `delete_model` function: |
253 | 253 |
|
254 | 254 |
```{r} |
255 | 255 |
# delete the "B cells" model from the new database |
256 |
-delete_model("B cells", path.to.models = tempdir()) |
|
256 |
+delete_model("B cells", path_to_models = tempdir()) |
|
257 | 257 |
``` |
258 | 258 |
|
259 | 259 |
## Session Info |
Former-commit-id: 91ec8da24f45c110ab4f2e728ab71e9c1d255ca8
... | ... |
@@ -147,12 +147,12 @@ from the majority group. To use the same set of cells while training multiple |
147 | 147 |
times for one model, users can use `set.seed`. |
148 | 148 |
```{r} |
149 | 149 |
set.seed(123) |
150 |
-clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", |
|
150 |
+classifier_B <- train_classifier(train_obj = train_set, cell_type = "B cells", |
|
151 | 151 |
marker_genes = selected_marker_genes_B, |
152 | 152 |
sce_assay = 'counts', sce_tag_slot = 'B_cell') |
153 | 153 |
``` |
154 | 154 |
```{r} |
155 |
-clf_B |
|
155 |
+classifier_B |
|
156 | 156 |
``` |
157 | 157 |
The classification model is a `scAnnotatR` object. |
158 | 158 |
Details about the classification model are accessible via getter methods. |
... | ... |
@@ -160,7 +160,7 @@ Details about the classification model are accessible via getter methods. |
160 | 160 |
For example: |
161 | 161 |
|
162 | 162 |
```{r} |
163 |
-clf(clf_B) |
|
163 |
+caret_model(classifier_B) |
|
164 | 164 |
``` |
165 | 165 |
|
166 | 166 |
## Test model |
... | ... |
@@ -169,7 +169,7 @@ The `test_classifier` model automatically tests a classifier's performance |
169 | 169 |
against another dataset. Here, we used the `test_set` created before: |
170 | 170 |
|
171 | 171 |
```{r} |
172 |
-clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, |
|
172 |
+classifier_B_test <- test_classifier(test_obj = test_set, classifier = classifier_B, |
|
173 | 173 |
sce_assay = 'counts', sce_tag_slot = 'B_cell') |
174 | 174 |
``` |
175 | 175 |
|
... | ... |
@@ -191,7 +191,7 @@ Every classifier internally consists of a trained SVM and a probability threshol |
191 | 191 |
this threshold are classified as the respective cell type. The *overall_roc* slot summarizes the True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds. |
192 | 192 |
|
193 | 193 |
```{r} |
194 |
-clf_B_test$overall_roc |
|
194 |
+classifier_B_test$overall_roc |
|
195 | 195 |
``` |
196 | 196 |
|
197 | 197 |
In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we will incorrectly classify some cells as B cells. At the sime time, we may not retrieve all actual B cells with higher p_thres (0.6, for example). |
... | ... |
@@ -202,7 +202,7 @@ There is of course a certain trade-off between the sensitivity and the specifici |
202 | 202 |
|
203 | 203 |
Apart from numbers, we also provide a method to plot the ROC curve. |
204 | 204 |
```{r} |
205 |
-roc_curve <- plot_roc_curve(test_result = clf_B_test) |
|
205 |
+roc_curve <- plot_roc_curve(test_result = classifier_B_test) |
|
206 | 206 |
``` |
207 | 207 |
```{r} |
208 | 208 |
plot(roc_curve) |
... | ... |
@@ -233,7 +233,7 @@ New classification models can be stored using the `save_new_model` function: |
233 | 233 |
|
234 | 234 |
```{r} |
235 | 235 |
# no copy of pretrained models is performed |
236 |
-save_new_model(new_model = clf_B, path.to.models = tempdir(), |
|
236 |
+save_new_model(new_model = classifier_B, path.to.models = tempdir(), |
|
237 | 237 |
include.default = FALSE) |
238 | 238 |
``` |
239 | 239 |
|
Former-commit-id: b8f82581ead9bfe31c088db02857a1d1c996d832
... | ... |
@@ -112,13 +112,13 @@ We may want to check the number of cells in each category: |
112 | 112 |
table(train_set$B_cell) |
113 | 113 |
``` |
114 | 114 |
|
115 |
-## Defining features |
|
115 |
+## Defining marker genes |
|
116 | 116 |
|
117 |
-Next, we define a set of features, which will be used in training the |
|
117 |
+Next, we define a set of marker genes, which will be used in training the |
|
118 | 118 |
classification model. Supposing we are training a model for classifying |
119 |
-B cells, we define the set of features as follows: |
|
119 |
+B cells, we define the set of marker genes as follows: |
|
120 | 120 |
```{r} |
121 |
-selected_features_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM', |
|
121 |
+selected_marker_genes_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM', |
|
122 | 122 |
"CR2", "MEF2C", 'VPREB3', 'CD86', 'LY86', "BLK", "DERL3") |
123 | 123 |
``` |
124 | 124 |
|
... | ... |
@@ -128,7 +128,7 @@ When the model is being trained, three pieces of information must be |
128 | 128 |
provided: |
129 | 129 |
|
130 | 130 |
* the Seurat/SCE object used for training |
131 |
- * the set of applied features |
|
131 |
+ * the set of applied marker genes |
|
132 | 132 |
* the cell type defining the trained model |
133 | 133 |
|
134 | 134 |
In case the dataset does not contain any cell classified as the target |
... | ... |
@@ -148,7 +148,7 @@ times for one model, users can use `set.seed`. |
148 | 148 |
```{r} |
149 | 149 |
set.seed(123) |
150 | 150 |
clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", |
151 |
- features = selected_features_B, |
|
151 |
+ marker_genes = selected_marker_genes_B, |
|
152 | 152 |
sce_assay = 'counts', sce_tag_slot = 'B_cell') |
153 | 153 |
``` |
154 | 154 |
```{r} |
... | ... |
@@ -210,8 +210,8 @@ plot(roc_curve) |
210 | 210 |
|
211 | 211 |
### Which model to choose? |
212 | 212 |
|
213 |
-Changes in the training data, in the set of features and in the prediction probability |
|
214 |
-threshold will all lead to a change in model performance. |
|
213 |
+Changes in the training data, in the set of marker genes and in the prediction |
|
214 |
+probability threshold will all lead to a change in model performance. |
|
215 | 215 |
|
216 | 216 |
There are several ways to evaluate the trained model, including the overall |
217 | 217 |
accuracy, the AUC score and the sensitivity/specificity of the model when |
Former-commit-id: 15b64d9e0186c72b444cb313c15cb97cd80a3b4e
... | ... |
@@ -147,7 +147,8 @@ from the majority group. To use the same set of cells while training multiple |
147 | 147 |
times for one model, users can use `set.seed`. |
148 | 148 |
```{r} |
149 | 149 |
set.seed(123) |
150 |
-clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B, |
|
150 |
+clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", |
|
151 |
+ features = selected_features_B, |
|
151 | 152 |
sce_assay = 'counts', sce_tag_slot = 'B_cell') |
152 | 153 |
``` |
153 | 154 |
```{r} |
... | ... |
@@ -232,7 +233,8 @@ New classification models can be stored using the `save_new_model` function: |
232 | 233 |
|
233 | 234 |
```{r} |
234 | 235 |
# no copy of pretrained models is performed |
235 |
-save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) |
|
236 |
+save_new_model(new_model = clf_B, path.to.models = tempdir(), |
|
237 |
+ include.default = FALSE) |
|
236 | 238 |
``` |
237 | 239 |
|
238 | 240 |
Parameters: |
... | ... |
@@ -251,7 +253,7 @@ Models can be deleted from the model database using the `delete_model` function: |
251 | 253 |
|
252 | 254 |
```{r} |
253 | 255 |
# delete the "B cells" model from the new database |
254 |
-delete_model("B cells", path.to.models = getwd()) |
|
256 |
+delete_model("B cells", path.to.models = tempdir()) |
|
255 | 257 |
``` |
256 | 258 |
|
257 | 259 |
## Session Info |
Former-commit-id: 9823f9b8cf0ad38ebbe2bcd6f870a63a98c29d07
... | ... |
@@ -19,7 +19,7 @@ options(rmarkdown.html_vignette.check_title = FALSE) |
19 | 19 |
|
20 | 20 |
## Introduction |
21 | 21 |
|
22 |
-One of key functions of the scClassifR package is to provide users |
|
22 |
+One of key functions of the scAnnotatR package is to provide users |
|
23 | 23 |
easy tools to train their own model classifying new cell types from labeled |
24 | 24 |
scRNA-seq data. |
25 | 25 |
|
... | ... |
@@ -50,9 +50,9 @@ To start the training workflow, we first install and load the necessary librarie |
50 | 50 |
if (!requireNamespace("BiocManager", quietly = TRUE)) |
51 | 51 |
install.packages("BiocManager") |
52 | 52 |
|
53 |
-# the scClassifR package |
|
54 |
-if (!require(scClassifR)) |
|
55 |
- BiocManager::install("scClassifR") |
|
53 |
+# the scAnnotatR package |
|
54 |
+if (!require(scAnnotatR)) |
|
55 |
+ BiocManager::install("scAnnotatR") |
|
56 | 56 |
|
57 | 57 |
# we use the scRNAseq package to load example data |
58 | 58 |
if (!require(scRNAseq)) |
... | ... |
@@ -61,7 +61,7 @@ if (!require(scRNAseq)) |
61 | 61 |
|
62 | 62 |
```{r} |
63 | 63 |
library(scRNAseq) |
64 |
-library(scClassifR) |
|
64 |
+library(scAnnotatR) |
|
65 | 65 |
``` |
66 | 66 |
|
67 | 67 |
First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset. |
... | ... |
@@ -105,7 +105,7 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`, |
105 | 105 |
function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'})) |
106 | 106 |
``` |
107 | 107 |
|
108 |
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scClassifR from training and testing. |
|
108 |
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scAnnotatR from training and testing. |
|
109 | 109 |
|
110 | 110 |
We may want to check the number of cells in each category: |
111 | 111 |
```{r} |
... | ... |
@@ -153,7 +153,7 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features |
153 | 153 |
```{r} |
154 | 154 |
clf_B |
155 | 155 |
``` |
156 |
-The classification model is a `scClassifR` object. |
|
156 |
+The classification model is a `scAnnotatR` object. |
|
157 | 157 |
Details about the classification model are accessible via getter methods. |
158 | 158 |
|
159 | 159 |
For example: |
Former-commit-id: c41bda4739a20cb0a346da77aaa02256d756b67f
... | ... |
@@ -19,7 +19,7 @@ options(rmarkdown.html_vignette.check_title = FALSE) |
19 | 19 |
|
20 | 20 |
## Introduction |
21 | 21 |
|
22 |
-One of key functions of the scTypeR package is to provide users |
|
22 |
+One of key functions of the scClassifR package is to provide users |
|
23 | 23 |
easy tools to train their own model classifying new cell types from labeled |
24 | 24 |
scRNA-seq data. |
25 | 25 |
|
... | ... |
@@ -50,9 +50,9 @@ To start the training workflow, we first install and load the necessary librarie |
50 | 50 |
if (!requireNamespace("BiocManager", quietly = TRUE)) |
51 | 51 |
install.packages("BiocManager") |
52 | 52 |
|
53 |
-# the scTypeR package |
|
54 |
-if (!require(scTypeR)) |
|
55 |
- BiocManager::install("scTypeR") |
|
53 |
+# the scClassifR package |
|
54 |
+if (!require(scClassifR)) |
|
55 |
+ BiocManager::install("scClassifR") |
|
56 | 56 |
|
57 | 57 |
# we use the scRNAseq package to load example data |
58 | 58 |
if (!require(scRNAseq)) |
... | ... |
@@ -61,7 +61,7 @@ if (!require(scRNAseq)) |
61 | 61 |
|
62 | 62 |
```{r} |
63 | 63 |
library(scRNAseq) |
64 |
-library(scTypeR) |
|
64 |
+library(scClassifR) |
|
65 | 65 |
``` |
66 | 66 |
|
67 | 67 |
First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset. |
... | ... |
@@ -105,7 +105,7 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`, |
105 | 105 |
function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'})) |
106 | 106 |
``` |
107 | 107 |
|
108 |
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scTypeR from training and testing. |
|
108 |
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scClassifR from training and testing. |
|
109 | 109 |
|
110 | 110 |
We may want to check the number of cells in each category: |
111 | 111 |
```{r} |
... | ... |
@@ -153,7 +153,7 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features |
153 | 153 |
```{r} |
154 | 154 |
clf_B |
155 | 155 |
``` |
156 |
-The classification model is a `scTypeR` object. |
|
156 |
+The classification model is a `scClassifR` object. |
|
157 | 157 |
Details about the classification model are accessible via getter methods. |
158 | 158 |
|
159 | 159 |
For example: |
Former-commit-id: 864885ec7ff787bf98b464fcc188c50693100b77
... | ... |
@@ -45,7 +45,7 @@ which is available in the scRNAseq (2.4.0) library. The dataset is stored as a |
45 | 45 |
SCE object. |
46 | 46 |
|
47 | 47 |
To start the training workflow, we first install and load the necessary libraries. |
48 |
-```{r} |
|
48 |
+```{r, eval = FALSE} |
|
49 | 49 |
# use BiocManager to install from Bioconductor |
50 | 50 |
if (!requireNamespace("BiocManager", quietly = TRUE)) |
51 | 51 |
install.packages("BiocManager") |
... | ... |
@@ -59,6 +59,11 @@ if (!require(scRNAseq)) |
59 | 59 |
BiocManager::install("scRNAseq") |
60 | 60 |
``` |
61 | 61 |
|
62 |
+```{r} |
|
63 |
+library(scRNAseq) |
|
64 |
+library(scTypeR) |
|
65 |
+``` |
|
66 |
+ |
|
62 | 67 |
First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset. |
63 | 68 |
|
64 | 69 |
```{r} |
Former-commit-id: d3cd57efacc4694501f4f615a49b7fe899a06e5b
... | ... |
@@ -59,10 +59,11 @@ if (!require(scRNAseq)) |
59 | 59 |
BiocManager::install("scRNAseq") |
60 | 60 |
``` |
61 | 61 |
|
62 |
-Load the dataset: |
|
62 |
+First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset. |
|
63 | 63 |
|
64 | 64 |
```{r} |
65 | 65 |
zilionis <- ZilionisLungData() |
66 |
+zilionis <- zilionis[, 1:5000] |
|
66 | 67 |
``` |
67 | 68 |
|
68 | 69 |
We split this dataset into two parts, one for the training and the other for the testing. |
... | ... |
@@ -112,12 +113,8 @@ Next, we define a set of features, which will be used in training the |
112 | 113 |
classification model. Supposing we are training a model for classifying |
113 | 114 |
B cells, we define the set of features as follows: |
114 | 115 |
```{r} |
115 |
-selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B", |
|
116 |
- "CD38", "CD37", "CD83", "CR2", "MVK", "MME", |
|
117 |
- "IL2RA", "PTEN", "POU2AF1", "MEF2C", "IRF8", |
|
118 |
- "TCF3", "BACH2", "MZB1", 'VPREB3', 'RASGRP2', |
|
119 |
- 'CD86', 'CD84', 'LY86', 'CD74', 'SP140', "BLK", |
|
120 |
- 'FLI1', 'CD14', "DERL3", "LRMP") |
|
116 |
+selected_features_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM', |
|
117 |
+ "CR2", "MEF2C", 'VPREB3', 'CD86', 'LY86', "BLK", "DERL3") |
|
121 | 118 |
``` |
122 | 119 |
|
123 | 120 |
## Train model |
Former-commit-id: af53b823be827fde1d0a8435136b9455293ccbdc
... | ... |
@@ -19,7 +19,7 @@ options(rmarkdown.html_vignette.check_title = FALSE) |
19 | 19 |
|
20 | 20 |
## Introduction |
21 | 21 |
|
22 |
-One of key functions of the SingleCellClassR package is to provide users |
|
22 |
+One of key functions of the scTypeR package is to provide users |
|
23 | 23 |
easy tools to train their own model classifying new cell types from labeled |
24 | 24 |
scRNA-seq data. |
25 | 25 |
|
... | ... |
@@ -50,9 +50,9 @@ To start the training workflow, we first install and load the necessary librarie |
50 | 50 |
if (!requireNamespace("BiocManager", quietly = TRUE)) |
51 | 51 |
install.packages("BiocManager") |
52 | 52 |
|
53 |
-# the SingleCellClassR package |
|
54 |
-if (!require(SingleCellClassR)) |
|
55 |
- BiocManager::install("SingleCellClassR") |
|
53 |
+# the scTypeR package |
|
54 |
+if (!require(scTypeR)) |
|
55 |
+ BiocManager::install("scTypeR") |
|
56 | 56 |
|
57 | 57 |
# we use the scRNAseq package to load example data |
58 | 58 |
if (!require(scRNAseq)) |
... | ... |
@@ -99,7 +99,7 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`, |
99 | 99 |
function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'})) |
100 | 100 |
``` |
101 | 101 |
|
102 |
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. |
|
102 |
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scTypeR from training and testing. |
|
103 | 103 |
|
104 | 104 |
We may want to check the number of cells in each category: |
105 | 105 |
```{r} |
... | ... |
@@ -151,7 +151,7 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features |
151 | 151 |
```{r} |
152 | 152 |
clf_B |
153 | 153 |
``` |
154 |
-The classification model is a `SingleCellClassR` object. |
|
154 |
+The classification model is a `scTypeR` object. |
|
155 | 155 |
Details about the classification model are accessible via getter methods. |
156 | 156 |
|
157 | 157 |
For example: |
Former-commit-id: f8479b40bc3ff528d07dd7c08bb7b95ed163e1fb
... | ... |
@@ -73,10 +73,12 @@ test_set <- zilionis[, (1+pivot):ncol(zilionis)] |
73 | 73 |
``` |
74 | 74 |
|
75 | 75 |
In this dataset, the cell type meta data is stored in the *Most likely LM22 cell type* |
76 |
-slot of the SingleCellExperiment object (in both train object and test object). |
|
77 |
-If cell type is stored in another slot of object meta data, |
|
78 |
-the slot/tag slot name must be then provided as a parameter in the |
|
79 |
-train and test method. |
|
76 |
+slot of the SingleCellExperiment object (in both the train object and test object). |
|
77 |
+ |
|
78 |
+If the cell type is stored not stored as the default identification (set through |
|
79 |
+`Idents` for Seurat object) the slot must be set as a parameter in the training |
|
80 |
+and testing function (see below). |
|
81 |
+ |
|
80 | 82 |
```{r} |
81 | 83 |
unique(train_set$`Most likely LM22 cell type`) |
82 | 84 |
``` |
Former-commit-id: e5f1d8de2f05f7473f211949af702734a4c14cd9
... | ... |
@@ -19,18 +19,21 @@ options(rmarkdown.html_vignette.check_title = FALSE) |
19 | 19 |
|
20 | 20 |
## Introduction |
21 | 21 |
|
22 |
-One of basic functions of the SingleCellClassR package is to provide users |
|
22 |
+One of key functions of the SingleCellClassR package is to provide users |
|
23 | 23 |
easy tools to train their own model classifying new cell types from labeled |
24 | 24 |
scRNA-seq data. |
25 | 25 |
|
26 |
-From the very beginning, this vignette shows how to train a basic |
|
26 |
+This vignette shows how to train a basic |
|
27 | 27 |
classification model for an independant cell type, which is not a child of |
28 | 28 |
any other cell type. |
29 | 29 |
|
30 | 30 |
## Preparing train object and test object |
31 | 31 |
|
32 |
-The workflow starts from a couple of objects where cells have been |
|
33 |
-assigned to be different cell types. To do this, users may have annotated |
|
32 |
+The workflow starts with either a [Seurat](https://satijalab.org/seurat/) or |
|
33 |
+[SingleCellExperiment](https://osca.bioconductor.org/) object where cells have already |
|
34 |
+been assigned to different cell types. |
|
35 |
+ |
|
36 |
+To do this, users may have annotated |
|
34 | 37 |
scRNA-seq data (by a FACS-sorting process, for example), create a Seurat/ |
35 | 38 |
SingleCellExperiment (SCE) object based on the sequencing data and assign the |
36 | 39 |
predetermined cell types as cell meta data. If the scRNA-seq data has not |
... | ... |
@@ -41,10 +44,19 @@ In this vignette, we use the human lung dataset from Zilionis et al., 2019, |
41 | 44 |
which is available in the scRNAseq (2.4.0) library. The dataset is stored as a |
42 | 45 |
SCE object. |
43 | 46 |
|
44 |
-To start the training workflow, we first load the neccessary libraries. |
|
47 |
+To start the training workflow, we first install and load the necessary libraries. |
|
45 | 48 |
```{r} |
46 |
-library(SingleCellClassR) |
|
47 |
-library(scRNAseq) |
|
49 |
+# use BiocManager to install from Bioconductor |
|
50 |
+if (!requireNamespace("BiocManager", quietly = TRUE)) |
|
51 |
+ install.packages("BiocManager") |
|
52 |
+ |
|
53 |
+# the SingleCellClassR package |
|
54 |
+if (!require(SingleCellClassR)) |
|
55 |
+ BiocManager::install("SingleCellClassR") |
|
56 |
+ |
|
57 |
+# we use the scRNAseq package to load example data |
|
58 |
+if (!require(scRNAseq)) |
|
59 |
+ BiocManager::install("scRNAseq") |
|
48 | 60 |
``` |
49 | 61 |
|
50 | 62 |
Load the dataset: |
... | ... |
@@ -53,7 +65,7 @@ Load the dataset: |
53 | 65 |
zilionis <- ZilionisLungData() |
54 | 66 |
``` |
55 | 67 |
|
56 |
-We cut this dataset into two parts, one for the training and the other for the testing. |
|
68 |
+We split this dataset into two parts, one for the training and the other for the testing. |
|
57 | 69 |
```{r} |
58 | 70 |
pivot = ncol(zilionis)%/%2 |
59 | 71 |
train_set <- zilionis[, 1:pivot] |
... | ... |
@@ -85,14 +97,14 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`, |
85 | 97 |
function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'})) |
86 | 98 |
``` |
87 | 99 |
|
88 |
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid the affect of those NAs cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. |
|
100 |
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. |
|
89 | 101 |
|
90 | 102 |
We may want to check the number of cells in each category: |
91 | 103 |
```{r} |
92 | 104 |
table(train_set$B_cell) |
93 | 105 |
``` |
94 | 106 |
|
95 |
-## Defining set of features |
|
107 |
+## Defining features |
|
96 | 108 |
|
97 | 109 |
Next, we define a set of features, which will be used in training the |
98 | 110 |
classification model. Supposing we are training a model for classifying |
... | ... |
@@ -108,21 +120,27 @@ selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B", |
108 | 120 |
|
109 | 121 |
## Train model |
110 | 122 |
|
111 |
-When the model is being trained, three most important information must be |
|
112 |
-provided are: the Seurat/SCE object used for training, the set of applied features |
|
113 |
-and the cell type defining the trained model. |
|
123 |
+When the model is being trained, three pieces of information must be |
|
124 |
+provided: |
|
125 |
+ |
|
126 |
+ * the Seurat/SCE object used for training |
|
127 |
+ * the set of applied features |
|
128 |
+ * the cell type defining the trained model |
|
129 |
+ |
|
130 |
+In case the dataset does not contain any cell classified as the target |
|
131 |
+cell type, the function will fail. |
|
114 | 132 |
|
115 |
-Cell type corresponding to the trained model must exist among identities |
|
116 |
-assigned to cells in the trained Seurat object. Remember if cell types |
|
117 |
-are not indicated as active identification of the trained object, name |
|
118 |
-of the tag slot in object meta data must be provided to the sce_tag_slot parameter. |
|
133 |
+If the cell type annotation is not set in the default identification slot |
|
134 |
+(`Idents` for `Seurat` objects) the name |
|
135 |
+of the metadata field must be provided to the `sce_tag_slot parameter`. |
|
119 | 136 |
|
120 |
-When training on a imbalanced dataset, the trained model may bias toward the |
|
137 |
+When training on an imbalanced dataset (f.e. a datasets containing 90% B cells and |
|
138 |
+only very few other cell types), the trained model may bias toward the |
|
121 | 139 |
majority group and ignore the presence of the minority group. To avoid this, |
122 |
-the number of positive cells and negative cells will automatically be balanced |
|
140 |
+the number of positive cells and negative cells will be automatically balanced |
|
123 | 141 |
before training. Therefore, a smaller number cells will be randomly picked |
124 | 142 |
from the majority group. To use the same set of cells while training multiple |
125 |
-times for one model, users can use set.seed. |
|
143 |
+times for one model, users can use `set.seed`. |
|
126 | 144 |
```{r} |
127 | 145 |
set.seed(123) |
128 | 146 |
clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B, |
... | ... |
@@ -131,15 +149,20 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features |
131 | 149 |
```{r} |
132 | 150 |
clf_B |
133 | 151 |
``` |
134 |
-The classifying model is a S4 object named SingleCellClassR. |
|
135 |
-Details about the classifying model is accessible via getter methods. |
|
152 |
+The classification model is a `SingleCellClassR` object. |
|
153 |
+Details about the classification model are accessible via getter methods. |
|
154 |
+ |
|
136 | 155 |
For example: |
156 |
+ |
|
137 | 157 |
```{r} |
138 | 158 |
clf(clf_B) |
139 | 159 |
``` |
140 | 160 |
|
141 | 161 |
## Test model |
142 | 162 |
|
163 |
+The `test_classifier` model automatically tests a classifier's performance |
|
164 |
+against another dataset. Here, we used the `test_set` created before: |
|
165 |
+ |
|
143 | 166 |
```{r} |
144 | 167 |
clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, |
145 | 168 |
sce_assay = 'counts', sce_tag_slot = 'B_cell') |
... | ... |
@@ -155,17 +178,18 @@ Apart from the output exported to console, test classifier function also returns |
155 | 178 |
|
156 | 179 |
* **acc**: prediction accuracy at the fixed probability threshold, the probability threshold value can also be queried using *p_thres(classifier)* |
157 | 180 |
|
158 |
- * **auc**: AUC score of provided by current classifier |
|
181 |
+ * **auc**: AUC score provided by current classifier |
|
159 | 182 |
|
160 | 183 |
* **overall_roc**: True Positive Rate and False Positive Rate with a certain number of prediction probability thresholds |
161 | 184 |
|
162 |
-With the same classification model, the sensitivity and the specification of classification can be different because of the prediction probability threshold. To optimize user experience, we have the *overall_roc* as a summary of True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds: |
|
185 |
+Every classifier internally consists of a trained SVM and a probability threshold. Only cells that are classified with a probability exceeding |
|
186 |
+this threshold are classified as the respective cell type. The *overall_roc* slot summarizes the True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds. |
|
163 | 187 |
|
164 | 188 |
```{r} |
165 | 189 |
clf_B_test$overall_roc |
166 | 190 |
``` |
167 | 191 |
|
168 |
-In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example). |
|
192 |
+In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we will incorrectly classify some cells as B cells. At the sime time, we may not retrieve all actual B cells with higher p_thres (0.6, for example). |
|
169 | 193 |
|
170 | 194 |
There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model. |
171 | 195 |
|
... | ... |
@@ -181,13 +205,12 @@ plot(roc_curve) |
181 | 205 |
|
182 | 206 |
### Which model to choose? |
183 | 207 |
|
184 |
-Changes in train data, in the set of features and in the prediction probability |
|
208 |
+Changes in the training data, in the set of features and in the prediction probability |
|
185 | 209 |
threshold will all lead to a change in model performance. |
186 | 210 |
|
187 | 211 |
There are several ways to evaluate the trained model, including the overall |
188 | 212 |
accuracy, the AUC score and the sensitivity/specificity of the model when |
189 |
-testing on an independent dataset. Here, we export all these statistics to |
|
190 |
-help user have a wider range of choices. In this example, we choose the model |
|
213 |
+testing on an independent dataset. In this example, we choose the model |
|
191 | 214 |
which has the best AUC score. |
192 | 215 |
|
193 | 216 |
*Tip: Using more general markers of the whole population leads to higher |
... | ... |
@@ -195,29 +218,36 @@ sensitivity. This sometimes produces lower specificity because of close |
195 | 218 |
cell types (T cells and NK cells, for example). While training some models, |
196 | 219 |
we observed that we can use the markers producing high sensitivity but at |
197 | 220 |
the same time can improve the specificity by increasing the probability |
198 |
-threshold. Of course, this tip can only applied in some cases, because |
|
221 |
+threshold. Of course, this can only applied in some cases, because |
|
199 | 222 |
some markers can even have a larger affect on the specificity than the |
200 | 223 |
prediction probability threshold.* |
201 | 224 |
|
202 | 225 |
## Save classification model for further use |
203 | 226 |
|
204 |
-After having obtained a good classification model, users may want to save it |
|
205 |
-for future classification. To do this, we provide a method that helps the user |
|
206 |
-step-by-step store all new classification models. |
|
227 |
+New classification models can be stored using the `save_new_model` function: |
|
207 | 228 |
|
208 |
-To use this method, two information must be provided: the to be saved model and |
|
209 |
-the directory path where the new model will be stored. This method will then |
|
210 |
-create a small database containing all trained models. Therefore, users must |
|
211 |
-indicate the same path to models in order to use multiple classification models |
|
212 |
-at the same time. |
|
229 |
+```{r} |
|
230 |
+# no copy of pretrained models is performed |
|
231 |
+save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) |
|
232 |
+``` |
|
233 |
+ |
|
234 |
+Parameters: |
|
235 |
+ |
|
236 |
+ * **new_model**: The new model that should be added to the database in the |
|
237 |
+ specified directory. |
|
238 |
+ * **path.to.models**: The directory where the new models should be stored. |
|
239 |
+ * **include.default**: If set, the default models shipped with the package |
|
240 |
+ are added to the database. |
|
213 | 241 |
|
214 | 242 |
Users can also choose whether copy all pretrained models of the packages to the |
215 | 243 |
new model database. If not, in the future, user can only choose to use either |
216 | 244 |
default pretrained models or new models by specifying only one path to models. |
217 | 245 |
|
246 |
+Models can be deleted from the model database using the `delete_model` function: |
|
247 |
+ |
|
218 | 248 |
```{r} |
219 |
-# no copy of pretrained models is performed |
|
220 |
-save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) |
|
249 |
+# delete the "B cells" model from the new database |
|
250 |
+delete_model("B cells", path.to.models = getwd()) |
|
221 | 251 |
``` |
222 | 252 |
|
223 | 253 |
## Session Info |
Former-commit-id: fa2385410e4cefdc694545daa48892d15ffe8921
... | ... |
@@ -29,41 +29,67 @@ any other cell type. |
29 | 29 |
|
30 | 30 |
## Preparing train object and test object |
31 | 31 |
|
32 |
-The workflow starts from a couple of Seurat objects where cells have been |
|
32 |
+The workflow starts from a couple of objects where cells have been |
|
33 | 33 |
assigned to be different cell types. To do this, users may have annotated |
34 |
-scRNA-seq data (by a FACS-sorting process, for example), create a Seurat |
|
35 |
-object based on the sequencing data and assign the predetermined cell types |
|
36 |
-as Seurat meta data. If the scRNA-seq data has not been annotated yet, |
|
37 |
-another possible approach is to follow the basic Seurat workflow until |
|
38 |
-assigning cell type identity to clusters. |
|
34 |
+scRNA-seq data (by a FACS-sorting process, for example), create a Seurat/ |
|
35 |
+SingleCellExperiment (SCE) object based on the sequencing data and assign the |
|
36 |
+predetermined cell types as cell meta data. If the scRNA-seq data has not |
|
37 |
+been annotated yet, another possible approach is to follow the basic |
|
38 |
+workflow (Seurat, for example) until assigning cell type identity to clusters. |
|
39 |
+ |
|
40 |
+In this vignette, we use the human lung dataset from Zilionis et al., 2019, |
|
41 |
+which is available in the scRNAseq (2.4.0) library. The dataset is stored as a |
|
42 |
+SCE object. |
|
39 | 43 |
|
40 | 44 |
To start the training workflow, we first load the neccessary libraries. |
41 | 45 |
```{r} |
42 | 46 |
library(SingleCellClassR) |
43 |
-library(SingleCellClassR.data) |
|
47 |
+library(scRNAseq) |
|
44 | 48 |
``` |
45 | 49 |
|
46 |
-One Seurat object will be used as train object, while the other is the test |
|
47 |
-object. In this example, we used Sade-Feldman dataset to create the train |
|
48 |
-object. |
|
50 |
+Load the dataset: |
|
49 | 51 |
|
50 | 52 |
```{r} |
51 |
-data("feldman_seurat") |
|
52 |
-feldman_seurat |
|
53 |
+zilionis <- ZilionisLungData() |
|
53 | 54 |
``` |
54 | 55 |
|
55 |
-We load Jerby-Arnon dataset for the testing object. |
|
56 |
+We cut this dataset into two parts, one for the training and the other for the testing. |
|
56 | 57 |
```{r} |
57 |
-data("jerby_seurat") |
|
58 |
-jerby_seurat |
|
58 |
+pivot = ncol(zilionis)%/%2 |
|
59 |
+train_set <- zilionis[, 1:pivot] |
|
60 |
+test_set <- zilionis[, (1+pivot):ncol(zilionis)] |
|
59 | 61 |
``` |
60 |
-In our example, the cell type meta data is indicated as the active |
|
61 |
-identification of the Seurat object (in both train object and test |
|
62 |
-object). If cell type is stored in another slot of object meta data, |
|
62 |
+ |
|
63 |
+In this dataset, the cell type meta data is stored in the *Most likely LM22 cell type* |
|
64 |
+slot of the SingleCellExperiment object (in both train object and test object). |
|
65 |
+If cell type is stored in another slot of object meta data, |
|
63 | 66 |
the slot/tag slot name must be then provided as a parameter in the |
64 | 67 |
train and test method. |
65 | 68 |
```{r} |
66 |
-head(Idents(feldman_seurat)) |
|
69 |
+unique(train_set$`Most likely LM22 cell type`) |
|
70 |
+``` |
|
71 |
+```{r} |
|
72 |
+unique(test_set$`Most likely LM22 cell type`) |
|
73 |
+``` |
|
74 |
+ |
|
75 |
+We want to train a classifier for B cells and their phenotypes. Considering memory B cells, |
|
76 |
+naive B cells and plasma cells as B cell phenotypes, we convert all those cells to a uniform |
|
77 |
+cell label, ie. B cells. All non B cells are converted into 'others'. |
|
78 |
+ |
|
79 |
+```{r} |
|
80 |
+# change cell label |
|
81 |
+train_set$B_cell <- unlist(lapply(train_set$`Most likely LM22 cell type`, |
|
82 |
+ function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'})) |
|
83 |
+ |
|
84 |
+test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`, |
|
85 |
+ function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'})) |
|
86 |
+``` |
|
87 |
+ |
|
88 |
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid the affect of those NAs cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. |
|
89 |
+ |
|
90 |
+We may want to check the number of cells in each category: |
|
91 |
+```{r} |
|
92 |
+table(train_set$B_cell) |
|
67 | 93 |
``` |
68 | 94 |
|
69 | 95 |
## Defining set of features |
... | ... |
@@ -83,13 +109,13 @@ selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B", |
83 | 109 |
## Train model |
84 | 110 |
|
85 | 111 |
When the model is being trained, three most important information must be |
86 |
-provided are: the Seurat object used for training, the set of applied features |
|
112 |
+provided are: the Seurat/SCE object used for training, the set of applied features |
|
87 | 113 |
and the cell type defining the trained model. |
88 | 114 |
|
89 | 115 |
Cell type corresponding to the trained model must exist among identities |
90 | 116 |
assigned to cells in the trained Seurat object. Remember if cell types |
91 | 117 |
are not indicated as active identification of the trained object, name |
92 |
-of the tag slot in object meta data must be provided to the tag_slot parameter. |
|
118 |
+of the tag slot in object meta data must be provided to the sce_tag_slot parameter. |
|
93 | 119 |
|
94 | 120 |
When training on a imbalanced dataset, the trained model may bias toward the |
95 | 121 |
majority group and ignore the presence of the minority group. To avoid this, |
... | ... |
@@ -99,8 +125,8 @@ from the majority group. To use the same set of cells while training multiple |
99 | 125 |
times for one model, users can use set.seed. |
100 | 126 |
```{r} |
101 | 127 |
set.seed(123) |
102 |
-clf_B <- train_classifier(train_obj = feldman_seurat, |
|
103 |
-features = selected_features_B, cell_type = "B cells") |
|
128 |
+clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B, |
|
129 |
+ sce_assay = 'counts', sce_tag_slot = 'B_cell') |
|
104 | 130 |
``` |
105 | 131 |
```{r} |
106 | 132 |
clf_B |
... | ... |
@@ -115,7 +141,8 @@ clf(clf_B) |
115 | 141 |
## Test model |
116 | 142 |
|
117 | 143 |
```{r} |
118 |
-clf_B_test <- test_classifier(test_obj = jerby_seurat, classifier = clf_B) |
|
144 |
+clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, |
|
145 |
+ sce_assay = 'counts', sce_tag_slot = 'B_cell') |
|
119 | 146 |
``` |
120 | 147 |
|
121 | 148 |
### Interpreting test model result |
... | ... |
@@ -138,7 +165,7 @@ With the same classification model, the sensitivity and the specification of cla |
138 | 165 |
clf_B_test$overall_roc |
139 | 166 |
``` |
140 | 167 |
|
141 |
-In this example of B cell classifier, the current threshold is at 0.5. The sensitivity is 0.9932203, and the specificity is 0.9890493 (FPR = 0.010950643). The higher sensitivity (0.9943503) can be reached if we set the p_thres at 0.4. However, we will have lower specificity (FPR = 0.013966037), which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example). |
|
168 |
+In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example). |
|
142 | 169 |
|
143 | 170 |
There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model. |
144 | 171 |
|
Former-commit-id: c30d5e0eb17749597b12316855f05517b93b7be9
1 | 1 |
new file mode 100644 |
... | ... |
@@ -0,0 +1,199 @@ |
1 |
+--- |
|
2 |
+title: "Training basic model classifying a cell type from scRNA-seq data" |
|
3 |
+author: "Vy Nguyen" |
|
4 |
+date: "`r Sys.Date()`" |
|
5 |
+output: rmarkdown::html_vignette |
|
6 |
+vignette: > |
|
7 |
+ %\VignetteIndexEntry{2. Training basic model} |
|
8 |
+ %\VignetteEngine{knitr::rmarkdown} |
|
9 |
+ \usepackage[utf8]{inputenc} |
|
10 |
+--- |
|
11 |
+ |
|
12 |
+```{r setup, include = FALSE} |
|
13 |
+knitr::opts_chunk$set( |
|
14 |
+ collapse = TRUE, |
|
15 |
+ comment = "#>" |
|
16 |
+) |
|
17 |
+options(rmarkdown.html_vignette.check_title = FALSE) |
|
18 |
+``` |
|
19 |
+ |
|
20 |
+## Introduction |
|
21 |
+ |
|
22 |
+One of basic functions of the SingleCellClassR package is to provide users |
|
23 |
+easy tools to train their own model classifying new cell types from labeled |
|
24 |
+scRNA-seq data. |
|
25 |
+ |
|
26 |
+From the very beginning, this vignette shows how to train a basic |
|
27 |
+classification model for an independant cell type, which is not a child of |
|
28 |
+any other cell type. |
|
29 |
+ |
|
30 |
+## Preparing train object and test object |
|
31 |
+ |
|
32 |
+The workflow starts from a couple of Seurat objects where cells have been |
|
33 |
+assigned to be different cell types. To do this, users may have annotated |
|
34 |
+scRNA-seq data (by a FACS-sorting process, for example), create a Seurat |
|
35 |
+object based on the sequencing data and assign the predetermined cell types |
|
36 |
+as Seurat meta data. If the scRNA-seq data has not been annotated yet, |
|
37 |
+another possible approach is to follow the basic Seurat workflow until |
|
38 |
+assigning cell type identity to clusters. |
|
39 |
+ |
|
40 |
+To start the training workflow, we first load the neccessary libraries. |
|
41 |
+```{r} |
|
42 |
+library(SingleCellClassR) |
|
43 |
+library(SingleCellClassR.data) |
|
44 |
+``` |
|
45 |
+ |
|
46 |
+One Seurat object will be used as train object, while the other is the test |
|
47 |
+object. In this example, we used Sade-Feldman dataset to create the train |
|
48 |
+object. |
|
49 |
+ |
|
50 |
+```{r} |
|
51 |
+data("feldman_seurat") |
|
52 |
+feldman_seurat |
|
53 |
+``` |
|
54 |
+ |
|
55 |
+We load Jerby-Arnon dataset for the testing object. |
|
56 |
+```{r} |
|
57 |
+data("jerby_seurat") |
|
58 |
+jerby_seurat |
|
59 |
+``` |
|
60 |
+In our example, the cell type meta data is indicated as the active |
|
61 |
+identification of the Seurat object (in both train object and test |
|
62 |
+object). If cell type is stored in another slot of object meta data, |
|
63 |
+the slot/tag slot name must be then provided as a parameter in the |
|
64 |
+train and test method. |
|
65 |
+```{r} |
|
66 |
+head(Idents(feldman_seurat)) |
|
67 |
+``` |
|
68 |
+ |
|
69 |
+## Defining set of features |
|
70 |
+ |
|
71 |
+Next, we define a set of features, which will be used in training the |
|
72 |
+classification model. Supposing we are training a model for classifying |
|
73 |
+B cells, we define the set of features as follows: |
|
74 |
+```{r} |
|
75 |
+selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B", |
|
76 |
+ "CD38", "CD37", "CD83", "CR2", "MVK", "MME", |
|
77 |
+ "IL2RA", "PTEN", "POU2AF1", "MEF2C", "IRF8", |
|
78 |
+ "TCF3", "BACH2", "MZB1", 'VPREB3', 'RASGRP2', |
|
79 |
+ 'CD86', 'CD84', 'LY86', 'CD74', 'SP140', "BLK", |
|
80 |
+ 'FLI1', 'CD14', "DERL3", "LRMP") |
|
81 |
+``` |
|
82 |
+ |
|
83 |
+## Train model |
|
84 |
+ |
|
85 |
+When the model is being trained, three most important information must be |
|
86 |
+provided are: the Seurat object used for training, the set of applied features |
|
87 |
+and the cell type defining the trained model. |
|
88 |
+ |
|
89 |
+Cell type corresponding to the trained model must exist among identities |
|
90 |
+assigned to cells in the trained Seurat object. Remember if cell types |
|
91 |
+are not indicated as active identification of the trained object, name |
|
92 |
+of the tag slot in object meta data must be provided to the tag_slot parameter. |
|
93 |
+ |
|
94 |
+When training on a imbalanced dataset, the trained model may bias toward the |
|
95 |
+majority group and ignore the presence of the minority group. To avoid this, |
|
96 |
+the number of positive cells and negative cells will automatically be balanced |
|
97 |
+before training. Therefore, a smaller number cells will be randomly picked |
|
98 |
+from the majority group. To use the same set of cells while training multiple |
|
99 |
+times for one model, users can use set.seed. |
|
100 |
+```{r} |
|
101 |
+set.seed(123) |
|
102 |
+clf_B <- train_classifier(train_obj = feldman_seurat, |
|
103 |
+features = selected_features_B, cell_type = "B cells") |
|
104 |
+``` |
|
105 |
+```{r} |
|
106 |
+clf_B |
|
107 |
+``` |
|
108 |
+The classifying model is a S4 object named SingleCellClassR. |
|
109 |
+Details about the classifying model is accessible via getter methods. |
|
110 |
+For example: |
|
111 |
+```{r} |
|
112 |
+clf(clf_B) |
|
113 |
+``` |
|
114 |
+ |
|
115 |
+## Test model |
|
116 |
+ |
|
117 |
+```{r} |
|
118 |
+clf_B_test <- test_classifier(test_obj = jerby_seurat, classifier = clf_B) |
|
119 |
+``` |
|
120 |
+ |
|
121 |
+### Interpreting test model result |
|
122 |
+ |
|
123 |
+Apart from the output exported to console, test classifier function also returns an object, which is a list of: |
|
124 |
+ |
|
125 |
+ * **test_tag**: actual cell label, this can be different from the label provided by users because of ambiguous characters or the incoherence in cell type and sub cell type label assignment. |
|
126 |
+ |
|
127 |
+ * **pred**: cell type prediction using current classifier |
|
128 |
+ |
|
129 |
+ * **acc**: prediction accuracy at the fixed probability threshold, the probability threshold value can also be queried using *p_thres(classifier)* |
|
130 |
+ |
|
131 |
+ * **auc**: AUC score of provided by current classifier |
|
132 |
+ |
|
133 |
+ * **overall_roc**: True Positive Rate and False Positive Rate with a certain number of prediction probability thresholds |
|
134 |
+ |
|
135 |
+With the same classification model, the sensitivity and the specification of classification can be different because of the prediction probability threshold. To optimize user experience, we have the *overall_roc* as a summary of True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds: |
|
136 |
+ |
|
137 |
+```{r} |
|
138 |
+clf_B_test$overall_roc |
|
139 |
+``` |
|
140 |
+ |
|
141 |
+In this example of B cell classifier, the current threshold is at 0.5. The sensitivity is 0.9932203, and the specificity is 0.9890493 (FPR = 0.010950643). The higher sensitivity (0.9943503) can be reached if we set the p_thres at 0.4. However, we will have lower specificity (FPR = 0.013966037), which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example). |
|
142 |
+ |
|
143 |
+There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model. |
|
144 |
+ |
|
145 |
+### Plotting ROC curve |
|
146 |
+ |
|
147 |
+Apart from numbers, we also provide a method to plot the ROC curve. |
|
148 |
+```{r} |
|
149 |
+roc_curve <- plot_roc_curve(test_result = clf_B_test) |
|
150 |
+``` |
|
151 |
+```{r} |
|
152 |
+plot(roc_curve) |
|
153 |
+``` |
|
154 |
+ |
|
155 |
+### Which model to choose? |
|
156 |
+ |
|
157 |
+Changes in train data, in the set of features and in the prediction probability |
|
158 |
+threshold will all lead to a change in model performance. |
|
159 |
+ |
|
160 |
+There are several ways to evaluate the trained model, including the overall |
|
161 |
+accuracy, the AUC score and the sensitivity/specificity of the model when |
|
162 |
+testing on an independent dataset. Here, we export all these statistics to |
|
163 |
+help user have a wider range of choices. In this example, we choose the model |
|
164 |
+which has the best AUC score. |
|
165 |
+ |
|
166 |
+*Tip: Using more general markers of the whole population leads to higher |
|
167 |
+sensitivity. This sometimes produces lower specificity because of close |
|
168 |
+cell types (T cells and NK cells, for example). While training some models, |
|
169 |
+we observed that we can use the markers producing high sensitivity but at |
|
170 |
+the same time can improve the specificity by increasing the probability |
|
171 |
+threshold. Of course, this tip can only applied in some cases, because |
|
172 |
+some markers can even have a larger affect on the specificity than the |
|
173 |
+prediction probability threshold.* |
|
174 |
+ |
|
175 |
+## Save classification model for further use |
|
176 |
+ |
|
177 |
+After having obtained a good classification model, users may want to save it |
|
178 |
+for future classification. To do this, we provide a method that helps the user |
|
179 |
+step-by-step store all new classification models. |
|
180 |
+ |
|
181 |
+To use this method, two information must be provided: the to be saved model and |
|
182 |
+the directory path where the new model will be stored. This method will then |
|
183 |
+create a small database containing all trained models. Therefore, users must |
|
184 |
+indicate the same path to models in order to use multiple classification models |
|
185 |
+at the same time. |
|
186 |
+ |
|
187 |
+Users can also choose whether copy all pretrained models of the packages to the |
|
188 |
+new model database. If not, in the future, user can only choose to use either |
|
189 |
+default pretrained models or new models by specifying only one path to models. |
|
190 |
+ |
|
191 |
+```{r} |
|
192 |
+# no copy of pretrained models is performed |
|
193 |
+save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) |
|
194 |
+``` |
|
195 |
+ |
|
196 |
+## Session Info |
|
197 |
+```{r} |
|
198 |
+sessionInfo() |
|
199 |
+``` |
|
0 | 200 |
\ No newline at end of file |