Browse code

Fix vignettes

Johannes Griss authored on 21/03/2024 14:09:36
Showing 1 changed files
... ...
@@ -69,6 +69,10 @@ First, we load the dataset. To reduce the computational complexity of this vigne
69 69
 ```{r}
70 70
 zilionis <- ZilionisLungData()
71 71
 zilionis <- zilionis[, 1:5000]
72
+
73
+# now we add simple colnames (= cell ids) to the dataset
74
+# Note: This is normally not necessary
75
+colnames(zilionis) <- paste0("Cell", 1:ncol(zilionis))
72 76
 ```
73 77
 
74 78
 We split this dataset into two parts, one for the training and the other for the testing.
Browse code

Fixed typo

Johannes Griss authored on 30/07/2021 13:53:47
Showing 1 changed files
... ...
@@ -24,7 +24,7 @@ easy tools to train their own model classifying new cell types from labeled
24 24
 scRNA-seq data.
25 25
 
26 26
 This vignette shows how to train a basic 
27
-classification model for an independant cell type, which is not a child of 
27
+classification model for an independent cell type, which is not a child of 
28 28
 any other cell type.
29 29
 
30 30
 ## Preparing train object and test object
Browse code

merge separate methods for Seurat and SCE signatures into one function

nttvy authored on 29/07/2021 22:23:02
Showing 1 changed files
... ...
@@ -148,8 +148,8 @@ times for one model, users can use `set.seed`.
148 148
 ```{r}
149 149
 set.seed(123)
150 150
 classifier_B <- train_classifier(train_obj = train_set, cell_type = "B cells", 
151
-                          marker_genes = selected_marker_genes_B,
152
-                          sce_assay = 'counts', sce_tag_slot = 'B_cell')
151
+                                 marker_genes = selected_marker_genes_B,
152
+                                 assay = 'counts', tag_slot = 'B_cell')
153 153
 ```
154 154
 ```{r}
155 155
 classifier_B
... ...
@@ -169,8 +169,8 @@ The `test_classifier` model automatically tests a classifier's performance
169 169
 against another dataset. Here, we used the `test_set` created before:
170 170
 
171 171
 ```{r}
172
-classifier_B_test <- test_classifier(test_obj = test_set, classifier = classifier_B, 
173
-                              sce_assay = 'counts', sce_tag_slot = 'B_cell')
172
+classifier_B_test <- test_classifier(classifier = classifier_B, test_obj = test_set,  
173
+                                     assay = 'counts', tag_slot = 'B_cell')
174 174
 ```
175 175
 
176 176
 ### Interpreting test model result
Browse code

remove default_models store in package, use BiocFileCache to manage pretrained models

Former-commit-id: 4a69d2314d0fe5849d5002a1dc1cdd474f9afaff

nttvy authored on 29/07/2021 14:57:11
Showing 1 changed files
... ...
@@ -233,7 +233,7 @@ New classification models can be stored using the `save_new_model` function:
233 233
 
234 234
 ```{r}
235 235
 # no copy of pretrained models is performed
236
-save_new_model(new_model = classifier_B, path.to.models = tempdir(),
236
+save_new_model(new_model = classifier_B, path_to_models = tempdir(),
237 237
                include.default = FALSE) 
238 238
 ```
239 239
 
... ...
@@ -241,7 +241,7 @@ Parameters:
241 241
 
242 242
   * **new_model**: The new model that should be added to the database in the
243 243
                    specified directory.
244
-  * **path.to.models**: The directory where the new models should be stored.
244
+  * **path_to_models**: The directory where the new models should be stored.
245 245
   * **include.default**: If set, the default models shipped with the package
246 246
                          are added to the database.
247 247
 
... ...
@@ -253,7 +253,7 @@ Models can be deleted from the model database using the `delete_model` function:
253 253
 
254 254
 ```{r}
255 255
 # delete the "B cells" model from the new database
256
-delete_model("B cells", path.to.models = tempdir())
256
+delete_model("B cells", path_to_models = tempdir())
257 257
 ```
258 258
 
259 259
 ## Session Info
Browse code

replace core clf by caret_model replace all other clf by classifiers

Former-commit-id: 91ec8da24f45c110ab4f2e728ab71e9c1d255ca8

nttvy authored on 22/07/2021 13:51:48
Showing 1 changed files
... ...
@@ -147,12 +147,12 @@ from the majority group. To use the same set of cells while training multiple
147 147
 times for one model, users can use `set.seed`. 
148 148
 ```{r}
149 149
 set.seed(123)
150
-clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", 
150
+classifier_B <- train_classifier(train_obj = train_set, cell_type = "B cells", 
151 151
                           marker_genes = selected_marker_genes_B,
152 152
                           sce_assay = 'counts', sce_tag_slot = 'B_cell')
153 153
 ```
154 154
 ```{r}
155
-clf_B
155
+classifier_B
156 156
 ```
157 157
 The classification model is a `scAnnotatR` object. 
158 158
 Details about the classification model are accessible via getter methods. 
... ...
@@ -160,7 +160,7 @@ Details about the classification model are accessible via getter methods.
160 160
 For example:
161 161
 
162 162
 ```{r}
163
-clf(clf_B)
163
+caret_model(classifier_B)
164 164
 ```
165 165
 
166 166
 ## Test model
... ...
@@ -169,7 +169,7 @@ The `test_classifier` model automatically tests a classifier's performance
169 169
 against another dataset. Here, we used the `test_set` created before:
170 170
 
171 171
 ```{r}
172
-clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, 
172
+classifier_B_test <- test_classifier(test_obj = test_set, classifier = classifier_B, 
173 173
                               sce_assay = 'counts', sce_tag_slot = 'B_cell')
174 174
 ```
175 175
 
... ...
@@ -191,7 +191,7 @@ Every classifier internally consists of a trained SVM and a probability threshol
191 191
 this threshold are classified as the respective cell type. The *overall_roc* slot summarizes the True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds.
192 192
 
193 193
 ```{r}
194
-clf_B_test$overall_roc
194
+classifier_B_test$overall_roc
195 195
 ```
196 196
 
197 197
 In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we will incorrectly classify some cells as B cells. At the sime time, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
... ...
@@ -202,7 +202,7 @@ There is of course a certain trade-off between the sensitivity and the specifici
202 202
 
203 203
 Apart from numbers, we also provide a method to plot the ROC curve. 
204 204
 ```{r}
205
-roc_curve <- plot_roc_curve(test_result = clf_B_test)
205
+roc_curve <- plot_roc_curve(test_result = classifier_B_test)
206 206
 ```
207 207
 ```{r}
208 208
 plot(roc_curve)
... ...
@@ -233,7 +233,7 @@ New classification models can be stored using the `save_new_model` function:
233 233
 
234 234
 ```{r}
235 235
 # no copy of pretrained models is performed
236
-save_new_model(new_model = clf_B, path.to.models = tempdir(),
236
+save_new_model(new_model = classifier_B, path.to.models = tempdir(),
237 237
                include.default = FALSE) 
238 238
 ```
239 239
 
Browse code

change features to marker_genes

Former-commit-id: b8f82581ead9bfe31c088db02857a1d1c996d832

nttvy authored on 22/07/2021 11:32:05
Showing 1 changed files
... ...
@@ -112,13 +112,13 @@ We may want to check the number of cells in each category:
112 112
 table(train_set$B_cell)
113 113
 ```
114 114
 
115
-## Defining features
115
+## Defining marker genes
116 116
 
117
-Next, we define a set of features, which will be used in training the 
117
+Next, we define a set of marker genes, which will be used in training the 
118 118
 classification model. Supposing we are training a model for classifying 
119
-B cells, we define the set of features as follows:
119
+B cells, we define the set of marker genes as follows:
120 120
 ```{r}
121
-selected_features_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM',
121
+selected_marker_genes_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM',
122 122
                          "CR2", "MEF2C", 'VPREB3', 'CD86', 'LY86', "BLK", "DERL3")
123 123
 ```
124 124
 
... ...
@@ -128,7 +128,7 @@ When the model is being trained, three pieces of information must be
128 128
 provided: 
129 129
 
130 130
   * the Seurat/SCE object used for training
131
-  * the set of applied features
131
+  * the set of applied marker genes
132 132
   * the cell type defining the trained model
133 133
 
134 134
 In case the dataset does not contain any cell classified as the target
... ...
@@ -148,7 +148,7 @@ times for one model, users can use `set.seed`.
148 148
 ```{r}
149 149
 set.seed(123)
150 150
 clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", 
151
-                          features = selected_features_B,
151
+                          marker_genes = selected_marker_genes_B,
152 152
                           sce_assay = 'counts', sce_tag_slot = 'B_cell')
153 153
 ```
154 154
 ```{r}
... ...
@@ -210,8 +210,8 @@ plot(roc_curve)
210 210
 
211 211
 ### Which model to choose?
212 212
 
213
-Changes in the training data, in the set of features and in the prediction probability
214
-threshold will all lead to a change in model performance.
213
+Changes in the training data, in the set of marker genes and in the prediction 
214
+probability threshold will all lead to a change in model performance.
215 215
 
216 216
 There are several ways to evaluate the trained model, including the overall 
217 217
 accuracy, the AUC score and the sensitivity/specificity of the model when 
Browse code

export models in vignettes to tempdir instead of current directory

Former-commit-id: 15b64d9e0186c72b444cb313c15cb97cd80a3b4e

nttvy authored on 20/07/2021 11:35:28
Showing 1 changed files
... ...
@@ -147,7 +147,8 @@ from the majority group. To use the same set of cells while training multiple
147 147
 times for one model, users can use `set.seed`. 
148 148
 ```{r}
149 149
 set.seed(123)
150
-clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B,
150
+clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", 
151
+                          features = selected_features_B,
151 152
                           sce_assay = 'counts', sce_tag_slot = 'B_cell')
152 153
 ```
153 154
 ```{r}
... ...
@@ -232,7 +233,8 @@ New classification models can be stored using the `save_new_model` function:
232 233
 
233 234
 ```{r}
234 235
 # no copy of pretrained models is performed
235
-save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) 
236
+save_new_model(new_model = clf_B, path.to.models = tempdir(),
237
+               include.default = FALSE) 
236 238
 ```
237 239
 
238 240
 Parameters:
... ...
@@ -251,7 +253,7 @@ Models can be deleted from the model database using the `delete_model` function:
251 253
 
252 254
 ```{r}
253 255
 # delete the "B cells" model from the new database
254
-delete_model("B cells", path.to.models = getwd())
256
+delete_model("B cells", path.to.models = tempdir())
255 257
 ```
256 258
 
257 259
 ## Session Info
Browse code

rename package from scClassifR to scAnnotatR

Former-commit-id: 9823f9b8cf0ad38ebbe2bcd6f870a63a98c29d07

nttvy authored on 01/07/2021 12:02:51
Showing 1 changed files
... ...
@@ -19,7 +19,7 @@ options(rmarkdown.html_vignette.check_title = FALSE)
19 19
 
20 20
 ## Introduction
21 21
 
22
-One of key functions of the scClassifR package is to provide users 
22
+One of key functions of the scAnnotatR package is to provide users 
23 23
 easy tools to train their own model classifying new cell types from labeled 
24 24
 scRNA-seq data.
25 25
 
... ...
@@ -50,9 +50,9 @@ To start the training workflow, we first install and load the necessary librarie
50 50
 if (!requireNamespace("BiocManager", quietly = TRUE))
51 51
     install.packages("BiocManager")
52 52
 
53
-# the scClassifR package
54
-if (!require(scClassifR))
55
-  BiocManager::install("scClassifR")
53
+# the scAnnotatR package
54
+if (!require(scAnnotatR))
55
+  BiocManager::install("scAnnotatR")
56 56
 
57 57
 # we use the scRNAseq package to load example data
58 58
 if (!require(scRNAseq))
... ...
@@ -61,7 +61,7 @@ if (!require(scRNAseq))
61 61
 
62 62
 ```{r}
63 63
 library(scRNAseq)
64
-library(scClassifR)
64
+library(scAnnotatR)
65 65
 ```
66 66
 
67 67
 First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.
... ...
@@ -105,7 +105,7 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
105 105
                                  function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
106 106
 ```
107 107
 
108
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scClassifR from training and testing. 
108
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scAnnotatR from training and testing. 
109 109
 
110 110
 We may want to check the number of cells in each category:
111 111
 ```{r}
... ...
@@ -153,7 +153,7 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features
153 153
 ```{r}
154 154
 clf_B
155 155
 ```
156
-The classification model is a `scClassifR` object. 
156
+The classification model is a `scAnnotatR` object. 
157 157
 Details about the classification model are accessible via getter methods. 
158 158
 
159 159
 For example:
Browse code

rename package

Former-commit-id: c41bda4739a20cb0a346da77aaa02256d756b67f

nttvy authored on 29/12/2020 00:49:20
Showing 1 changed files
... ...
@@ -19,7 +19,7 @@ options(rmarkdown.html_vignette.check_title = FALSE)
19 19
 
20 20
 ## Introduction
21 21
 
22
-One of key functions of the scTypeR package is to provide users 
22
+One of key functions of the scClassifR package is to provide users 
23 23
 easy tools to train their own model classifying new cell types from labeled 
24 24
 scRNA-seq data.
25 25
 
... ...
@@ -50,9 +50,9 @@ To start the training workflow, we first install and load the necessary librarie
50 50
 if (!requireNamespace("BiocManager", quietly = TRUE))
51 51
     install.packages("BiocManager")
52 52
 
53
-# the scTypeR package
54
-if (!require(scTypeR))
55
-  BiocManager::install("scTypeR")
53
+# the scClassifR package
54
+if (!require(scClassifR))
55
+  BiocManager::install("scClassifR")
56 56
 
57 57
 # we use the scRNAseq package to load example data
58 58
 if (!require(scRNAseq))
... ...
@@ -61,7 +61,7 @@ if (!require(scRNAseq))
61 61
 
62 62
 ```{r}
63 63
 library(scRNAseq)
64
-library(scTypeR)
64
+library(scClassifR)
65 65
 ```
66 66
 
67 67
 First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.
... ...
@@ -105,7 +105,7 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
105 105
                                  function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
106 106
 ```
107 107
 
108
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scTypeR from training and testing. 
108
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scClassifR from training and testing. 
109 109
 
110 110
 We may want to check the number of cells in each category:
111 111
 ```{r}
... ...
@@ -153,7 +153,7 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features
153 153
 ```{r}
154 154
 clf_B
155 155
 ```
156
-The classification model is a `scTypeR` object. 
156
+The classification model is a `scClassifR` object. 
157 157
 Details about the classification model are accessible via getter methods. 
158 158
 
159 159
 For example:
Browse code

add eval = FALSe to the code chunks involving package installation

Former-commit-id: 864885ec7ff787bf98b464fcc188c50693100b77

nttvy authored on 24/12/2020 22:29:17
Showing 1 changed files
... ...
@@ -45,7 +45,7 @@ which is available in the scRNAseq (2.4.0) library. The dataset is stored as a
45 45
 SCE object.
46 46
 
47 47
 To start the training workflow, we first install and load the necessary libraries. 
48
-```{r}
48
+```{r, eval = FALSE}
49 49
 # use BiocManager to install from Bioconductor
50 50
 if (!requireNamespace("BiocManager", quietly = TRUE))
51 51
     install.packages("BiocManager")
... ...
@@ -59,6 +59,11 @@ if (!require(scRNAseq))
59 59
   BiocManager::install("scRNAseq")
60 60
 ```
61 61
 
62
+```{r}
63
+library(scRNAseq)
64
+library(scTypeR)
65
+```
66
+
62 67
 First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.
63 68
 
64 69
 ```{r}
Browse code

reduce dataset size in vignettes

Former-commit-id: d3cd57efacc4694501f4f615a49b7fe899a06e5b

nttvy authored on 09/12/2020 13:02:17
Showing 1 changed files
... ...
@@ -59,10 +59,11 @@ if (!require(scRNAseq))
59 59
   BiocManager::install("scRNAseq")
60 60
 ```
61 61
 
62
-Load the dataset:
62
+First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.
63 63
 
64 64
 ```{r}
65 65
 zilionis <- ZilionisLungData()
66
+zilionis <- zilionis[, 1:5000]
66 67
 ```
67 68
 
68 69
 We split this dataset into two parts, one for the training and the other for the testing.
... ...
@@ -112,12 +113,8 @@ Next, we define a set of features, which will be used in training the
112 113
 classification model. Supposing we are training a model for classifying 
113 114
 B cells, we define the set of features as follows:
114 115
 ```{r}
115
-selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B", 
116
-                         "CD38", "CD37", "CD83", "CR2", "MVK", "MME", 
117
-                         "IL2RA", "PTEN", "POU2AF1", "MEF2C", "IRF8", 
118
-                         "TCF3", "BACH2", "MZB1", 'VPREB3', 'RASGRP2', 
119
-                         'CD86', 'CD84', 'LY86', 'CD74', 'SP140', "BLK", 
120
-                         'FLI1', 'CD14', "DERL3", "LRMP")
116
+selected_features_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM',
117
+                         "CR2", "MEF2C", 'VPREB3', 'CD86', 'LY86', "BLK", "DERL3")
121 118
 ```
122 119
 
123 120
 ## Train model
Browse code

Change package name

Former-commit-id: af53b823be827fde1d0a8435136b9455293ccbdc

nttvy authored on 01/12/2020 09:03:46
Showing 1 changed files
... ...
@@ -19,7 +19,7 @@ options(rmarkdown.html_vignette.check_title = FALSE)
19 19
 
20 20
 ## Introduction
21 21
 
22
-One of key functions of the SingleCellClassR package is to provide users 
22
+One of key functions of the scTypeR package is to provide users 
23 23
 easy tools to train their own model classifying new cell types from labeled 
24 24
 scRNA-seq data.
25 25
 
... ...
@@ -50,9 +50,9 @@ To start the training workflow, we first install and load the necessary librarie
50 50
 if (!requireNamespace("BiocManager", quietly = TRUE))
51 51
     install.packages("BiocManager")
52 52
 
53
-# the SingleCellClassR package
54
-if (!require(SingleCellClassR))
55
-  BiocManager::install("SingleCellClassR")
53
+# the scTypeR package
54
+if (!require(scTypeR))
55
+  BiocManager::install("scTypeR")
56 56
 
57 57
 # we use the scRNAseq package to load example data
58 58
 if (!require(scRNAseq))
... ...
@@ -99,7 +99,7 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
99 99
                                  function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
100 100
 ```
101 101
 
102
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. 
102
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scTypeR from training and testing. 
103 103
 
104 104
 We may want to check the number of cells in each category:
105 105
 ```{r}
... ...
@@ -151,7 +151,7 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features
151 151
 ```{r}
152 152
 clf_B
153 153
 ```
154
-The classification model is a `SingleCellClassR` object. 
154
+The classification model is a `scTypeR` object. 
155 155
 Details about the classification model are accessible via getter methods. 
156 156
 
157 157
 For example:
Browse code

Adapted vignette text

Former-commit-id: f8479b40bc3ff528d07dd7c08bb7b95ed163e1fb

Johannes Griss authored on 16/11/2020 23:02:38
Showing 1 changed files
... ...
@@ -73,10 +73,12 @@ test_set <- zilionis[, (1+pivot):ncol(zilionis)]
73 73
 ```
74 74
 
75 75
 In this dataset, the cell type meta data is stored in the *Most likely LM22 cell type*
76
-slot of the SingleCellExperiment object (in both train object and test object). 
77
-If cell type is stored in another slot of object meta data, 
78
-the slot/tag slot name must be then provided as a parameter in the 
79
-train and test method. 
76
+slot of the SingleCellExperiment object (in both the train object and test object). 
77
+
78
+If the cell type is stored not stored as the default identification (set through
79
+`Idents` for Seurat object) the slot must be set as a parameter in the training
80
+and testing function (see below).
81
+
80 82
 ```{r}
81 83
 unique(train_set$`Most likely LM22 cell type`)
82 84
 ```
Browse code

Adapted training-basic-model vignette

Former-commit-id: e5f1d8de2f05f7473f211949af702734a4c14cd9

Johannes Griss authored on 16/11/2020 17:28:26
Showing 1 changed files
... ...
@@ -19,18 +19,21 @@ options(rmarkdown.html_vignette.check_title = FALSE)
19 19
 
20 20
 ## Introduction
21 21
 
22
-One of basic functions of the SingleCellClassR package is to provide users 
22
+One of key functions of the SingleCellClassR package is to provide users 
23 23
 easy tools to train their own model classifying new cell types from labeled 
24 24
 scRNA-seq data.
25 25
 
26
-From the very beginning, this vignette shows how to train a basic 
26
+This vignette shows how to train a basic 
27 27
 classification model for an independant cell type, which is not a child of 
28 28
 any other cell type.
29 29
 
30 30
 ## Preparing train object and test object
31 31
 
32
-The workflow starts from a couple of objects where cells have been 
33
-assigned to be different cell types. To do this, users may have annotated 
32
+The workflow starts with either a [Seurat](https://satijalab.org/seurat/) or 
33
+[SingleCellExperiment](https://osca.bioconductor.org/) object where cells have already
34
+been assigned to different cell types. 
35
+
36
+To do this, users may have annotated 
34 37
 scRNA-seq data (by a FACS-sorting process, for example), create a Seurat/
35 38
 SingleCellExperiment (SCE) object based on the sequencing data and assign the 
36 39
 predetermined cell types as cell meta data. If the scRNA-seq data has not 
... ...
@@ -41,10 +44,19 @@ In this vignette, we use the human lung dataset from Zilionis et al., 2019,
41 44
 which is available in the scRNAseq (2.4.0) library. The dataset is stored as a
42 45
 SCE object.
43 46
 
44
-To start the training workflow, we first load the neccessary libraries. 
47
+To start the training workflow, we first install and load the necessary libraries. 
45 48
 ```{r}
46
-library(SingleCellClassR)
47
-library(scRNAseq)
49
+# use BiocManager to install from Bioconductor
50
+if (!requireNamespace("BiocManager", quietly = TRUE))
51
+    install.packages("BiocManager")
52
+
53
+# the SingleCellClassR package
54
+if (!require(SingleCellClassR))
55
+  BiocManager::install("SingleCellClassR")
56
+
57
+# we use the scRNAseq package to load example data
58
+if (!require(scRNAseq))
59
+  BiocManager::install("scRNAseq")
48 60
 ```
49 61
 
50 62
 Load the dataset:
... ...
@@ -53,7 +65,7 @@ Load the dataset:
53 65
 zilionis <- ZilionisLungData()
54 66
 ```
55 67
 
56
-We cut this dataset into two parts, one for the training and the other for the testing.
68
+We split this dataset into two parts, one for the training and the other for the testing.
57 69
 ```{r}
58 70
 pivot = ncol(zilionis)%/%2
59 71
 train_set <- zilionis[, 1:pivot]
... ...
@@ -85,14 +97,14 @@ test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
85 97
                                  function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
86 98
 ```
87 99
 
88
-We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid the affect of those NAs cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. 
100
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. 
89 101
 
90 102
 We may want to check the number of cells in each category:
91 103
 ```{r}
92 104
 table(train_set$B_cell)
93 105
 ```
94 106
 
95
-## Defining set of features
107
+## Defining features
96 108
 
97 109
 Next, we define a set of features, which will be used in training the 
98 110
 classification model. Supposing we are training a model for classifying 
... ...
@@ -108,21 +120,27 @@ selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B",
108 120
 
109 121
 ## Train model
110 122
 
111
-When the model is being trained, three most important information must be 
112
-provided are: the Seurat/SCE object used for training, the set of applied features
113
-and the cell type defining the trained model. 
123
+When the model is being trained, three pieces of information must be 
124
+provided: 
125
+
126
+  * the Seurat/SCE object used for training
127
+  * the set of applied features
128
+  * the cell type defining the trained model
129
+
130
+In case the dataset does not contain any cell classified as the target
131
+cell type, the function will fail. 
114 132
 
115
-Cell type corresponding to the trained model must exist among identities 
116
-assigned to cells in the trained Seurat object. Remember if cell types 
117
-are not indicated as active identification of the trained object, name 
118
-of the tag slot in object meta data must be provided to the sce_tag_slot parameter. 
133
+If the cell type annotation is not set in the default identification slot
134
+(`Idents` for `Seurat` objects) the name 
135
+of the metadata field must be provided to the `sce_tag_slot parameter`. 
119 136
 
120
-When training on a imbalanced dataset, the trained model may bias toward the 
137
+When training on an imbalanced dataset (f.e. a datasets containing 90% B cells and
138
+only very few other cell types), the trained model may bias toward the 
121 139
 majority group and ignore the presence of the minority group. To avoid this, 
122
-the number of positive cells and negative cells will automatically be balanced 
140
+the number of positive cells and negative cells will be automatically balanced 
123 141
 before training. Therefore, a smaller number cells will be randomly picked  
124 142
 from the majority group. To use the same set of cells while training multiple 
125
-times for one model, users can use set.seed. 
143
+times for one model, users can use `set.seed`. 
126 144
 ```{r}
127 145
 set.seed(123)
128 146
 clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B,
... ...
@@ -131,15 +149,20 @@ clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features
131 149
 ```{r}
132 150
 clf_B
133 151
 ```
134
-The classifying model is a S4 object named SingleCellClassR. 
135
-Details about the classifying model is accessible via getter methods. 
152
+The classification model is a `SingleCellClassR` object. 
153
+Details about the classification model are accessible via getter methods. 
154
+
136 155
 For example:
156
+
137 157
 ```{r}
138 158
 clf(clf_B)
139 159
 ```
140 160
 
141 161
 ## Test model
142 162
 
163
+The `test_classifier` model automatically tests a classifier's performance
164
+against another dataset. Here, we used the `test_set` created before:
165
+
143 166
 ```{r}
144 167
 clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, 
145 168
                               sce_assay = 'counts', sce_tag_slot = 'B_cell')
... ...
@@ -155,17 +178,18 @@ Apart from the output exported to console, test classifier function also returns
155 178
 
156 179
   * **acc**: prediction accuracy at the fixed probability threshold, the probability threshold value can also be queried using *p_thres(classifier)*
157 180
 
158
-  * **auc**: AUC score of provided by current classifier
181
+  * **auc**: AUC score provided by current classifier
159 182
   
160 183
   * **overall_roc**: True Positive Rate and False Positive Rate with a certain number of prediction probability thresholds
161 184
   
162
-With the same classification model, the sensitivity and the specification of classification can be different because of the prediction probability threshold. To optimize user experience, we have the *overall_roc* as a summary of True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds:
185
+Every classifier internally consists of a trained SVM and a probability threshold. Only cells that are classified with a probability exceeding
186
+this threshold are classified as the respective cell type. The *overall_roc* slot summarizes the True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds.
163 187
 
164 188
 ```{r}
165 189
 clf_B_test$overall_roc
166 190
 ```
167 191
 
168
-In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
192
+In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we will incorrectly classify some cells as B cells. At the sime time, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
169 193
 
170 194
 There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model.
171 195
 
... ...
@@ -181,13 +205,12 @@ plot(roc_curve)
181 205
 
182 206
 ### Which model to choose?
183 207
 
184
-Changes in train data, in the set of features and in the prediction probability
208
+Changes in the training data, in the set of features and in the prediction probability
185 209
 threshold will all lead to a change in model performance.
186 210
 
187 211
 There are several ways to evaluate the trained model, including the overall 
188 212
 accuracy, the AUC score and the sensitivity/specificity of the model when 
189
-testing on an independent dataset. Here, we export all these statistics to 
190
-help user have a wider range of choices. In this example, we choose the model
213
+testing on an independent dataset. In this example, we choose the model
191 214
 which has the best AUC score.
192 215
 
193 216
 *Tip: Using more general markers of the whole population leads to higher 
... ...
@@ -195,29 +218,36 @@ sensitivity. This sometimes produces lower specificity because of close
195 218
 cell types (T cells and NK cells, for example). While training some models, 
196 219
 we observed that we can use the markers producing high sensitivity but at 
197 220
 the same time can improve the specificity by increasing the probability 
198
-threshold. Of course, this tip can only applied in some cases, because 
221
+threshold. Of course, this can only applied in some cases, because 
199 222
 some markers can even have a larger affect on the specificity than the 
200 223
 prediction probability threshold.*
201 224
 
202 225
 ## Save classification model for further use
203 226
 
204
-After having obtained a good classification model, users may want to save it 
205
-for future classification. To do this, we provide a method that helps the user 
206
-step-by-step store all new classification models.
227
+New classification models can be stored using the `save_new_model` function:
207 228
 
208
-To use this method, two information must be provided: the to be saved model and
209
-the directory path where the new model will be stored. This method will then 
210
-create a small database containing all trained models. Therefore, users must 
211
-indicate the same path to models in order to use multiple classification models
212
-at the same time.
229
+```{r}
230
+# no copy of pretrained models is performed
231
+save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) 
232
+```
233
+
234
+Parameters:
235
+
236
+  * **new_model**: The new model that should be added to the database in the
237
+                   specified directory.
238
+  * **path.to.models**: The directory where the new models should be stored.
239
+  * **include.default**: If set, the default models shipped with the package
240
+                         are added to the database.
213 241
 
214 242
 Users can also choose whether copy all pretrained models of the packages to the
215 243
 new model database. If not, in the future, user can only choose to use either 
216 244
 default pretrained models or new models by specifying only one path to models.
217 245
 
246
+Models can be deleted from the model database using the `delete_model` function:
247
+
218 248
 ```{r}
219
-# no copy of pretrained models is performed
220
-save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) 
249
+# delete the "B cells" model from the new database
250
+delete_model("B cells", path.to.models = getwd())
221 251
 ```
222 252
 
223 253
 ## Session Info
Browse code

fix bugs + adapt vignettes to scRNAseq dataset

Former-commit-id: fa2385410e4cefdc694545daa48892d15ffe8921

nttvy authored on 14/11/2020 11:50:55
Showing 1 changed files
... ...
@@ -29,41 +29,67 @@ any other cell type.
29 29
 
30 30
 ## Preparing train object and test object
31 31
 
32
-The workflow starts from a couple of Seurat objects where cells have been 
32
+The workflow starts from a couple of objects where cells have been 
33 33
 assigned to be different cell types. To do this, users may have annotated 
34
-scRNA-seq data (by a FACS-sorting process, for example), create a Seurat 
35
-object based on the sequencing data and assign the predetermined cell types 
36
-as Seurat meta data. If the scRNA-seq data has not been annotated yet, 
37
-another possible approach is to follow the basic Seurat workflow until 
38
-assigning cell type identity to clusters.
34
+scRNA-seq data (by a FACS-sorting process, for example), create a Seurat/
35
+SingleCellExperiment (SCE) object based on the sequencing data and assign the 
36
+predetermined cell types as cell meta data. If the scRNA-seq data has not 
37
+been annotated yet, another possible approach is to follow the basic 
38
+workflow (Seurat, for example) until assigning cell type identity to clusters.
39
+
40
+In this vignette, we use the human lung dataset from Zilionis et al., 2019,
41
+which is available in the scRNAseq (2.4.0) library. The dataset is stored as a
42
+SCE object.
39 43
 
40 44
 To start the training workflow, we first load the neccessary libraries. 
41 45
 ```{r}
42 46
 library(SingleCellClassR)
43
-library(SingleCellClassR.data)
47
+library(scRNAseq)
44 48
 ```
45 49
 
46
-One Seurat object will be used as train object, while the other is the test 
47
-object. In this example, we used Sade-Feldman dataset to create the train 
48
-object.
50
+Load the dataset:
49 51
 
50 52
 ```{r}
51
-data("feldman_seurat")
52
-feldman_seurat
53
+zilionis <- ZilionisLungData()
53 54
 ```
54 55
 
55
-We load Jerby-Arnon dataset for the testing object.
56
+We cut this dataset into two parts, one for the training and the other for the testing.
56 57
 ```{r}
57
-data("jerby_seurat")
58
-jerby_seurat
58
+pivot = ncol(zilionis)%/%2
59
+train_set <- zilionis[, 1:pivot]
60
+test_set <- zilionis[, (1+pivot):ncol(zilionis)]
59 61
 ```
60
-In our example, the cell type meta data is indicated as the active 
61
-identification of the Seurat object (in both train object and test 
62
-object). If cell type is stored in another slot of object meta data, 
62
+
63
+In this dataset, the cell type meta data is stored in the *Most likely LM22 cell type*
64
+slot of the SingleCellExperiment object (in both train object and test object). 
65
+If cell type is stored in another slot of object meta data, 
63 66
 the slot/tag slot name must be then provided as a parameter in the 
64 67
 train and test method. 
65 68
 ```{r}
66
-head(Idents(feldman_seurat))
69
+unique(train_set$`Most likely LM22 cell type`)
70
+```
71
+```{r}
72
+unique(test_set$`Most likely LM22 cell type`)
73
+```
74
+
75
+We want to train a classifier for B cells and their phenotypes. Considering memory B cells,
76
+naive B cells and plasma cells as B cell phenotypes, we convert all those cells to a uniform 
77
+cell label, ie. B cells. All non B cells are converted into 'others'.
78
+
79
+```{r}
80
+# change cell label
81
+train_set$B_cell <- unlist(lapply(train_set$`Most likely LM22 cell type`,
82
+                                  function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
83
+
84
+test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
85
+                                 function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
86
+```
87
+
88
+We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid the affect of those NAs cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by SingleCellClassR from training and testing. 
89
+
90
+We may want to check the number of cells in each category:
91
+```{r}
92
+table(train_set$B_cell)
67 93
 ```
68 94
 
69 95
 ## Defining set of features
... ...
@@ -83,13 +109,13 @@ selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B",
83 109
 ## Train model
84 110
 
85 111
 When the model is being trained, three most important information must be 
86
-provided are: the Seurat object used for training, the set of applied features
112
+provided are: the Seurat/SCE object used for training, the set of applied features
87 113
 and the cell type defining the trained model. 
88 114
 
89 115
 Cell type corresponding to the trained model must exist among identities 
90 116
 assigned to cells in the trained Seurat object. Remember if cell types 
91 117
 are not indicated as active identification of the trained object, name 
92
-of the tag slot in object meta data must be provided to the tag_slot parameter. 
118
+of the tag slot in object meta data must be provided to the sce_tag_slot parameter. 
93 119
 
94 120
 When training on a imbalanced dataset, the trained model may bias toward the 
95 121
 majority group and ignore the presence of the minority group. To avoid this, 
... ...
@@ -99,8 +125,8 @@ from the majority group. To use the same set of cells while training multiple
99 125
 times for one model, users can use set.seed. 
100 126
 ```{r}
101 127
 set.seed(123)
102
-clf_B <- train_classifier(train_obj = feldman_seurat, 
103
-features = selected_features_B, cell_type = "B cells")
128
+clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B,
129
+                          sce_assay = 'counts', sce_tag_slot = 'B_cell')
104 130
 ```
105 131
 ```{r}
106 132
 clf_B
... ...
@@ -115,7 +141,8 @@ clf(clf_B)
115 141
 ## Test model
116 142
 
117 143
 ```{r}
118
-clf_B_test <- test_classifier(test_obj = jerby_seurat, classifier = clf_B)
144
+clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, 
145
+                              sce_assay = 'counts', sce_tag_slot = 'B_cell')
119 146
 ```
120 147
 
121 148
 ### Interpreting test model result
... ...
@@ -138,7 +165,7 @@ With the same classification model, the sensitivity and the specification of cla
138 165
 clf_B_test$overall_roc
139 166
 ```
140 167
 
141
-In this example of B cell classifier, the current threshold is at 0.5. The sensitivity is 0.9932203, and the specificity is 0.9890493 (FPR = 0.010950643). The higher sensitivity (0.9943503) can be reached if we set the p_thres at 0.4. However, we will have lower specificity (FPR = 0.013966037), which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
168
+In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
142 169
 
143 170
 There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model.
144 171
 
Browse code

Initial commit

Former-commit-id: c30d5e0eb17749597b12316855f05517b93b7be9

Johannes Griss authored on 10/11/2020 10:34:39
Showing 1 changed files
1 1
new file mode 100644
... ...
@@ -0,0 +1,199 @@
1
+---
2
+title: "Training basic model classifying a cell type from scRNA-seq data"
3
+author: "Vy Nguyen"
4
+date: "`r Sys.Date()`"
5
+output: rmarkdown::html_vignette
6
+vignette: >
7
+  %\VignetteIndexEntry{2. Training basic model}
8
+  %\VignetteEngine{knitr::rmarkdown}
9
+  \usepackage[utf8]{inputenc}
10
+---
11
+
12
+```{r setup, include = FALSE}
13
+knitr::opts_chunk$set(
14
+  collapse = TRUE,
15
+  comment = "#>"
16
+)
17
+options(rmarkdown.html_vignette.check_title = FALSE)
18
+```
19
+
20
+## Introduction
21
+
22
+One of basic functions of the SingleCellClassR package is to provide users 
23
+easy tools to train their own model classifying new cell types from labeled 
24
+scRNA-seq data.
25
+
26
+From the very beginning, this vignette shows how to train a basic 
27
+classification model for an independant cell type, which is not a child of 
28
+any other cell type.
29
+
30
+## Preparing train object and test object
31
+
32
+The workflow starts from a couple of Seurat objects where cells have been 
33
+assigned to be different cell types. To do this, users may have annotated 
34
+scRNA-seq data (by a FACS-sorting process, for example), create a Seurat 
35
+object based on the sequencing data and assign the predetermined cell types 
36
+as Seurat meta data. If the scRNA-seq data has not been annotated yet, 
37
+another possible approach is to follow the basic Seurat workflow until 
38
+assigning cell type identity to clusters.
39
+
40
+To start the training workflow, we first load the neccessary libraries. 
41
+```{r}
42
+library(SingleCellClassR)
43
+library(SingleCellClassR.data)
44
+```
45
+
46
+One Seurat object will be used as train object, while the other is the test 
47
+object. In this example, we used Sade-Feldman dataset to create the train 
48
+object.
49
+
50
+```{r}
51
+data("feldman_seurat")
52
+feldman_seurat
53
+```
54
+
55
+We load Jerby-Arnon dataset for the testing object.
56
+```{r}
57
+data("jerby_seurat")
58
+jerby_seurat
59
+```
60
+In our example, the cell type meta data is indicated as the active 
61
+identification of the Seurat object (in both train object and test 
62
+object). If cell type is stored in another slot of object meta data, 
63
+the slot/tag slot name must be then provided as a parameter in the 
64
+train and test method. 
65
+```{r}
66
+head(Idents(feldman_seurat))
67
+```
68
+
69
+## Defining set of features
70
+
71
+Next, we define a set of features, which will be used in training the 
72
+classification model. Supposing we are training a model for classifying 
73
+B cells, we define the set of features as follows:
74
+```{r}
75
+selected_features_B <- c("CD19", "MS4A1", "SDC1", "CD79A", "CD79B", 
76
+                         "CD38", "CD37", "CD83", "CR2", "MVK", "MME", 
77
+                         "IL2RA", "PTEN", "POU2AF1", "MEF2C", "IRF8", 
78
+                         "TCF3", "BACH2", "MZB1", 'VPREB3', 'RASGRP2', 
79
+                         'CD86', 'CD84', 'LY86', 'CD74', 'SP140', "BLK", 
80
+                         'FLI1', 'CD14', "DERL3", "LRMP")
81
+```
82
+
83
+## Train model
84
+
85
+When the model is being trained, three most important information must be 
86
+provided are: the Seurat object used for training, the set of applied features
87
+and the cell type defining the trained model. 
88
+
89
+Cell type corresponding to the trained model must exist among identities 
90
+assigned to cells in the trained Seurat object. Remember if cell types 
91
+are not indicated as active identification of the trained object, name 
92
+of the tag slot in object meta data must be provided to the tag_slot parameter. 
93
+
94
+When training on a imbalanced dataset, the trained model may bias toward the 
95
+majority group and ignore the presence of the minority group. To avoid this, 
96
+the number of positive cells and negative cells will automatically be balanced 
97
+before training. Therefore, a smaller number cells will be randomly picked  
98
+from the majority group. To use the same set of cells while training multiple 
99
+times for one model, users can use set.seed. 
100
+```{r}
101
+set.seed(123)
102
+clf_B <- train_classifier(train_obj = feldman_seurat, 
103
+features = selected_features_B, cell_type = "B cells")
104
+```
105
+```{r}
106
+clf_B
107
+```
108
+The classifying model is a S4 object named SingleCellClassR. 
109
+Details about the classifying model is accessible via getter methods. 
110
+For example:
111
+```{r}
112
+clf(clf_B)
113
+```
114
+
115
+## Test model
116
+
117
+```{r}
118
+clf_B_test <- test_classifier(test_obj = jerby_seurat, classifier = clf_B)
119
+```
120
+
121
+### Interpreting test model result
122
+
123
+Apart from the output exported to console, test classifier function also returns an object, which is a list of:
124
+  
125
+  * **test_tag**: actual cell label, this can be different from the label provided by users because of ambiguous characters or the incoherence in cell type and sub cell type label assignment.  
126
+
127
+  * **pred**: cell type prediction using current classifier
128
+
129
+  * **acc**: prediction accuracy at the fixed probability threshold, the probability threshold value can also be queried using *p_thres(classifier)*
130
+
131
+  * **auc**: AUC score of provided by current classifier
132
+  
133
+  * **overall_roc**: True Positive Rate and False Positive Rate with a certain number of prediction probability thresholds
134
+  
135
+With the same classification model, the sensitivity and the specification of classification can be different because of the prediction probability threshold. To optimize user experience, we have the *overall_roc* as a summary of True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds:
136
+
137
+```{r}
138
+clf_B_test$overall_roc
139
+```
140
+
141
+In this example of B cell classifier, the current threshold is at 0.5. The sensitivity is 0.9932203, and the specificity is 0.9890493 (FPR = 0.010950643). The higher sensitivity (0.9943503) can be reached if we set the p_thres at 0.4. However, we will have lower specificity (FPR = 0.013966037), which means that we misclassify more stranger cells as B cells. In contradiction, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
142
+
143
+There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model.
144
+
145
+### Plotting ROC curve
146
+
147
+Apart from numbers, we also provide a method to plot the ROC curve. 
148
+```{r}
149
+roc_curve <- plot_roc_curve(test_result = clf_B_test)
150
+```
151
+```{r}
152
+plot(roc_curve)
153
+```
154
+
155
+### Which model to choose?
156
+
157
+Changes in train data, in the set of features and in the prediction probability
158
+threshold will all lead to a change in model performance.
159
+
160
+There are several ways to evaluate the trained model, including the overall 
161
+accuracy, the AUC score and the sensitivity/specificity of the model when 
162
+testing on an independent dataset. Here, we export all these statistics to 
163
+help user have a wider range of choices. In this example, we choose the model
164
+which has the best AUC score.
165
+
166
+*Tip: Using more general markers of the whole population leads to higher 
167
+sensitivity. This sometimes produces lower specificity because of close 
168
+cell types (T cells and NK cells, for example). While training some models, 
169
+we observed that we can use the markers producing high sensitivity but at 
170
+the same time can improve the specificity by increasing the probability 
171
+threshold. Of course, this tip can only applied in some cases, because 
172
+some markers can even have a larger affect on the specificity than the 
173
+prediction probability threshold.*
174
+
175
+## Save classification model for further use
176
+
177
+After having obtained a good classification model, users may want to save it 
178
+for future classification. To do this, we provide a method that helps the user 
179
+step-by-step store all new classification models.
180
+
181
+To use this method, two information must be provided: the to be saved model and
182
+the directory path where the new model will be stored. This method will then 
183
+create a small database containing all trained models. Therefore, users must 
184
+indicate the same path to models in order to use multiple classification models
185
+at the same time.
186
+
187
+Users can also choose whether copy all pretrained models of the packages to the
188
+new model database. If not, in the future, user can only choose to use either 
189
+default pretrained models or new models by specifying only one path to models.
190
+
191
+```{r}
192
+# no copy of pretrained models is performed
193
+save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) 
194
+```
195
+
196
+## Session Info
197
+```{r}
198
+sessionInfo()
199
+```
0 200
\ No newline at end of file