vignettes/training-basic-model.Rmd
0065b788
 ---
 title: "Training basic model classifying a cell type from scRNA-seq data"
 author: "Vy Nguyen"
 date: "`r Sys.Date()`"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{2. Training basic model}
   %\VignetteEngine{knitr::rmarkdown}
   \usepackage[utf8]{inputenc}
 ---
 
 ```{r setup, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>"
 )
 options(rmarkdown.html_vignette.check_title = FALSE)
 ```
 
 ## Introduction
 
5796ba1f
 One of key functions of the scTypeR package is to provide users 
0065b788
 easy tools to train their own model classifying new cell types from labeled 
 scRNA-seq data.
 
c604829a
 This vignette shows how to train a basic 
0065b788
 classification model for an independant cell type, which is not a child of 
 any other cell type.
 
 ## Preparing train object and test object
 
c604829a
 The workflow starts with either a [Seurat](https://satijalab.org/seurat/) or 
 [SingleCellExperiment](https://osca.bioconductor.org/) object where cells have already
 been assigned to different cell types. 
 
 To do this, users may have annotated 
855ac3ab
 scRNA-seq data (by a FACS-sorting process, for example), create a Seurat/
 SingleCellExperiment (SCE) object based on the sequencing data and assign the 
 predetermined cell types as cell meta data. If the scRNA-seq data has not 
 been annotated yet, another possible approach is to follow the basic 
 workflow (Seurat, for example) until assigning cell type identity to clusters.
 
 In this vignette, we use the human lung dataset from Zilionis et al., 2019,
 which is available in the scRNAseq (2.4.0) library. The dataset is stored as a
 SCE object.
0065b788
 
c604829a
 To start the training workflow, we first install and load the necessary libraries. 
0065b788
 ```{r}
c604829a
 # use BiocManager to install from Bioconductor
 if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
 
5796ba1f
 # the scTypeR package
 if (!require(scTypeR))
   BiocManager::install("scTypeR")
c604829a
 
 # we use the scRNAseq package to load example data
 if (!require(scRNAseq))
   BiocManager::install("scRNAseq")
0065b788
 ```
 
a6da6ff0
 First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.
0065b788
 
 ```{r}
855ac3ab
 zilionis <- ZilionisLungData()
a6da6ff0
 zilionis <- zilionis[, 1:5000]
0065b788
 ```
 
c604829a
 We split this dataset into two parts, one for the training and the other for the testing.
0065b788
 ```{r}
855ac3ab
 pivot = ncol(zilionis)%/%2
 train_set <- zilionis[, 1:pivot]
 test_set <- zilionis[, (1+pivot):ncol(zilionis)]
0065b788
 ```
855ac3ab
 
 In this dataset, the cell type meta data is stored in the *Most likely LM22 cell type*
2ddab674
 slot of the SingleCellExperiment object (in both the train object and test object). 
 
 If the cell type is stored not stored as the default identification (set through
 `Idents` for Seurat object) the slot must be set as a parameter in the training
 and testing function (see below).
 
0065b788
 ```{r}
855ac3ab
 unique(train_set$`Most likely LM22 cell type`)
 ```
 ```{r}
 unique(test_set$`Most likely LM22 cell type`)
 ```
 
 We want to train a classifier for B cells and their phenotypes. Considering memory B cells,
 naive B cells and plasma cells as B cell phenotypes, we convert all those cells to a uniform 
 cell label, ie. B cells. All non B cells are converted into 'others'.
 
 ```{r}
 # change cell label
 train_set$B_cell <- unlist(lapply(train_set$`Most likely LM22 cell type`,
                                   function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
 
 test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
                                  function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
 ```
 
5796ba1f
 We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don't know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as 'ambiguous'. All cells tagged 'ambiguous' will be ignored by scTypeR from training and testing. 
855ac3ab
 
 We may want to check the number of cells in each category:
 ```{r}
 table(train_set$B_cell)
0065b788
 ```
 
c604829a
 ## Defining features
0065b788
 
 Next, we define a set of features, which will be used in training the 
 classification model. Supposing we are training a model for classifying 
 B cells, we define the set of features as follows:
 ```{r}
a6da6ff0
 selected_features_B <- c("CD19", "MS4A1", "CD79A", "CD79B", 'CD27', 'IGHG1', 'IGHG2', 'IGHM',
                          "CR2", "MEF2C", 'VPREB3', 'CD86', 'LY86', "BLK", "DERL3")
0065b788
 ```
 
 ## Train model
 
c604829a
 When the model is being trained, three pieces of information must be 
 provided: 
 
   * the Seurat/SCE object used for training
   * the set of applied features
   * the cell type defining the trained model
 
 In case the dataset does not contain any cell classified as the target
 cell type, the function will fail. 
0065b788
 
c604829a
 If the cell type annotation is not set in the default identification slot
 (`Idents` for `Seurat` objects) the name 
 of the metadata field must be provided to the `sce_tag_slot parameter`. 
0065b788
 
c604829a
 When training on an imbalanced dataset (f.e. a datasets containing 90% B cells and
 only very few other cell types), the trained model may bias toward the 
0065b788
 majority group and ignore the presence of the minority group. To avoid this, 
c604829a
 the number of positive cells and negative cells will be automatically balanced 
0065b788
 before training. Therefore, a smaller number cells will be randomly picked  
 from the majority group. To use the same set of cells while training multiple 
c604829a
 times for one model, users can use `set.seed`. 
0065b788
 ```{r}
 set.seed(123)
855ac3ab
 clf_B <- train_classifier(train_obj = train_set, cell_type = "B cells", features = selected_features_B,
                           sce_assay = 'counts', sce_tag_slot = 'B_cell')
0065b788
 ```
 ```{r}
 clf_B
 ```
5796ba1f
 The classification model is a `scTypeR` object. 
c604829a
 Details about the classification model are accessible via getter methods. 
 
0065b788
 For example:
c604829a
 
0065b788
 ```{r}
 clf(clf_B)
 ```
 
 ## Test model
 
c604829a
 The `test_classifier` model automatically tests a classifier's performance
 against another dataset. Here, we used the `test_set` created before:
 
0065b788
 ```{r}
855ac3ab
 clf_B_test <- test_classifier(test_obj = test_set, classifier = clf_B, 
                               sce_assay = 'counts', sce_tag_slot = 'B_cell')
0065b788
 ```
 
 ### Interpreting test model result
 
 Apart from the output exported to console, test classifier function also returns an object, which is a list of:
   
   * **test_tag**: actual cell label, this can be different from the label provided by users because of ambiguous characters or the incoherence in cell type and sub cell type label assignment.  
 
   * **pred**: cell type prediction using current classifier
 
   * **acc**: prediction accuracy at the fixed probability threshold, the probability threshold value can also be queried using *p_thres(classifier)*
 
c604829a
   * **auc**: AUC score provided by current classifier
0065b788
   
   * **overall_roc**: True Positive Rate and False Positive Rate with a certain number of prediction probability thresholds
   
c604829a
 Every classifier internally consists of a trained SVM and a probability threshold. Only cells that are classified with a probability exceeding
 this threshold are classified as the respective cell type. The *overall_roc* slot summarizes the True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds.
0065b788
 
 ```{r}
 clf_B_test$overall_roc
 ```
 
c604829a
 In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we will incorrectly classify some cells as B cells. At the sime time, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
0065b788
 
 There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model.
 
 ### Plotting ROC curve
 
 Apart from numbers, we also provide a method to plot the ROC curve. 
 ```{r}
 roc_curve <- plot_roc_curve(test_result = clf_B_test)
 ```
 ```{r}
 plot(roc_curve)
 ```
 
 ### Which model to choose?
 
c604829a
 Changes in the training data, in the set of features and in the prediction probability
0065b788
 threshold will all lead to a change in model performance.
 
 There are several ways to evaluate the trained model, including the overall 
 accuracy, the AUC score and the sensitivity/specificity of the model when 
c604829a
 testing on an independent dataset. In this example, we choose the model
0065b788
 which has the best AUC score.
 
 *Tip: Using more general markers of the whole population leads to higher 
 sensitivity. This sometimes produces lower specificity because of close 
 cell types (T cells and NK cells, for example). While training some models, 
 we observed that we can use the markers producing high sensitivity but at 
 the same time can improve the specificity by increasing the probability 
c604829a
 threshold. Of course, this can only applied in some cases, because 
0065b788
 some markers can even have a larger affect on the specificity than the 
 prediction probability threshold.*
 
 ## Save classification model for further use
 
c604829a
 New classification models can be stored using the `save_new_model` function:
0065b788
 
c604829a
 ```{r}
 # no copy of pretrained models is performed
 save_new_model(new_model = clf_B, path.to.models = getwd(),include.default = FALSE) 
 ```
 
 Parameters:
 
   * **new_model**: The new model that should be added to the database in the
                    specified directory.
   * **path.to.models**: The directory where the new models should be stored.
   * **include.default**: If set, the default models shipped with the package
                          are added to the database.
0065b788
 
 Users can also choose whether copy all pretrained models of the packages to the
 new model database. If not, in the future, user can only choose to use either 
 default pretrained models or new models by specifying only one path to models.
 
c604829a
 Models can be deleted from the model database using the `delete_model` function:
 
0065b788
 ```{r}
c604829a
 # delete the "B cells" model from the new database
 delete_model("B cells", path.to.models = getwd())
0065b788
 ```
 
 ## Session Info
 ```{r}
 sessionInfo()
 ```