Browse code

Merge pull request #562 from adeschen/main

Update main vignette

Astrid DeschĂȘnes authored on 16/10/2024 06:28:13 • GitHub committed on 16/10/2024 06:28:13
Showing 2 changed files

... ...
@@ -24,7 +24,6 @@ BiocStyle::markdown()
24 24
 suppressPackageStartupMessages({
25 25
     library(knitr)
26 26
     library(RAIDS)
27
-    library(gdsfmt)
28 27
 })
29 28
 
30 29
 set.seed(121444)
... ...
@@ -106,194 +105,178 @@ BiocManager::install("RAIDS")
106 105
 
107 106
 # Main Steps
108 107
 
109
-
110
-This is an overview of genetic ancestry inference from cancer-derived 
108
+This is an overview of the genetic ancestry inference from cancer-derived 
111 109
 molecular data:
112 110
 
113 111
 ```{r graphMainSteps, echo=FALSE, fig.align="center", fig.cap="An overview of the genetic ancestry inference process.", out.width='130%', results='asis', warning=FALSE, message=FALSE}
114
-knitr::include_graphics("MainSteps_v04.png")
112
+knitr::include_graphics("MainSteps_v05.png")
115 113
 ```
116 114
 
117 115
 The main steps are:
118 116
 
119
-**Step 1.** Format reference data from the population reference dataset (optional)
120
-
121
-**Step 2.1** Optimize ancestry inference parameters
122
-
123
-**Step 2.2** Infer ancestry for the subjects of the external study
124
-
125
-These steps are described in detail in the following. Steps 2.1 and 2.2 can be 
126
-run together using one wrapper function.
127
-
128
-<br>
129
-<br>
130
-
117
+**Step 1.** Set-up and provide population reference files
131 118
 
132
-## Main Step - Ancestry Inference
119
+**Step 2** Sample the reference data for donor genotypes, to be used for synthesis and optimize ancestry inference parameters
133 120
 
134
-A wrapper function encapsulates multiple steps of the workflow. 
135
-
136
-```{r graphWrapper, echo=FALSE, fig.align="center", fig.cap="Final step - The wrapper function encapsulates multiple steps of the workflow.", out.width='120%', results='asis', warning=FALSE, message=FALSE}
137
-knitr::include_graphics("MainSteps_Wrapper_v04.png")
138
-```
121
+**Step 3** Infer ancestry for the subjects of the external study
139 122
 
140
-In summary, the wrapper function generates the synthetic dataset and uses 
141
-it to selected the optimal parameters before calling the genetic ancestry 
142
-on the current profiles.
123
+**Step 4** Present and interpret the results
143 124
 
144
-According to the type of input data (RNA or DNA), a specific wrapper function 
145
-is available.
125
+These steps are described in detail in the following. 
146 126
 
147 127
 <br>
128
+<br>
148 129
 
149
-### DNA Data - Wrapper function to run ancestry inference on DNA data
150 130
 
151
-The wrapper function, called _runExomeAncestry()_, requires 4 files as input:
131
+## Step 1. Set-up and provide population reference files
152 132
 
153
-- The **population reference GDS file**
154
-- The **population reference SNV Annotation GDS file**
155
-- The **Profile SNP file** (one per sample present in the study)
156
-- The **Profile PED RDS file** (one file with information for all 
157
-profiles in the study)
158 133
 
159
-In addition, a *data.frame* containing the general information about the 
160
-study is also required. The *data.frame* must contain those 3 columns:
134
+### 1.1 Create a directory structure
161 135
 
162
-- _study.id_: The study identifier (example: TCGA-BRCA).
163
-- _study.desc_: The description of the study.
164
-- _study.platform_: The type of sequencing (example: RNA-seq).
136
+First, a specific directory structure must be created. The structure must 
137
+correspond to this:
165 138
 
166
-<br>
167
-
168
-#### **Population reference files**
139
+```
169 140
 
170
-For demonstration purpose, a small 
171
-**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small 
172
-**population reference SNV Annotation GDS file** (called 
173
-_ex1_good_small_1KG_Annot.gds_) are 
174
-included in this package. Beware that those two files should not be used to 
175
-run a real ancestry inference.The results obtained with those files won't be 
176
-reliable.
141
+#############################################################################
142
+## Working directory structure
143
+#############################################################################
144
+workingDirectory/  
145
+	data/  
146
+		refGDS  
147
+		profileGDS 
177 148
 
178
-The required **population reference GDS file** and 
179
-**population reference SNV Annotation GDS file** should be stored in the same
180
-directory. In the example below, this directory is referred to 
181
-as **pathReference**. 
149
+```
182 150
 
183 151
 <br>
184 152
 
185
-#### **Profile SNP file**
153
+This following running example creates a temporary working directory structure 
154
+when the demo samples will be run. Some sub-directories in 
155
+*workingDirectory/data* will be created in subsequent steps.
186 156
 
187
-The **Profile SNP file** can be either in a VCF format or in a generic format.
188 157
 
189
-The **Profile SNP VCF file**  follows the VCF standard with at least 
190
-those genotype fields: _GT_, _AD_ and _DP_. The identifier of the genotype 
191
-in the VCF file must correspond to the profile identifier _Name.ID_. 
192
-The SNVs  must be germline variants and should include the genotype of the 
193
-wild-type homozygous at the selected positions in the reference. One file per
194
-profile is need and the VCF file must be gzipped.
158
+```{r createDir, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
195 159
 
196
-Note that the name assigned to the **Profile SNP VCF file** has to 
197
-correspond to the profile identifier _Name.ID_ in the following analysis. 
198
-For example, a SNP file called "Sample.01.vcf.gz" would be 
199
-associated to the "Sample.01" profile.
160
+#############################################################################
161
+## Create a temporary working directory structure
162
+#############################################################################
163
+pathWorkingDirectory <- file.path(tempdir(), "workingDirectory")
164
+pathWorkingDirectoryData <- file.path(pathWorkingDirectory, "data")
200 165
 
201
-A generic SNP file can replace the VCF file. The **Profile SNP Generic file** 
202
-format is coma separated and the mandatory columns are:
166
+if (!dir.exists(pathWorkingDirectory)) {
167
+        dir.create(pathWorkingDirectory)
168
+        dir.create(pathWorkingDirectoryData)
169
+        dir.create(file.path(pathWorkingDirectoryData, "refGDS"))
170
+}
203 171
 
204
-* _Chromosome_: The name of the chromosome
205
-* _Position_: The position on the chromosome
206
-* _Ref_: The reference nucleotide
207
-* _Alt_: The aternative nucleotide
208
-* _Count_: The total count
209
-* _File1R_: The count for the reference nucleotide
210
-* _File1A_: The count for the alternative nucleotide
172
+```
211 173
 
212
-Beware that the starting position in the **population reference GDS File** is 
213
-zero (like BED files). The **Profile SNP Generic file** should also start 
214
-at position zero.
174
+<br>
215 175
 
216
-Note that the name assigned to the **Profile SNP Generic file** has to 
217
-correspond to the profile identifier _Name.ID_ in the following analysis. 
218
-For example, a SNP file called "Sample.01.generic.txt.gz" would be 
219
-associated to the "Sample.01" profile.
176
+### 1.2 Download the population reference files
220 177
 
221
-<br>
222 178
 
223
-#### **Profile PED RDS file**
179
+The population reference files should be downloaded in the *data/refGDS* 
180
+sub-directory. This following code downloads the complete pre-processed files 
181
+for 1000 Genomes (1KG), in hg38. The size of the 1KG GDS file is 15GB. 
224 182
 
225
-The **Profile PED RDS file** must contain a *data.frame* describing all 
226
-the profiles to be analyzed. These 5 mandatory columns: 
183
+```
227 184
 
228
-- _Name.ID_: The unique sample identifier. The associated **profile SNP file** 
229
-should be called "Name.ID.txt.gz". 
230
-- _Case.ID_: The patient identifier associated to the sample.
231
-- _Sample.Type_: The information about the profile tissue source 
232
-(primary tumor, metastatic tumor, normal, etc..). 
233
-- _Diagnosis_: The donor's diagnosis.
234
-- _Source_: The source of the profile sequence data (example: dbGAP_XYZ).
185
+#############################################################################
186
+## How to download the pre-processed files for 1000 Genomes (1KG) (15 GB)
187
+#############################################################################
188
+cd workingDirectory
189
+cd data/refGDS
235 190
 
236
-Important: The row names of the *data.frame* must be the profiles *Name.ID*.
191
+wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matGeno1000g.gds
192
+wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matAnnot1000g.gds
193
+cd -
237 194
 
238
-This file is referred to as the **Profile PED RDS file** (PED for pedigree). 
239
-Alternatively, the PED information can be saved in another type of 
240
-file (CVS, etc..) as long as the *data.frame* information can be regenerated 
241
-in R (with _read.csv()_ or else). 
195
+```
242 196
 
243 197
 <br>
244 198
 
245
-#### **Example**
246
-
247
-This example run an ancestry inference on an exome sample. Both population 
248
-reference files are demonstration files and should not be 
249
-used for a real ancestry inference. Beware that running an ancestry inference 
250
-on real data will take longer to run.
199
+For demonstration purpose, a small 
200
+**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small 
201
+**population reference SNV Annotation GDS file** (called 
202
+_ex1_good_small_1KG_Annot.gds_) are 
203
+included in this package. Beware that those two files should not be used to 
204
+run a real ancestry inference. The results obtained with those files won't be 
205
+reliable.
251 206
 
252
-```{r runExomeAncestry, echo=TRUE, eval=TRUE, collapse=FALSE, warning=FALSE, message=FALSE}
253
-#############################################################################
254
-## Load required packages
255
-#############################################################################
256
-library(RAIDS)    
257
-library(gdsfmt)
207
+In this running example, the demonstration files are copied in the required 
208
+*data/refGDS* directory.
258 209
 
259
-## Path to the demo 1KG GDS file is located in this package
260
-dataDir <- system.file("extdata", package="RAIDS")
210
+```{r copyRefFile, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
261 211
 
262 212
 #############################################################################
263
-## Load the information about the profile
213
+## Load RAIDS package
264 214
 #############################################################################
265
-data(demoPedigreeEx1)
266
-head(demoPedigreeEx1)
215
+library(RAIDS)   
267 216
 
268 217
 #############################################################################
269 218
 ## The population reference GDS file and SNV Annotation GDS file
270
-## need to be located in the same directory.
219
+## need to be located in the same sub-directory.
271 220
 ## Note that the population reference GDS file used for this example is a
272 221
 ## simplified version and CANNOT be used for any real analysis
273 222
 #############################################################################
223
+## Path to the demo 1KG GDS file is located in this package
224
+dataDir <- system.file("extdata", package="RAIDS")
274 225
 pathReference <- file.path(dataDir, "tests")
275 226
 
276 227
 fileGDS <- file.path(pathReference, "ex1_good_small_1KG.gds")
277 228
 fileAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
278 229
 
230
+file.copy(fileGDS, file.path(pathWorkingDirectoryData, "refGDS"))
231
+file.copy(fileAnnotGDS, file.path(pathWorkingDirectoryData, "refGDS"))
232
+
233
+```
234
+<br>
235
+<br>
236
+
237
+## Step 2 Ancestry inference with RAIDS
238
+
239
+### 2.1 Set-up required directories
240
+
241
+```{r installRaids, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
242
+
279 243
 #############################################################################
280
-## A data frame containing general information about the study
281
-## is also required. The data frame must have
282
-## those 3 columns: "study.id", "study.desc", "study.platform"
244
+## The file path to the population reference GDS file 
245
+##     is required (refGenotype will be used as input later)
246
+## The file path to the population reference SNV Annotation GDS file
247
+##     is also required (refAnnotation will be used as input later)
283 248
 #############################################################################
284
-studyDF <- data.frame(study.id="MYDATA",
285
-                   study.desc="Description",
286
-                   study.platform="PLATFORM",
287
-                   stringsAsFactors=FALSE)
249
+pathReference <- file.path(pathWorkingDirectoryData, "refGDS")
250
+
251
+refGenotype <- file.path(pathReference, "ex1_good_small_1KG.gds")
252
+refAnnotation <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
288 253
 
289 254
 #############################################################################
290
-## The Sample SNP VCF files (one per sample) need
291
-## to be all located in the same directory.
255
+## The output directories inside workingDirectory/data must be created
256
+##    (pathProfileGDS will be used as input later)
292 257
 #############################################################################
293
-pathGeno <- file.path(dataDir, "example", "snpPileup")
258
+pathProfileGDS <- file.path(pathWorkingDirectoryData, "profileGDS")
259
+
260
+if (!dir.exists(pathProfileGDS)) {
261
+    dir.create(pathProfileGDS)
262
+}
263
+
264
+```
265
+
266
+
267
+<br>
268
+
269
+### 2.2 Sample reference donor profiles from the reference data
270
+
271
+With the 1KG reference, we recommend sampling 30 donor profiles per population. 
272
+For reproducibility, be sure to use the same random-number generator seed.
273
+
274
+In the following code, only 2 profiles per population are sampled:
275
+
276
+```{r sampling, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
294 277
 
295 278
 #############################################################################
296
-## Fix RNG seed to ensure reproducible results
279
+## Fix seed to ensure reproducible results
297 280
 #############################################################################
298 281
 set.seed(3043)
299 282
 
... ...
@@ -302,393 +285,173 @@ set.seed(3043)
302 285
 ## the synthetic data.
303 286
 ## Here we select 2 profiles from the simplified 1KG GDS for each 
304 287
 ## subcontinental-level.
305
-## Normally, we use 30 profile for each 
306
-## subcontinental-level but it is too big for the example.
288
+## Normally, we would use 30 profiles for each subcontinental-level.
307 289
 ## The 1KG files in this example only have 6 profiles for each 
308 290
 ## subcontinental-level (for demo purpose only).
309 291
 #############################################################################
310
-gds1KG <- snpgdsOpen(fileGDS)
311
-dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
312
-closefn.gds(gds1KG)
313
-
314
-## GenomeInfoDb and BSgenome are required libraries to run this example
315
-if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
316
-      requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
317
-
318
-    ## Chromosome length information
319
-    ## chr23 is chrX, chr24 is chrY and chrM is 25
320
-    chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]
321
-
322
-    ###########################################################################
323
-    ## The path where the Sample GDS files (one per sample)
324
-    ## will be created needs to be specified.
325
-    ###########################################################################
326
-    pathProfileGDS <- file.path(tempdir(), "exampleDNA", "out.tmp")
327
-
328
-    ###########################################################################
329
-    ## The path where the result files will be created needs to 
330
-    ## be specified
331
-    ###########################################################################
332
-    pathOut <- file.path(tempdir(), "exampleDNA", "res.out")
333
-
334
-    ## Example can only be run if the current directory is in writing mode
335
-    if (!dir.exists(file.path(tempdir(), "exampleDNA"))) {
336
-
337
-        dir.create(file.path(tempdir(), "exampleDNA"))
338
-        dir.create(pathProfileGDS)
339
-        dir.create(pathOut)
340
-    
341
-        #########################################################################
342
-        ## The wrapper function generates the synthetic dataset and uses it 
343
-        ## to selected the optimal parameters before calling the genetic 
344
-        ## ancestry on the current profiles.
345
-        ## All important information, for each step, are saved in 
346
-        ## multiple output files.
347
-        ## The 'genoSource' parameter has 2 options depending on how the 
348
-        ##   SNP files have been generated: 
349
-        ##   SNP VCF files have been generated: 
350
-        ##  "VCF" or "generic" (other software)
351
-        ##
352
-        #########################################################################
353
-        runExomeAncestry(pedStudy=demoPedigreeEx1, studyDF=studyDF,
354
-                 pathProfileGDS=pathProfileGDS,
355
-                 pathGeno=pathGeno,
356
-                 pathOut=pathOut,
357
-                 fileReferenceGDS=fileGDS,
358
-                 fileReferenceAnnotGDS=fileAnnotGDS,
359
-                 chrInfo=chrInfo,
360
-                 syntheticRefDF=dataRef,
361
-                 genoSource="VCF")
362
-        list.files(pathOut)
363
-        list.files(file.path(pathOut, demoPedigreeEx1$Name.ID[1]))
364
-
365
-        #######################################################################
366
-        ## The file containing the ancestry inference (SuperPop column) and 
367
-        ## optimal number of PCA component (D column)
368
-        ## optimal number of neighbours (K column)
369
-        #######################################################################
370
-        resAncestry <- read.csv(file.path(pathOut, 
371
-                        paste0(demoPedigreeEx1$Name.ID[1], ".Ancestry.csv")))
372
-        print(resAncestry)
373
-
374
-        ## Remove temporary files created for this demo
375
-        unlink(pathProfileGDS, recursive=TRUE, force=TRUE)
376
-        unlink(pathOut, recursive=TRUE, force=TRUE)
377
-        unlink(file.path(tempdir(), "exampleDNA"), recursive=TRUE, force=TRUE)
378
-    }
379
-}
380
-  
292
+dataRef <- select1KGPopForSynthetic(fileReferenceGDS=refGenotype,
293
+                                        nbProfiles=2L)
381 294
 
382 295
 ```
383 296
 
384
-<br>
297
+The output object is going to be used later at the ancestry inference step.
298
+
385 299
 <br>
386 300
 
387
-The *runExomeAncestry()* function generates 3 types of files 
388
-in the *pathOut* directory.
301
+### 2.3 Perform the ancestry inference
389 302
 
390
-* The ancestry inference CSV file (**".Ancestry.csv"** file) 
391
-* The inference information RDS file (**".infoCall.rds"** file)
392
-* The parameter information RDS files from the synthetic inference 
393
-(__"KNN.synt.__*__.rds"__ files in a sub-directory)
303
+Within a single function call, data synthesis is performed, the synthetic 
304
+data are used to optimize the inference parameters and, with these, the 
305
+ancestry of the input profile donor is inferred.
394 306
 
395
-In addition, a sub-directory (named using the *profile ID*) is 
396
-also created. 
307
+According to the type of input data (RNA or DNA), a specific function 
308
+is available.
397 309
 
398
-The inferred ancestry is stored in the ancestry inference CSV 
399
-file (**".Ancestry.csv"** file) which also contains those columns:
310
+The *inferAncestry()* function is used for DNA profiles while 
311
+the *inferAncestryGeneAware()* function is RNA specific.
400 312
 
401
-* _sample.id_: The unique identifier of the sample 
402
-* _D_: The optimal PCA dimension value used to infer the ancestry
403
-* _k_: The optimal number of neighbors value used to infer the ancestry
404
-* _SuperPop_: The inferred ancestry
313
+In this example, the profile is from DNA source and requires the use of the 
314
+*inferAncestry()* function.
405 315
 
406
-<br>
407
-<br>
316
+```{r infere, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
408 317
 
318
+###########################################################################
319
+## GenomeInfoDb and BSgenome are required libraries to run this example
320
+###########################################################################
321
+if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
322
+      requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
409 323
 
324
+    #######################################################################
325
+    ## Chromosome length information is required
326
+    ## chr23 is chrX, chr24 is chrY and chrM is 25
327
+    #######################################################################
328
+    genome <- BSgenome.Hsapiens.UCSC.hg38::Hsapiens
329
+    chrInfo <- GenomeInfoDb::seqlengths(genome)[1:25]
330
+
331
+    #######################################################################
332
+    ## The SNP VCF file of the DNA profile donor
333
+    #######################################################################
334
+    fileDonorVCF <- file.path(dataDir, "example", "snpPileup", "ex1.vcf.gz")
335
+
336
+    #######################################################################
337
+    ## The ancestry inference call
338
+    #######################################################################
339
+    resOut <- inferAncestry(profileFile=fileDonorVCF, 
340
+        pathProfileGDS=pathProfileGDS,
341
+        fileReferenceGDS=refGenotype,
342
+        fileReferenceAnnotGDS=refAnnotation,
343
+        chrInfo=chrInfo,
344
+        syntheticRefDF=dataRef,
345
+        genoSource=c("VCF"))
346
+}
410 347
 
411
-### RNA data - Wrapper function to run ancestry inference on RNA data 
348
+```
412 349
 
413
-The process is the same as for the DNA but use the wrapper function 
414
-called _runRNAAncestry()_. Internally the data is process differently.
415
-It requires 4 files as input:
350
+A profile GDS file is created in the *pathProfileGDS* directory while all the 
351
+ancestry and optimal parameters information are integrated in the output 
352
+object.
416 353
 
417
-- The **population reference GDS file**
418
-- The **population reference SNV Annotation GDS file**
419
-- The **Profile SNP file** (one per sample present in the study)
420
-- The **Profile PED RDS file** (one file with information for all 
421
-profiles in the study)
354
+At last, all temporary files created in this example should be deleted.
422 355
 
423
-A *data.frame* containing the general information about the study is 
424
-also required. The *data.frame* must contain those 3 columns:
356
+```{r removeTmp, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
425 357
 
426
-- _study.id_: The study identifier (example: TCGA-BRCA).
427
-- _study.desc_: The description of the study.
428
-- _study.platform_: The type of sequencing (example: RNA-seq).
358
+#######################################################################
359
+## Remove temporary files created for this demo
360
+#######################################################################
361
+unlink(pathWorkingDirectory, recursive=TRUE, force=TRUE)
362
+      
363
+```
429 364
 
430 365
 <br>
431
-
432
-#### **Population reference files**
433
-
434
-For demonstration purpose, a small 
435
-**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small 
436
-**population reference SNV Annotation GDS file** (called 
437
-_ex1_good_small_1KG_Annot.gds_) are 
438
-included in this package. Beware that those two files should not be used to 
439
-run a real ancestry inference.The results obtained with those files won't be 
440
-reliable.
441
-
442
-The required **population reference GDS file** and 
443
-**population reference SNV Annotation GDS file** should be stored in the same
444
-directory. In the example below, this directory is referred to 
445
-as **pathReference**.
446
-
447 366
 <br>
448 367
 
449
-#### **Profile SNP file**
450
-
451
-The **Profile SNP file** can be either in a VCF format or in a generic format.
452
-
453
-The **Profile SNP VCF file**  follows the VCF standard with at least 
454
-those genotype fields: _GT_, _AD_ and _DP_. The identifier of the genotype 
455
-in the VCF file must correspond to the profile identifier _Name.ID_. 
456
-The SNVs  must be germline variants and should include the genotype of the 
457
-wild-type homozygous at the selected positions in the reference. One file per
458
-profile is need and the VCF file must be gzipped.
459
-
460
-Note that the name assigned to the **Profile SNP VCF file** has to 
461
-correspond to the profile identifier _Name.ID_ in the following analysis. 
462
-For example, a SNP file called "Sample.01.vcf.gz" would be 
463
-associated to the "Sample.01" profile.
464
-
465
-A generic SNP file can replace the VCF file. The **Profile SNP Generic file** 
466
-format is coma separated and the mandatory columns are:
467 368
 
468
-* _Chromosome_: The name of the chromosome
469
-* _Position_: The position on the chromosome
470
-* _Ref_: The reference nucleotide
471
-* _Alt_: The aternative nucleotide
472
-* _Count_: The total count
473
-* _File1R_: The count for the reference nucleotide
474
-* _File1A_: The count for the alternative nucleotide
369
+## Step 3. Examine the value of the inference call
475 370
 
476
-Beware that the starting position in the **population reference GDS File** is 
477
-zero (like BED files). The **Profile SNP Generic file** should also start 
478
-at position zero.
371
+The inferred ancestry and the optimal parameters are present in the *list* 
372
+object generated by the *inferAncestry()* and *inferAncestryGeneAware()* 
373
+functions.
479 374
 
480
-Note that the name assigned to the **Profile SNP Generic file** has to 
481
-correspond to the profile identifier _Name.ID_ in the following analysis. 
482
-For example, a SNP file called "Sample.01.generic.txt.gz" would be 
483
-associated to the "Sample.01" profile.
484 375
 
485
-<br>
486
-
487
-#### **Profile PED RDS file**
488
-
489
-The **Profile PED RDS file** must contain a *data.frame* describing all 
490
-the profiles to be analyzed. These 5 mandatory columns: 
376
+```{r printRes, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
491 377
 
492
-- _Name.ID_: The unique sample identifier. The associated **profile SNP file**  
493
-should be called "Name.ID.txt.gz". 
494
-- _Case.ID_: The patient identifier associated to the sample.
495
-- _Sample.Type_: The information about the profile tissue source 
496
-(primary tumor, metastatic tumor, normal, etc..). 
497
-- _Diagnosis_: The donor's diagnosis.
498
-- _Source_: The source of the profile sequence data (example: dbGAP_XYZ).
378
+###########################################################################
379
+## The output is a list object with multiple entries
380
+###########################################################################
381
+class(resOut)
382
+names(resOut)
499 383
 
500
-Important: The row names of the *data.frame* must be the profiles _Name.ID_.
384
+```
501 385
 
502
-This file is referred to as the **Profile PED RDS file** (PED for pedigree). 
503
-Alternatively, the PED information can be saved in another type of 
504
-file (CVS, etc..) as long as the *data.frame* information can be regenerated 
505
-in R (with _read.csv()_ or else). 
506 386
 
507 387
 <br>
508 388
 
509
-#### **Example**
510
-
511
-This example run an ancestry inference on an RNA sample. Both population 
512
-reference files are demonstration files and should not be 
513
-used for a real ancestry inference. Beware that running an ancestry inference 
514
-on real data will take longer to run.
389
+### 3.1 Inspect the inference and the optimal parameters
515 390
 
516
-```{r runRNAAncestry, echo=TRUE, eval=TRUE, collapse=FALSE, warning=FALSE, message=FALSE}
517
-#############################################################################
518
-## Load required packages
519
-#############################################################################
520
-library(RAIDS)    
521
-library(gdsfmt)
522 391
 
523
-## Path to the demo 1KG GDS file is located in this package
524
-dataDir <- system.file("extdata", package="RAIDS")
392
+For the global ancestry inference using PCA followed by nearest neighbor 
393
+classification these parameters are *D* (the number of the top principal 
394
+directions retained) and *k* (the number of nearest neighbors).  
525 395
 
526
-#############################################################################
527
-## Load the information about the profile
528
-#############################################################################
529
-data(demoPedigreeEx1)
530
-head(demoPedigreeEx1)
396
+The information is stored in the *Ancestry* entry as a *data.frame* object. 
397
+It is a contains those columns:
531 398
 
532
-#############################################################################
533
-## The population reference GDS file and SNV Annotation GDS file
534
-## need to be located in the same directory.
535
-## Note that the population reference GDS file used for this example is a
536
-## simplified version and CANNOT be used for any real analysis
537
-#############################################################################
538
-pathReference <- file.path(dataDir, "tests")
539
-
540
-fileGDS <- file.path(pathReference, "ex1_good_small_1KG.gds")
541
-fileAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds")
399
+* _sample.id_: The unique identifier of the sample 
400
+* _D_: The optimal PCA dimension value used to infer the ancestry
401
+* _k_: The optimal number of neighbors value used to infer the ancestry
402
+* _SuperPop_: The inferred ancestry
542 403
 
543
-#############################################################################
544
-## A data frame containing general information about the study
545
-## is also required. The data frame must have
546
-## those 3 columns: "study.id", "study.desc", "study.platform"
547
-#############################################################################
548
-studyDF <- data.frame(study.id="MYDATA",
549
-                   study.desc="Description",
550
-                   study.platform="PLATFORM",
551
-                   stringsAsFactors=FALSE)
552 404
 
553
-#############################################################################
554
-## The Sample SNP VCF files (one per sample) need
555
-## to be all located in the same directory.
556
-#############################################################################
557
-pathGeno <- file.path(dataDir, "example", "snpPileupRNA")
405
+```{r print, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
558 406
 
559
-#############################################################################
560
-## Fix RNG seed to ensure reproducible results
561
-#############################################################################
562
-set.seed(3043)
407
+###########################################################################
408
+## The ancestry information is stored in the 'Ancestry' entry 
409
+###########################################################################
410
+print(resOut$Ancestry)
563 411
 
564
-#############################################################################
565
-## Select the profiles from the population reference GDS file for 
566
-## the synthetic data.
567
-## Here we select 2 profiles from the simplified 1KG GDS for each 
568
-## subcontinental-level.
569
-## Normally, we use 30 profile for each 
570
-## subcontinental-level but it is too big for the example.
571
-## The 1KG files in this example only have 6 profiles for each 
572
-## subcontinental-level (for demo purpose only).
573
-#############################################################################
574
-gds1KG <- snpgdsOpen(fileGDS)
575
-dataRef <- select1KGPop(gds1KG, nbProfiles=2L)
576
-closefn.gds(gds1KG)
577
-
578
-## GenomeInfoDb and BSgenome are required libraries to run this example
579
-if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
580
-      requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) {
581
-
582
-  ## Chromosome length information
583
-  ## chr23 is chrX, chr24 is chrY and chrM is 25
584
-  chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25]
585
-
586
-  #############################################################################
587
-  ## The path where the Sample GDS files (one per sample)
588
-  ## will be created needs to be specified.
589
-  #############################################################################
590
-  pathProfileGDS <- file.path(tempdir(), "exampleRNA", "outRNA.tmp")
591
-
592
-  #############################################################################
593
-  ## The path where the result files will be created needs to 
594
-  ## be specified
595
-  #############################################################################
596
-  pathOut <- file.path(tempdir(), "exampleRNA", "resRNA.out")
597
-
598
-  ## Example can only be run if the current directory is in writing mode
599
-  if (!dir.exists(file.path(tempdir(), "exampleRNA"))) {
600
-
601
-      dir.create(file.path(tempdir(), "exampleRNA"))
602
-      dir.create(pathProfileGDS)
603
-      dir.create(pathOut)
604
-    
605
-      #########################################################################
606
-      ## The wrapper function generates the synthetic dataset and uses it 
607
-      ## to selected the optimal parameters before calling the genetic 
608
-      ## ancestry on the current profiles.
609
-      ## All important information, for each step, are saved in 
610
-      ## multiple output files.
611
-      ## The 'genoSource' parameter has 2 options depending on how the 
612
-      ##   SNP files have been generated: 
613
-      ##   SNP VCF files have been generated: 
614
-      ##  "VCF" or "generic" (other software)
615
-      #########################################################################
616
-      runRNAAncestry(pedStudy=demoPedigreeEx1, studyDF=studyDF,
617
-                    pathProfileGDS=pathProfileGDS,
618
-                    pathGeno=pathGeno,
619
-                    pathOut=pathOut,
620
-                    fileReferenceGDS=fileGDS,
621
-                    fileReferenceAnnotGDS=fileAnnotGDS,
622
-                    chrInfo=chrInfo,
623
-                    syntheticRefDF=dataRef,
624
-                    blockTypeID="GeneS.Ensembl.Hsapiens.v86",
625
-                    genoSource="VCF")
626
-    
627
-      list.files(pathOut)
628
-      list.files(file.path(pathOut, demoPedigreeEx1$Name.ID[1]))
629
-
630
-      #########################################################################
631
-      ## The file containing the ancestry inference (SuperPop column) and 
632
-      ## optimal number of PCA component (D column)
633
-      ## optimal number of neighbours (K column)
634
-      #########################################################################
635
-      resAncestry <- read.csv(file.path(pathOut, 
636
-                        paste0(demoPedigreeEx1$Name.ID[1], ".Ancestry.csv")))
637
-      print(resAncestry)
638
-
639
-      ## Remove temporary files created for this demo
640
-      unlink(pathProfileGDS, recursive=TRUE, force=TRUE)
641
-      unlink(pathOut, recursive=TRUE, force=TRUE)
642
-      unlink(file.path(tempdir(), "example"), recursive=TRUE, force=TRUE)
643
-  }
644
-}
645
- 
646 412
 ```
647 413
 
648
-<br>
649 414
 <br>
650 415
 
651
-The *runRNAAncestry()* function generates 3 types of files 
652
-in the *pathOut* directory.
416
+### 3.2 Visualize the RAIDS performance for the synthetic data
653 417
 
654
-* The ancestry inference CSV file (**".Ancestry.csv"** file) 
655
-* The inference information RDS file (**".infoCall.rds"** file)
656
-* The parameter information RDS files from the synthetic inference 
657
-(__"KNN.synt.__*__.rds"__ files in a sub-directory)
658 418
 
659
-In addition, a sub-directory (named using the *profile ID*) is 
660
-also created. 
419
+The *createAUROCGraph()* function enable the visualization of RAIDS 
420
+performance for the synthetic data, as a function of *D* and *k*.
661 421
 
662
-The inferred ancestry is stored in the ancestry inference CSV 
663
-file (**".Ancestry.csv"** file) which also contains those columns:
422
+```{r visualize, echo=TRUE, eval=TRUE, fig.align="center", fig.cap="RAIDS performance for the synthtic data.", results='asis', collapse=FALSE, warning=FALSE, message=FALSE}
664 423
 
665
-* _sample.id_: The unique identifier of the sample 
666
-* _D_: The optimal PCA dimension value used to infer the ancestry
667
-* _k_: The optimal number of neighbors value used to infer the ancestry
668
-* _SuperPop_: The inferred ancestry
424
+###########################################################################
425
+## Create a graph showing the perfomance for the synthetic data
426
+## The output is a ggplot object
427
+###########################################################################
428
+createAUROCGraph(dfAUROC=resOut$paraSample$dfAUROC, title="Example ex1")
669 429
 
430
+```
431
+
432
+In this specific demonstration, the performances are lower than expected 
433
+with a real profile and a complete reference population file.
670 434
 
671 435
 <br>
672 436
 <br>
673 437
 
438
+# Format population reference dataset (optional)
674 439
 
675
-## Format population reference dataset (optional)
676 440
 
677
-
678
-```{r graphStep1, echo=FALSE, fig.align="center", fig.cap="Step 1 - Formatting the information from the population reference dataset (optional)", out.width='120%', results='asis', warning=FALSE, message=FALSE}
679
-knitr::include_graphics("MainSteps_Step1_v04.png")
441
+```{r graphStep1, echo=FALSE, fig.align="center", fig.cap="Step 1 - Provide population reference data", out.width='120%', results='asis', warning=FALSE, message=FALSE}
442
+knitr::include_graphics("Step1_population_file_v01.png")
680 443
 ```
681 444
 
682 445
 
683 446
 A population reference dataset with known ancestry is required to infer 
684
-ancestry. The population must be large enough to ensure ???
447
+ancestry. 
685 448
 
686 449
 Three important reference files, containing formatted information about 
687 450
 the reference dataset, are required:
688 451
 
689 452
 - The population reference GDS File
690 453
 - The population reference SNV Annotation GDS file
691
-- The population reference SNV Retained VCF file
454
+- The population reference SNV Retained VCF file (optional)
692 455
 
693 456
 
694 457
 The format of those files are described 
... ...
@@ -717,7 +480,6 @@ you choose to use the pre-processed files.</span>
717 480
 <br>
718 481
 
719 482
 
720
-
721 483
 # Session info
722 484
 
723 485
 Here is the output of `sessionInfo()` in the environment in which this 
724 486
new file mode 100644
725 487
Binary files /dev/null and b/vignettes/Step1_population_file_v01.png differ