...
|
...
|
@@ -66,10 +66,13 @@ sequences of:
|
66
|
66
|
* targeted gene panels
|
67
|
67
|
* RNA,
|
68
|
68
|
|
69
|
|
-including those from cancer-derived nucleic acids. The **RAIDS** package implements a data synthesis method that, for any given
|
70
|
|
-molecular profile, enables, on the one hand, profile-specific inference
|
|
69
|
+including those from cancer-derived nucleic acids. The **RAIDS** package implements a
|
|
70
|
+data synthesis method that, for any given
|
|
71
|
+molecular profile of an idividual, enables, on the one hand, profile-specific inference
|
71
|
72
|
parameter optimization and, on the other hand, a profile-specific inference
|
72
|
|
-accuracy estimate.
|
|
73
|
+accuracy estimate. By the molecular profile we mean a table of the individual's
|
|
74
|
+germline genotypes at genome positions with sufficient read coverage in the
|
|
75
|
+individual's input data, where sequence variants are frequent in the population reference data.
|
73
|
76
|
|
74
|
77
|
<br>
|
75
|
78
|
<br>
|
...
|
...
|
@@ -298,33 +301,30 @@ data are used to optimize the inference parameters and, with these, the
|
298
|
301
|
ancestry is inferred from the input sequence profile.
|
299
|
302
|
|
300
|
303
|
According to the type of input data (RNA or DNA sequence), a specific function
|
301
|
|
-should be called. The *inferAncestry()* function is used for DNA profiles while
|
|
304
|
+should be called. The *inferAncestry()* function (*inferAncestryDNA()* is
|
|
305
|
+the same as *inferAncestry()* ) is used for DNA profiles while
|
302
|
306
|
the *inferAncestryGeneAware()* function is RNA specific.
|
303
|
307
|
|
304
|
|
-The *inferAncestry()* function requires a specific profile input format. The
|
305
|
|
-format is set by the *genoSource* parameter.
|
|
308
|
+The *inferAncestry()* function requires a specific input format for the individual's
|
|
309
|
+genotyping profile as explained in the Introduction. The format is set by
|
|
310
|
+the *genoSource* parameter.
|
306
|
311
|
|
307
|
|
-One of those formats is in a VCF format (*genoSource=c("VCF")*).
|
308
|
|
-This format follows the VCF standard with at least those genotype
|
309
|
|
-fields: _GT_, _AD_ and _DP_.
|
310
|
|
-The SNVs must be germline variants and should include the genotype of the
|
311
|
|
-wild-type homozygous at the selected positions in the reference. The VCF file
|
312
|
|
-must be gzipped.
|
|
312
|
+One of the allowed formats is VCF (*genoSource=c("VCF")*), with the following
|
|
313
|
+mandatory fields: _GT_, _AD_ and _DP_.
|
|
314
|
+The VCF file must be gzipped.
|
313
|
315
|
|
314
|
|
-A generic SNP file can replace the VCF file (*genoSource=c("generic")*).
|
315
|
|
-The format is comma separated and the mandatory columns are:
|
|
316
|
+Also allowed is a "generic" fileĀ format (*genoSource=c("generic")*), specified as
|
|
317
|
+a comma-separated table The following columns are mandatory:
|
316
|
318
|
|
317
|
|
-* _Chromosome_: The name of the chromosome
|
|
319
|
+* _Chromosome_: The name of the chromosome can be formatted as chr1 or 1
|
318
|
320
|
* _Position_: The position on the chromosome
|
319
|
321
|
* _Ref_: The reference nucleotide
|
320
|
|
-* _Alt_: The aternative nucleotide
|
321
|
|
-* _Count_: The total count
|
322
|
|
-* _File1R_: The count for the reference nucleotide
|
323
|
|
-* _File1A_: The count for the alternative nucleotide
|
324
|
|
-
|
325
|
|
-Beware that the starting position in the **population reference GDS file** is
|
326
|
|
-zero (like BED files). The generic SNP file should also start
|
327
|
|
-at position zero.
|
|
322
|
+* _Alt_: The alternative nucleotide
|
|
323
|
+* _Count_: The total read count
|
|
324
|
+* _File1R_: Read count for the reference nucleotide
|
|
325
|
+* _File1A_: Read count for the alternative nucleotide
|
|
326
|
+
|
|
327
|
+Note: a header with identical column names is required.
|
328
|
328
|
|
329
|
329
|
In this example, the profile is from DNA source and requires the use of the
|
330
|
330
|
*inferAncestry()* function.
|
...
|
...
|
@@ -364,11 +364,8 @@ if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
|
364
|
364
|
|
365
|
365
|
```
|
366
|
366
|
|
367
|
|
-A profile GDS file is created in the *pathProfileGDS* directory while all the
|
368
|
|
-ancestry and optimal parameters information are integrated in the output
|
369
|
|
-object.
|
370
|
367
|
|
371
|
|
-At last, all temporary files created in this example should be deleted.
|
|
368
|
+The temporary files created in this example are deleted as follows.
|
372
|
369
|
|
373
|
370
|
```{r removeTmp, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
|
374
|
371
|
|
...
|
...
|
@@ -406,16 +403,17 @@ names(resOut)
|
406
|
403
|
### 3.1 Inspect the inference and the optimal parameters
|
407
|
404
|
|
408
|
405
|
|
409
|
|
-For the global ancestry inference using PCA followed by nearest neighbor
|
410
|
|
-classification these parameters are *D* (the number of the top principal
|
411
|
|
-directions retained) and *k* (the number of nearest neighbors).
|
|
406
|
+Global ancestry is inferred using principal-component decomposition
|
|
407
|
+followed by nearest neighbor classification. Two parameters are defined and optimized:
|
|
408
|
+*D*, the number of the top principal directions retained and *k*, the number of nearest
|
|
409
|
+neighbors.
|
412
|
410
|
|
413
|
|
-The information is stored in the *Ancestry* entry as a *data.frame* object.
|
414
|
|
-It is a contains those columns:
|
|
411
|
+The results of the inference are provided as the *Ancestry* item in the *resOut* list.
|
|
412
|
+It is a *data.frame* with the following columns:
|
415
|
413
|
|
416
|
414
|
* _sample.id_: The unique identifier of the sample
|
417
|
|
-* _D_: The optimal PCA dimension value used to infer the ancestry
|
418
|
|
-* _k_: The optimal number of neighbors value used to infer the ancestry
|
|
415
|
+* _D_: The optimal *D* inference parameter
|
|
416
|
+* _k_: The optimal *k* inference parameter
|
419
|
417
|
* _SuperPop_: The inferred ancestry
|
420
|
418
|
|
421
|
419
|
|
...
|
...
|
@@ -446,7 +444,7 @@ createAUROCGraph(dfAUROC=resOut$paraSample$dfAUROC, title="Example ex1")
|
446
|
444
|
|
447
|
445
|
```
|
448
|
446
|
|
449
|
|
-In this specific example, the performances are lower than expected
|
|
447
|
+In this illustrative example, the performance estimates are lower than expected
|
450
|
448
|
with a realistic sequence profile and a complete reference population file.
|
451
|
449
|
|
452
|
450
|
<br>
|
...
|
...
|
@@ -471,7 +469,7 @@ the reference dataset, are required:
|
471
|
469
|
- The population reference SNV Retained VCF file (optional)
|
472
|
470
|
|
473
|
471
|
|
474
|
|
-The format of those files are described
|
|
472
|
+The formats of those files are described in
|
475
|
473
|
the [Population reference dataset GDS files](Create_Reference_GDS_File.html)
|
476
|
474
|
vignette.
|
477
|
475
|
|