Browse code

Merge branch 'main' of https://github.com/belleau/RAIDS

belleau authored on 16/04/2025 01:55:53
Showing 1 changed files

... ...
@@ -66,10 +66,13 @@ sequences of:
66 66
 * targeted gene panels 
67 67
 * RNA,
68 68
 
69
-including those from cancer-derived nucleic acids. The **RAIDS** package implements a data synthesis method that, for any given
70
-molecular profile, enables, on the one hand, profile-specific inference
69
+including those from cancer-derived nucleic acids. The **RAIDS** package implements a 
70
+data synthesis method that, for any given
71
+molecular profile of an idividual, enables, on the one hand, profile-specific inference
71 72
 parameter optimization and, on the other hand, a profile-specific inference
72
-accuracy estimate.
73
+accuracy estimate. By the molecular profile we mean a table of the individual's 
74
+germline genotypes at genome positions with sufficient read coverage in the 
75
+individual's input data, where sequence variants are frequent in the population reference data. 
73 76
 
74 77
 <br>
75 78
 <br>
... ...
@@ -298,33 +301,30 @@ data are used to optimize the inference parameters and, with these, the
298 301
 ancestry is inferred from the input sequence profile.
299 302
 
300 303
 According to the type of input data (RNA or DNA sequence), a specific function 
301
-should be called. The *inferAncestry()* function is used for DNA profiles while 
304
+should be called. The *inferAncestry()* function (*inferAncestryDNA()* is 
305
+the same as *inferAncestry()* ) is used for DNA profiles while 
302 306
 the *inferAncestryGeneAware()* function is RNA specific.
303 307
 
304
-The *inferAncestry()* function requires a specific profile input format. The 
305
-format is set by the *genoSource* parameter. 
308
+The *inferAncestry()* function requires a specific input format for the individual's 
309
+genotyping profile as explained in the Introduction. The format is set by 
310
+the *genoSource* parameter. 
306 311
 
307
-One of those formats is in a VCF format (*genoSource=c("VCF")*). 
308
-This format follows the VCF standard with at least those genotype 
309
-fields: _GT_, _AD_ and _DP_. 
310
-The SNVs  must be germline variants and should include the genotype of the 
311
-wild-type homozygous at the selected positions in the reference. The VCF file 
312
-must be gzipped.
312
+One of the allowed formats is VCF (*genoSource=c("VCF")*), with the following 
313
+mandatory fields: _GT_, _AD_ and _DP_. 
314
+The VCF file must be gzipped.
313 315
 
314
-A generic SNP file can replace the VCF file (*genoSource=c("generic")*). 
315
-The format is comma separated and the mandatory columns are:
316
+Also allowed is a  "generic" fileĀ format  (*genoSource=c("generic")*), specified as 
317
+a comma-separated table The following columns are mandatory:
316 318
 
317
-* _Chromosome_: The name of the chromosome
319
+* _Chromosome_: The name of the chromosome can be formatted as chr1 or 1
318 320
 * _Position_: The position on the chromosome
319 321
 * _Ref_: The reference nucleotide
320
-* _Alt_: The aternative nucleotide
321
-* _Count_: The total count
322
-* _File1R_: The count for the reference nucleotide
323
-* _File1A_: The count for the alternative nucleotide
324
-
325
-Beware that the starting position in the **population reference GDS file** is 
326
-zero (like BED files). The generic SNP file should also start 
327
-at position zero.
322
+* _Alt_: The alternative nucleotide
323
+* _Count_: The total read count
324
+* _File1R_: Read count for the reference nucleotide
325
+* _File1A_: Read count for the alternative nucleotide
326
+ 
327
+Note: a header with identical column names is required.
328 328
 
329 329
 In this example, the profile is from DNA source and requires the use of the 
330 330
 *inferAncestry()* function.
... ...
@@ -364,11 +364,8 @@ if (requireNamespace("GenomeInfoDb", quietly=TRUE) &&
364 364
 
365 365
 ```
366 366
 
367
-A profile GDS file is created in the *pathProfileGDS* directory while all the 
368
-ancestry and optimal parameters information are integrated in the output 
369
-object.
370 367
 
371
-At last, all temporary files created in this example should be deleted.
368
+The temporary files created in this example are deleted as follows.
372 369
 
373 370
 ```{r removeTmp, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
374 371
 
... ...
@@ -406,16 +403,17 @@ names(resOut)
406 403
 ### 3.1 Inspect the inference and the optimal parameters
407 404
 
408 405
 
409
-For the global ancestry inference using PCA followed by nearest neighbor 
410
-classification these parameters are *D* (the number of the top principal 
411
-directions retained) and *k* (the number of nearest neighbors).  
406
+Global ancestry is inferred using principal-component decomposition
407
+followed by nearest neighbor classification. Two parameters are defined and optimized: 
408
+*D*, the number of the top principal directions retained and *k*, the number of nearest 
409
+neighbors.  
412 410
 
413
-The information is stored in the *Ancestry* entry as a *data.frame* object. 
414
-It is a contains those columns:
411
+The results of the inference are provided as the *Ancestry* item in the *resOut* list. 
412
+It is a *data.frame* with the following columns:
415 413
 
416 414
 * _sample.id_: The unique identifier of the sample 
417
-* _D_: The optimal PCA dimension value used to infer the ancestry
418
-* _k_: The optimal number of neighbors value used to infer the ancestry
415
+* _D_: The optimal *D* inference parameter
416
+* _k_: The optimal *k* inference parameter
419 417
 * _SuperPop_: The inferred ancestry
420 418
 
421 419
 
... ...
@@ -446,7 +444,7 @@ createAUROCGraph(dfAUROC=resOut$paraSample$dfAUROC, title="Example ex1")
446 444
 
447 445
 ```
448 446
 
449
-In this specific example, the performances are lower than expected 
447
+In this illustrative example, the performance estimates are lower than expected 
450 448
 with a realistic sequence profile and a complete reference population file.
451 449
 
452 450
 <br>
... ...
@@ -471,7 +469,7 @@ the reference dataset, are required:
471 469
 - The population reference SNV Retained VCF file (optional)
472 470
 
473 471
 
474
-The format of those files are described 
472
+The formats of those files are described in 
475 473
 the [Population reference dataset GDS files](Create_Reference_GDS_File.html) 
476 474
 vignette.
477 475