Update main vignette
... | ... |
@@ -24,7 +24,6 @@ BiocStyle::markdown() |
24 | 24 |
suppressPackageStartupMessages({ |
25 | 25 |
library(knitr) |
26 | 26 |
library(RAIDS) |
27 |
- library(gdsfmt) |
|
28 | 27 |
}) |
29 | 28 |
|
30 | 29 |
set.seed(121444) |
... | ... |
@@ -106,194 +105,178 @@ BiocManager::install("RAIDS") |
106 | 105 |
|
107 | 106 |
# Main Steps |
108 | 107 |
|
109 |
- |
|
110 |
-This is an overview of genetic ancestry inference from cancer-derived |
|
108 |
+This is an overview of the genetic ancestry inference from cancer-derived |
|
111 | 109 |
molecular data: |
112 | 110 |
|
113 | 111 |
```{r graphMainSteps, echo=FALSE, fig.align="center", fig.cap="An overview of the genetic ancestry inference process.", out.width='130%', results='asis', warning=FALSE, message=FALSE} |
114 |
-knitr::include_graphics("MainSteps_v04.png") |
|
112 |
+knitr::include_graphics("MainSteps_v05.png") |
|
115 | 113 |
``` |
116 | 114 |
|
117 | 115 |
The main steps are: |
118 | 116 |
|
119 |
-**Step 1.** Format reference data from the population reference dataset (optional) |
|
120 |
- |
|
121 |
-**Step 2.1** Optimize ancestry inference parameters |
|
122 |
- |
|
123 |
-**Step 2.2** Infer ancestry for the subjects of the external study |
|
124 |
- |
|
125 |
-These steps are described in detail in the following. Steps 2.1 and 2.2 can be |
|
126 |
-run together using one wrapper function. |
|
127 |
- |
|
128 |
-<br> |
|
129 |
-<br> |
|
130 |
- |
|
117 |
+**Step 1.** Set-up and provide population reference files |
|
131 | 118 |
|
132 |
-## Main Step - Ancestry Inference |
|
119 |
+**Step 2** Sample the reference data for donor genotypes, to be used for synthesis and optimize ancestry inference parameters |
|
133 | 120 |
|
134 |
-A wrapper function encapsulates multiple steps of the workflow. |
|
135 |
- |
|
136 |
-```{r graphWrapper, echo=FALSE, fig.align="center", fig.cap="Final step - The wrapper function encapsulates multiple steps of the workflow.", out.width='120%', results='asis', warning=FALSE, message=FALSE} |
|
137 |
-knitr::include_graphics("MainSteps_Wrapper_v04.png") |
|
138 |
-``` |
|
121 |
+**Step 3** Infer ancestry for the subjects of the external study |
|
139 | 122 |
|
140 |
-In summary, the wrapper function generates the synthetic dataset and uses |
|
141 |
-it to selected the optimal parameters before calling the genetic ancestry |
|
142 |
-on the current profiles. |
|
123 |
+**Step 4** Present and interpret the results |
|
143 | 124 |
|
144 |
-According to the type of input data (RNA or DNA), a specific wrapper function |
|
145 |
-is available. |
|
125 |
+These steps are described in detail in the following. |
|
146 | 126 |
|
147 | 127 |
<br> |
128 |
+<br> |
|
148 | 129 |
|
149 |
-### DNA Data - Wrapper function to run ancestry inference on DNA data |
|
150 | 130 |
|
151 |
-The wrapper function, called _runExomeAncestry()_, requires 4 files as input: |
|
131 |
+## Step 1. Set-up and provide population reference files |
|
152 | 132 |
|
153 |
-- The **population reference GDS file** |
|
154 |
-- The **population reference SNV Annotation GDS file** |
|
155 |
-- The **Profile SNP file** (one per sample present in the study) |
|
156 |
-- The **Profile PED RDS file** (one file with information for all |
|
157 |
-profiles in the study) |
|
158 | 133 |
|
159 |
-In addition, a *data.frame* containing the general information about the |
|
160 |
-study is also required. The *data.frame* must contain those 3 columns: |
|
134 |
+### 1.1 Create a directory structure |
|
161 | 135 |
|
162 |
-- _study.id_: The study identifier (example: TCGA-BRCA). |
|
163 |
-- _study.desc_: The description of the study. |
|
164 |
-- _study.platform_: The type of sequencing (example: RNA-seq). |
|
136 |
+First, a specific directory structure must be created. The structure must |
|
137 |
+correspond to this: |
|
165 | 138 |
|
166 |
-<br> |
|
167 |
- |
|
168 |
-#### **Population reference files** |
|
139 |
+``` |
|
169 | 140 |
|
170 |
-For demonstration purpose, a small |
|
171 |
-**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small |
|
172 |
-**population reference SNV Annotation GDS file** (called |
|
173 |
-_ex1_good_small_1KG_Annot.gds_) are |
|
174 |
-included in this package. Beware that those two files should not be used to |
|
175 |
-run a real ancestry inference.The results obtained with those files won't be |
|
176 |
-reliable. |
|
141 |
+############################################################################# |
|
142 |
+## Working directory structure |
|
143 |
+############################################################################# |
|
144 |
+workingDirectory/ |
|
145 |
+ data/ |
|
146 |
+ refGDS |
|
147 |
+ profileGDS |
|
177 | 148 |
|
178 |
-The required **population reference GDS file** and |
|
179 |
-**population reference SNV Annotation GDS file** should be stored in the same |
|
180 |
-directory. In the example below, this directory is referred to |
|
181 |
-as **pathReference**. |
|
149 |
+``` |
|
182 | 150 |
|
183 | 151 |
<br> |
184 | 152 |
|
185 |
-#### **Profile SNP file** |
|
153 |
+This following running example creates a temporary working directory structure |
|
154 |
+when the demo samples will be run. Some sub-directories in |
|
155 |
+*workingDirectory/data* will be created in subsequent steps. |
|
186 | 156 |
|
187 |
-The **Profile SNP file** can be either in a VCF format or in a generic format. |
|
188 | 157 |
|
189 |
-The **Profile SNP VCF file** follows the VCF standard with at least |
|
190 |
-those genotype fields: _GT_, _AD_ and _DP_. The identifier of the genotype |
|
191 |
-in the VCF file must correspond to the profile identifier _Name.ID_. |
|
192 |
-The SNVs must be germline variants and should include the genotype of the |
|
193 |
-wild-type homozygous at the selected positions in the reference. One file per |
|
194 |
-profile is need and the VCF file must be gzipped. |
|
158 |
+```{r createDir, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
195 | 159 |
|
196 |
-Note that the name assigned to the **Profile SNP VCF file** has to |
|
197 |
-correspond to the profile identifier _Name.ID_ in the following analysis. |
|
198 |
-For example, a SNP file called "Sample.01.vcf.gz" would be |
|
199 |
-associated to the "Sample.01" profile. |
|
160 |
+############################################################################# |
|
161 |
+## Create a temporary working directory structure |
|
162 |
+############################################################################# |
|
163 |
+pathWorkingDirectory <- file.path(tempdir(), "workingDirectory") |
|
164 |
+pathWorkingDirectoryData <- file.path(pathWorkingDirectory, "data") |
|
200 | 165 |
|
201 |
-A generic SNP file can replace the VCF file. The **Profile SNP Generic file** |
|
202 |
-format is coma separated and the mandatory columns are: |
|
166 |
+if (!dir.exists(pathWorkingDirectory)) { |
|
167 |
+ dir.create(pathWorkingDirectory) |
|
168 |
+ dir.create(pathWorkingDirectoryData) |
|
169 |
+ dir.create(file.path(pathWorkingDirectoryData, "refGDS")) |
|
170 |
+} |
|
203 | 171 |
|
204 |
-* _Chromosome_: The name of the chromosome |
|
205 |
-* _Position_: The position on the chromosome |
|
206 |
-* _Ref_: The reference nucleotide |
|
207 |
-* _Alt_: The aternative nucleotide |
|
208 |
-* _Count_: The total count |
|
209 |
-* _File1R_: The count for the reference nucleotide |
|
210 |
-* _File1A_: The count for the alternative nucleotide |
|
172 |
+``` |
|
211 | 173 |
|
212 |
-Beware that the starting position in the **population reference GDS File** is |
|
213 |
-zero (like BED files). The **Profile SNP Generic file** should also start |
|
214 |
-at position zero. |
|
174 |
+<br> |
|
215 | 175 |
|
216 |
-Note that the name assigned to the **Profile SNP Generic file** has to |
|
217 |
-correspond to the profile identifier _Name.ID_ in the following analysis. |
|
218 |
-For example, a SNP file called "Sample.01.generic.txt.gz" would be |
|
219 |
-associated to the "Sample.01" profile. |
|
176 |
+### 1.2 Download the population reference files |
|
220 | 177 |
|
221 |
-<br> |
|
222 | 178 |
|
223 |
-#### **Profile PED RDS file** |
|
179 |
+The population reference files should be downloaded in the *data/refGDS* |
|
180 |
+sub-directory. This following code downloads the complete pre-processed files |
|
181 |
+for 1000 Genomes (1KG), in hg38. The size of the 1KG GDS file is 15GB. |
|
224 | 182 |
|
225 |
-The **Profile PED RDS file** must contain a *data.frame* describing all |
|
226 |
-the profiles to be analyzed. These 5 mandatory columns: |
|
183 |
+``` |
|
227 | 184 |
|
228 |
-- _Name.ID_: The unique sample identifier. The associated **profile SNP file** |
|
229 |
-should be called "Name.ID.txt.gz". |
|
230 |
-- _Case.ID_: The patient identifier associated to the sample. |
|
231 |
-- _Sample.Type_: The information about the profile tissue source |
|
232 |
-(primary tumor, metastatic tumor, normal, etc..). |
|
233 |
-- _Diagnosis_: The donor's diagnosis. |
|
234 |
-- _Source_: The source of the profile sequence data (example: dbGAP_XYZ). |
|
185 |
+############################################################################# |
|
186 |
+## How to download the pre-processed files for 1000 Genomes (1KG) (15 GB) |
|
187 |
+############################################################################# |
|
188 |
+cd workingDirectory |
|
189 |
+cd data/refGDS |
|
235 | 190 |
|
236 |
-Important: The row names of the *data.frame* must be the profiles *Name.ID*. |
|
191 |
+wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matGeno1000g.gds |
|
192 |
+wget https://labshare.cshl.edu/shares/krasnitzlab/aicsPaper/matAnnot1000g.gds |
|
193 |
+cd - |
|
237 | 194 |
|
238 |
-This file is referred to as the **Profile PED RDS file** (PED for pedigree). |
|
239 |
-Alternatively, the PED information can be saved in another type of |
|
240 |
-file (CVS, etc..) as long as the *data.frame* information can be regenerated |
|
241 |
-in R (with _read.csv()_ or else). |
|
195 |
+``` |
|
242 | 196 |
|
243 | 197 |
<br> |
244 | 198 |
|
245 |
-#### **Example** |
|
246 |
- |
|
247 |
-This example run an ancestry inference on an exome sample. Both population |
|
248 |
-reference files are demonstration files and should not be |
|
249 |
-used for a real ancestry inference. Beware that running an ancestry inference |
|
250 |
-on real data will take longer to run. |
|
199 |
+For demonstration purpose, a small |
|
200 |
+**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small |
|
201 |
+**population reference SNV Annotation GDS file** (called |
|
202 |
+_ex1_good_small_1KG_Annot.gds_) are |
|
203 |
+included in this package. Beware that those two files should not be used to |
|
204 |
+run a real ancestry inference. The results obtained with those files won't be |
|
205 |
+reliable. |
|
251 | 206 |
|
252 |
-```{r runExomeAncestry, echo=TRUE, eval=TRUE, collapse=FALSE, warning=FALSE, message=FALSE} |
|
253 |
-############################################################################# |
|
254 |
-## Load required packages |
|
255 |
-############################################################################# |
|
256 |
-library(RAIDS) |
|
257 |
-library(gdsfmt) |
|
207 |
+In this running example, the demonstration files are copied in the required |
|
208 |
+*data/refGDS* directory. |
|
258 | 209 |
|
259 |
-## Path to the demo 1KG GDS file is located in this package |
|
260 |
-dataDir <- system.file("extdata", package="RAIDS") |
|
210 |
+```{r copyRefFile, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
261 | 211 |
|
262 | 212 |
############################################################################# |
263 |
-## Load the information about the profile |
|
213 |
+## Load RAIDS package |
|
264 | 214 |
############################################################################# |
265 |
-data(demoPedigreeEx1) |
|
266 |
-head(demoPedigreeEx1) |
|
215 |
+library(RAIDS) |
|
267 | 216 |
|
268 | 217 |
############################################################################# |
269 | 218 |
## The population reference GDS file and SNV Annotation GDS file |
270 |
-## need to be located in the same directory. |
|
219 |
+## need to be located in the same sub-directory. |
|
271 | 220 |
## Note that the population reference GDS file used for this example is a |
272 | 221 |
## simplified version and CANNOT be used for any real analysis |
273 | 222 |
############################################################################# |
223 |
+## Path to the demo 1KG GDS file is located in this package |
|
224 |
+dataDir <- system.file("extdata", package="RAIDS") |
|
274 | 225 |
pathReference <- file.path(dataDir, "tests") |
275 | 226 |
|
276 | 227 |
fileGDS <- file.path(pathReference, "ex1_good_small_1KG.gds") |
277 | 228 |
fileAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds") |
278 | 229 |
|
230 |
+file.copy(fileGDS, file.path(pathWorkingDirectoryData, "refGDS")) |
|
231 |
+file.copy(fileAnnotGDS, file.path(pathWorkingDirectoryData, "refGDS")) |
|
232 |
+ |
|
233 |
+``` |
|
234 |
+<br> |
|
235 |
+<br> |
|
236 |
+ |
|
237 |
+## Step 2 Ancestry inference with RAIDS |
|
238 |
+ |
|
239 |
+### 2.1 Set-up required directories |
|
240 |
+ |
|
241 |
+```{r installRaids, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
242 |
+ |
|
279 | 243 |
############################################################################# |
280 |
-## A data frame containing general information about the study |
|
281 |
-## is also required. The data frame must have |
|
282 |
-## those 3 columns: "study.id", "study.desc", "study.platform" |
|
244 |
+## The file path to the population reference GDS file |
|
245 |
+## is required (refGenotype will be used as input later) |
|
246 |
+## The file path to the population reference SNV Annotation GDS file |
|
247 |
+## is also required (refAnnotation will be used as input later) |
|
283 | 248 |
############################################################################# |
284 |
-studyDF <- data.frame(study.id="MYDATA", |
|
285 |
- study.desc="Description", |
|
286 |
- study.platform="PLATFORM", |
|
287 |
- stringsAsFactors=FALSE) |
|
249 |
+pathReference <- file.path(pathWorkingDirectoryData, "refGDS") |
|
250 |
+ |
|
251 |
+refGenotype <- file.path(pathReference, "ex1_good_small_1KG.gds") |
|
252 |
+refAnnotation <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds") |
|
288 | 253 |
|
289 | 254 |
############################################################################# |
290 |
-## The Sample SNP VCF files (one per sample) need |
|
291 |
-## to be all located in the same directory. |
|
255 |
+## The output directories inside workingDirectory/data must be created |
|
256 |
+## (pathProfileGDS will be used as input later) |
|
292 | 257 |
############################################################################# |
293 |
-pathGeno <- file.path(dataDir, "example", "snpPileup") |
|
258 |
+pathProfileGDS <- file.path(pathWorkingDirectoryData, "profileGDS") |
|
259 |
+ |
|
260 |
+if (!dir.exists(pathProfileGDS)) { |
|
261 |
+ dir.create(pathProfileGDS) |
|
262 |
+} |
|
263 |
+ |
|
264 |
+``` |
|
265 |
+ |
|
266 |
+ |
|
267 |
+<br> |
|
268 |
+ |
|
269 |
+### 2.2 Sample reference donor profiles from the reference data |
|
270 |
+ |
|
271 |
+With the 1KG reference, we recommend sampling 30 donor profiles per population. |
|
272 |
+For reproducibility, be sure to use the same random-number generator seed. |
|
273 |
+ |
|
274 |
+In the following code, only 2 profiles per population are sampled: |
|
275 |
+ |
|
276 |
+```{r sampling, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
294 | 277 |
|
295 | 278 |
############################################################################# |
296 |
-## Fix RNG seed to ensure reproducible results |
|
279 |
+## Fix seed to ensure reproducible results |
|
297 | 280 |
############################################################################# |
298 | 281 |
set.seed(3043) |
299 | 282 |
|
... | ... |
@@ -302,393 +285,173 @@ set.seed(3043) |
302 | 285 |
## the synthetic data. |
303 | 286 |
## Here we select 2 profiles from the simplified 1KG GDS for each |
304 | 287 |
## subcontinental-level. |
305 |
-## Normally, we use 30 profile for each |
|
306 |
-## subcontinental-level but it is too big for the example. |
|
288 |
+## Normally, we would use 30 profiles for each subcontinental-level. |
|
307 | 289 |
## The 1KG files in this example only have 6 profiles for each |
308 | 290 |
## subcontinental-level (for demo purpose only). |
309 | 291 |
############################################################################# |
310 |
-gds1KG <- snpgdsOpen(fileGDS) |
|
311 |
-dataRef <- select1KGPop(gds1KG, nbProfiles=2L) |
|
312 |
-closefn.gds(gds1KG) |
|
313 |
- |
|
314 |
-## GenomeInfoDb and BSgenome are required libraries to run this example |
|
315 |
-if (requireNamespace("GenomeInfoDb", quietly=TRUE) && |
|
316 |
- requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) { |
|
317 |
- |
|
318 |
- ## Chromosome length information |
|
319 |
- ## chr23 is chrX, chr24 is chrY and chrM is 25 |
|
320 |
- chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25] |
|
321 |
- |
|
322 |
- ########################################################################### |
|
323 |
- ## The path where the Sample GDS files (one per sample) |
|
324 |
- ## will be created needs to be specified. |
|
325 |
- ########################################################################### |
|
326 |
- pathProfileGDS <- file.path(tempdir(), "exampleDNA", "out.tmp") |
|
327 |
- |
|
328 |
- ########################################################################### |
|
329 |
- ## The path where the result files will be created needs to |
|
330 |
- ## be specified |
|
331 |
- ########################################################################### |
|
332 |
- pathOut <- file.path(tempdir(), "exampleDNA", "res.out") |
|
333 |
- |
|
334 |
- ## Example can only be run if the current directory is in writing mode |
|
335 |
- if (!dir.exists(file.path(tempdir(), "exampleDNA"))) { |
|
336 |
- |
|
337 |
- dir.create(file.path(tempdir(), "exampleDNA")) |
|
338 |
- dir.create(pathProfileGDS) |
|
339 |
- dir.create(pathOut) |
|
340 |
- |
|
341 |
- ######################################################################### |
|
342 |
- ## The wrapper function generates the synthetic dataset and uses it |
|
343 |
- ## to selected the optimal parameters before calling the genetic |
|
344 |
- ## ancestry on the current profiles. |
|
345 |
- ## All important information, for each step, are saved in |
|
346 |
- ## multiple output files. |
|
347 |
- ## The 'genoSource' parameter has 2 options depending on how the |
|
348 |
- ## SNP files have been generated: |
|
349 |
- ## SNP VCF files have been generated: |
|
350 |
- ## "VCF" or "generic" (other software) |
|
351 |
- ## |
|
352 |
- ######################################################################### |
|
353 |
- runExomeAncestry(pedStudy=demoPedigreeEx1, studyDF=studyDF, |
|
354 |
- pathProfileGDS=pathProfileGDS, |
|
355 |
- pathGeno=pathGeno, |
|
356 |
- pathOut=pathOut, |
|
357 |
- fileReferenceGDS=fileGDS, |
|
358 |
- fileReferenceAnnotGDS=fileAnnotGDS, |
|
359 |
- chrInfo=chrInfo, |
|
360 |
- syntheticRefDF=dataRef, |
|
361 |
- genoSource="VCF") |
|
362 |
- list.files(pathOut) |
|
363 |
- list.files(file.path(pathOut, demoPedigreeEx1$Name.ID[1])) |
|
364 |
- |
|
365 |
- ####################################################################### |
|
366 |
- ## The file containing the ancestry inference (SuperPop column) and |
|
367 |
- ## optimal number of PCA component (D column) |
|
368 |
- ## optimal number of neighbours (K column) |
|
369 |
- ####################################################################### |
|
370 |
- resAncestry <- read.csv(file.path(pathOut, |
|
371 |
- paste0(demoPedigreeEx1$Name.ID[1], ".Ancestry.csv"))) |
|
372 |
- print(resAncestry) |
|
373 |
- |
|
374 |
- ## Remove temporary files created for this demo |
|
375 |
- unlink(pathProfileGDS, recursive=TRUE, force=TRUE) |
|
376 |
- unlink(pathOut, recursive=TRUE, force=TRUE) |
|
377 |
- unlink(file.path(tempdir(), "exampleDNA"), recursive=TRUE, force=TRUE) |
|
378 |
- } |
|
379 |
-} |
|
380 |
- |
|
292 |
+dataRef <- select1KGPopForSynthetic(fileReferenceGDS=refGenotype, |
|
293 |
+ nbProfiles=2L) |
|
381 | 294 |
|
382 | 295 |
``` |
383 | 296 |
|
384 |
-<br> |
|
297 |
+The output object is going to be used later at the ancestry inference step. |
|
298 |
+ |
|
385 | 299 |
<br> |
386 | 300 |
|
387 |
-The *runExomeAncestry()* function generates 3 types of files |
|
388 |
-in the *pathOut* directory. |
|
301 |
+### 2.3 Perform the ancestry inference |
|
389 | 302 |
|
390 |
-* The ancestry inference CSV file (**".Ancestry.csv"** file) |
|
391 |
-* The inference information RDS file (**".infoCall.rds"** file) |
|
392 |
-* The parameter information RDS files from the synthetic inference |
|
393 |
-(__"KNN.synt.__*__.rds"__ files in a sub-directory) |
|
303 |
+Within a single function call, data synthesis is performed, the synthetic |
|
304 |
+data are used to optimize the inference parameters and, with these, the |
|
305 |
+ancestry of the input profile donor is inferred. |
|
394 | 306 |
|
395 |
-In addition, a sub-directory (named using the *profile ID*) is |
|
396 |
-also created. |
|
307 |
+According to the type of input data (RNA or DNA), a specific function |
|
308 |
+is available. |
|
397 | 309 |
|
398 |
-The inferred ancestry is stored in the ancestry inference CSV |
|
399 |
-file (**".Ancestry.csv"** file) which also contains those columns: |
|
310 |
+The *inferAncestry()* function is used for DNA profiles while |
|
311 |
+the *inferAncestryGeneAware()* function is RNA specific. |
|
400 | 312 |
|
401 |
-* _sample.id_: The unique identifier of the sample |
|
402 |
-* _D_: The optimal PCA dimension value used to infer the ancestry |
|
403 |
-* _k_: The optimal number of neighbors value used to infer the ancestry |
|
404 |
-* _SuperPop_: The inferred ancestry |
|
313 |
+In this example, the profile is from DNA source and requires the use of the |
|
314 |
+*inferAncestry()* function. |
|
405 | 315 |
|
406 |
-<br> |
|
407 |
-<br> |
|
316 |
+```{r infere, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
408 | 317 |
|
318 |
+########################################################################### |
|
319 |
+## GenomeInfoDb and BSgenome are required libraries to run this example |
|
320 |
+########################################################################### |
|
321 |
+if (requireNamespace("GenomeInfoDb", quietly=TRUE) && |
|
322 |
+ requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) { |
|
409 | 323 |
|
324 |
+ ####################################################################### |
|
325 |
+ ## Chromosome length information is required |
|
326 |
+ ## chr23 is chrX, chr24 is chrY and chrM is 25 |
|
327 |
+ ####################################################################### |
|
328 |
+ genome <- BSgenome.Hsapiens.UCSC.hg38::Hsapiens |
|
329 |
+ chrInfo <- GenomeInfoDb::seqlengths(genome)[1:25] |
|
330 |
+ |
|
331 |
+ ####################################################################### |
|
332 |
+ ## The SNP VCF file of the DNA profile donor |
|
333 |
+ ####################################################################### |
|
334 |
+ fileDonorVCF <- file.path(dataDir, "example", "snpPileup", "ex1.vcf.gz") |
|
335 |
+ |
|
336 |
+ ####################################################################### |
|
337 |
+ ## The ancestry inference call |
|
338 |
+ ####################################################################### |
|
339 |
+ resOut <- inferAncestry(profileFile=fileDonorVCF, |
|
340 |
+ pathProfileGDS=pathProfileGDS, |
|
341 |
+ fileReferenceGDS=refGenotype, |
|
342 |
+ fileReferenceAnnotGDS=refAnnotation, |
|
343 |
+ chrInfo=chrInfo, |
|
344 |
+ syntheticRefDF=dataRef, |
|
345 |
+ genoSource=c("VCF")) |
|
346 |
+} |
|
410 | 347 |
|
411 |
-### RNA data - Wrapper function to run ancestry inference on RNA data |
|
348 |
+``` |
|
412 | 349 |
|
413 |
-The process is the same as for the DNA but use the wrapper function |
|
414 |
-called _runRNAAncestry()_. Internally the data is process differently. |
|
415 |
-It requires 4 files as input: |
|
350 |
+A profile GDS file is created in the *pathProfileGDS* directory while all the |
|
351 |
+ancestry and optimal parameters information are integrated in the output |
|
352 |
+object. |
|
416 | 353 |
|
417 |
-- The **population reference GDS file** |
|
418 |
-- The **population reference SNV Annotation GDS file** |
|
419 |
-- The **Profile SNP file** (one per sample present in the study) |
|
420 |
-- The **Profile PED RDS file** (one file with information for all |
|
421 |
-profiles in the study) |
|
354 |
+At last, all temporary files created in this example should be deleted. |
|
422 | 355 |
|
423 |
-A *data.frame* containing the general information about the study is |
|
424 |
-also required. The *data.frame* must contain those 3 columns: |
|
356 |
+```{r removeTmp, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
425 | 357 |
|
426 |
-- _study.id_: The study identifier (example: TCGA-BRCA). |
|
427 |
-- _study.desc_: The description of the study. |
|
428 |
-- _study.platform_: The type of sequencing (example: RNA-seq). |
|
358 |
+####################################################################### |
|
359 |
+## Remove temporary files created for this demo |
|
360 |
+####################################################################### |
|
361 |
+unlink(pathWorkingDirectory, recursive=TRUE, force=TRUE) |
|
362 |
+ |
|
363 |
+``` |
|
429 | 364 |
|
430 | 365 |
<br> |
431 |
- |
|
432 |
-#### **Population reference files** |
|
433 |
- |
|
434 |
-For demonstration purpose, a small |
|
435 |
-**population reference GDS file** (called _ex1_good_small_1KG.gds_) and a small |
|
436 |
-**population reference SNV Annotation GDS file** (called |
|
437 |
-_ex1_good_small_1KG_Annot.gds_) are |
|
438 |
-included in this package. Beware that those two files should not be used to |
|
439 |
-run a real ancestry inference.The results obtained with those files won't be |
|
440 |
-reliable. |
|
441 |
- |
|
442 |
-The required **population reference GDS file** and |
|
443 |
-**population reference SNV Annotation GDS file** should be stored in the same |
|
444 |
-directory. In the example below, this directory is referred to |
|
445 |
-as **pathReference**. |
|
446 |
- |
|
447 | 366 |
<br> |
448 | 367 |
|
449 |
-#### **Profile SNP file** |
|
450 |
- |
|
451 |
-The **Profile SNP file** can be either in a VCF format or in a generic format. |
|
452 |
- |
|
453 |
-The **Profile SNP VCF file** follows the VCF standard with at least |
|
454 |
-those genotype fields: _GT_, _AD_ and _DP_. The identifier of the genotype |
|
455 |
-in the VCF file must correspond to the profile identifier _Name.ID_. |
|
456 |
-The SNVs must be germline variants and should include the genotype of the |
|
457 |
-wild-type homozygous at the selected positions in the reference. One file per |
|
458 |
-profile is need and the VCF file must be gzipped. |
|
459 |
- |
|
460 |
-Note that the name assigned to the **Profile SNP VCF file** has to |
|
461 |
-correspond to the profile identifier _Name.ID_ in the following analysis. |
|
462 |
-For example, a SNP file called "Sample.01.vcf.gz" would be |
|
463 |
-associated to the "Sample.01" profile. |
|
464 |
- |
|
465 |
-A generic SNP file can replace the VCF file. The **Profile SNP Generic file** |
|
466 |
-format is coma separated and the mandatory columns are: |
|
467 | 368 |
|
468 |
-* _Chromosome_: The name of the chromosome |
|
469 |
-* _Position_: The position on the chromosome |
|
470 |
-* _Ref_: The reference nucleotide |
|
471 |
-* _Alt_: The aternative nucleotide |
|
472 |
-* _Count_: The total count |
|
473 |
-* _File1R_: The count for the reference nucleotide |
|
474 |
-* _File1A_: The count for the alternative nucleotide |
|
369 |
+## Step 3. Examine the value of the inference call |
|
475 | 370 |
|
476 |
-Beware that the starting position in the **population reference GDS File** is |
|
477 |
-zero (like BED files). The **Profile SNP Generic file** should also start |
|
478 |
-at position zero. |
|
371 |
+The inferred ancestry and the optimal parameters are present in the *list* |
|
372 |
+object generated by the *inferAncestry()* and *inferAncestryGeneAware()* |
|
373 |
+functions. |
|
479 | 374 |
|
480 |
-Note that the name assigned to the **Profile SNP Generic file** has to |
|
481 |
-correspond to the profile identifier _Name.ID_ in the following analysis. |
|
482 |
-For example, a SNP file called "Sample.01.generic.txt.gz" would be |
|
483 |
-associated to the "Sample.01" profile. |
|
484 | 375 |
|
485 |
-<br> |
|
486 |
- |
|
487 |
-#### **Profile PED RDS file** |
|
488 |
- |
|
489 |
-The **Profile PED RDS file** must contain a *data.frame* describing all |
|
490 |
-the profiles to be analyzed. These 5 mandatory columns: |
|
376 |
+```{r printRes, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
491 | 377 |
|
492 |
-- _Name.ID_: The unique sample identifier. The associated **profile SNP file** |
|
493 |
-should be called "Name.ID.txt.gz". |
|
494 |
-- _Case.ID_: The patient identifier associated to the sample. |
|
495 |
-- _Sample.Type_: The information about the profile tissue source |
|
496 |
-(primary tumor, metastatic tumor, normal, etc..). |
|
497 |
-- _Diagnosis_: The donor's diagnosis. |
|
498 |
-- _Source_: The source of the profile sequence data (example: dbGAP_XYZ). |
|
378 |
+########################################################################### |
|
379 |
+## The output is a list object with multiple entries |
|
380 |
+########################################################################### |
|
381 |
+class(resOut) |
|
382 |
+names(resOut) |
|
499 | 383 |
|
500 |
-Important: The row names of the *data.frame* must be the profiles _Name.ID_. |
|
384 |
+``` |
|
501 | 385 |
|
502 |
-This file is referred to as the **Profile PED RDS file** (PED for pedigree). |
|
503 |
-Alternatively, the PED information can be saved in another type of |
|
504 |
-file (CVS, etc..) as long as the *data.frame* information can be regenerated |
|
505 |
-in R (with _read.csv()_ or else). |
|
506 | 386 |
|
507 | 387 |
<br> |
508 | 388 |
|
509 |
-#### **Example** |
|
510 |
- |
|
511 |
-This example run an ancestry inference on an RNA sample. Both population |
|
512 |
-reference files are demonstration files and should not be |
|
513 |
-used for a real ancestry inference. Beware that running an ancestry inference |
|
514 |
-on real data will take longer to run. |
|
389 |
+### 3.1 Inspect the inference and the optimal parameters |
|
515 | 390 |
|
516 |
-```{r runRNAAncestry, echo=TRUE, eval=TRUE, collapse=FALSE, warning=FALSE, message=FALSE} |
|
517 |
-############################################################################# |
|
518 |
-## Load required packages |
|
519 |
-############################################################################# |
|
520 |
-library(RAIDS) |
|
521 |
-library(gdsfmt) |
|
522 | 391 |
|
523 |
-## Path to the demo 1KG GDS file is located in this package |
|
524 |
-dataDir <- system.file("extdata", package="RAIDS") |
|
392 |
+For the global ancestry inference using PCA followed by nearest neighbor |
|
393 |
+classification these parameters are *D* (the number of the top principal |
|
394 |
+directions retained) and *k* (the number of nearest neighbors). |
|
525 | 395 |
|
526 |
-############################################################################# |
|
527 |
-## Load the information about the profile |
|
528 |
-############################################################################# |
|
529 |
-data(demoPedigreeEx1) |
|
530 |
-head(demoPedigreeEx1) |
|
396 |
+The information is stored in the *Ancestry* entry as a *data.frame* object. |
|
397 |
+It is a contains those columns: |
|
531 | 398 |
|
532 |
-############################################################################# |
|
533 |
-## The population reference GDS file and SNV Annotation GDS file |
|
534 |
-## need to be located in the same directory. |
|
535 |
-## Note that the population reference GDS file used for this example is a |
|
536 |
-## simplified version and CANNOT be used for any real analysis |
|
537 |
-############################################################################# |
|
538 |
-pathReference <- file.path(dataDir, "tests") |
|
539 |
- |
|
540 |
-fileGDS <- file.path(pathReference, "ex1_good_small_1KG.gds") |
|
541 |
-fileAnnotGDS <- file.path(pathReference, "ex1_good_small_1KG_Annot.gds") |
|
399 |
+* _sample.id_: The unique identifier of the sample |
|
400 |
+* _D_: The optimal PCA dimension value used to infer the ancestry |
|
401 |
+* _k_: The optimal number of neighbors value used to infer the ancestry |
|
402 |
+* _SuperPop_: The inferred ancestry |
|
542 | 403 |
|
543 |
-############################################################################# |
|
544 |
-## A data frame containing general information about the study |
|
545 |
-## is also required. The data frame must have |
|
546 |
-## those 3 columns: "study.id", "study.desc", "study.platform" |
|
547 |
-############################################################################# |
|
548 |
-studyDF <- data.frame(study.id="MYDATA", |
|
549 |
- study.desc="Description", |
|
550 |
- study.platform="PLATFORM", |
|
551 |
- stringsAsFactors=FALSE) |
|
552 | 404 |
|
553 |
-############################################################################# |
|
554 |
-## The Sample SNP VCF files (one per sample) need |
|
555 |
-## to be all located in the same directory. |
|
556 |
-############################################################################# |
|
557 |
-pathGeno <- file.path(dataDir, "example", "snpPileupRNA") |
|
405 |
+```{r print, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE} |
|
558 | 406 |
|
559 |
-############################################################################# |
|
560 |
-## Fix RNG seed to ensure reproducible results |
|
561 |
-############################################################################# |
|
562 |
-set.seed(3043) |
|
407 |
+########################################################################### |
|
408 |
+## The ancestry information is stored in the 'Ancestry' entry |
|
409 |
+########################################################################### |
|
410 |
+print(resOut$Ancestry) |
|
563 | 411 |
|
564 |
-############################################################################# |
|
565 |
-## Select the profiles from the population reference GDS file for |
|
566 |
-## the synthetic data. |
|
567 |
-## Here we select 2 profiles from the simplified 1KG GDS for each |
|
568 |
-## subcontinental-level. |
|
569 |
-## Normally, we use 30 profile for each |
|
570 |
-## subcontinental-level but it is too big for the example. |
|
571 |
-## The 1KG files in this example only have 6 profiles for each |
|
572 |
-## subcontinental-level (for demo purpose only). |
|
573 |
-############################################################################# |
|
574 |
-gds1KG <- snpgdsOpen(fileGDS) |
|
575 |
-dataRef <- select1KGPop(gds1KG, nbProfiles=2L) |
|
576 |
-closefn.gds(gds1KG) |
|
577 |
- |
|
578 |
-## GenomeInfoDb and BSgenome are required libraries to run this example |
|
579 |
-if (requireNamespace("GenomeInfoDb", quietly=TRUE) && |
|
580 |
- requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly=TRUE)) { |
|
581 |
- |
|
582 |
- ## Chromosome length information |
|
583 |
- ## chr23 is chrX, chr24 is chrY and chrM is 25 |
|
584 |
- chrInfo <- GenomeInfoDb::seqlengths(BSgenome.Hsapiens.UCSC.hg38::Hsapiens)[1:25] |
|
585 |
- |
|
586 |
- ############################################################################# |
|
587 |
- ## The path where the Sample GDS files (one per sample) |
|
588 |
- ## will be created needs to be specified. |
|
589 |
- ############################################################################# |
|
590 |
- pathProfileGDS <- file.path(tempdir(), "exampleRNA", "outRNA.tmp") |
|
591 |
- |
|
592 |
- ############################################################################# |
|
593 |
- ## The path where the result files will be created needs to |
|
594 |
- ## be specified |
|
595 |
- ############################################################################# |
|
596 |
- pathOut <- file.path(tempdir(), "exampleRNA", "resRNA.out") |
|
597 |
- |
|
598 |
- ## Example can only be run if the current directory is in writing mode |
|
599 |
- if (!dir.exists(file.path(tempdir(), "exampleRNA"))) { |
|
600 |
- |
|
601 |
- dir.create(file.path(tempdir(), "exampleRNA")) |
|
602 |
- dir.create(pathProfileGDS) |
|
603 |
- dir.create(pathOut) |
|
604 |
- |
|
605 |
- ######################################################################### |
|
606 |
- ## The wrapper function generates the synthetic dataset and uses it |
|
607 |
- ## to selected the optimal parameters before calling the genetic |
|
608 |
- ## ancestry on the current profiles. |
|
609 |
- ## All important information, for each step, are saved in |
|
610 |
- ## multiple output files. |
|
611 |
- ## The 'genoSource' parameter has 2 options depending on how the |
|
612 |
- ## SNP files have been generated: |
|
613 |
- ## SNP VCF files have been generated: |
|
614 |
- ## "VCF" or "generic" (other software) |
|
615 |
- ######################################################################### |
|
616 |
- runRNAAncestry(pedStudy=demoPedigreeEx1, studyDF=studyDF, |
|
617 |
- pathProfileGDS=pathProfileGDS, |
|
618 |
- pathGeno=pathGeno, |
|
619 |
- pathOut=pathOut, |
|
620 |
- fileReferenceGDS=fileGDS, |
|
621 |
- fileReferenceAnnotGDS=fileAnnotGDS, |
|
622 |
- chrInfo=chrInfo, |
|
623 |
- syntheticRefDF=dataRef, |
|
624 |
- blockTypeID="GeneS.Ensembl.Hsapiens.v86", |
|
625 |
- genoSource="VCF") |
|
626 |
- |
|
627 |
- list.files(pathOut) |
|
628 |
- list.files(file.path(pathOut, demoPedigreeEx1$Name.ID[1])) |
|
629 |
- |
|
630 |
- ######################################################################### |
|
631 |
- ## The file containing the ancestry inference (SuperPop column) and |
|
632 |
- ## optimal number of PCA component (D column) |
|
633 |
- ## optimal number of neighbours (K column) |
|
634 |
- ######################################################################### |
|
635 |
- resAncestry <- read.csv(file.path(pathOut, |
|
636 |
- paste0(demoPedigreeEx1$Name.ID[1], ".Ancestry.csv"))) |
|
637 |
- print(resAncestry) |
|
638 |
- |
|
639 |
- ## Remove temporary files created for this demo |
|
640 |
- unlink(pathProfileGDS, recursive=TRUE, force=TRUE) |
|
641 |
- unlink(pathOut, recursive=TRUE, force=TRUE) |
|
642 |
- unlink(file.path(tempdir(), "example"), recursive=TRUE, force=TRUE) |
|
643 |
- } |
|
644 |
-} |
|
645 |
- |
|
646 | 412 |
``` |
647 | 413 |
|
648 |
-<br> |
|
649 | 414 |
<br> |
650 | 415 |
|
651 |
-The *runRNAAncestry()* function generates 3 types of files |
|
652 |
-in the *pathOut* directory. |
|
416 |
+### 3.2 Visualize the RAIDS performance for the synthetic data |
|
653 | 417 |
|
654 |
-* The ancestry inference CSV file (**".Ancestry.csv"** file) |
|
655 |
-* The inference information RDS file (**".infoCall.rds"** file) |
|
656 |
-* The parameter information RDS files from the synthetic inference |
|
657 |
-(__"KNN.synt.__*__.rds"__ files in a sub-directory) |
|
658 | 418 |
|
659 |
-In addition, a sub-directory (named using the *profile ID*) is |
|
660 |
-also created. |
|
419 |
+The *createAUROCGraph()* function enable the visualization of RAIDS |
|
420 |
+performance for the synthetic data, as a function of *D* and *k*. |
|
661 | 421 |
|
662 |
-The inferred ancestry is stored in the ancestry inference CSV |
|
663 |
-file (**".Ancestry.csv"** file) which also contains those columns: |
|
422 |
+```{r visualize, echo=TRUE, eval=TRUE, fig.align="center", fig.cap="RAIDS performance for the synthtic data.", results='asis', collapse=FALSE, warning=FALSE, message=FALSE} |
|
664 | 423 |
|
665 |
-* _sample.id_: The unique identifier of the sample |
|
666 |
-* _D_: The optimal PCA dimension value used to infer the ancestry |
|
667 |
-* _k_: The optimal number of neighbors value used to infer the ancestry |
|
668 |
-* _SuperPop_: The inferred ancestry |
|
424 |
+########################################################################### |
|
425 |
+## Create a graph showing the perfomance for the synthetic data |
|
426 |
+## The output is a ggplot object |
|
427 |
+########################################################################### |
|
428 |
+createAUROCGraph(dfAUROC=resOut$paraSample$dfAUROC, title="Example ex1") |
|
669 | 429 |
|
430 |
+``` |
|
431 |
+ |
|
432 |
+In this specific demonstration, the performances are lower than expected |
|
433 |
+with a real profile and a complete reference population file. |
|
670 | 434 |
|
671 | 435 |
<br> |
672 | 436 |
<br> |
673 | 437 |
|
438 |
+# Format population reference dataset (optional) |
|
674 | 439 |
|
675 |
-## Format population reference dataset (optional) |
|
676 | 440 |
|
677 |
- |
|
678 |
-```{r graphStep1, echo=FALSE, fig.align="center", fig.cap="Step 1 - Formatting the information from the population reference dataset (optional)", out.width='120%', results='asis', warning=FALSE, message=FALSE} |
|
679 |
-knitr::include_graphics("MainSteps_Step1_v04.png") |
|
441 |
+```{r graphStep1, echo=FALSE, fig.align="center", fig.cap="Step 1 - Provide population reference data", out.width='120%', results='asis', warning=FALSE, message=FALSE} |
|
442 |
+knitr::include_graphics("Step1_population_file_v01.png") |
|
680 | 443 |
``` |
681 | 444 |
|
682 | 445 |
|
683 | 446 |
A population reference dataset with known ancestry is required to infer |
684 |
-ancestry. The population must be large enough to ensure ??? |
|
447 |
+ancestry. |
|
685 | 448 |
|
686 | 449 |
Three important reference files, containing formatted information about |
687 | 450 |
the reference dataset, are required: |
688 | 451 |
|
689 | 452 |
- The population reference GDS File |
690 | 453 |
- The population reference SNV Annotation GDS file |
691 |
-- The population reference SNV Retained VCF file |
|
454 |
+- The population reference SNV Retained VCF file (optional) |
|
692 | 455 |
|
693 | 456 |
|
694 | 457 |
The format of those files are described |
... | ... |
@@ -717,7 +480,6 @@ you choose to use the pre-processed files.</span> |
717 | 480 |
<br> |
718 | 481 |
|
719 | 482 |
|
720 |
- |
|
721 | 483 |
# Session info |
722 | 484 |
|
723 | 485 |
Here is the output of `sessionInfo()` in the environment in which this |