...
|
...
|
@@ -103,7 +103,7 @@ knitr::include_graphics("MainSteps_v05.png")
|
103
|
103
|
|
104
|
104
|
The main steps are:
|
105
|
105
|
|
106
|
|
-**Step 1.** Set-up and provide population reference files
|
|
106
|
+**Step 1.** Set-up working directory and provide population reference files
|
107
|
107
|
|
108
|
108
|
**Step 2** Sample the reference data for donors whose genotypes will be used for synthesis and optimize ancestry inference parameters using synthetic data
|
109
|
109
|
|
...
|
...
|
@@ -117,7 +117,7 @@ These steps are described in detail in the following.
|
117
|
117
|
<br>
|
118
|
118
|
|
119
|
119
|
|
120
|
|
-## Step 1. Set-up and provide population reference files
|
|
120
|
+## Step 1. Set-up working directory and provide population reference files
|
121
|
121
|
|
122
|
122
|
|
123
|
123
|
### 1.1 Create a working directory structure
|
...
|
...
|
@@ -146,6 +146,7 @@ example will be run.
|
146
|
146
|
|
147
|
147
|
#############################################################################
|
148
|
148
|
## Create a temporary working directory structure
|
|
149
|
+## using the tempdir() function
|
149
|
150
|
#############################################################################
|
150
|
151
|
pathWorkingDirectory <- file.path(tempdir(), "workingDirectory")
|
151
|
152
|
pathWorkingDirectoryData <- file.path(pathWorkingDirectory, "data")
|
...
|
...
|
@@ -166,7 +167,8 @@ if (!dir.exists(pathWorkingDirectory)) {
|
166
|
167
|
|
167
|
168
|
The population reference files should be downloaded into the *data/refGDS*
|
168
|
169
|
sub-directory. This following code downloads the complete pre-processed files
|
169
|
|
-for 1000 Genomes (1KG), for the hg38 build of the human genome, in the GDS format. The size of the 1KG GDS file is 15GB.
|
|
170
|
+for 1000 Genomes (1KG), for the hg38 build of the human genome, in the GDS
|
|
171
|
+format. The size of the 1KG GDS file is 15GB.
|
170
|
172
|
|
171
|
173
|
```
|
172
|
174
|
|
...
|
...
|
@@ -202,9 +204,9 @@ library(RAIDS)
|
202
|
204
|
|
203
|
205
|
#############################################################################
|
204
|
206
|
## The population reference GDS file and SNV Annotation GDS file
|
205
|
|
-## need to be located in the same sub-directory.
|
|
207
|
+## need to be located in the same sub-directory.
|
206
|
208
|
## Note that the mini-reference GDS file used for this example is
|
207
|
|
-## NOT sufficient for reliable inference
|
|
209
|
+## NOT sufficient for reliable inference.
|
208
|
210
|
#############################################################################
|
209
|
211
|
## Path to the demo 1KG GDS file is located in this package
|
210
|
212
|
dataDir <- system.file("extdata", package="RAIDS")
|
...
|
...
|
@@ -265,18 +267,21 @@ In the following code, only 2 individual profiles per
|
265
|
267
|
sub-continental population are sampled from the
|
266
|
268
|
demo population GDS file:
|
267
|
269
|
|
268
|
|
-```{r sampling, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
|
|
270
|
+```{r samplingProfiles, echo=TRUE, eval=TRUE, collapse=TRUE, warning=FALSE, message=FALSE}
|
269
|
271
|
|
270
|
272
|
#############################################################################
|
271
|
|
-## Set up the following random number generator seed to reproduce the expected results
|
|
273
|
+## Set up the following random number generator seed to reproduce
|
|
274
|
+## the expected results
|
272
|
275
|
#############################################################################
|
273
|
276
|
set.seed(3043)
|
274
|
277
|
|
275
|
278
|
#############################################################################
|
276
|
279
|
## Choose the profiles from the population reference GDS file for
|
277
|
|
-## data synthesis.
|
278
|
|
-## Here we choose 2 profiles perm subcontinental population from the mini 1KG GDS file.
|
279
|
|
-## Normally, we would use 30 randomly chosen profiles per subcontinental population.
|
|
280
|
+## data synthesis.
|
|
281
|
+## Here we choose 2 profiles per subcontinental population
|
|
282
|
+## from the mini 1KG GDS file.
|
|
283
|
+## Normally, we would use 30 randomly chosen profiles per
|
|
284
|
+## subcontinental population.
|
280
|
285
|
#############################################################################
|
281
|
286
|
dataRef <- select1KGPopForSynthetic(fileReferenceGDS=refGenotype,
|
282
|
287
|
nbProfiles=2L)
|
...
|
...
|
@@ -284,13 +289,11 @@ dataRef <- select1KGPopForSynthetic(fileReferenceGDS=refGenotype,
|
284
|
289
|
```
|
285
|
290
|
|
286
|
291
|
|
287
|
|
-
|
288
|
292
|
<br>
|
289
|
293
|
|
290
|
294
|
### 2.3 Infer ancestry
|
291
|
295
|
|
292
|
|
-Within a single function
|
293
|
|
-call, data synthesis is performed, the synthetic
|
|
296
|
+Within a single function call, data synthesis is performed, the synthetic
|
294
|
297
|
data are used to optimize the inference parameters and, with these, the
|
295
|
298
|
ancestry is inferred from the input sequence profile.
|
296
|
299
|
|
...
|
...
|
@@ -302,8 +305,8 @@ The *inferAncestry()* function requires a specific profile input format. The
|
302
|
305
|
format is set by the *genoSource* parameter.
|
303
|
306
|
|
304
|
307
|
One of those formats is in a VCF format (*genoSource=c("VCF")*).
|
305
|
|
-This format follows the VCF standard
|
306
|
|
-with at least those genotype fields: _GT_, _AD_ and _DP_.
|
|
308
|
+This format follows the VCF standard with at least those genotype
|
|
309
|
+fields: _GT_, _AD_ and _DP_.
|
307
|
310
|
The SNVs must be germline variants and should include the genotype of the
|
308
|
311
|
wild-type homozygous at the selected positions in the reference. The VCF file
|
309
|
312
|
must be gzipped.
|