Bioconductor Code: epistasisGA

History View file @ fde33dc

                     new file mode 100644
@@ -0,0 +1,268 @@
                     +---
                     +title: "Use of the snpGADGET package to identify multi-SNP effects in case-parent triad studies"
                     +author: "Michael Nodzenski"
                     +date: "May 26, 2020"
                     +package: snpGADGET
                     +output:
                     +  BiocStyle::html_document:
                     +    toc: true
                     +vignette: >
                     +  %\VignetteIndexEntry{Vignette Title}
                     +  %\VignetteEngine{knitr::rmarkdown}
                     +  %\VignetteEncoding{UTF-8}
                     +---
+                    +
                     +```{r setup, include=FALSE}
                     +knitr::opts_chunk$set(echo = TRUE)
                     +```
+                    +
                     +## Introduction
+                    +
                     +While methods for characterizing marginal relationships between individual SNPs and disease status have been well developed in high throughput genetic association studies of complex diseases, identifying joint associations between collections of genetic variants and disease remains challenging. To date, studies have overwhelmingly focused on detecting variant-disease associations on a SNP by SNP basis. Doing so allows researchers to scan millions of SNPs for evidence of association with disease, but only SNPs with large marginal disease associations can be identified, missing collections of SNPs with large joint disease associations despite small marginal associations. For many diseases, it is hypothesized that increased penetrance may result from the joint effect of variants at multiple susceptibility loci, suggesting methods focused on identifying multi-SNP associations may offer greater insight into the genetic architecture of complex diseases.
+                    +
                     +The snpGADGET package presents one such approach. In this package, we implement the GADGET method (citation) for identifying multi-snp disease associations in case-parent triad studies. Briefly, GADGET uses a genetic algorithm (GA) to identify collections of high penetrance SNPs. Genetic algorithms are a class of general purpose optimization algorithms particularly well suited to combinatorial optimization. In a genetic algorithm, optimal solutions are identified by mimicking the process of natural selection. In the first iteration of the algorithm, or 'generation', a set of potential solutions, collectively known as a 'population' with each component referred to as a 'chromosome', is initialized. Individual chromosomes are made up of finite sets of discrete elements, just as biological chromosomes are comprised of SNPs. A user defined function then assigns a 'fitness score' to each chromosome, with the fitness score constructed such that better solutions have higher fitness scores. Then, the chromosomes are subjected to 'crossing over', where component elements of different chromosomes are exchanged, or 'mutation', where component elements are arbitrarily altered, to generate a new population for the next generation. Chromosomes with higher fitness scores are preferentially selected to produce 'offspring', analogous to how organisms with higher fitness reproduce more effectively under natural selection. The scoring and mutation/crossing over process continues iteratively until stopping criteria have been achieved, hopefully resulting in an acceptable solution to the optimization problem.
+                    +
                     +In GADGET, the term 'chromosome' refers to a collection of SNPs we wish to examine for evidence of a strong multi-SNP association with disease. To be clear, these SNPs need not be members of the same biological chromosome. The details of the fitness score function and crossover/mutation operations are given in (citation), but in short, the intuitive aim of the fitness score is to assign high scores to collections of SNPs that are jointly transmitted to disease affected children (cases) much more frequently than they are not transmitted.
+                    +
                     +As a final note, users should be aware this method does not currently scale genome wide, but it does accommodate larger numbers of input SNPs than comparable methods. In our simulations, we have 10,000 input SNPs, but users are encouraged to experiment with larger numbers if desired.
+                    +
                     +## Basic Usage
+                    +
                     +### Step 1. Load Data
+                    +
                     +We begin our example usage of GADGET by loading an example case-parent triad data.
+                    +
                     +```{r}
                     +library(knnGA)
                     +data(case)
                     +data(dad)
                     +data(mom)
                     +```
+                    +
                     +These data were simulated based on a case-parent triad study of children with orofacial-clefts. We have a separate dataset for mothers, fathers, and the affected children. The rows correspond to families, and the columns correspond to SNPs. Because this is a small example, each of these datasets includes only `r ncol(case)` SNPs. The SNPs in columns 51, 52, 76, and 77 are a true risk pathway, where the joint combination of these variants substantially increases the penetrance compared to all other genotypes with the joint association far exceeding the sum of the marginals.
+                    +
                     +### Step 2. Pre-process Data
+                    +
                     +The second step in the analysis pipeline is to pre-process the data. We begin pre-processing by constructing a matrix to indicate whether a given pair of SNPs are from the same biological chromosome, with `TRUE` indicating presence on the same chromosome and FALSE otherwise. For the example data, the SNPs are drawn from chromosomes 10-13, with the columns sorted by chromosome and 25 SNPs per chromosome. We therefore construct the matrix as follows:
+                    +
                     +```{r}
                     +library(Matrix)
                     +chrom.mat <- as.matrix(bdiag(list(matrix(rep(TRUE, 25^2), nrow = 25),
                     +                               matrix(rep(TRUE, 25^2), nrow = 25),
                     +                               matrix(rep(TRUE, 25^2), nrow = 25),
                     +                               matrix(rep(TRUE, 25^2), nrow = 25))))
+                    +
                     +```
+                    +
                     +Now, we can execute pre-processing:
+                    +
                     +```{r}
                     +pp.list <- preprocess.genetic.data(case, father.genetic.data = dad,
                     +                                mother.genetic.data = mom,
                     +                                chrom.mat = chrom.mat)
                     +```
+                    +
                     +This function performs a few disparate tasks that users should note. First, it identifies the minor allele for each input SNP, based on the observed frequency in the mothers and fathers. Any SNPs in the input data where a '1' corresponds to one copy of the major allele are re-coded so that '1' corresponds to one copy of the minor allele. The identities of these re-coded SNPs can be found in the output from `preprocess.genetic.data`. Afterwards, any SNPs with minor allele frequency below a given value, 2.5\% by default, are filtered. Following SNP filtering, $\chi^2$ statistics for univariable SNP-disease associations are computed for each requisite SNP assuming an additive model. These test statistics are incorporated into the GADGET algorithm in a later step (see (citation) for details).
+                    +
                     +### Step 3. Run GADGET for Observed Data
+                    +
                     +We now use GADGET to identify interesting collections of SNPs that appear to jointly increase the disease penetrance via the `run.ga` function. The method requires a number of tuning parameters, but the function defaults are a good starting point. More information regarding these parameters can be found in the package documentation and the paper describing the method (citation). Briefly, the GADGET method requires the user to pre-specify the number of SNPs that are jointly associated with disease status. This is controlled by the `chromosome.size` argument. We recommend running the algorithm for a range of sizes, typically 2-6. For this simple example, however, we will use only examine sizes 3 and 4.
+                    +
                     +```{r, message = FALSE}
                     +run.ga(pp.list, n.chromosomes = 5, chromosome.size = 3, results.dir = "size3_res",
                     +        cluster.type = "interactive", registryargs = list(file.dir = "size3_reg", seed = 1300),
                     +        n.top.chroms = 5, n.islands = 8, island.cluster.size = 4, n.migrations = 2)
+                    +
                     +run.ga(pp.list, n.chromosomes = 5, chromosome.size = 4, results.dir = "size4_res",
                     +        cluster.type = "interactive", registryargs = list(file.dir = "size4_reg", seed = 1400),
                     +        n.top.chroms = 5, n.islands = 8, island.cluster.size = 4, n.migrations = 2)
                     +```
+                    +
                     +Note that `run.ga` does not output results directly to the R session, it will instead write results to the directory specified in `results.dir`. Furthermore, GADGET uses a technique known as an island model genetic algorithm to identify potentially interesting collections of SNPs. The details of this technique are described fully in (citation), but users will note that results for each island (parameter `n.islands`) are written separately to `results.dir`.
+                    +
                     +The `cluster.type` and `registry.args` parameters are important. For the above example, the "interactive" cluster type indicates that all islands run sequentially in the R session. However this is not how we anticipate `run.ga` will be used in most cases. Rather, we strongly recommend this function be used with high performance computing clusters to avoid prohibitively long run-times. An example (not run) of this type of command can be seen below:
+                    +
                     +```{r, eval=F}
                     +library(BiocParallel)
                     +fname <- batchtoolsTemplate("slurm")
                     +run.ga(pp.list, n.chromosomes = 20, chromosome.size = 3, results.dir = "size3_res",
                     +        cluster.type = "slurm", registryargs = list(file.dir = "size3_reg", seed = 1300),
                     +        cluster.template = fname, resources = list(chunks.as.arrayjobs = TRUE),
                     +        n.top.chroms = 5, n.islands = 12, island.cluster.size = 4)
+                    +
                     +```
+                    +
                     +The `cluster.template` must be properly calibrated to the user's HPC cluster. Packages `batchtools` and `BiocParallel` both contain good documentation on how to construct these files. It is also possible to run `run.ga` on a single machine with multiple cores (see `run.ga` documentation).
+                    +
                     +Regardless of the chosen `cluster.type`, `run.ga` uses the functionality from package `batchtools` to run jobs. In the case of HPC cluster use, with a properly configured `cluster.template`, users simply need to execute `run.ga` from an interactive R session and the jobs will be submitted to and executed on the cluster. This approach is well described in section 4.3.2 of the vignette "Introduction to BiocParallel" from package `BiocParallel`. Depending on `cluster.type`, jobs may take minutes to hours to complete. The status of jobs can be queried using the functions in package `batchtools`, most commonly `getStatus`. For users of HPC clusters, commands such as 'squeue' can also be used. For larger numbers of submissions, users may also construct their own automated scripts for checking that jobs have successfully completed. Perhaps the most obvious indication of a job failure is the `results.dir` will not contain `n.islands` different sets of results. This is particularly important when running jobs on an HPC cluster, where jobs may fail relatively cryptically. In the case of job failure due to problems with cluster schedulers (jobs fail to launch, node failure, etc.), the failed jobs can be re-run using the exact same `run.ga` command. The function will automatically identify the island jobs that still need to be run, and submit only those. This is also true for users running jobs on personal machines, who need to stop computations before all island results are available.
+                    +
                     +Once users have confirmed `run.ga` has completed and run properly, the sets of results across islands should be condensed using function `combine.islands`:
+                    +
                     +```{r}
                     +size3.combined.res <- combine.islands("size3_res")
                     +size4.combined.res <- combine.islands("size4_res")
+                    +
                     +```
+                    +
                     +Importantly, the function will write these results to the specified directory, so users should not call this function multiple times for the same directory.
+                    +
                     +The function condenses island results in two ways. First, it simply concatenates the results across islands, so that if a particular set of SNPs is identified on more than one island, it will appear in more than one row in the condensed results. Second, the concatenated results are further condensed so that each row contains a unique collection of SNPs, with an additional column indicating the number of islands on which that collection was identified. These two types of results are both useful in different contexts. In both cases, the rows are sorted in decreasing order according to fitness score, or, more plainly, with the most interesting SNPs appearing at the top of the dataset.
+                    +
                     +```{r}
                     +#first method of condensing
                     +head(size4.combined.res$all.results)
+                    +
                     +#second method of condensing
                     +head(size4.combined.res$unique.results)
+                    +
                     +```
+                    +
                     +In general, the `unique.results` object is more useful for examining interesting sets of SNPs as it contains all of the information in the `all.results` object, just in a more easily digestible form. In this case, we see that the SNPs corresponding to columns 51, 52, 76, and 77 are the top collection, or, more plainly, the group of 4 SNPs identified as most likely to exhibit a joint association with disease status beyond the sum of its component marginal associations. Note that these are the actual SNPs simulated to exhibit a joint effect, so we have identified the true risk pathway.
+                    +
                     +An important but subtle component of the results users should note are the `diff.vec` columns. The magnitude of these values are not particularly important, but the sign is useful. A positive value for a given snp indicates the minor allele is positively associated with disease status, while a negative value implies the reference (‘wild type’) allele is positively associated with the disease. In the case of the risk pathway SNPs, all values are positive, indicating the risk pathway involves having at copies of the minor allele for all 4 SNPs. If any of these coefficients were negative, the interpretation would be similar. For instance, if the `snp1.diff.vec` value were negative, and the remaining values positive, this would imply the proposed risk pathway includes having copies of the major allele for SNP 1 and copies of the minor alleles for SNPs 2-4.
+                    +
                     +### Step 4.Create Permuted Datasets
+                    +
                     +Of course, in real studies we do not know the identity of true risk pathways. We would therefore like a way to determine whether the results from the observed data are consistent with what we expect under the null hypothesis of no association between any input SNPs and disease status. We do so via permutation testing. The first step is to create a number of permuted datasets using the `permute.dataset` command. Here we create 4 different permuted datasets. In real applications, users are advised to create at least 100, more if computationally feasible.
+                    +
                     +```{r}
                     +set.seed(1400)
                     +perm.data.list <- permute.dataset(case, father.genetic.data = dad,
                     +                                  mother.genetic.data = mom, n.permutations = 4)
+                    +
                     +```
+                    +
                     +### Step 5. Run GADGET on Permuted Datasets
+                    +
                     +Once we have created the permuted datasets, for each permutation, we perform exactly the same sequence of analyses shown above. Users should note this step is by far the most time consuming of the workflow and almost certainly requires access to a computing cluster. However, since this vignette provides only a very small example, we are able to run the analyses locally. We begin by pre-processing the permuted datasets:
+                    +
                     +```{r}
                     +preprocess.lists <- lapply(perm.data.list, function(permutation){
+                    +
                     +  preprocess.genetic.data(permutation$case,
                     +                          complement.genetic.data = permutation$comp,
                     +                          chrom.mat = chrom.mat)
                     +})
+                    +
                     +```
+                    +
                     +Then, we run GADGET on each permuted dataset for each chromosome size:
+                    +
                     +```{r}
                     +#specify chromosome sizes
                     +chrom.sizes <- 3:4
+                    +
                     +#specify a different seed for each permutation
                     +seeds <- 4:7
+                    +
                     +#run GADGET for each permutation and size
                     +lapply(seq_along(preprocess.lists), function(permutation){
+                    +
                     +  perm.data.list <- preprocess.lists[[permutation]]
                     +  seed.val <- seeds[permutation]
+                    +
                     +  lapply(chrom.sizes, function(chrom.size){
+                    +
                     +    res.dir <- paste0("perm", permutation, "_size", chrom.size, "_res")
                     +    reg.dir <- paste0("perm", permutation, "_size", chrom.size, "_reg")
                     +    run.ga(perm.data.list, n.chromosomes = 5, chromosome.size = chrom.size,
                     +           results.dir = res.dir, cluster.type = "interactive",
                     +           registryargs = list(file.dir = reg.dir, seed = seed.val),
                     +           n.top.chroms = 5, n.islands = 8, island.cluster.size = 4, n.migrations = 2)
+                    +
                     +  })
+                    +
                     +})
+                    +
                     +#condense the results
                     +perm.res.dirs <- as.vector(outer(paste0("perm", 1:4),
                     +                                 paste0("_size", chrom.sizes, "_res"), paste0))
                     +perm.res.list <- lapply(perm.res.dirs, function(perm.res.dir){
+                    +
                     +  condensed.res <- combine.islands(perm.res.dir)
                     +  condensed.res$all.results
+                    +
                     +})
+                    +
                     +```
+                    +
                     +### Step 6. Run Global Test of Association
+                    +
                     +Now we are ready to more formally examine whether our results are consistent with what would be expected under the null hypothesis of no association between the input variants and disease status. Our first step is to run a global test across all chromosome sizes examining whether the fitness scores from the observed data are drawn from the distribution expected under the null. The details of the global test are described in (citation). Briefly, for each chromosome size, we use the null permutations to estimate the null fitness score CDF via the mean eCDF. Then, for the observed data and each permutation, for each chromosome size, we compute a one-sided Kolmogorov Smirnov test statistic of the null hypothesis that the observed CDF is not more than the null CDF. Intuitively, we are interested in whether the observed fitness scores are consistently higher than the null permutation based scores. If there are lots of observed fitness scores higher than the permutation scores, the observed fitness score CDF will be much lower than the null CDF for at least some fitness scores, yielding a large one-sided KS-statistic. Otherwise, the KS-statistics should be small.
+                    +
                     +We then use the vectors of KS-statistics, with each element corresponding to a different chromosome size, for the permutation datasets to estimate a null mean KS statistic vector and covariance matrix. We finally compute the Mahalanobis distance between the observed KS-statistic vector and the estimated null distribution, as well for each of the individual permutations. The p-value of the global test is the proportion of null Mahalanobis distances that exceed the observed distance. This test is implemented in function `run.global.test`:
+                    +
                     +```{r}
                     +# chromosome size 3 results
                     +chrom3.list <- list(observed.data = size3.combined.res$all.results,
                     +                     permutation.list = perm.res.list[1:4])
+                    +
                     +# chromosome size 4 results
                     +chrom4.list <- list(observed.data = size4.combined.res$all.results,
                     +                     permutation.list = perm.res.list[5:8])
+                    +
                     +# list of results across chromosome sizes, with each list
                     +# element corresponding to a chromosome size
                     +final.results <- list(chrom3.list, chrom4.list)
+                    +
                     +# run global test
                     +global.test.res <- run.global.test(final.results)
+                    +
                     +# look at the global test Mahalanobis distance and p-value
                     +global.test.res$obs.test.stat
                     +global.test.res$pval
+                    +
                     +```
+                    +
                     +We see the observed Mahalanobis distance is `r global.test.res$obs.test.stat` with associated, permutation based p-value `global.test.res$pval`. In cases where the p-value is zero, users may wish to report the p-value as $\frac{1}{\#\:permutations}$ or $\lt\frac{1}{\#\:permutations}$. Above, this would be $\frac{1}{4}$. Note that this underscores the need to run as many permutations as is computationally feasible in real applications.
+                    +
                     +Regardless of result, it is crucial that users are clear about what the test implies. The null hypothesis is the observed KS-vector is drawn from the null distribution. In cases where we reject the null hypothesis, of course the conclusion is the observed vector is not drawn from the null distribution. However, to be very clear, rejecting the null does not necessarily imply the observed fitness scores are consistently higher than the null permutation based scores. That is one reason the null may be rejected, but not the only reason. For instance, it is possible that the null will be rejected solely because the covariance pattern among chromosome sizes for the observed data differ from the permutation based estimate. Therefore, to best characterize results, it is incumbent on users to closely examine results beyond the global test.
+                    +
                     +To that end, `run.global.test` provides additional, chromosome size specific information that users are encouraged to examine. First, users may look at the `element.test.stats` and `element.pvals` objects from the `run.global.test` output:
+                    +
                     +```{r}
                     +global.test.res$element.test.stats
                     +global.test.res$element.pvals
+                    +
                     +```
+                    +
                     +The `element.test.stats` object reports the one-sided KS statistics for each chromosome size for the observed data. The `element.pvals` object reports the proportion of permutation based KS statistics that exceed the observed KS statistic for each chromosome size. In this case, we see that none of the permutation based KS statistics exceeded the observed value for either chromosome size. This indicates that for both chromosome sizes, the distribution of observed fitness scores has a heavier upper tail than the null permutations. Given a sufficient number of permutations, this would suggest the presence of a true multi-SNP risk pathway in the observed data.
+                    +
                     +Second, users may also wish to examine the `max.obs.fitness` and `max.order.pvals` elements of the output:
+                    +
                     +```{r}
                     +global.test.res$max.obs.fitness
                     +global.test.res$max.order.pvals
+                    +
                     +```
+                    +
                     +The `max.obs.fitness` element reports the maximum fitness score in the observed data for each chromosome size, and the `max.order.pvals` element reports the proportion of permutations whose maximum fitness score exceeds the observed fitness score. In the example above, we see the maximum observed fitness score exceeds all of the permutation maxima for each chromosome size. Again, given sufficient permutations to estimate the null distribution, these p-values correspond to the test of the null hypothesis that the maximum observed fitness score for a given chromosome size is not greater than what would be expected given no SNP-disease associations. Rejecting the null implies the observed maximum fitness score exceeds what would be expected by random chance.
+                    +
                     +### Step 7. Visualize Results
+                    +
                     +As a final step in the analytic pipeline, we recommend users examine network plots of the results using function `network.plot`. This may be particularly useful when trying to determine the true number of SNPs involved in multi-SNP risk pathways. For instance, in the example above, the true risk pathway involves 4 SNPS, but we ran GADGET for chromosome sizes of 3 and 4. For chromosome size 3, we saw that many of the top identified collections of SNPs were subsets of the true 4 SNP pathway. If we didn't know the true pathway size was 4, a network plot might help make this clearer. We start by examining the plot for chromosome size 3:
+                    +
                     +```{r}
                     +network.plot(size3.combined.res$all.results)
                     +```
+                    +
                     +By default, the size of a given SNP's node is proportional to the maximum fitness score of all chromosomes in which that SNP appears, as is the color, with larger fitness scores corresponding to greener color. Similarly, the width and color of the edge between a given pair of SNP nodes corresponds to the maximum fitness score among all chromosomes in which that pair appears, with wider edges and redder color corresponding to higher scores. Users should note `network.plot` allows users to specify other mechanisms for coloring and weighting graph nodes and edges (see package documentation). In this case, we immediately see the true 4 SNP pathway appears despite having mis-specified the true pathway size. The plot for chromosome size 4 also shows the same pathway SNPs:
+                    +
                     +```{r}
                     +network.plot(size4.combined.res$all.results)
                     +```
+                    +
                     +### Cleanup and sessionInfo()
+                    +
                     +```{r}
                     +#remove all example directories
                     +perm.reg.dirs <- as.vector(outer(paste0("perm", 1:4),
                     +                                 paste0("_size", chrom.sizes, "_reg"), paste0))
                     +lapply(c("size3_res", "size3_reg", "size4_res", "size4_reg", perm.reg.dirs, perm.res.dirs),
                     +       unlink, recursive = TRUE)
+                    +
                     +#session information
                     +sessionInfo()
+                    +
                     +```
+                    +

@@ -7,8 +7,10 @@ Maintainer: Michael Nodzenski <[email protected]>
                      Description: Runs genetic algorithm to identify multi-snp effects in case-parent triad studies.
                      License: GPL-3
                      Encoding: UTF-8
                     -LazyData: true
                      Imports: BiocParallel, data.table, matrixStats, stats, methods, survival, dplyr, igraph, batchtools, qgraph
                     -Suggests: snpStats, Matrix
                     +Suggests: snpStats, Matrix, BiocStyle, knitr, rmarkdown
                      RoxygenNote: 7.1.0
                      Depends: R (>= 3.5.0)
                     +biocViews: Genetics, SNP, GeneticVariability
                     +VignetteBuilder: knitr
+                    +

Add in vignette.