README.md
58181e2b
 # GCAT (genotype conditional association test)
ea46d24f
 
37b57a7e
 `gcatest` implements the genotype conditional association test (GCAT).
ea46d24f
 
58181e2b
 ## Installation
ea46d24f
 
58181e2b
 To install latest version on Bioconductor, open R and type:
8c3bb744
 
58181e2b
 ```R
 if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
ea46d24f
 
58181e2b
 BiocManager::install("gcatest")
 ```
bc4d6f6a
 
58181e2b
 You can also install development version from GitHub this way:
79d12bfe
 ```R
 install.packages("devtools")
 library("devtools")
6ba1b3fa
 install_github("Storeylab/gcatest")
79d12bfe
 ```
bc4d6f6a
 
58181e2b
 ## Example
ea46d24f
 
58181e2b
 `gcatest` includes a simple example:
ea46d24f
 
 ```R
58181e2b
 library(gcatest)
 library(lfa)
 LF <- lfa(sim_geno, 3)
 gcat_p <- gcat(sim_geno, LF, sim_trait)
ea46d24f
 ```
3361ee8e
 
58181e2b
 The example is also available in PLINK format at:
68eb3a23
 
c1246793
 * https://genomics.princeton.edu/storeylab/data/gcat/demo/sim_geno.bed
 * https://genomics.princeton.edu/storeylab/data/gcat/demo/sim_geno.bim
 * https://genomics.princeton.edu/storeylab/data/gcat/demo/sim_geno.fam
68eb3a23
 
58181e2b
 The package `genio` has the function `read_plink` to read this data.
 Example:
3361ee8e
 
 ```R
58181e2b
 library(gcatest)
 library(lfa)
 library(genio)
 data <- read_plink("sim_geno")
 sim_geno <- data$X
 sim_trait <- data$fam$pheno
 LF <- lfa(sim_geno, 3)
 gcat_p <- gcat(sim_geno, LF, sim_trait)
3361ee8e
 ```
07f7fd75
 
58181e2b
 ## Checking genotype model fit
07f7fd75
 
 The main assumption that needs to verified on real data before using GCAT is that the probabilistic model of population structure fits the genotype data well.  Note that this verification does not involve the trait model, which is an important and positive aspect of GCAT.  The function `model.gof` returns a p-value for each SNP based on simulating a null distribution for the population structure model. The lower the p-value is for a given SNP, the worse the model fits that particular SNP.  Statistically significant p-values tell us which SNPs the model fails for, and those SNPs should be filtered out if necessary before using the GCAT test.  We can also adjust the value of `d` (which is the number of logistic factors included in the population structure model) to try to maximize the number of SNPs that are included in the GCAT analysis. In the example simulated data set, the last five SNPs are simulated to violate the model.
 
 ```R
58181e2b
 library(gcatest)
 library(lfa)
 library(genio)
 data <- read_plink("sim_geno")
 sim_geno <- data$X
 sim_trait <- data$fam$pheno
 LF <- lfa(sim_geno, 3)
 gof <- sHWE(sim_geno, LF, B=2)
 filtered <- gof < (1 / nrow(sim_geno))
 sim_geno <- sim_geno[!filtered,]
 
 LF <- lfa(sim_geno, 3)
 gcat_p <- gcat(sim_geno, LF, sim_trait)
07f7fd75
 ```
58181e2b
 
 ## Citations
 
 Song, Minsun, Wei Hao, and John D. Storey. "Testing for Genetic Associations in Arbitrarily Structured Populations." Nature Genetics 47, no. 5 (May 2015): 550-54. [doi:10.1038/ng.3244](https://doi.org/10.1038/ng.3244).
 
c1246793
 Hao, Wei, Minsun Song, and John D. Storey. "Probabilistic Models of Genetic Variation in Structured Populations Applied to Global Human Studies." Bioinformatics 32, no. 5 (March 1, 2016): 713–21. [doi:10.1093/bioinformatics/btv641](https://doi.org/10.1093/bioinformatics/btv641). [arXiv](https://arxiv.org/abs/1312.2041).
58181e2b