Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

FINDING NEEDLES IN GENOMIC
HAYSTACKS WITH “WIDE”
RANDOM FOREST
Piotr Szul
CSIRO Data61

Which needle is the right one?
All humans carry between 200 to
800 mutation that disrupt the
function of a gene.
The human genome is 3 billion letters long.
Finding genetic underpinnings fordiseases and phenotypic traits
What are the biologicalmechanism?
Who is at risk for a disease?
How to prevent and treat?

5319
talented
staff
$1billion+
budget
Working
with over
2800+
industry
partners
55
sites across
Australia
Top 1%
of global
research
agencies
Each year
6 CSIRO
technologies
contribute
$5 billion to
the economy

Agenda
• Intro to Genome Wide Association Studies
• Variant Spark and “Cursed Forest”
• GWAS use-cases

Genome Wide Association Studies
imagecourtesyofPasieka
SciencePhotoLibrary
1000+ samples
Relatively common > 1%
~ 500,000 SNPs

Look at the data
Typical GWAS: 1M variants x 5K samples
Full genome: 80M variants x 2.5K samples
0 1 0 … 1
1 1 1 … 1
0 0 0 … 0
0 0 1 … 1
0 1 1 … 1
0 0 0 … 0
1 2 0 … 0
.........
.........
0 0 0 … 2
1 2 0 … 0
samples (103)
variants(106)
0 1 0 0 0 0 1 ... 0 1
1 1 0 0 1 0 2 ... 0 2
0 1 0 1 1 0 0 ... 0 0
.....................
1 1 0 1 1 0 0 ... 2 0
variants x samples
transpose
D
N
D
.
N
1 x samples
predictors response
associate
0
10,000
20,000
30,000
40,000
50,000
100,000 1,000,000 10,000,000 100,000,000
Studies 1000 Genomes
samples
variants

GWAS
0
2000
4000
6000
8000
10000
12000
2008 2009 2010 2011 2012 2013 2014 2015
GWAS Studies
Associations Studies
2713 studies
31183 associations
Hirota at al (2012) Genome-wide association study identifies eight new
susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome-
wide association studies

Missing Heritability
Manolio et al. (2009) Finding the missing heritability of complex diseases
… human height heritability is ~80% yet more
that 40 associated loci explain only about5%
of phenotypic variance …
“Dark matter” of
genomics

Epistasis
Traditional approach for interaction modeling ’squares’ the problem size
500,000 SNPs à ~100,000,000,000 pairs

Random Forest to the rescue
Lunetta et al. (2004) Screening large-scale association study data:
exploiting interactions using random forests
Breiman (2001) Random Forests.
Machine Learning

Random Forest in GWAS
• Non-parametric and arbitrarily expressive
• Insensitive to outliers and non-informative predictors
• Stable performance – no overfitting
• Easy to tune
• Built in error estimate (OOB error)
• Variable importance measures
• Ability to deal with heterogeneous data
• Easy to parallelize and scale on HPC
Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
RF is an appropriate candidate to capture the genetic heterogeneity
underlying the trait because RF itself is an ensemble of many heterogeneous
trees built from uncorrelated subsamples of the original data

VariantSpark
0
1000
2000
P
ython
R
H
adoop
A
dam
A
D
M
IX
TU
R
E
VariantS
park
method
timeinseconds
task
binary−conversion
clustering
pre−processing
It can cluster 3000 individuals and 80 million
variants
O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul
Transformational Bioinformatics Team
Aidan O’BrienLaurence Wilson
Software
Open source (MIT) @ https://github.com/csirobigdata/variant-spark

Random Forest SparkML
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
• Failing for millions of variables
• Relatively slow

“Cursed Forest”
broadcast
aggregate
1
2,1 2,2
Executors
v1
v2
v3v3v3
vn
…
var, point
local best
split
var1, point1
var21, point21 var22, point22
global
best split
…
initial sample
split subsets
Driver
Partition data by variables
(columns)
• Columns are “small” –
easy partition
• An executor can find (an
exact) best split for many
variables
• Finding globalbest split
is efficient

Some implementation tricks
• ”Native” data shape
– VCF files are organized by ”variables”
• Building by levels and tree batching
– Minimize communication overhead and the number of stages
• Optimized split finding for ordered factors
– The most frequent operation
– Java implementation faster then Scala
• Choice of data representation
– byte representation for variant data
– with sparsity 0.75 a sparse vector 3x bigger than a byte array

How fast it is? 16 CPU cores 32GB RAM
local mode

Big data performance
• Yarn Cluster (12 workers)
– 16 x IntelXeon E5-2660@2.20GHzCPU
– 128 GB of RAM
• Spark 1.6.1 on YARN
– 128 executors
– 6GB / executor (0.75TB)
• Synthetic dataset(mtry = 0.25)
Typical
GWAS
Range
100K trees: 5 – 50h
AWS: ~$215.50
Whole
Genome
Range
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
50M variable x 10k samples!

Other features
• Various input formats
– VCF, CSV, parquet
• A variety of RF (fine) tuning
parameters
– Sampling
– Depths
– Splitting
• Insight into RF model
– cumulative OOB error
– per tree variable importance
– per tree OOB predictions

Simulated Data Study
• Synthetic dataset of 2.5M variables and 5000 samples
• 5 informative variables with dichotomous response
• Compare RF importance ranking with the model
• Rank-biased overlap (RBO) – measure of ranking
overlap (with emphasis of highly ranked elements)
RBO
0
0.5
1
1.5
w_1 w_2 w_3 w_4 w_5

Bone Mineral Density Study
• Osteoporotic fracture is a leading cause of morbidity
and mortality particularly amongst the elderly.
• In 2004 ten millionAmericans were estimated to have
osteoporosis, resulting in 1.5 million fractures per
annum.
• Hip fracture is associated with a one year mortality
rate of 36% in men and 21% in women
Burden of disease of osteoporotic fractures overall is
similar to that of colorectal cancer and greater than that of
hypertension and breast cancer
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.

Bone Mineral Density Study
• 2036 samples & 288,768
SNPs
• Replicates 21 of 26 known
associated genes
• Identifies 2 novel loci (known
association with BMD)
• Provides strong evidence for
further 4 loci
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.

BMD - VariantSpark Results
Known BMD locations have
significantly higher ranking
(Mann-Whitney U, p = 1.3e-7)
A few novel highly ranked
locations with plausible
association with BMD: COLEC10,
PRODH
Not replicated DCDC5 ranked
9,667 out of 10,000

Future work and directions
Techchnical
Compare and
’merge’ with
yggdrasil
Deployment on
cloud platforms
Further
performance
improvements
Functional
Implementation
of cutting edge
research
Integration
within genomics
platforms
(GATK4)
More ML
algorithms
Research Applications
Data science
research Gradient Boosted
Trees

References
1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical
genetics and GenABEL hands-on tutorial
2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci
for atopic dermatitis in the Japanese population
3. The NHGRI-EBI Catalog of published genome-wide association studies
4. Manolio et al. (2009) Finding the missing heritability of complex diseases
5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions
using random forests
6. Breiman (2004) Random Forests. Machine Learning
7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
8. Danecek et al. The Variant Call Format and VCFtools
9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column
Partitioning in SPARK
11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection
Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.

Conclusions
Apache Spark is a feasible platform machine learning
in population scale genomics.
VariantSpark with CursedForest is a promising
alternative for traditional GWAS approaches.
Data shape, type, etc. matter – different optimizations
are needed.

Thank You
Email: piotr.szul@data61.csiro.au
Github: https://github.com/csirobigdata/variant-spark

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

More Related Content

What's hot (20)

Similar to Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul (20)

More from Spark Summit (20)

Recently uploaded (20)

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul