FINDING NEEDLES IN GENOMIC
HAYSTACKS WITH “WIDE”
RANDOM FOREST
Piotr Szul
CSIRO Data61
Which needle is the right one?
All humans carry between 200 to
800 mutation that disrupt the
function of a gene.
The human genome is 3 billion letters long.
Finding genetic underpinnings fordiseases and phenotypic traits
What are the biologicalmechanism?
Who is at risk for a disease?
How to prevent and treat?
5319
talented
staff
$1billion+	
budget
Working
with over
2800+
industry
partners
55
sites	across	
Australia
Top 1%
of global
research
agencies
Each year
6 CSIRO
technologies
contribute
$5 billion to
the economy
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul
Agenda
• Intro to Genome Wide Association Studies
• Variant Spark and “Cursed Forest”
• GWAS use-cases
Genome Wide Association Studies
imagecourtesyofPasieka
SciencePhotoLibrary
1000+ samples
Relatively common > 1%
~ 500,000 SNPs
Look at the data
Typical GWAS: 1M variants x 5K samples
Full genome: 80M variants x 2.5K samples
0 1 0 … 1
1 1 1 … 1
0 0 0 … 0
0 0 1 … 1
0 1 1 … 1
0 0 0 … 0
1 2 0 … 0
.........
.........
0 0 0 … 2
1 2 0 … 0
samples (103)
variants(106)
0 1 0 0 0 0 1 ... 0 1
1 1 0 0 1 0 2 ... 0 2
0 1 0 1 1 0 0 ... 0 0
.....................
1 1 0 1 1 0 0 ... 2 0
variants x samples
transpose
D
N
D
.
N
1 x samples
predictors response
associate
0
10,000
20,000
30,000
40,000
50,000
100,000 1,000,000 10,000,000 100,000,000
Studies 1000 Genomes
samples
variants
GWAS
0
2000
4000
6000
8000
10000
12000
2008 2009 2010 2011 2012 2013 2014 2015
GWAS Studies
Associations Studies
2713 studies
31183 associations
Hirota at al (2012) Genome-wide association study identifies eight new
susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome-
wide association studies
Missing Heritability
Manolio et al. (2009) Finding the missing heritability of complex diseases
… human height heritability is ~80% yet more
that 40 associated loci explain only about5%
of phenotypic variance …
“Dark matter” of
genomics
Epistasis
Traditional approach for interaction modeling ’squares’ the problem size
500,000 SNPs à ~100,000,000,000 pairs
Random Forest to the rescue
Lunetta et al. (2004) Screening large-scale association study data:
exploiting interactions using random forests
Breiman (2001) Random Forests.
Machine Learning
Random Forest in GWAS
• Non-parametric and arbitrarily expressive
• Insensitive to outliers and non-informative predictors
• Stable performance – no overfitting
• Easy to tune
• Built in error estimate (OOB error)
• Variable importance measures
• Ability to deal with heterogeneous data
• Easy to parallelize and scale on HPC
Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
RF is an appropriate candidate to capture the genetic heterogeneity
underlying the trait because RF itself is an ensemble of many heterogeneous
trees built from uncorrelated subsamples of the original data
VariantSpark
0
1000
2000
P
ython
R
H
adoop
A
dam
A
D
M
IX
TU
R
E
VariantS
park
method
timeinseconds
task
binary−conversion
clustering
pre−processing
It	can	cluster	3000	individuals	and 80	million	
variants
O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul
Transformational Bioinformatics Team
Aidan O’BrienLaurence Wilson
Software
Open source (MIT) @ https://github.com/csirobigdata/variant-spark
Random Forest SparkML
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
• Failing for millions of variables
• Relatively slow
“Cursed Forest”
broadcast
aggregate
1
2,1 2,2
Executors
v1
v2
v3v3v3
vn
…
var, point
local best
split
var1, point1
var21, point21 var22, point22
global
best split
…
initial sample
split subsets
Driver
Partition data by variables
(columns)
• Columns are “small” –
easy partition
• An executor can find (an
exact) best split for many
variables
• Finding globalbest split
is efficient
Some implementation tricks
• ”Native” data shape
– VCF files are organized by ”variables”
• Building by levels and tree batching
– Minimize communication overhead and the number of stages
• Optimized split finding for ordered factors
– The most frequent operation
– Java implementation faster then Scala
• Choice of data representation
– byte representation for variant data
– with sparsity 0.75 a sparse vector 3x bigger than a byte array
How fast it is? 16 CPU cores 32GB RAM
local mode
Big data performance
• Yarn Cluster (12 workers)
– 16 x IntelXeon E5-2660@2.20GHzCPU
– 128 GB of RAM
• Spark 1.6.1 on YARN
– 128 executors
– 6GB / executor (0.75TB)
• Synthetic dataset(mtry = 0.25)
Typical
GWAS
Range
100K trees: 5 – 50h
AWS: ~$215.50
Whole
Genome
Range
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
50M variable x 10k samples!
Other features
• Various input formats
– VCF, CSV, parquet
• A variety of RF (fine) tuning
parameters
– Sampling
– Depths
– Splitting
• Insight into RF model
– cumulative OOB error
– per tree variable importance
– per tree OOB predictions
Simulated Data Study
• Synthetic dataset of 2.5M variables and 5000 samples
• 5 informative variables with dichotomous response
• Compare RF importance ranking with the model
• Rank-biased overlap (RBO) – measure of ranking
overlap (with emphasis of highly ranked elements)
RBO
0
0.5
1
1.5
w_1 w_2 w_3 w_4 w_5
Bone Mineral Density Study
• Osteoporotic fracture is a leading cause of morbidity
and mortality particularly amongst the elderly.
• In 2004 ten millionAmericans were estimated to have
osteoporosis, resulting in 1.5 million fractures per
annum.
• Hip fracture is associated with a one year mortality
rate of 36% in men and 21% in women
Burden of disease of osteoporotic fractures overall is
similar to that of colorectal cancer and greater than that of
hypertension and breast cancer
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.
Bone Mineral Density Study
• 2036 samples & 288,768
SNPs
• Replicates 21 of 26 known
associated genes
• Identifies 2 novel loci (known
association with BMD)
• Provides strong evidence for
further 4 loci
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.
BMD - VariantSpark Results
Known BMD locations have
significantly higher ranking
(Mann-Whitney U, p = 1.3e-7)
A few novel highly ranked
locations with plausible
association with BMD: COLEC10,
PRODH
Not replicated DCDC5 ranked
9,667 out of 10,000
Future work and directions
Techchnical
Compare and
’merge’ with
yggdrasil
Deployment on
cloud platforms
Further
performance
improvements
Functional
Implementation
of cutting edge
research
Integration
within genomics
platforms
(GATK4)
More ML
algorithms
Research Applications
Data science
research Gradient Boosted
Trees
References
1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical
genetics and GenABEL hands-on tutorial
2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci
for atopic dermatitis in the Japanese population
3. The NHGRI-EBI Catalog of published genome-wide association studies
4. Manolio et al. (2009) Finding the missing heritability of complex diseases
5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions
using random forests
6. Breiman (2004) Random Forests. Machine Learning
7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
8. Danecek et al. The Variant Call Format and VCFtools
9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column
Partitioning in SPARK
11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection
Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
Conclusions
Apache Spark is a feasible platform machine learning
in population scale genomics.
VariantSpark with CursedForest is a promising
alternative for traditional GWAS approaches.
Data shape, type, etc. matter – different optimizations
are needed.
Thank You
Email: piotr.szul@data61.csiro.au
Github: https://github.com/csirobigdata/variant-spark

More Related Content

PDF
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
PPTX
VariantSpark a library for genomics by Lynn Langit
PDF
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
PDF
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
PPTX
Data analysis & integration challenges in genomics
PDF
Genome Big Data
PPT
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
PPTX
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
VariantSpark a library for genomics by Lynn Langit
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Data analysis & integration challenges in genomics
Genome Big Data
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Managing Genomes At Scale: What We Learned - StampedeCon 2014

What's hot (20)

PPTX
2016 davis-plantbio
PDF
Managing Genomics Data at the Sanger Institute
PPTX
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
PDF
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
ODP
Next generation genomics: Petascale data in the life sciences
ODP
Next-generation sequencing: Data mangement
PDF
Drug Repurposing using Deep Learning on Knowledge Graphs
PPT
Computation and Knowledge
PDF
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
PPTX
Emerging challenges in data-intensive genomics
PDF
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
ODP
Life sciences big data use cases
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PPTX
Cshl minseqe 2013_ouellette
PPTX
Fostering Serendipity through Big Linked Data
PDF
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
PPT
Folker Meyer: Metagenomic Data Annotation
PPTX
HPCAC - the state of bioinformatics in 2017
PPTX
Introduction to Bayesian phylogenetics and BEAST
PDF
Cassava genome hub
2016 davis-plantbio
Managing Genomics Data at the Sanger Institute
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Next generation genomics: Petascale data in the life sciences
Next-generation sequencing: Data mangement
Drug Repurposing using Deep Learning on Knowledge Graphs
Computation and Knowledge
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Emerging challenges in data-intensive genomics
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Life sciences big data use cases
Advanced Bioinformatics for Genomics and BioData Driven Research
Cshl minseqe 2013_ouellette
Fostering Serendipity through Big Linked Data
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Folker Meyer: Metagenomic Data Annotation
HPCAC - the state of bioinformatics in 2017
Introduction to Bayesian phylogenetics and BEAST
Cassava genome hub
Ad

Similar to Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul (20)

PDF
Introduction to 16S rRNA gene multivariate analysis
PPT
Phylogenomic Supertrees. ORP Bininda-Emond
PDF
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
PPT
American Statistical Association October 23 2009 Presentation Part 1
PPT
Large scale machine learning challenges for systems biology
PDF
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
PPTX
Bioinformatics Data Pipelines built by CSIRO on AWS
PDF
Challenges and opportunities for machine learning in biomedical research
PPTX
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
ODP
OKC Grand Rounds 2009
PPT
Paper presentation @DILS'07
PPTX
VariantSpark: applying Spark-based machine learning methods to genomic inform...
PDF
Current advances to bridge the usability-expressivity gap in biomedical seman...
PPTX
Charleston Conference 2016
PPTX
Knowledge extraction and visualisation using rule-based machine learning
PDF
Comprehensive Exam Slides 11/13/2013
PPT
American Society for Mass Spectrometry Conference 2013
PDF
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
PPT
Cornell Pbsb 20090126 Nets
Introduction to 16S rRNA gene multivariate analysis
Phylogenomic Supertrees. ORP Bininda-Emond
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
American Statistical Association October 23 2009 Presentation Part 1
Large scale machine learning challenges for systems biology
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
Bioinformatics Data Pipelines built by CSIRO on AWS
Challenges and opportunities for machine learning in biomedical research
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
OKC Grand Rounds 2009
Paper presentation @DILS'07
VariantSpark: applying Spark-based machine learning methods to genomic inform...
Current advances to bridge the usability-expressivity gap in biomedical seman...
Charleston Conference 2016
Knowledge extraction and visualisation using rule-based machine learning
Comprehensive Exam Slides 11/13/2013
American Society for Mass Spectrometry Conference 2013
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
Cornell Pbsb 20090126 Nets
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
Leprosy and NLEP programme community medicine
PPT
Predictive modeling basics in data cleaning process
PPTX
Business_Capability_Map_Collection__pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
chrmotography.pptx food anaylysis techni
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Transcultural that can help you someday.
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Introduction to the R Programming Language
Introduction to Data Science and Data Analysis
Leprosy and NLEP programme community medicine
Predictive modeling basics in data cleaning process
Business_Capability_Map_Collection__pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Introduction to Inferential Statistics.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
CYBER SECURITY the Next Warefare Tactics
chrmotography.pptx food anaylysis techni
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
A Complete Guide to Streamlining Business Processes
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Transcultural that can help you someday.
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Introduction to the R Programming Language

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

  • 1. FINDING NEEDLES IN GENOMIC HAYSTACKS WITH “WIDE” RANDOM FOREST Piotr Szul CSIRO Data61
  • 2. Which needle is the right one? All humans carry between 200 to 800 mutation that disrupt the function of a gene. The human genome is 3 billion letters long. Finding genetic underpinnings fordiseases and phenotypic traits What are the biologicalmechanism? Who is at risk for a disease? How to prevent and treat?
  • 3. 5319 talented staff $1billion+ budget Working with over 2800+ industry partners 55 sites across Australia Top 1% of global research agencies Each year 6 CSIRO technologies contribute $5 billion to the economy
  • 5. Agenda • Intro to Genome Wide Association Studies • Variant Spark and “Cursed Forest” • GWAS use-cases
  • 6. Genome Wide Association Studies imagecourtesyofPasieka SciencePhotoLibrary 1000+ samples Relatively common > 1% ~ 500,000 SNPs
  • 7. Look at the data Typical GWAS: 1M variants x 5K samples Full genome: 80M variants x 2.5K samples 0 1 0 … 1 1 1 1 … 1 0 0 0 … 0 0 0 1 … 1 0 1 1 … 1 0 0 0 … 0 1 2 0 … 0 ......... ......... 0 0 0 … 2 1 2 0 … 0 samples (103) variants(106) 0 1 0 0 0 0 1 ... 0 1 1 1 0 0 1 0 2 ... 0 2 0 1 0 1 1 0 0 ... 0 0 ..................... 1 1 0 1 1 0 0 ... 2 0 variants x samples transpose D N D . N 1 x samples predictors response associate 0 10,000 20,000 30,000 40,000 50,000 100,000 1,000,000 10,000,000 100,000,000 Studies 1000 Genomes samples variants
  • 8. GWAS 0 2000 4000 6000 8000 10000 12000 2008 2009 2010 2011 2012 2013 2014 2015 GWAS Studies Associations Studies 2713 studies 31183 associations Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome- wide association studies
  • 9. Missing Heritability Manolio et al. (2009) Finding the missing heritability of complex diseases … human height heritability is ~80% yet more that 40 associated loci explain only about5% of phenotypic variance … “Dark matter” of genomics
  • 10. Epistasis Traditional approach for interaction modeling ’squares’ the problem size 500,000 SNPs à ~100,000,000,000 pairs
  • 11. Random Forest to the rescue Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests Breiman (2001) Random Forests. Machine Learning
  • 12. Random Forest in GWAS • Non-parametric and arbitrarily expressive • Insensitive to outliers and non-informative predictors • Stable performance – no overfitting • Easy to tune • Built in error estimate (OOB error) • Variable importance measures • Ability to deal with heterogeneous data • Easy to parallelize and scale on HPC Sun (2010) Multigenic Modeling of Complex Disease by Random Forests RF is an appropriate candidate to capture the genetic heterogeneity underlying the trait because RF itself is an ensemble of many heterogeneous trees built from uncorrelated subsamples of the original data
  • 13. VariantSpark 0 1000 2000 P ython R H adoop A dam A D M IX TU R E VariantS park method timeinseconds task binary−conversion clustering pre−processing It can cluster 3000 individuals and 80 million variants O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul Transformational Bioinformatics Team Aidan O’BrienLaurence Wilson Software Open source (MIT) @ https://github.com/csirobigdata/variant-spark
  • 14. Random Forest SparkML Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK • Failing for millions of variables • Relatively slow
  • 15. “Cursed Forest” broadcast aggregate 1 2,1 2,2 Executors v1 v2 v3v3v3 vn … var, point local best split var1, point1 var21, point21 var22, point22 global best split … initial sample split subsets Driver Partition data by variables (columns) • Columns are “small” – easy partition • An executor can find (an exact) best split for many variables • Finding globalbest split is efficient
  • 16. Some implementation tricks • ”Native” data shape – VCF files are organized by ”variables” • Building by levels and tree batching – Minimize communication overhead and the number of stages • Optimized split finding for ordered factors – The most frequent operation – Java implementation faster then Scala • Choice of data representation – byte representation for variant data – with sparsity 0.75 a sparse vector 3x bigger than a byte array
  • 17. How fast it is? 16 CPU cores 32GB RAM local mode
  • 18. Big data performance • Yarn Cluster (12 workers) – 16 x IntelXeon [email protected] – 128 GB of RAM • Spark 1.6.1 on YARN – 128 executors – 6GB / executor (0.75TB) • Synthetic dataset(mtry = 0.25) Typical GWAS Range 100K trees: 5 – 50h AWS: ~$215.50 Whole Genome Range 100K trees: 200 – 2000h AWS: ~ $ 8620.00 50M variable x 10k samples!
  • 19. Other features • Various input formats – VCF, CSV, parquet • A variety of RF (fine) tuning parameters – Sampling – Depths – Splitting • Insight into RF model – cumulative OOB error – per tree variable importance – per tree OOB predictions
  • 20. Simulated Data Study • Synthetic dataset of 2.5M variables and 5000 samples • 5 informative variables with dichotomous response • Compare RF importance ranking with the model • Rank-biased overlap (RBO) – measure of ranking overlap (with emphasis of highly ranked elements) RBO 0 0.5 1 1.5 w_1 w_2 w_3 w_4 w_5
  • 21. Bone Mineral Density Study • Osteoporotic fracture is a leading cause of morbidity and mortality particularly amongst the elderly. • In 2004 ten millionAmericans were estimated to have osteoporosis, resulting in 1.5 million fractures per annum. • Hip fracture is associated with a one year mortality rate of 36% in men and 21% in women Burden of disease of osteoporotic fractures overall is similar to that of colorectal cancer and greater than that of hypertension and breast cancer Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting BoneMineral Density and Fracture Risk.
  • 22. Bone Mineral Density Study • 2036 samples & 288,768 SNPs • Replicates 21 of 26 known associated genes • Identifies 2 novel loci (known association with BMD) • Provides strong evidence for further 4 loci Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting BoneMineral Density and Fracture Risk.
  • 23. BMD - VariantSpark Results Known BMD locations have significantly higher ranking (Mann-Whitney U, p = 1.3e-7) A few novel highly ranked locations with plausible association with BMD: COLEC10, PRODH Not replicated DCDC5 ranked 9,667 out of 10,000
  • 24. Future work and directions Techchnical Compare and ’merge’ with yggdrasil Deployment on cloud platforms Further performance improvements Functional Implementation of cutting edge research Integration within genomics platforms (GATK4) More ML algorithms Research Applications Data science research Gradient Boosted Trees
  • 25. References 1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical genetics and GenABEL hands-on tutorial 2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population 3. The NHGRI-EBI Catalog of published genome-wide association studies 4. Manolio et al. (2009) Finding the missing heritability of complex diseases 5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests 6. Breiman (2004) Random Forests. Machine Learning 7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests 8. Danecek et al. The Variant Call Format and VCFtools 9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information 10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK 11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
  • 26. Conclusions Apache Spark is a feasible platform machine learning in population scale genomics. VariantSpark with CursedForest is a promising alternative for traditional GWAS approaches. Data shape, type, etc. matter – different optimizations are needed.
  • 27. Thank You Email: [email protected] Github: https://github.com/csirobigdata/variant-spark