Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

by Data Fellas,
Spark London Meetup July, 1st ‘15
Share and analyse genomic data
at scale with Spark, Adam, Tachyon and the Spark Notebook

PART I
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
Outline
PART II
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services

Andy Petrella
@noootsab
Maths
scala
Apache Spark
Spark Notebook
Trainer
Data Banana
Xavier Tordoir
@xtordoir
Physics
Bioinformatics
Scala
Spark

PART I
Spark & Genomics
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
So that’s the
thing that
separates us?

Adam
What is genomics data
Okay, sounds
good. Give me
two of them!
Genome is an important factor in health:
Medical Diagnostics
Drug response
Diseases mechanisms
…

Adam
You mean devs
are slacking
of?
On the data production:
Fast biotech progress
No so fast IT progress?

Adam
No! They’re
just sticky
bubbles...
Sequence {A, T, G, C}
3 billion bases

Adam
Okay, a lot of
bubbles.
3 billion bases
… x 30 (x 60?)

Adam
C’mon. a big
mess of plenty
of lil’ bubbles
then.
On the data production: massively parallel
3 billion bases
… x 30 (x 60?)

Adam
Ah that
explain why
the black bars
are differents

Adam
Dude... Tens of
millions

Adam
Staaaaaaph Tens of
millions
1000’s
1,000,000’s
…

Adam
‘coz it makes
sparkling
bubbles, right?
Ok, looks like Apache Spark
makes a lot of sense here …

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Well done, a
spec as text in
a pDf…

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Take that

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Dunno what is
a Genotype but
it contains a
Variant.
Apparently.

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Yeaaah:
generate
client == more
slack
Adam provides an avro
schema

Adam
An efficient storage
Machism in I.
T., what a
flaw!
● Distribute data
● Schema based
● Read/query efficient
● Compact

Adam
That’s a quick
step
● Distribute data
● Schema based
● Compact
PARQUET!

Adam
Is Eve okay to
use the
parquet for
that?
● Distribute data
● Schema based
● Compact
PARQUET!
Adam provides parquet as storage format

Adam
A clean API
Object
Wrappedy
adam Context

Adam
A clean API
I could have
done this as a
one liner
adam Context
IO methods

Adam
A clean API
At least, it’s
going to be
simpler than
the chemistry
● Scala classes generated from Avro
● Data loaded as RDDs
● functions on RDDs
○ write to HDFS
○ genomic objects manipulations
○ Primitives to query genomics
datasets

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Part of a pipeline
human | Seq |
SNAP |
Avocado |
Adam | Ga4gh
ADAM is JVM library leveraging
- Spark
- Avro
- Parquet
It still needs to be combined with sources
(snap)
Adam data is part of processes (AVOCADO).
It CAN ALSO BE THE SOURCE FOR external
PROCESSING, LEARNING (LIKE mllIB).

Thousands Genomes
Open Data Set
Games without
Frontiers
1000 genomes: http://www.1000genomes.org/

Produces BAMs, VCFs, ...
Thousands Genomes
Why do you
complain, they
are
compressed …

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Thousands Genomes
Where are the data
DNA Russian
roulette:
which is
fastest?
● EBI FTP: ftp://ftp.1000genomes.ebi.ac.
uk/vol1/ftp/
● NCBI FTP: ftp://ftp-trace.ncbi.nih.
gov/1000genomes/ftp/
● S3: http://aws.amazon.com/1000genomes/
● GS: gs://genomics-public-data/ftp-trace.ncbi.
nih.gov/1000genomes/ftp

Thousands Genomes
Adam that shit on S3
Hmmm like in
the good old
days of HPC
The bad part …
● get the vcf.gz file on local disk (& time for a
coffee)
● uncompress (& go for lunch)
● put in HDFS (& take dessert)

Thousands Genomes
what?
No grappa?
The good part …
the Notebook (this one)

Thousands Genomes
Okay, good
enough to wait
a bit…
What did we gain?
● before: 152 GB (gzipped) in 23 files
● After: 71 GB in 9172 partitions
(43,372,735,220 genotypes)

Explore Genomics
Access the data
Just in case,
you don’t
believe us -_-’
Access data from this notebook

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Explore Genomics
Compute statistics
We’re there to
compute,
right?
Compute Freqs from this spark
notebook

Learn Genomics
The problem
Insane, you’ll
have hard time
with me |:-[
How to deal with heterogenous data?
● Population stratification
● Identify natural clusters
● Assign genomes to these clusters

Learn Genomics
The dimensions
Wiiiiiiiiiiiiiiiiide
rows
● 1000 Samples (Rows)
● 30,000,000 variants (columns or
variables)
Hard to explore such a feature space…

Learn Genomics
The dimensions
*LDA for
Latent
Dirichelet
Allocation…
Dimensionality reduction?
● Ideal would be a “Genetic” Mixture
measure (lda* would do that…)
● Or a genetic distance (edit distance)
KMeans & distances to centroids

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Learn Genomics
The model
Reduce, train,
validate, infer
● Split training/validation set
● Train KMeans with 25 clusters
● Compute distances to each centroid as
new features
● Train Random Forest
● Validation

Learn Genomics
The notebook
Define and train the model in this
Notebook
The whole
shebang?

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Our pipeline
I am a Llama
Convert VCFs to ADAM
StoRE ADAM to S3
Compute alleles frequencies
Store alleles frequencies to S3
Compute Minor Allele frequency distribution
Train a Model for stratification
Hmmm… quite some missing pieces, right?

PART II
Standards & Micro Services
Wake up!
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Ga4GH
Let’s fix the baseline
In I.T. it’s easy
everything is
standardized…
Global Alliance for Genomic and Health
http://genomicsandhealth.org/
http://ga4gh.org/
Framework for responsible data sharing
● Define schemas
● Define services
Along with Ethical, Legal, security, clinical aspects

GA4GH
models
… everybody
has is own
standard

GA4GH
Services
But a shared
schema is a bit
better!

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
GA4GH
Metadata
The data of my
data is also
my data
Work In Progress
● Individual
● Sample
● Experiment
● Dataset
● IndividualGroup
● Analysis
But still very young
and too much centered on data
Beacon ⁽*⁾
Tells the world you have data.
CLearly not enough

Med At Scale
By Data Fellas
Existing scalable implementation:
Google Genomics
Uses
● BigQuery
● google cloud computing
● dremel
● …
That’s what
happens when
you think you
have…

Med At Scale
By Data Fellas
Google Genomics is pushing Hard
…

Med At Scale
Scalability first
BIG
There is another scalable implementation:
Med At Scale, by Data Fellas
Uses
● Apache Spark
● Adam
● S3
● HDFS
● …

Med At Scale
Scalability first
Data Fellas is pushing TOO
BIG

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Composability
very BIG
GA4GH defines quite some methods, or
services
They don’t have all the same requirements
in term of exposure and data processing
→ micro services for the Win
Allows granular deployment and
composition/chaining of methods to
answer a global question

Med At Scale
Customization
Data Fellas is a data science company
Thus our goal is to expose data analyses
A data analysis is
● elaborated in a notebook
● validated on a cluster
● deployed as a micro service it self
Still defining a Schema and Service
VERY VERY BIG

Med At Scale
Ready for the load
Balls!
We saw that one row has
30,000,000 columns
The queries are slicing and dicing those columns
→ views are huge
Hence, Tachyon via RDD.persist/save will optimize
the collocated queries in space and time.
The hard part (will/)is to size the tachyon cluster

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Ad Hoc Analytics
Who left the
rats out?
Standards are very important
However, they cannot define everything,
mostly OLAP.
Ad-Hoc analytics are thus allowed on the raw
data using Apache Spark directly.
Of course, interactivity is a key to
performance… hence the Spark-Notebook is
involved.

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
How it works
Finally…

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
ADAM (and Spark)
Finally…

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
MLlib (and Spark)
Finally…

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Efficient binary data
Finally…

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Micro Service
Finally…

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Cache and Collaboration
Finally…

Explore
Using GA4GH endpoints
notebook TIME!
Use scala/Java Avro client from
the browser.
I give you
Bananas
You give me
Ananas

Customize
Create and Use micro service (WIP)
Planning the
next gear
Remember the frequencies use case?
There is a custom endpoint manually created
We’re working on an Integrated Workflow
In a notebook:
● create the process
● create Cassandra schema
● persist (using connector)
● Define service AVRO IDL
● Generate project for DCOS
● Log usage (see next)

TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Optimization
Query mining (Roadmap)
Always look
at the bright
side
Back to the high dimensionality problem
Caching beforehands is a good solution
but is not optimal.
Plan: ANalyse the Request/Response
objects and the gathered runtime metrics
to adapt the caching policies -- query
mining processes

References
Adam: https://github.com/bigdatagenomics/adam
Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/
GA4GH data working group: http://ga4gh.org/
Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/

Q/A⁽*⁾
THANKS!
⁽*⁾ or head to the pub (at least beers…)

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

More Related Content

What's hot (20)

Similar to Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook (20)

More from Andy Petrella (20)

Recently uploaded (20)

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook