SlideShare a Scribd company logo
by Data Fellas,
Spark London Meetup July, 1st ‘15
Share and analyse genomic data
at scale with Spark, Adam, Tachyon and the Spark Notebook
PART I
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
Outline
PART II
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services
Andy Petrella
@noootsab
Maths
scala
Apache Spark
Spark Notebook
Trainer
Data Banana
Xavier Tordoir
@xtordoir
Physics
Bioinformatics
Scala
Spark
PART I
Spark & Genomics
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
So that’s the
thing that
separates us?
Adam
What is genomics data
Okay, sounds
good. Give me
two of them!
Genome is an important factor in health:
Medical Diagnostics
Drug response
Diseases mechanisms
…
Adam
What is genomics data
You mean devs
are slacking
of?
On the data production:
Fast biotech progress
No so fast IT progress?
Adam
What is genomics data
No! They’re
just sticky
bubbles...
On the data production:
Sequence {A, T, G, C}
3 billion bases
Adam
What is genomics data
Okay, a lot of
bubbles.
On the data production:
Sequence {A, T, G, C}
3 billion bases
… x 30 (x 60?)
Adam
What is genomics data
C’mon. a big
mess of plenty
of lil’ bubbles
then.
On the data production: massively parallel
Sequence {A, T, G, C}
3 billion bases
… x 30 (x 60?)
Adam
What is genomics data
Ah that
explain why
the black bars
are differents
Adam
What is genomics data
Dude... Tens of
millions
Adam
What is genomics data
Staaaaaaph Tens of
millions
1000’s
1,000,000’s
…
Adam
What is genomics data
‘coz it makes
sparkling
bubbles, right?
Ok, looks like Apache Spark
makes a lot of sense here …
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Well done, a
spec as text in
a pDf…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Take that
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Dunno what is
a Genotype but
it contains a
Variant.
Apparently.
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Yeaaah:
generate
client == more
slack
Adam provides an avro
schema
Adam
An efficient storage
Machism in I.
T., what a
flaw!
● Distribute data
● Schema based
● Read/query efficient
● Compact
Adam
An efficient storage
That’s a quick
step
● Distribute data
● Schema based
● Read/query efficient
● Compact
PARQUET!
Adam
An efficient storage
Is Eve okay to
use the
parquet for
that?
● Distribute data
● Schema based
● Read/query efficient
● Compact
PARQUET!
Adam provides parquet as storage format
Adam
A clean API
Object
Wrappedy
adam Context
Adam
A clean API
I could have
done this as a
one liner
adam Context
IO methods
Adam
A clean API
At least, it’s
going to be
simpler than
the chemistry
● Scala classes generated from Avro
● Data loaded as RDDs
● functions on RDDs
○ write to HDFS
○ genomic objects manipulations
○ Primitives to query genomics
datasets
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Part of a pipeline
human | Seq |
SNAP |
Avocado |
Adam | Ga4gh
ADAM is JVM library leveraging
- Spark
- Avro
- Parquet
It still needs to be combined with sources
(snap)
Adam data is part of processes (AVOCADO).
It CAN ALSO BE THE SOURCE FOR external
PROCESSING, LEARNING (LIKE mllIB).
Thousands Genomes
Open Data Set
Games without
Frontiers
1000 genomes: http://www.1000genomes.org/
Produces BAMs, VCFs, ...
Thousands Genomes
Why do you
complain, they
are
compressed …
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Thousands Genomes
Where are the data
DNA Russian
roulette:
which is
fastest?
● EBI FTP: ftp://ftp.1000genomes.ebi.ac.
uk/vol1/ftp/
● NCBI FTP: ftp://ftp-trace.ncbi.nih.
gov/1000genomes/ftp/
● S3: http://aws.amazon.com/1000genomes/
● GS: gs://genomics-public-data/ftp-trace.ncbi.
nih.gov/1000genomes/ftp
Thousands Genomes
Adam that shit on S3
Hmmm like in
the good old
days of HPC
The bad part …
● get the vcf.gz file on local disk (& time for a
coffee)
● uncompress (& go for lunch)
● put in HDFS (& take dessert)
Thousands Genomes
Adam that shit on S3
what?
No grappa?
The good part …
the Notebook (this one)
Thousands Genomes
Adam that shit on S3
Okay, good
enough to wait
a bit…
What did we gain?
● before: 152 GB (gzipped) in 23 files
● After: 71 GB in 9172 partitions
(43,372,735,220 genotypes)
Explore Genomics
Access the data
Just in case,
you don’t
believe us -_-’
Access data from this notebook
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Explore Genomics
Compute statistics
We’re there to
compute,
right?
Compute Freqs from this spark
notebook
Learn Genomics
The problem
Insane, you’ll
have hard time
with me |:-[
How to deal with heterogenous data?
● Population stratification
● Identify natural clusters
● Assign genomes to these clusters
Learn Genomics
The dimensions
Wiiiiiiiiiiiiiiiiide
rows
● 1000 Samples (Rows)
● 30,000,000 variants (columns or
variables)
Hard to explore such a feature space…
Learn Genomics
The dimensions
*LDA for
Latent
Dirichelet
Allocation…
Dimensionality reduction?
● Ideal would be a “Genetic” Mixture
measure (lda* would do that…)
● Or a genetic distance (edit distance)
KMeans & distances to centroids
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Learn Genomics
The model
Reduce, train,
validate, infer
● Split training/validation set
● Train KMeans with 25 clusters
● Compute distances to each centroid as
new features
● Train Random Forest
● Validation
Learn Genomics
The notebook
Define and train the model in this
Notebook
The whole
shebang?
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Our pipeline
I am a Llama
Convert VCFs to ADAM
StoRE ADAM to S3
Compute alleles frequencies
Store alleles frequencies to S3
Compute Minor Allele frequency distribution
Train a Model for stratification
Hmmm… quite some missing pieces, right?
PART II
Standards & Micro Services
Wake up!
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Ga4GH
Let’s fix the baseline
In I.T. it’s easy
everything is
standardized…
Global Alliance for Genomic and Health
http://genomicsandhealth.org/
http://ga4gh.org/
Framework for responsible data sharing
● Define schemas
● Define services
Along with Ethical, Legal, security, clinical aspects
GA4GH
models
… everybody
has is own
standard
GA4GH
Services
But a shared
schema is a bit
better!
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
GA4GH
Metadata
The data of my
data is also
my data
Work In Progress
● Individual
● Sample
● Experiment
● Dataset
● IndividualGroup
● Analysis
But still very young
and too much centered on data
Beacon ⁽*⁾
Tells the world you have data.
CLearly not enough
Med At Scale
By Data Fellas
Existing scalable implementation:
Google Genomics
Uses
● BigQuery
● google cloud computing
● dremel
● …
That’s what
happens when
you think you
have…
Med At Scale
By Data Fellas
Google Genomics is pushing Hard
…
Med At Scale
Scalability first
BIG
There is another scalable implementation:
Med At Scale, by Data Fellas
Uses
● Apache Spark
● Adam
● S3
● HDFS
● …
Med At Scale
Scalability first
Data Fellas is pushing TOO
BIG
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Composability
very BIG
GA4GH defines quite some methods, or
services
They don’t have all the same requirements
in term of exposure and data processing
→ micro services for the Win
Allows granular deployment and
composition/chaining of methods to
answer a global question
Med At Scale
Customization
Data Fellas is a data science company
Thus our goal is to expose data analyses
A data analysis is
● elaborated in a notebook
● validated on a cluster
● deployed as a micro service it self
Still defining a Schema and Service
VERY VERY BIG
Med At Scale
Ready for the load
Balls!
We saw that one row has
30,000,000 columns
The queries are slicing and dicing those columns
→ views are huge
Hence, Tachyon via RDD.persist/save will optimize
the collocated queries in space and time.
The hard part (will/)is to size the tachyon cluster
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Ad Hoc Analytics
Who left the
rats out?
Standards are very important
However, they cannot define everything,
mostly OLAP.
Ad-Hoc analytics are thus allowed on the raw
data using Apache Spark directly.
Of course, interactivity is a key to
performance… hence the Spark-Notebook is
involved.
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
How it works
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
ADAM (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
MLlib (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Efficient binary data
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Micro Service
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Cache and Collaboration
Finally…
Explore
Using GA4GH endpoints
notebook TIME!
Use scala/Java Avro client from
the browser.
I give you
Bananas
You give me
Ananas
Customize
Create and Use micro service (WIP)
Planning the
next gear
Remember the frequencies use case?
There is a custom endpoint manually created
We’re working on an Integrated Workflow
In a notebook:
● create the process
● create Cassandra schema
● persist (using connector)
● Define service AVRO IDL
● Generate project for DCOS
● Log usage (see next)
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Optimization
Query mining (Roadmap)
Always look
at the bright
side
Back to the high dimensionality problem
Caching beforehands is a good solution
but is not optimal.
Plan: ANalyse the Request/Response
objects and the gathered runtime metrics
to adapt the caching policies -- query
mining processes
References
Adam: https://github.com/bigdatagenomics/adam
Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/
GA4GH data working group: http://ga4gh.org/
Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/
Q/A⁽*⁾
THANKS!
⁽*⁾ or head to the pub (at least beers…)

More Related Content

What's hot (20)

PDF
Scalable Genome Analysis with ADAM
fnothaft
 
PDF
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
 
PDF
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
PPTX
Managing Genomes At Scale: What We Learned - StampedeCon 2014
StampedeCon
 
PDF
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
PDF
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
PDF
Strata Big Data Science Talk on ADAM
Matt Massie
 
PDF
From Genomics to Medicine: Advancing Healthcare at Scale
Databricks
 
PDF
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
PDF
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
PPTX
2016 02 23_biological_databases_part1
Prof. Wim Van Criekinge
 
PPTX
Big Data Science with H2O in R
Anqi Fu
 
PDF
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
Sri Ambati
 
PDF
Tese phd
Rodrigo Senra
 
PPTX
2017 biological databases_part1_vupload
Prof. Wim Van Criekinge
 
PDF
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB
 
PPTX
2016 bioinformatics i_databases_wim_vancriekinge
Prof. Wim Van Criekinge
 
PDF
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
PPTX
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Allen Day, PhD
 
PDF
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 
Scalable Genome Analysis with ADAM
fnothaft
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
StampedeCon
 
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
Strata Big Data Science Talk on ADAM
Matt Massie
 
From Genomics to Medicine: Advancing Healthcare at Scale
Databricks
 
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
2016 02 23_biological_databases_part1
Prof. Wim Van Criekinge
 
Big Data Science with H2O in R
Anqi Fu
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
Sri Ambati
 
Tese phd
Rodrigo Senra
 
2017 biological databases_part1_vupload
Prof. Wim Van Criekinge
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB
 
2016 bioinformatics i_databases_wim_vancriekinge
Prof. Wim Van Criekinge
 
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Allen Day, PhD
 
Data Science with Spark - Training at SparkSummit (East)
Krishna Sankar
 

Similar to Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook (20)

PPTX
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
PPTX
CT Brown - Doing next-gen sequencing analysis in the cloud
Jan Aerts
 
PPTX
Talk at Bioinformatics Open Source Conference, 2012
c.titus.brown
 
PPTX
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Codemotion
 
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Andy Petrella
 
PDF
From Lab to Factory: Or how to turn data into value
Peadar Coyle
 
PPTX
Analyzing Data With Python
Sarah Guido
 
PDF
Data Science as Scale
Conor B. Murphy
 
PPTX
Recurrent Neural Networks for Text Analysis
odsc
 
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
PDF
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
ODP
Cloud accounting software uk
Arcus Universe Ltd
 
PDF
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Lviv Startup Club
 
PPTX
Understanding Jupyter notebooks using bioinformatics examples
Lynn Langit
 
PPTX
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Austin Ogilvie
 
PPTX
The Past, Present, and Future of Hadoop at LinkedIn
Carl Steinbach
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 
PPTX
Data Science Salon Miami Presentation
Greg Werner
 
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
CT Brown - Doing next-gen sequencing analysis in the cloud
Jan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
c.titus.brown
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Codemotion
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Andy Petrella
 
From Lab to Factory: Or how to turn data into value
Peadar Coyle
 
Analyzing Data With Python
Sarah Guido
 
Data Science as Scale
Conor B. Murphy
 
Recurrent Neural Networks for Text Analysis
odsc
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
Cloud accounting software uk
Arcus Universe Ltd
 
Roman Kyslyi: Синтетичні дані – стратегії, використання (UA)
Lviv Startup Club
 
Understanding Jupyter notebooks using bioinformatics examples
Lynn Langit
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Austin Ogilvie
 
The Past, Present, and Future of Hadoop at LinkedIn
Carl Steinbach
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 
Data Science Salon Miami Presentation
Greg Werner
 
Ad

More from Andy Petrella (20)

PPTX
Data Observability Best Pracices
Andy Petrella
 
PDF
How to Build a Global Data Mapping
Andy Petrella
 
PDF
Interactive notebooks
Andy Petrella
 
PDF
Governance compliance
Andy Petrella
 
PDF
Data science governance and GDPR
Andy Petrella
 
PDF
Data science governance : what and how
Andy Petrella
 
PDF
Scala: the unpredicted lingua franca for data science
Andy Petrella
 
PDF
Agile data science with scala
Andy Petrella
 
PDF
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Andy Petrella
 
PDF
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
PDF
Towards a rebirth of data science (by Data Fellas)
Andy Petrella
 
PDF
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella
 
PDF
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
PPTX
Liège créative: Open Science
Andy Petrella
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PDF
Spark devoxx2014
Andy Petrella
 
PDF
Machine Learning and GraphX
Andy Petrella
 
PDF
Quanti-litative Revolution in GIS
Andy Petrella
 
PDF
Scala and-fp-in-big-data
Andy Petrella
 
PDF
Software Crafted And Libraries Available
Andy Petrella
 
Data Observability Best Pracices
Andy Petrella
 
How to Build a Global Data Mapping
Andy Petrella
 
Interactive notebooks
Andy Petrella
 
Governance compliance
Andy Petrella
 
Data science governance and GDPR
Andy Petrella
 
Data science governance : what and how
Andy Petrella
 
Scala: the unpredicted lingua franca for data science
Andy Petrella
 
Agile data science with scala
Andy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Andy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
Liège créative: Open Science
Andy Petrella
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Spark devoxx2014
Andy Petrella
 
Machine Learning and GraphX
Andy Petrella
 
Quanti-litative Revolution in GIS
Andy Petrella
 
Scala and-fp-in-big-data
Andy Petrella
 
Software Crafted And Libraries Available
Andy Petrella
 
Ad

Recently uploaded (20)

PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

  • 1. by Data Fellas, Spark London Meetup July, 1st ‘15 Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook
  • 2. PART I Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model Outline PART II GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  • 3. Andy Petrella @noootsab Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir @xtordoir Physics Bioinformatics Scala Spark
  • 4. PART I Spark & Genomics Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model So that’s the thing that separates us?
  • 5. Adam What is genomics data Okay, sounds good. Give me two of them! Genome is an important factor in health: Medical Diagnostics Drug response Diseases mechanisms …
  • 6. Adam What is genomics data You mean devs are slacking of? On the data production: Fast biotech progress No so fast IT progress?
  • 7. Adam What is genomics data No! They’re just sticky bubbles... On the data production: Sequence {A, T, G, C} 3 billion bases
  • 8. Adam What is genomics data Okay, a lot of bubbles. On the data production: Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  • 9. Adam What is genomics data C’mon. a big mess of plenty of lil’ bubbles then. On the data production: massively parallel Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  • 10. Adam What is genomics data Ah that explain why the black bars are differents
  • 11. Adam What is genomics data Dude... Tens of millions
  • 12. Adam What is genomics data Staaaaaaph Tens of millions 1000’s 1,000,000’s …
  • 13. Adam What is genomics data ‘coz it makes sparkling bubbles, right? Ok, looks like Apache Spark makes a lot of sense here …
  • 14. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Well done, a spec as text in a pDf…
  • 15. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Take that
  • 16. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Dunno what is a Genotype but it contains a Variant. Apparently.
  • 17. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Yeaaah: generate client == more slack Adam provides an avro schema
  • 18. Adam An efficient storage Machism in I. T., what a flaw! ● Distribute data ● Schema based ● Read/query efficient ● Compact
  • 19. Adam An efficient storage That’s a quick step ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET!
  • 20. Adam An efficient storage Is Eve okay to use the parquet for that? ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET! Adam provides parquet as storage format
  • 22. Adam A clean API I could have done this as a one liner adam Context IO methods
  • 23. Adam A clean API At least, it’s going to be simpler than the chemistry ● Scala classes generated from Avro ● Data loaded as RDDs ● functions on RDDs ○ write to HDFS ○ genomic objects manipulations ○ Primitives to query genomics datasets
  • 24. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Part of a pipeline human | Seq | SNAP | Avocado | Adam | Ga4gh ADAM is JVM library leveraging - Spark - Avro - Parquet It still needs to be combined with sources (snap) Adam data is part of processes (AVOCADO). It CAN ALSO BE THE SOURCE FOR external PROCESSING, LEARNING (LIKE mllIB).
  • 25. Thousands Genomes Open Data Set Games without Frontiers 1000 genomes: http://www.1000genomes.org/
  • 26. Produces BAMs, VCFs, ... Thousands Genomes Why do you complain, they are compressed …
  • 27. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Thousands Genomes Where are the data DNA Russian roulette: which is fastest? ● EBI FTP: ftp://ftp.1000genomes.ebi.ac. uk/vol1/ftp/ ● NCBI FTP: ftp://ftp-trace.ncbi.nih. gov/1000genomes/ftp/ ● S3: http://aws.amazon.com/1000genomes/ ● GS: gs://genomics-public-data/ftp-trace.ncbi. nih.gov/1000genomes/ftp
  • 28. Thousands Genomes Adam that shit on S3 Hmmm like in the good old days of HPC The bad part … ● get the vcf.gz file on local disk (& time for a coffee) ● uncompress (& go for lunch) ● put in HDFS (& take dessert)
  • 29. Thousands Genomes Adam that shit on S3 what? No grappa? The good part … the Notebook (this one)
  • 30. Thousands Genomes Adam that shit on S3 Okay, good enough to wait a bit… What did we gain? ● before: 152 GB (gzipped) in 23 files ● After: 71 GB in 9172 partitions (43,372,735,220 genotypes)
  • 31. Explore Genomics Access the data Just in case, you don’t believe us -_-’ Access data from this notebook
  • 32. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Explore Genomics Compute statistics We’re there to compute, right? Compute Freqs from this spark notebook
  • 33. Learn Genomics The problem Insane, you’ll have hard time with me |:-[ How to deal with heterogenous data? ● Population stratification ● Identify natural clusters ● Assign genomes to these clusters
  • 34. Learn Genomics The dimensions Wiiiiiiiiiiiiiiiiide rows ● 1000 Samples (Rows) ● 30,000,000 variants (columns or variables) Hard to explore such a feature space…
  • 35. Learn Genomics The dimensions *LDA for Latent Dirichelet Allocation… Dimensionality reduction? ● Ideal would be a “Genetic” Mixture measure (lda* would do that…) ● Or a genetic distance (edit distance) KMeans & distances to centroids
  • 36. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Learn Genomics The model Reduce, train, validate, infer ● Split training/validation set ● Train KMeans with 25 clusters ● Compute distances to each centroid as new features ● Train Random Forest ● Validation
  • 37. Learn Genomics The notebook Define and train the model in this Notebook The whole shebang?
  • 38. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Our pipeline I am a Llama Convert VCFs to ADAM StoRE ADAM to S3 Compute alleles frequencies Store alleles frequencies to S3 Compute Minor Allele frequency distribution Train a Model for stratification Hmmm… quite some missing pieces, right?
  • 39. PART II Standards & Micro Services Wake up! GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  • 40. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Ga4GH Let’s fix the baseline In I.T. it’s easy everything is standardized… Global Alliance for Genomic and Health http://genomicsandhealth.org/ http://ga4gh.org/ Framework for responsible data sharing ● Define schemas ● Define services Along with Ethical, Legal, security, clinical aspects
  • 43. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. GA4GH Metadata The data of my data is also my data Work In Progress ● Individual ● Sample ● Experiment ● Dataset ● IndividualGroup ● Analysis But still very young and too much centered on data Beacon ⁽*⁾ Tells the world you have data. CLearly not enough
  • 44. Med At Scale By Data Fellas Existing scalable implementation: Google Genomics Uses ● BigQuery ● google cloud computing ● dremel ● … That’s what happens when you think you have…
  • 45. Med At Scale By Data Fellas Google Genomics is pushing Hard …
  • 46. Med At Scale Scalability first BIG There is another scalable implementation: Med At Scale, by Data Fellas Uses ● Apache Spark ● Adam ● S3 ● HDFS ● …
  • 47. Med At Scale Scalability first Data Fellas is pushing TOO BIG
  • 48. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Composability very BIG GA4GH defines quite some methods, or services They don’t have all the same requirements in term of exposure and data processing → micro services for the Win Allows granular deployment and composition/chaining of methods to answer a global question
  • 49. Med At Scale Customization Data Fellas is a data science company Thus our goal is to expose data analyses A data analysis is ● elaborated in a notebook ● validated on a cluster ● deployed as a micro service it self Still defining a Schema and Service VERY VERY BIG
  • 50. Med At Scale Ready for the load Balls! We saw that one row has 30,000,000 columns The queries are slicing and dicing those columns → views are huge Hence, Tachyon via RDD.persist/save will optimize the collocated queries in space and time. The hard part (will/)is to size the tachyon cluster
  • 51. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Ad Hoc Analytics Who left the rats out? Standards are very important However, they cannot define everything, mostly OLAP. Ad-Hoc analytics are thus allowed on the raw data using Apache Spark directly. Of course, interactivity is a key to performance… hence the Spark-Notebook is involved.
  • 52. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale How it works Finally…
  • 53. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale ADAM (and Spark) Finally…
  • 54. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale MLlib (and Spark) Finally…
  • 55. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Efficient binary data Finally…
  • 56. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Micro Service Finally…
  • 57. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Cache and Collaboration Finally…
  • 58. Explore Using GA4GH endpoints notebook TIME! Use scala/Java Avro client from the browser. I give you Bananas You give me Ananas
  • 59. Customize Create and Use micro service (WIP) Planning the next gear Remember the frequencies use case? There is a custom endpoint manually created We’re working on an Integrated Workflow In a notebook: ● create the process ● create Cassandra schema ● persist (using connector) ● Define service AVRO IDL ● Generate project for DCOS ● Log usage (see next)
  • 60. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Optimization Query mining (Roadmap) Always look at the bright side Back to the high dimensionality problem Caching beforehands is a good solution but is not optimal. Plan: ANalyse the Request/Response objects and the gathered runtime metrics to adapt the caching policies -- query mining processes
  • 61. References Adam: https://github.com/bigdatagenomics/adam Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats GA4GH website: http://genomicsandhealth.org/ GA4GH data working group: http://ga4gh.org/ Spark-Notebook: https://github.com/andypetrella/spark-notebook/ Med-At-Scale: https://github.com/med-at-scale/high-health Data Fellas: http://data-fellas.guru/
  • 62. Q/A⁽*⁾ THANKS! ⁽*⁾ or head to the pub (at least beers…)