SlideShare a Scribd company logo
Cloud Accelerated Genomics
Allen Day, PhD // Science Advocate
@allenday // #genomics #ml #datascience
Table of Contents
Section 1
Section 2
Section 3
Throughout
Getting from Research to Application… Faster
What are the bottlenecks for translating research into products?
Emphasis on information processing.
From CompBio Research to CompBio Engineering
Getting results, more of them, and predictably improving
Data Integration - Cutting Edge Use Cases
What’s happening right now in industry and academia?
How to use Google Cloud?
I’ll introduce specific cloud services, along with examples of
how they’ve been used successfully. Compute Engine,
Kubernetes, Dataflow, Cloud ML, Genomics API
How to Understand?
Linear B is a syllabic script
that was used for writing
Mycenaean Greek, the
earliest attested form of
Greek. The script predates
the Greek alphabet by
several centuries. The oldest
Mycenaean writing dates to
about 1450 BC.
Hypothetico-Deductive
Method (Iterative)
Organize
Analyze,
Interpret, and
Plan
Choose Data
Acquire
Hypothetico-Deductive
Method (Iterative)
Organize
Analyze,
Interpret, and
Plan
Choose Data
Acquire
Situation:
Not enough data.
No means to get more.
Dead Language.
Outcome:
Cannot understand.
Also:
Passive learning.
No feedback.
DNA Sequencing Value Chain
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
DNA
Sequencing
Human Genetics Scenario
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
%Effort
0
100
DNA
Sequencing
Situation:
Unlimited Free DNA
Result:
Slow to understand.
Pre-NGS
~2000
Future
~2020
Now
Q: Why Slow to Understand? A1: Data Processing
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
%Effort
0
100
DNA
Sequencing
Situation:
We still have an
analysis bottleneck
Result:
Slow to understand.
Pre-NGS
~2000
Future
~2020
Now
00:20 - Connecting…
01:22 - Link Established
Cloud Accelerated Genomics
GOOGLE CONFIDENTIAL
Google Cloud Platform lets you run your apps on the
same system as Google
GOOGLE CONFIDENTIAL
So you can focus on what matters
to your science
Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
Google confidential │ Do not distribute
Google can is good at handleing massive volumes of genomic data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6WGS
>100x US PhDs
~1M WGS
0.25s
Google confidential │ Do not distributeGoogle confidential │ Do not distribute
Google Genomics
August 2015
Google confidential │ Do not distribute
Google Genomics is more than infrastructure
General-purpose
cloud infrastructure
Genomics-specific
featuresGenomics API
Virtual Machines & Storage
Data Services & Tools
Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data
Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data
CONFIDENTIAL & PROPRIETARY
3.75 TERABYTES PER HUMAN
1.00 TB GENOME
2.00 TB EPIGENOME
0.70 TB TRANSCRIPTOME
0.06 TB METABOLOME
0.04 TB PROTEOME
~1 MB STANDARD LAB TESTS
5-YR LONGITUDINAL STUDY
BASELINE STUDY: BIG DATA ANALYSIS
Validate a pipeline to process complex phenotypic, biochemical,
and genomic data
● Pilot Study (N=200)
○ Determine optimal biospecimen collection strategy for stable sampling
and reproducible assays
○ Determine optimal assay methodology
○ Validate quality control methods
○ Validate device data against surrogate and primary endpoints
● Baseline Study (N=10,000+)
○ 6 cohorts from low to high risk for cardiovascular and cancer
○ Characterize human systems biology
○ Define normal values for a given parameter in heterogeneous states
○ Predict meaningful events
○ Validate wearable devices for human monitoring
○ Characterize transitions in disease state
Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)
Confidential & ProprietaryGoogle Cloud Platform 21
Platinum Genomes
1000 Genomes
Medical (Human)
Population-scale Genome Projects
1000 Bulls
10K Dog Genomes
Veterinary / Agriculture
Open Cannabis Project
Genome To Fields
Panzea (1000 Maize)
AgriculturePersonal Genome Project
Human Microbiome Project
NCBI GEO Human 100K
Cancer Genome Atlas
Many Other
Interesting
Datasets...
Google confidential │ Do not distribute
PI / Biologist : variant calls for the 1,000 genomes
Google confidential │ Do not distribute
Information: principal coordinates analysis (1000 genomes)
Google confidential │ Do not distribute
Knowledge: populations cluster together
Bioinformatics scientist: BigQuery enables fast tertiary analysis
Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform,
Load (ETL), analytics,
real-time computation and
process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against
multi-terabyte datasets in
seconds.
cloud.google.com/bigquery
BigQuery
Google Cloud Platform
Dataflow + BigQuery
Used for Extract, Transform,
Load (ETL), analytics,
real-time computation and
process orchestration.
cloud.google.com/dataflow
Dataflow
Run SQL queries against
multi-terabyte datasets in
seconds.
cloud.google.com/bigquery
BigQuery
Google Cloud Platform
Dataflow + BigQuery
Google confidential │ Do not distribute
Example: GATK
Analysis Pipeline
Old way: install
applications on host
kernel
libs
app
app app
app
Makefiles,
CWL, WDL
(on a virtual machine)
Cloud Accelerated Genomics
Cloud Accelerated Genomics
Google confidential │ Do not distribute
Example: GATK
Analysis Pipeline
Old way: install
applications on host
kernel
libs
app
app app
app
Makefiles,
CWL, WDL
(on a virtual machine)
Google confidential │ Do not distribute
Example: GATK
Analysis Pipeline
● Decouple process
management from
host configuration
● Portable across OS
distros and clouds
● Consistent
environment from
development to
production
● Immutable images
New way: deploy
containers
Old way: install
applications on host
kernel
libs
app
app app
app
libs
app
kernel
libs
app
libs
app
libs
app
Makefiles,
CWL, WDL
(on a virtual machine)
Dockerflow:
Dataflow + Docker
Benefits
Google confidential │ Do not distribute
Use Case:
Reproducible Science with Docker
● Objective: Build a mutation-detection pipeline
● Provided to competitors
○ Training data set
○ Evalutation data set
● Competitors submit pipelines as Docker images to DREAM Challenge host, Sage Bionetworks
● Submitted pipelines were used to process unseen data set
● Post-competition, Docker images made public
● Incidentally, Google won this competition with a deep-learning based variant caller called
DeepVariant cloud.google.com/genomics/v1alpha2/deepvariant
Confidential & ProprietaryGoogle Cloud Platform 35
An idealized version of the
hypothetico-deductive
model of the scientific
method is shown. Various
potential threats to this
model exist (indicated in
red), including
hypothesizing after the
results are known
(HARKing) and lack of
data sharing. Together
these undermine the
robustness of results, and
may impact on the ability
of science to self-correct.
Threats to
reproducible
science.
http://www.nature.com/articles/s41562-016-0021
> java -jar target/dockerflow*dependencies.jar
--project=YOUR_PROJECT
--workflow-file=hello.yaml
--workspace=gs://YOUR_BUCKET/YOUR_FOLDER
--runner=DataflowPipelineRunner
To run it:
Variant Calls
Your Variant Caller
36PubSub
Queue
Sequencer
DNA Reads
Genomics
API
Genomics
API
BigQuery
Your Other Tool
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM
Marker-assisted selection for quantitative traits
Marker-assisted selection for quantitative traits
https://www.sec.gov/Archives/edgar/data/1110783/0000950134
02011773/c71992exv99w2.htm
Google Cloud Platform
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
https://www.slideshare.net/finance28/monsanto-082305a
Q: Why Slow to Understand? A1: Data Processing
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
%Effort
0
100
DNA
Sequencing
Situation:
We still have an
analysis bottleneck
Result:
Slow to understand.
Pre-NGS
~2000
Future
~2020
Now
Q: Why Slow to Understand? A2: Limited Feedback
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Secondary
Analytics
Analytics,
Intepretation,
Planning
Experiment
Design
DNA
Sequencing
Situation:
Data acquisition cost approaches zero
However, still slow to understand, because:
1. Restricted choice of what can be observed, i.e. controlled
modifications and artificial selection
2. Passive Learning. Limited feedback => Low rate of learning
Contrast with active learning...
Act
Observe
Observe
Act
Orient Decide
Decide Act
Biological System
Scientist
Molecular Sensors:
DNA sequencer,
Mass spectrometer,
Etc
However...
(Technology)-Limited
Experimental Capability
Google Cloud Platform
Even Moore’s Law / Carlson Curve
Google Cloud Platform
Even Moore’s Law / Carlson Curve - also applies to writing DNA
Act
Observe
Observe
Act
Orient Decide
Decide Act
Biological System
Scientist
Molecular Sensors:
DNA sequencer,
Mass spectrometer,
Etc
Bioengineering Tech:
DNA synthesizers,
CRISPR/Cas9,
Etc
Act
Observe
Observe
Act
Orient Decide
Decide Act
Biological System
Scientist
Molecular Sensors:
DNA sequencer,
Mass spectrometer,
Etc
Environmental Sensors:
Laser scanners,
Hyperspectral scanners,
UAVs
Etc
Bioengineering Tech:
DNA synthesizers,
CRISPR/Cas9,
Etc
Regulate/Measure
System I/O
Google Cloud Platform
Integration with Geospatial, Management, and Terrestrial Sensor Data
anezconsulting.com/precision-agronomy/
Google Cloud Platform
Descartes Labs - Google Cloud Customer
medium.com/@stevenpbrumby/corn-in-the-usa-d487dce84ee1
Cloud ML
Engine
TensorFlow
Google Cloud Platform
Phenomobile, http://www.mdpi.com/2073-4395/4/3/349/htm
See also: http://www.genomes2fields.org/
Google Cloud Platform
Temporo-Spatial Imaging of Growing Plants
Google Cloud Platform
Verily: Assisting Pathologists in Detecting Cancer with Deep Learning
research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
Prediction heatmaps produced by the algorithm had
improved so much that the localization score (FROC)
for the algorithm reached 89%, which significantly
exceeded the score of 73% for a pathologist with no
time constraint2
. We were not the only ones to see
promising results, as other groups were getting scores
as high as 81% with the same dataset.
Model generalized very well, even to images that were
acquired from a different hospital using different
scanners. For full details, see our paper “Detecting
Cancer Metastases on Gigapixel Pathology Images”.
00:20 - Connecting…
01:22 - Link Established
Google Cloud Platform
~~)( ,
Cloud VisionTensorFlowGoogle Genomics Dataflow Cloud ML Engine Docker
Baseline Study Data Private DataPublic Data
Build What’s Next
Thank You!
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience

More Related Content

PDF
Achieving HIPAA on GCP
Idan Tohami
 
PPTX
Just the sketch: advanced streaming analytics in Apache Metron
DataWorks Summit
 
PDF
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Databricks
 
PPTX
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
PPTX
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
PPTX
Accelerating Data-driven Discovery in Energy Science
Ian Foster
 
PPTX
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
DataWorks Summit
 
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
Achieving HIPAA on GCP
Idan Tohami
 
Just the sketch: advanced streaming analytics in Apache Metron
DataWorks Summit
 
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
Databricks
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
Accelerating Data-driven Discovery in Energy Science
Ian Foster
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
DataWorks Summit
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 

What's hot (20)

PPTX
Streamlined data sharing and analysis to accelerate cancer research
Ian Foster
 
PDF
A Gen3 Perspective of Disparate Data
Robert Grossman
 
PPTX
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
GigaScience, BGI Hong Kong
 
PDF
What is a Data Commons and Why Should You Care?
Robert Grossman
 
PDF
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
PDF
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
PPTX
TESTING IN BIG DATA WORLD
Konstantin Pletenev
 
PPTX
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
PPT
Computing Outside The Box June 2009
Ian Foster
 
PPTX
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PPTX
Empowering Transformational Science
Chelle Gentemann
 
PPTX
Open problems big_data_19_feb_2015_ver_0.1
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
PPTX
Starting the Hadoop Journey at a Global Leader in Cancer Research
DataWorks Summit/Hadoop Summit
 
PPTX
Data Science Driven Malware Detection
VMware Tanzu
 
PDF
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
PPTX
From the Pacific Research Platform to a National Research Platform
Larry Smarr
 
Streamlined data sharing and analysis to accelerate cancer research
Ian Foster
 
A Gen3 Perspective of Disparate Data
Robert Grossman
 
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
GigaScience, BGI Hong Kong
 
What is a Data Commons and Why Should You Care?
Robert Grossman
 
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
Open source stak of big data techs open suse asia
Muhammad Rifqi
 
TESTING IN BIG DATA WORLD
Konstantin Pletenev
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
Computing Outside The Box June 2009
Ian Foster
 
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Empowering Transformational Science
Chelle Gentemann
 
Open problems big_data_19_feb_2015_ver_0.1
Vijay Srinivas Agneeswaran, Ph.D
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
Starting the Hadoop Journey at a Global Leader in Cancer Research
DataWorks Summit/Hadoop Summit
 
Data Science Driven Malware Detection
VMware Tanzu
 
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
From the Pacific Research Platform to a National Research Platform
Larry Smarr
 
Ad

Similar to Cloud Accelerated Genomics (20)

PDF
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
Allen Day, PhD
 
PDF
20170406 Genomics@Google - KeyGene - Wageningen
Allen Day, PhD
 
PDF
Big Data in Genomics: Opportunities and Challenges
Matthieu Schapranow
 
PDF
The pulse of cloud computing with bioinformatics as an example
Enis Afgan
 
PPTX
Emerging challenges in data-intensive genomics
mikaelhuss
 
PPTX
11-Big Data Application in Biomedical Research and Health Care.pptx
shikhamittal42
 
PDF
20170402 Crop Innovation and Business - Amsterdam
Allen Day, PhD
 
PPTX
2016 davis-biotech
c.titus.brown
 
PDF
Big Datasets and Highly Sensitive Data
ARDC
 
PPTX
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Allen Day, PhD
 
PPTX
Data analysis & integration challenges in genomics
mikaelhuss
 
PPTX
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
 
PDF
Hadoop as a Platform for Genomics
MapR Technologies
 
PDF
Open Source Networking Solving Molecular Analysis of Cancer
Open Networking Summit
 
PPTX
How novel compute technology transforms life science research
Denis C. Bauer
 
PPTX
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
PPTX
2015 genome-center
c.titus.brown
 
PPT
Cloud Computing and Innovations for Optimizing Life Sciences Research
InterpretOmics
 
PPTX
Next Gen Sequencing and Associated Big Data / AI problem
Subhendu Dey
 
PPTX
Cloud Native Analysis Platform for NGS analysis
Yaoyu Wang
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
Allen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
Allen Day, PhD
 
Big Data in Genomics: Opportunities and Challenges
Matthieu Schapranow
 
The pulse of cloud computing with bioinformatics as an example
Enis Afgan
 
Emerging challenges in data-intensive genomics
mikaelhuss
 
11-Big Data Application in Biomedical Research and Health Care.pptx
shikhamittal42
 
20170402 Crop Innovation and Business - Amsterdam
Allen Day, PhD
 
2016 davis-biotech
c.titus.brown
 
Big Datasets and Highly Sensitive Data
ARDC
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Allen Day, PhD
 
Data analysis & integration challenges in genomics
mikaelhuss
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
MongoDB
 
Hadoop as a Platform for Genomics
MapR Technologies
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Networking Summit
 
How novel compute technology transforms life science research
Denis C. Bauer
 
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
2015 genome-center
c.titus.brown
 
Cloud Computing and Innovations for Optimizing Life Sciences Research
InterpretOmics
 
Next Gen Sequencing and Associated Big Data / AI problem
Subhendu Dey
 
Cloud Native Analysis Platform for NGS analysis
Yaoyu Wang
 
Ad

More from Idan Tohami (20)

PDF
Idan Tohami - Branding Portfolio (Logo, Visual Identity, Brand Styleguide)
Idan Tohami
 
PDF
Idan Tohami Graphic Design Portfolio.pdf
Idan Tohami
 
PDF
Simplify Your Security with Cybowall
Idan Tohami
 
PDF
AML Transaction Monitoring Tuning Webinar
Idan Tohami
 
PDF
Robotic Process Automation (RPA) Webinar - By Matrix-IFS
Idan Tohami
 
PDF
Open Banking / PSD2 & GDPR Regulations and How They Are Changing Fraud & Fina...
Idan Tohami
 
PPTX
Robotic Automation Process (RPA) Webinar - By Matrix-IFS
Idan Tohami
 
PDF
Robotic Automation Process (RPA) Brochure - By Matrix-IFS
Idan Tohami
 
PPTX
The Journey to the Hybrid Multi Cloud
Idan Tohami
 
PPTX
Introdction to Cloud Regulation for Enterprise by 2Bsecure
Idan Tohami
 
PPTX
Enterprise Journey to the Cloud - Opening Remarks
Idan Tohami
 
PPTX
Vmware on aws
Idan Tohami
 
PPTX
Ready.Set.Cloud - Enterprise Cloud Migration Framework
Idan Tohami
 
PPTX
Journey to the Public Cloud
Idan Tohami
 
PPTX
Google Cloud Fundamentals by CloudZone
Idan Tohami
 
PDF
HDinsight Workshop - Prerequisite Activity
Idan Tohami
 
PDF
Cloud Regulations and Security Standards by Ran Adler
Idan Tohami
 
PPTX
Azure Logic Apps by Gil Gross, CloudZone
Idan Tohami
 
PPTX
AWS Fundamentals @Back2School by CloudZone
Idan Tohami
 
PDF
Couchbase Day
Idan Tohami
 
Idan Tohami - Branding Portfolio (Logo, Visual Identity, Brand Styleguide)
Idan Tohami
 
Idan Tohami Graphic Design Portfolio.pdf
Idan Tohami
 
Simplify Your Security with Cybowall
Idan Tohami
 
AML Transaction Monitoring Tuning Webinar
Idan Tohami
 
Robotic Process Automation (RPA) Webinar - By Matrix-IFS
Idan Tohami
 
Open Banking / PSD2 & GDPR Regulations and How They Are Changing Fraud & Fina...
Idan Tohami
 
Robotic Automation Process (RPA) Webinar - By Matrix-IFS
Idan Tohami
 
Robotic Automation Process (RPA) Brochure - By Matrix-IFS
Idan Tohami
 
The Journey to the Hybrid Multi Cloud
Idan Tohami
 
Introdction to Cloud Regulation for Enterprise by 2Bsecure
Idan Tohami
 
Enterprise Journey to the Cloud - Opening Remarks
Idan Tohami
 
Vmware on aws
Idan Tohami
 
Ready.Set.Cloud - Enterprise Cloud Migration Framework
Idan Tohami
 
Journey to the Public Cloud
Idan Tohami
 
Google Cloud Fundamentals by CloudZone
Idan Tohami
 
HDinsight Workshop - Prerequisite Activity
Idan Tohami
 
Cloud Regulations and Security Standards by Ran Adler
Idan Tohami
 
Azure Logic Apps by Gil Gross, CloudZone
Idan Tohami
 
AWS Fundamentals @Back2School by CloudZone
Idan Tohami
 
Couchbase Day
Idan Tohami
 

Recently uploaded (20)

PDF
Drones in Disaster Response: Real-Time Data Collection and Analysis (www.kiu...
publication11
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PDF
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
PPTX
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PPTX
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PDF
Paleoseismic activity in the moon’s Taurus-Littrowvalley inferred from boulde...
Sérgio Sacani
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PPT
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
PPTX
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
PPTX
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
PPTX
Cell Structure and Organelles Slides PPT
JesusNeyra8
 
Drones in Disaster Response: Real-Time Data Collection and Analysis (www.kiu...
publication11
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
High-definition imaging of a filamentary connection between a close quasar pa...
Sérgio Sacani
 
Brain_stem_Medulla oblongata_functions of pons_mid brain
muralinath2
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Paleoseismic activity in the moon’s Taurus-Littrowvalley inferred from boulde...
Sérgio Sacani
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
Internal Capsule_Divisions_fibres_lesions
muralinath2
 
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
Cell Structure and Organelles Slides PPT
JesusNeyra8
 

Cloud Accelerated Genomics

  • 1. Cloud Accelerated Genomics Allen Day, PhD // Science Advocate @allenday // #genomics #ml #datascience
  • 2. Table of Contents Section 1 Section 2 Section 3 Throughout Getting from Research to Application… Faster What are the bottlenecks for translating research into products? Emphasis on information processing. From CompBio Research to CompBio Engineering Getting results, more of them, and predictably improving Data Integration - Cutting Edge Use Cases What’s happening right now in industry and academia? How to use Google Cloud? I’ll introduce specific cloud services, along with examples of how they’ve been used successfully. Compute Engine, Kubernetes, Dataflow, Cloud ML, Genomics API
  • 3. How to Understand? Linear B is a syllabic script that was used for writing Mycenaean Greek, the earliest attested form of Greek. The script predates the Greek alphabet by several centuries. The oldest Mycenaean writing dates to about 1450 BC.
  • 5. Hypothetico-Deductive Method (Iterative) Organize Analyze, Interpret, and Plan Choose Data Acquire Situation: Not enough data. No means to get more. Dead Language. Outcome: Cannot understand. Also: Passive learning. No feedback.
  • 6. DNA Sequencing Value Chain %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now Sboner, et al, 2011. The real cost of sequencing: higher than you think! Secondary Analytics Analytics, Intepretation, Planning Experiment Design DNA Sequencing
  • 7. Human Genetics Scenario Sboner, et al, 2011. The real cost of sequencing: higher than you think! Secondary Analytics Analytics, Intepretation, Planning Experiment Design %Effort 0 100 DNA Sequencing Situation: Unlimited Free DNA Result: Slow to understand. Pre-NGS ~2000 Future ~2020 Now
  • 8. Q: Why Slow to Understand? A1: Data Processing Sboner, et al, 2011. The real cost of sequencing: higher than you think! Secondary Analytics Analytics, Intepretation, Planning Experiment Design %Effort 0 100 DNA Sequencing Situation: We still have an analysis bottleneck Result: Slow to understand. Pre-NGS ~2000 Future ~2020 Now
  • 9. 00:20 - Connecting… 01:22 - Link Established
  • 11. GOOGLE CONFIDENTIAL Google Cloud Platform lets you run your apps on the same system as Google
  • 12. GOOGLE CONFIDENTIAL So you can focus on what matters to your science
  • 13. Google confidential │ Do not distribute Google is good at handling massive volumes of data uploads per minute users search index query response time 300hrs 500M+ 100PB+ 0.25s
  • 14. Google confidential │ Do not distribute Google can is good at handleing massive volumes of genomic data uploads per minute users search index query response time 300hrs 500M+ 100PB+ 0.25s ~6WGS >100x US PhDs ~1M WGS 0.25s
  • 15. Google confidential │ Do not distributeGoogle confidential │ Do not distribute Google Genomics August 2015
  • 16. Google confidential │ Do not distribute Google Genomics is more than infrastructure General-purpose cloud infrastructure Genomics-specific featuresGenomics API Virtual Machines & Storage Data Services & Tools
  • 17. Google confidential │ Do not distribute BioQuery Analysis Engine Medical Records Genomics Devices Imaging Patient Reports Baseline Study Data Private Data Pharma Health Providers … Google’s vision to tackle complex health data Public Data
  • 18. Google confidential │ Do not distribute BioQuery Analysis Engine Medical Records Genomics Devices Imaging Patient Reports Baseline Study Data Private Data Pharma Health Providers … Google’s vision to tackle complex health data Public Data
  • 19. CONFIDENTIAL & PROPRIETARY 3.75 TERABYTES PER HUMAN 1.00 TB GENOME 2.00 TB EPIGENOME 0.70 TB TRANSCRIPTOME 0.06 TB METABOLOME 0.04 TB PROTEOME ~1 MB STANDARD LAB TESTS 5-YR LONGITUDINAL STUDY BASELINE STUDY: BIG DATA ANALYSIS Validate a pipeline to process complex phenotypic, biochemical, and genomic data ● Pilot Study (N=200) ○ Determine optimal biospecimen collection strategy for stable sampling and reproducible assays ○ Determine optimal assay methodology ○ Validate quality control methods ○ Validate device data against surrogate and primary endpoints ● Baseline Study (N=10,000+) ○ 6 cohorts from low to high risk for cardiovascular and cancer ○ Characterize human systems biology ○ Define normal values for a given parameter in heterogeneous states ○ Predict meaningful events ○ Validate wearable devices for human monitoring ○ Characterize transitions in disease state
  • 20. Public Datasets Project https://cloud.google.com/bigquery/public-data/ A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1TB per month is free)
  • 21. Confidential & ProprietaryGoogle Cloud Platform 21 Platinum Genomes 1000 Genomes Medical (Human) Population-scale Genome Projects 1000 Bulls 10K Dog Genomes Veterinary / Agriculture Open Cannabis Project Genome To Fields Panzea (1000 Maize) AgriculturePersonal Genome Project Human Microbiome Project NCBI GEO Human 100K Cancer Genome Atlas Many Other Interesting Datasets...
  • 22. Google confidential │ Do not distribute PI / Biologist : variant calls for the 1,000 genomes
  • 23. Google confidential │ Do not distribute Information: principal coordinates analysis (1000 genomes)
  • 24. Google confidential │ Do not distribute Knowledge: populations cluster together
  • 25. Bioinformatics scientist: BigQuery enables fast tertiary analysis
  • 26. Google Cloud Platform Dataflow + BigQuery Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration. cloud.google.com/dataflow Dataflow Run SQL queries against multi-terabyte datasets in seconds. cloud.google.com/bigquery BigQuery
  • 27. Google Cloud Platform Dataflow + BigQuery Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration. cloud.google.com/dataflow Dataflow Run SQL queries against multi-terabyte datasets in seconds. cloud.google.com/bigquery BigQuery
  • 29. Google confidential │ Do not distribute Example: GATK Analysis Pipeline Old way: install applications on host kernel libs app app app app Makefiles, CWL, WDL (on a virtual machine)
  • 32. Google confidential │ Do not distribute Example: GATK Analysis Pipeline Old way: install applications on host kernel libs app app app app Makefiles, CWL, WDL (on a virtual machine)
  • 33. Google confidential │ Do not distribute Example: GATK Analysis Pipeline ● Decouple process management from host configuration ● Portable across OS distros and clouds ● Consistent environment from development to production ● Immutable images New way: deploy containers Old way: install applications on host kernel libs app app app app libs app kernel libs app libs app libs app Makefiles, CWL, WDL (on a virtual machine) Dockerflow: Dataflow + Docker Benefits
  • 34. Google confidential │ Do not distribute Use Case: Reproducible Science with Docker ● Objective: Build a mutation-detection pipeline ● Provided to competitors ○ Training data set ○ Evalutation data set ● Competitors submit pipelines as Docker images to DREAM Challenge host, Sage Bionetworks ● Submitted pipelines were used to process unseen data set ● Post-competition, Docker images made public ● Incidentally, Google won this competition with a deep-learning based variant caller called DeepVariant cloud.google.com/genomics/v1alpha2/deepvariant
  • 35. Confidential & ProprietaryGoogle Cloud Platform 35 An idealized version of the hypothetico-deductive model of the scientific method is shown. Various potential threats to this model exist (indicated in red), including hypothesizing after the results are known (HARKing) and lack of data sharing. Together these undermine the robustness of results, and may impact on the ability of science to self-correct. Threats to reproducible science. http://www.nature.com/articles/s41562-016-0021
  • 36. > java -jar target/dockerflow*dependencies.jar --project=YOUR_PROJECT --workflow-file=hello.yaml --workspace=gs://YOUR_BUCKET/YOUR_FOLDER --runner=DataflowPipelineRunner To run it: Variant Calls Your Variant Caller 36PubSub Queue Sequencer DNA Reads Genomics API Genomics API BigQuery Your Other Tool
  • 37. GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto https://www.youtube.com/watch?v=6KEvLURBenM
  • 38. GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto https://www.youtube.com/watch?v=6KEvLURBenM
  • 39. Marker-assisted selection for quantitative traits
  • 40. Marker-assisted selection for quantitative traits https://www.sec.gov/Archives/edgar/data/1110783/0000950134 02011773/c71992exv99w2.htm
  • 41. Google Cloud Platform Marker-Assisted Breeding Rapidly Increases Frequency of Favorable Genes https://www.slideshare.net/finance28/monsanto-082305a
  • 42. Q: Why Slow to Understand? A1: Data Processing Sboner, et al, 2011. The real cost of sequencing: higher than you think! Secondary Analytics Analytics, Intepretation, Planning Experiment Design %Effort 0 100 DNA Sequencing Situation: We still have an analysis bottleneck Result: Slow to understand. Pre-NGS ~2000 Future ~2020 Now
  • 43. Q: Why Slow to Understand? A2: Limited Feedback Sboner, et al, 2011. The real cost of sequencing: higher than you think! Secondary Analytics Analytics, Intepretation, Planning Experiment Design DNA Sequencing Situation: Data acquisition cost approaches zero However, still slow to understand, because: 1. Restricted choice of what can be observed, i.e. controlled modifications and artificial selection 2. Passive Learning. Limited feedback => Low rate of learning Contrast with active learning...
  • 44. Act Observe Observe Act Orient Decide Decide Act Biological System Scientist Molecular Sensors: DNA sequencer, Mass spectrometer, Etc However... (Technology)-Limited Experimental Capability
  • 45. Google Cloud Platform Even Moore’s Law / Carlson Curve
  • 46. Google Cloud Platform Even Moore’s Law / Carlson Curve - also applies to writing DNA
  • 47. Act Observe Observe Act Orient Decide Decide Act Biological System Scientist Molecular Sensors: DNA sequencer, Mass spectrometer, Etc Bioengineering Tech: DNA synthesizers, CRISPR/Cas9, Etc
  • 48. Act Observe Observe Act Orient Decide Decide Act Biological System Scientist Molecular Sensors: DNA sequencer, Mass spectrometer, Etc Environmental Sensors: Laser scanners, Hyperspectral scanners, UAVs Etc Bioengineering Tech: DNA synthesizers, CRISPR/Cas9, Etc Regulate/Measure System I/O
  • 49. Google Cloud Platform Integration with Geospatial, Management, and Terrestrial Sensor Data anezconsulting.com/precision-agronomy/
  • 50. Google Cloud Platform Descartes Labs - Google Cloud Customer medium.com/@stevenpbrumby/corn-in-the-usa-d487dce84ee1 Cloud ML Engine TensorFlow
  • 51. Google Cloud Platform Phenomobile, http://www.mdpi.com/2073-4395/4/3/349/htm See also: http://www.genomes2fields.org/
  • 52. Google Cloud Platform Temporo-Spatial Imaging of Growing Plants
  • 53. Google Cloud Platform Verily: Assisting Pathologists in Detecting Cancer with Deep Learning research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html Prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint2 . We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset. Model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “Detecting Cancer Metastases on Gigapixel Pathology Images”.
  • 54. 00:20 - Connecting… 01:22 - Link Established
  • 55. Google Cloud Platform ~~)( , Cloud VisionTensorFlowGoogle Genomics Dataflow Cloud ML Engine Docker Baseline Study Data Private DataPublic Data
  • 56. Build What’s Next Thank You! Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience