SlideShare a Scribd company logo
Background
The past decade has seen a significant increase in high-
throughput experimental studies that catalog variant
datasets using massively parallel sequencing experiments.
New insights of biological significance can be gained by this
information with multiple genomic locations based
annotations. However, efforts to obtain this information by
integrating and mining variant data have had limited
success so far and there has yet to be a method developed
that can be scalable, practical and applied to millions of
variants and their related annotations. We explored the use
of graph data structures as a proof of concept for scalable
interpretation of the impact of variant related data.
Methods and Materials Results
Leveraging Graph Data Structures for Variant Data and Related Annotations
Chris Zawora1, Jesse Milzman1, Yatpang Cheung1, Akshay Bhushan1, Michael S. Atkins2, Hue Vuong3, F. Pascal Girard2, Uma Mudunuri3
1Georgetown University, Washington, DC, 2FedCentric Technologies, LLC, Fairfax, VA, 3Frederick National Laboratory for Cancer Research, Frederick, MD
Conclusion
References
The 1000 Genomes Project Consortium. (2010). A map of human genome
variation from population-scale sequencing. Nature, 467, 1061–1073.
Gregg, B. (2014). Systems performance: enterprise and the cloud.
Pearson Hall: Upper Saddle River, NJ.
Acknowledgments
FedCentric acknowledges Frank D’Ippolito, Shafigh Mehraeen, Margreth Mpossi,
Supriya Nittoor, and Tianwen Chu for their invaluable assistance with this project.
This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of
Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect the views or
policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or
organizations imply endorsement by the U.S. Government.
Contract HHSN261200800001E - Funded by the National Cancer Institute
DEPARTMENT OF HEALTH AND HUMAN SERVICES • National Institutes of Health • National Cancer Institute
Frederick National Laboratory is a Federally Funded Research and Development Center
operated by Leidos Biomedical Research, Inc., for the National Cancer Institute
Introduction
Traditional approaches of data mining and integration in the
research field have relied on relational databases or
programming for deriving dynamic insights from research
related data. However, as more next generation sequencing
(NGS) becomes available, these approaches limit the
exploration of certain hypothesis. One such limitation is the
mining of variant data from publicly available databases
such as the 1000 genomes project and TCGA.
Although there are applications available for quickly finding
the public data with a certain set of variants or for finding
minor allele frequencies, there is no such application that
can be applied generically across all the projects allowing
researchers to globally mine and find patterns that would be
applicable to their specific research interests.
In this pilot project, we have investigated whether graph
database structures are applicable for mining variants from
individuals and populations in a scalable manner and
understanding their impact by integrating with known
annotations.
Phase I: As an initial evaluation of the graph structures we ran
several simple queries, also feasible through a relational
architecture, and measured performance speeds.
Simple Query Examples
• Get all information for a single variant
• Find annotations within a range of genomic locations
• Find variants associated with specific clinical phenotypes
Performance speeds
• Query times in milliseconds
• Better or equal to relational database query times
Queries
• Developed a new SQL-like query language called SparkQL
• Eases writing queries for non-programmatic users
Ingestion Times
•Slower than expected
•Sparksee is a multi-thread single write database
•Writes one node/edge at a time
•Each write involves creating connections with existing nodes
•Slows down as the graph size increases
Solution: Implement multi-threaded insertions in
combination with internal data structures to efficiently find
nodes and create edges
High Degree Vertices:
•Nodes with millions of edges
•Stored in a non-distributed list like format
•Searches for a specific edge might be slow
Example: nodes representing individuals with millions of
variants
Solution: Explore other graph clustering approaches that
can essentially condense the information presented
Phase II: We explored complex patterns and clusters inside the
graph and spectral clustering queries that were not feasible
through the relational architecture.
Complex Query Examples
• Compare variant profiles and find individuals that are closely
related
• Compare annotation profiles to find clusters of populations
Phase II Results
• Eight populations with 25 individuals from each population
• Strong eigenvalue support (near zero) for 3 main clusters
• Cluster pattern supported by population genetics (Fig. 4)
Performance speeds
• Spectral clustering took ca. 2 minutes
.
Our results indicate that a graph database, run on
an in-memory machine, can be a very powerful and
useful tool for cancer research. Performance using
the graph database for finding details on specific
nodes or a list of nodes is better or equal to a well-
architected relational database. We also see
promising initial results for identifying correlations
between genetic changes and specific phenotype
conditions.
We conclude that an in-memory graph
database would allow researchers to run known
queries while also providing the opportunity to
develop algorithms to explore complex correlations.
Graph models are useful for mining complex data
sets and could prove essential in the development
and implementation of tools aiding precision
medicine.
Data
• SNPs from 1000 genomes project
• Phenotype conditions from ClinVar
• Gene mappings & mRNA transcripts from Entrez Gene
• Amino acid changes from UniProt
The Graph
• Variants and annotations mapped to reference genomic
locations (Fig. 3)
• Includes all chromosomes and genomic locations
• 180 million nodes and 12 billion edges.
Graph Architecture
• Sparsity Technologies’ Sparsity Graph database
• API supports C, C++, C#, Python, and Java
• Implements graph and its attributes as maps and sparse
bitmap structures
• Allows it to scale with very limited memory requirements.
Hardware
• FedCentric Labs’ SGI UV300 system: x86/Linux system,
scales up to 1,152 cores & 64TB of memory
• Data in memory, very low latency, high performance (Fig. 2)
Fig. 2: Latency matters
Fig. 3: The Graph Model
Fig. 1: Graphs handle data complexity intuitively and interactively
Fig. 4: Results of spectral clustering of 1000 Genomes data

More Related Content

PDF
2015 GU-ICBI Poster (third printing)
Michael Atkins
 
PPT
Claudia medina: Linking Health Records for Population Health Research in Brazil.
Flávio Codeço Coelho
 
PPT
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Flávio Codeço Coelho
 
PDF
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
Microsoft Azure for Research
 
PPTX
The need for a transparent data supply chain
Paul Groth
 
PDF
Drug Repurposing using Deep Learning on Knowledge Graphs
Databricks
 
PPTX
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
PDF
Wim de Grave: Big Data in life sciences
Flávio Codeço Coelho
 
2015 GU-ICBI Poster (third printing)
Michael Atkins
 
Claudia medina: Linking Health Records for Population Health Research in Brazil.
Flávio Codeço Coelho
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Flávio Codeço Coelho
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
Microsoft Azure for Research
 
The need for a transparent data supply chain
Paul Groth
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Databricks
 
Sources of Change in Modern Knowledge Organization Systems
Paul Groth
 
Wim de Grave: Big Data in life sciences
Flávio Codeço Coelho
 

What's hot (20)

PPTX
Research Data Sharing: A Basic Framework
Paul Groth
 
PPTX
Machines are people too
Paul Groth
 
PDF
Cancer Analytics Poster
Michael Atkins
 
PDF
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagnósticos
 
PPT
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Carole Goble
 
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
PPTX
Model Organism Linked Data
Michel Dumontier
 
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
PDF
Next-Generation Search Engines for Information Retrieval
Waqas Tariq
 
PPTX
Knowledge graph construction for research & medicine
Paul Groth
 
PPT
RSC ChemSpider as an environment for teaching and sharing chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PDF
Role of Data Accessibility During Pandemic
Databricks
 
PPTX
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
PPTX
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Michel Dumontier
 
PDF
Machine learning in biology
Pranavathiyani G
 
PPTX
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
PDF
CEDAR work bench for metadata management
Pistoia Alliance
 
PDF
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
CEDAR: Center for Expanded Data Annotation and Retrieval
 
PPTX
is there life between standards? Data interoperability for AI.
Chris Evelo
 
PPT
Bioinformatics Projects And Applications
Dr. Paulsharma Chakravarthy
 
Research Data Sharing: A Basic Framework
Paul Groth
 
Machines are people too
Paul Groth
 
Cancer Analytics Poster
Michael Atkins
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagnósticos
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Carole Goble
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Paul Groth
 
Model Organism Linked Data
Michel Dumontier
 
The Roots: Linked data and the foundations of successful Agriculture Data
Paul Groth
 
Next-Generation Search Engines for Information Retrieval
Waqas Tariq
 
Knowledge graph construction for research & medicine
Paul Groth
 
RSC ChemSpider as an environment for teaching and sharing chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Role of Data Accessibility During Pandemic
Databricks
 
More ways of symbol grounding for knowledge graphs?
Paul Groth
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Michel Dumontier
 
Machine learning in biology
Pranavathiyani G
 
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
CEDAR work bench for metadata management
Pistoia Alliance
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
CEDAR: Center for Expanded Data Annotation and Retrieval
 
is there life between standards? Data interoperability for AI.
Chris Evelo
 
Bioinformatics Projects And Applications
Dr. Paulsharma Chakravarthy
 
Ad

Similar to FedCentric_Presentation (20)

PDF
Building bioinformatics resources for the global community
ExternalEvents
 
PPTX
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
PPTX
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET
 
PPTX
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
PPTX
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
PDF
Nucl. Acids Res.-2014-Howe-nar-gku1244
Yasel Cruz
 
PDF
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Databricks
 
PDF
NRNB Annual Report 2018
Alexander Pico
 
PPTX
KnetMiner Overview Oct 2017
Keywan Hassani-Pak
 
PPTX
Data-knowledge transition zones within the biomedical research ecosystem
Maryann Martone
 
PDF
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
Syed Ahmad Chan Bukhari, PhD
 
PPTX
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Jisc
 
PPTX
Knowledge Management in the AI Driven Scintific System
Subhasis Dasgupta
 
PDF
CINECA webinar slides: Making cohort data FAIR
CINECAProject
 
PDF
Research Statement Chien-Wei Lin
Chien-Wei Lin
 
PPTX
Recognising data sharing
Jisc RDM
 
PDF
Cri big data
Putchong Uthayopas
 
PPTX
Martone grethe
Maryann Martone
 
PPTX
Being FAIR: Enabling Reproducible Data Science
Carole Goble
 
Building bioinformatics resources for the global community
ExternalEvents
 
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET
 
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Yasel Cruz
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Databricks
 
NRNB Annual Report 2018
Alexander Pico
 
KnetMiner Overview Oct 2017
Keywan Hassani-Pak
 
Data-knowledge transition zones within the biomedical research ecosystem
Maryann Martone
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
Syed Ahmad Chan Bukhari, PhD
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
Jisc
 
Knowledge Management in the AI Driven Scintific System
Subhasis Dasgupta
 
CINECA webinar slides: Making cohort data FAIR
CINECAProject
 
Research Statement Chien-Wei Lin
Chien-Wei Lin
 
Recognising data sharing
Jisc RDM
 
Cri big data
Putchong Uthayopas
 
Martone grethe
Maryann Martone
 
Being FAIR: Enabling Reproducible Data Science
Carole Goble
 
Ad

FedCentric_Presentation

  • 1. Background The past decade has seen a significant increase in high- throughput experimental studies that catalog variant datasets using massively parallel sequencing experiments. New insights of biological significance can be gained by this information with multiple genomic locations based annotations. However, efforts to obtain this information by integrating and mining variant data have had limited success so far and there has yet to be a method developed that can be scalable, practical and applied to millions of variants and their related annotations. We explored the use of graph data structures as a proof of concept for scalable interpretation of the impact of variant related data. Methods and Materials Results Leveraging Graph Data Structures for Variant Data and Related Annotations Chris Zawora1, Jesse Milzman1, Yatpang Cheung1, Akshay Bhushan1, Michael S. Atkins2, Hue Vuong3, F. Pascal Girard2, Uma Mudunuri3 1Georgetown University, Washington, DC, 2FedCentric Technologies, LLC, Fairfax, VA, 3Frederick National Laboratory for Cancer Research, Frederick, MD Conclusion References The 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073. Gregg, B. (2014). Systems performance: enterprise and the cloud. Pearson Hall: Upper Saddle River, NJ. Acknowledgments FedCentric acknowledges Frank D’Ippolito, Shafigh Mehraeen, Margreth Mpossi, Supriya Nittoor, and Tianwen Chu for their invaluable assistance with this project. This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. Contract HHSN261200800001E - Funded by the National Cancer Institute DEPARTMENT OF HEALTH AND HUMAN SERVICES • National Institutes of Health • National Cancer Institute Frederick National Laboratory is a Federally Funded Research and Development Center operated by Leidos Biomedical Research, Inc., for the National Cancer Institute Introduction Traditional approaches of data mining and integration in the research field have relied on relational databases or programming for deriving dynamic insights from research related data. However, as more next generation sequencing (NGS) becomes available, these approaches limit the exploration of certain hypothesis. One such limitation is the mining of variant data from publicly available databases such as the 1000 genomes project and TCGA. Although there are applications available for quickly finding the public data with a certain set of variants or for finding minor allele frequencies, there is no such application that can be applied generically across all the projects allowing researchers to globally mine and find patterns that would be applicable to their specific research interests. In this pilot project, we have investigated whether graph database structures are applicable for mining variants from individuals and populations in a scalable manner and understanding their impact by integrating with known annotations. Phase I: As an initial evaluation of the graph structures we ran several simple queries, also feasible through a relational architecture, and measured performance speeds. Simple Query Examples • Get all information for a single variant • Find annotations within a range of genomic locations • Find variants associated with specific clinical phenotypes Performance speeds • Query times in milliseconds • Better or equal to relational database query times Queries • Developed a new SQL-like query language called SparkQL • Eases writing queries for non-programmatic users Ingestion Times •Slower than expected •Sparksee is a multi-thread single write database •Writes one node/edge at a time •Each write involves creating connections with existing nodes •Slows down as the graph size increases Solution: Implement multi-threaded insertions in combination with internal data structures to efficiently find nodes and create edges High Degree Vertices: •Nodes with millions of edges •Stored in a non-distributed list like format •Searches for a specific edge might be slow Example: nodes representing individuals with millions of variants Solution: Explore other graph clustering approaches that can essentially condense the information presented Phase II: We explored complex patterns and clusters inside the graph and spectral clustering queries that were not feasible through the relational architecture. Complex Query Examples • Compare variant profiles and find individuals that are closely related • Compare annotation profiles to find clusters of populations Phase II Results • Eight populations with 25 individuals from each population • Strong eigenvalue support (near zero) for 3 main clusters • Cluster pattern supported by population genetics (Fig. 4) Performance speeds • Spectral clustering took ca. 2 minutes . Our results indicate that a graph database, run on an in-memory machine, can be a very powerful and useful tool for cancer research. Performance using the graph database for finding details on specific nodes or a list of nodes is better or equal to a well- architected relational database. We also see promising initial results for identifying correlations between genetic changes and specific phenotype conditions. We conclude that an in-memory graph database would allow researchers to run known queries while also providing the opportunity to develop algorithms to explore complex correlations. Graph models are useful for mining complex data sets and could prove essential in the development and implementation of tools aiding precision medicine. Data • SNPs from 1000 genomes project • Phenotype conditions from ClinVar • Gene mappings & mRNA transcripts from Entrez Gene • Amino acid changes from UniProt The Graph • Variants and annotations mapped to reference genomic locations (Fig. 3) • Includes all chromosomes and genomic locations • 180 million nodes and 12 billion edges. Graph Architecture • Sparsity Technologies’ Sparsity Graph database • API supports C, C++, C#, Python, and Java • Implements graph and its attributes as maps and sparse bitmap structures • Allows it to scale with very limited memory requirements. Hardware • FedCentric Labs’ SGI UV300 system: x86/Linux system, scales up to 1,152 cores & 64TB of memory • Data in memory, very low latency, high performance (Fig. 2) Fig. 2: Latency matters Fig. 3: The Graph Model Fig. 1: Graphs handle data complexity intuitively and interactively Fig. 4: Results of spectral clustering of 1000 Genomes data