The power of graphs to analyze biological data

the power of graphs for analyzing biological datasets

Davy Suvee

Janssen Pharmaceutica

about me

who am i ...
➡ working as an it lead / software architect @ janssen pharmaceutica
• dealing with big scientific data sets
• hands-on expertise in big data and NoSQL technologies

➡ founder of datablend
• provide big data and NoSQL consultancy
Davy Suvee • share practical knowledge and big data use cases via blog

@DSUVEE

outline

➡ getting visual insights into big data sets
★ gene expression clustering (mongodb, Neo4j, Gephi)
★ Mutation prevalence (cassandra, Neo4j, Gephi)

➡ fluxgraph, a time machine for you graphs ...

insights in big data
➡ typical approach through warehousing
★ star schema with fact tables and dimension tables

insights in big data

★ real-time visualization
★ filtering
★ metrics
★ layouting
1, 2
★ modular

1. http://gephi.org/plugins/neo4j-graph-database-support/ 2. http://github.com/datablend/gephi-blueprints-plugin

gene expression clustering

➡ oncology data set:
★ 4.800 samples
★ 27.000 genes

➡ Question:
★ for a particular subset of samples,
which genes are co-expressed?

mongodb for storing gene expressions
{ "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} ,
  "sample_name" : "122551hp133a21.cel" ,
  "genomics_id" : 122551 ,
  "sample_id" : 343981 ,
  "donor_id" : 143981 ,
  "sample_type" : "Tissue" ,
  "sample_site" : "Ascending colon" ,
  "pathology_category" : "MALIGNANT" ,
  "pathology_morphology" : "Adenocarcinoma" ,
  "pathology_type" : "Primary malignant neoplasm of colon" ,
  "primary_site" : "Colon" ,
  "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} ,
                    { "gene" : "X10_at" , "expression" : 3.92335121981739} ,
                     … ]
}

pearson correlation through map-reduce
x y

pearson correlation 43 99

21 65

25 79 0,52
42 75

57 87

59 81

co-expression graph

➡ create a node for each gene
➡ if correlation between two genes >= 0.8, draw an edge between both nodes

graphs and time ...
➡ reproducible graph state

➡ towards a time-aware graph ...

➡ fluxgraph: a blueprints-compatible graph on top of Datomic

➡ make FluxGraph fully time-aware
★ travel your graph through time
★ time-scoped iteration of vertices and edges
★ temporal graph comparison

travel through time
FluxGraph fg = new FluxGraph();

travel through time
Davy

Vertex davy = fg.addVertex();
davy.setProperty(“name”,”Davy”);

travel through time
Davy

Peter
Vertex peter = ...

travel through time
Davy

Peter
Vertex peter = ...
Vertex michael = ...

Michael

travel through time
Davy

kn
ow

s
Peter
Vertex peter = ...
Vertex michael = ...

Edge e1 = Michael
fg.addEdge(davy, peter,“knows”);

travel through time

Davy
Date checkpoint = new Date();

kn
ow
s
Peter

Michael

travel through time

Davy

kn
ow
s
davy.setProperty(“name”,”David”); Peter

Michael

travel through time

David

kn
ow
s

Michael

travel through time

David

kn
ow
s

kn
Edge e2 =

ow
fg.addEdge(davy, michael,“knows”);

s
Michael

travel through time by default
time

kn
Davy ow David
Davy
s

kn
ow
checkpoint

s

current
Peter Peter

kn
ow
s
Michael Michael

travel through time
time

kn
Davy ow David
Davy
s

kn
ow
checkpoint

s

current
Peter Peter

kn
ow
s
Michael Michael

fg.setCheckpointTime(checkpoint);

time-scoped iteration

t1 t2 t3 tcurrrent

change change change

Davy Davy’ Davy’’ Davy’’’

➡ how to find the version of the vertex you are interested in?

t1 t2 t3 tcurrrent

next next next

previous previous previous

t1 t2 t3 tcurrrent

next next next


Vertex previousDavy = davy.getPreviousVersion();

t1 t2 t3 tcurrrent

next next next


Iterable<Vertex> allDavy = davy.getNextVersions();

t1 t2 t3 tcurrrent

next next next


Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

t1 t2 t3 tcurrrent

next next next


Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);
Interval valid = davy.getTimerInterval();

➡ When does an element change?

➡ vertex:
★ setting or removing a property
★ add or remove it from an edge
★ being removed


➡ vertex: ➡ edge:
★ setting or removing a property ★ setting or removing a property
★ add or remove it from an edge ★ being removed
★ being removed


➡ vertex: ➡ edge:
★ setting or removing a property ★ setting or removing a property
★ add or remove it from an edge ★ being removed
★ being removed

➡ ... and each element is time-scoped!

temporal graph comparison

David
Davy Davy

kn
kn

ow
ow

s
s
Peter what changed? Peter
kn
ow
s

Michael Michael

current checkpoint

➡ difference (A , B) = union (A , B) - B
➡ ... as a (immutable) graph!

➡ difference (A , B) = union (A , B) - B
➡ ... as a (immutable) graph! David

difference ( , )=

kn
ow
s

use case: longitudinal patient data
t1 t2 t3 t4 t5

smoking smoking death

patient patient patient patient patient

cancer cancer


➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)


➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)

➡ example analysis:
★ if a male patient is no longer smoking in 2005
★ what are the chances of getting lung cancer in 2010, comparing
patients that smoked before 2005
patients that never smoked

➡ get all male non-smokers in 2005

fg.setCheckpointTime(new DateTime(2005,12,31).toDate());



Iterator<Vertex> males =
fg.getVertices("gender", "male").iterator()



Iterator<Vertex> males =
fg.getVertices("gender", "male").iterator()

while (males.hasNext()) {
Vertex p2005 = males.next();
boolean smoking2005 =
p2005.getEdges(OUT,"smokingStatus").iterator().hasNext();
}

➡ which patients were smoking before 2005?

boolean smokingBefore2005 =
((FluxVertex)p2005).getPreviousVersions(new TimeAwareFilter() {

public TimeAwareElement filter(TimeAwareVertex element) {
return element.getEdges(OUT, "smokingStatus").iterator().hasNext()
? element : null;
}

}).iterator().hasNext();

➡ which patients have cancer in 2010

working set of smokers
Graph g =
fg.difference(smokerws,
time2010.toDate(),
time2005.toDate());

➡ which patients have cancer in 2010

working set of smokers
Graph g =
fg.difference(smokerws,
time2010.toDate(),
time2005.toDate());

➡ extract the patients that have an edge to the cancer node

The power of graphs to analyze biological data

More Related Content

Viewers also liked (10)

Similar to The power of graphs to analyze biological data (20)

Recently uploaded (20)

The power of graphs to analyze biological data