Storing and manipulating graphs in HBase

Storing and Manipulating Graphs
in HBase

Dan Lynn
dan@fullcontact.com
@danklynn

Keeps Contact Information Current and Complete

Based in Denver, Colorado

CTO & Co-Founder

Turn Partial Contacts
Into Full Contacts

Refresher: Graph Theory

rt ex
Ve

Refresher: Graph Theory

Edg
e

Tweets

@danklynn

retweeted

“#HBase rocks”
follows

author

@xorlev

Web Links

http://fullcontact.com/blog/

<a href=”...”>TechStars</a>

http://techstars.com/

Why should you care?

Vertex Inﬂuence
- PageRank

- Social Inﬂuence

- Network bottlenecks

Identifying Communities

neo4j

Very expressive querying
(e.g. Gremlin)

neo4j

Data must ﬁt on a
single machine

:-(

FlockDB

Scales horizontally

FlockDB

No multi-hop query support

:-(

RDBMS
(e.g. MySQL, Postgres, et al.)

RDBMS

Huge amounts of JOINing

:-(

Storing and manipulating graphs in HBase

HBase

Data model well-suited

Adjacency Matrix

1
3

2

Adjacency Matrix

1 2 3

1 0 1 1

2 1 0 1

3 1 1 0

Adjacency Matrix

Can use vectorized libraries

Adjacency Matrix

Requires O(n2) memory
n = number of vertices

Adjacency Matrix

Hard(er) to distribute

Adjacency List

1
3

2

Adjacency List

1 2,3

2 1,3

3 1,2

Adjacency List Design in HBase

e:dan@fullcontact.com
p:+13039316251

t:danklynn

row key “edges” column family

e:dan@fullcontact.com p:+13039316251= ...

t:danklynn= ...

p:+13039316251 e:dan@fullcontact.com= ...

t:danklynn= ...

t:danklynn e:dan@fullcontact.com= ...

p:+13039316251= ...

row key “edges” column family

e:dan@fullcontact.com p:+13039316251= ...

t:danklynn= ...
at to
W e?h
p:+13039316251 e:dan@fullcontact.com= ...
st or
t:danklynn= ...

t:danklynn e:dan@fullcontact.com= ...

p:+13039316251= ...

Custom Writables
package org.apache.hadoop.io;

public interface Writable {

void write(java.io.DataOutput dataOutput);

void readFields(java.io.DataInput dataInput);
}
java

Custom Writables
class EdgeValueWritable implements Writable {

EdgeValue edgeValue

void write(DataOutput dataOutput) {
dataOutput.writeDouble edgeValue.weight
}

void readFields(DataInput dataInput) {
Double weight = dataInput.readDouble()
edgeValue = new EdgeValue(weight)
}

// ...
}
groovy

Don’t get fancy with byte[]
class EdgeValueWritable implements Writable {
EdgeValue edgeValue

byte[] toBytes() {
// use strings if you can help it
}

static EdgeValueWritable fromBytes(byte[] bytes) {
// use strings if you can help it
}
}
groovy

Querying by vertex
def get = new Get(vertexKeyBytes)
get.addFamily(edgesFamilyBytes)

Result result = table.get(get);
result.noVersionMap.each {family, data ->

// construct edge objects as needed
// data is a Map<byte[],byte[]>
}

Adding edges to a vertex
def put = new Put(vertexKeyBytes)

put.add(
edgesFamilyBytes,
destinationVertexBytes,
edgeValue.toBytes() // your own implementation here
)

// if writing directly
table.put(put)

// if using TableReducer
context.write(NullWritable.get(), put)

Distributed Traversal / Indexing

p:+13039316251

t:danklynn


p:+13039316251

Pi v
ot v
e rt
ex

t:danklynn


p:+13039316251

Ma pReduce ove r
out bou nd edges
t:danklynn


p:+13039316251

Em it vertexes an d edge
dat a gro upe d by
the piv ot t:danklynn


Re duc e key p:+13039316251

“Ou t” vertex

t:danklynn
“In” vertex


e:dan@fullcontact.com t:danklynn

Re duc er em its higher-order edge


Ite rat ion 0


Ite rat ion 1


Ite rat ion 2


Reuse edges created
during previ ous
iterat ions

Ite rat ion 2


Ite rat ion 3


Reuse edges created
during previ ous
iterat ions

Ite rat ion 3


hop s req uires on ly

ite rat ion s

Do implement your own comparator
public static class Comparator
extends WritableComparator {

public int compare(
byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {

// .....
}
}
java

Do implement your own comparator

static {
WritableComparator.define(VertexKeyWritable,
new VertexKeyWritable.Comparator())
}

java

MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,
"graph");

MultiScanTableInputFormat.addScan(conf,
new Scan());

job.setInputFormatClass(
MultiScanTableInputFormat.class);

java

TableMapReduceUtil

TableMapReduceUtil.initTableReducerJob(
"graph", MyReducer.class, job);

java

Elastic MapReduce

HFi les
Copy to S3

Seq uen ceFiles

Elastic MapReduce

HFi les
Copy to S3
Elastic MapReduce

Seq uen ceFiles Seq uen ceFiles

Elastic MapReduce

HFi les
Copy to S3
Elastic MapReduce

HFileOutputFormat.conﬁgureIncrementalLoad(job, outputTable)

HFi les

Elastic MapReduce

HFi les
Copy to S3
Elastic MapReduce

HFileOutputFormat.conﬁgureIncrementalLoad(job, outputTable)

HFi les HBase
$ hadoop jar hbase-VERSION.jar completebulkload

Additional Resources
Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat example

Apache Mahout - Distributed machine learning on Hadoop

Storing and manipulating graphs in HBase

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Storing and manipulating graphs in HBase (20)

More from Dan Lynn (9)

Recently uploaded (20)

Storing and manipulating graphs in HBase