SlideShare a Scribd company logo
Storing and Manipulating Graphs
            in HBase


            Dan Lynn
          dan@fullcontact.com
              @danklynn
Keeps Contact Information Current and Complete


  Based in Denver, Colorado




                              CTO & Co-Founder
Turn Partial Contacts
 Into Full Contacts
Refresher: Graph Theory
Refresher: Graph Theory
Refresher: Graph Theory




     rt ex
Ve
Refresher: Graph Theory




                          Edg
                                e
Social Networks
Tweets

@danklynn

              retweeted


                                   “#HBase rocks”
 follows


                          author



            @xorlev
Web Links


http://fullcontact.com/blog/



                               <a href=”...”>TechStars</a>




                               http://techstars.com/
Why should you care?

Vertex Influence
- PageRank

- Social Influence

- Network bottlenecks

Identifying Communities
Storage Options
neo4j
neo4j




Very expressive querying
       (e.g. Gremlin)
neo4j




Transactional
neo4j




Data must fit on a
 single machine

       :-(
FlockDB
FlockDB




Scales horizontally
FlockDB




Very fast
FlockDB




No multi-hop query support

           :-(
RDBMS
(e.g. MySQL, Postgres, et al.)
RDBMS




Transactional
RDBMS




Huge amounts of JOINing

          :-(
Storing and manipulating graphs in HBase
HBase




Massively scalable
HBase




Data model well-suited
HBase




Multi-hop querying?
Modeling
Techniques
Adjacency Matrix


1
             3




    2
Adjacency Matrix

    1   2    3

1   0   1    1

2   1   0    1

3   1   1    0
Adjacency Matrix




Can use vectorized libraries
Adjacency Matrix




Requires   O(n2)   memory
                   n = number of vertices
Adjacency Matrix




Hard(er) to distribute
Adjacency List


1
                3




      2
Adjacency List




1           2,3

2           1,3

3           1,2
Adjacency List Design in HBase

e:dan@fullcontact.com
                                p:+13039316251




                   t:danklynn
Adjacency List Design in HBase
      row key               “edges” column family

e:dan@fullcontact.com   p:+13039316251= ...

                        t:danklynn= ...


p:+13039316251          e:dan@fullcontact.com= ...

                        t:danklynn= ...


t:danklynn              e:dan@fullcontact.com= ...

                        p:+13039316251= ...
Adjacency List Design in HBase
      row key               “edges” column family

e:dan@fullcontact.com   p:+13039316251= ...

                        t:danklynn= ...
                                                      at to
                                                W e?h
p:+13039316251          e:dan@fullcontact.com= ...
                                                   st or
                        t:danklynn= ...


t:danklynn              e:dan@fullcontact.com= ...

                        p:+13039316251= ...
Custom Writables
package org.apache.hadoop.io;

public interface Writable   {

    void write(java.io.DataOutput dataOutput);

    void readFields(java.io.DataInput dataInput);
}
                                                    java
Custom Writables
class EdgeValueWritable implements Writable {

    EdgeValue edgeValue

    void write(DataOutput dataOutput) {
        dataOutput.writeDouble edgeValue.weight
    }

    void readFields(DataInput dataInput) {
        Double weight = dataInput.readDouble()
        edgeValue = new EdgeValue(weight)
    }

    // ...
}
                                                  groovy
Don’t get fancy with byte[]
class EdgeValueWritable implements Writable {
   EdgeValue edgeValue

    byte[] toBytes() {
        // use strings if you can help it
    }

    static EdgeValueWritable fromBytes(byte[] bytes) {
        // use strings if you can help it
    }
}
                                                     groovy
Querying by vertex
def get = new Get(vertexKeyBytes)
get.addFamily(edgesFamilyBytes)

Result result = table.get(get);
result.noVersionMap.each {family, data ->

    // construct edge objects as needed
    // data is a Map<byte[],byte[]>
}
Adding edges to a vertex
def put = new Put(vertexKeyBytes)

put.add(
    edgesFamilyBytes,
    destinationVertexBytes,
    edgeValue.toBytes() // your own implementation here
)

// if writing directly
table.put(put)


// if using TableReducer
context.write(NullWritable.get(), put)
Distributed Traversal / Indexing

e:dan@fullcontact.com
                         p:+13039316251




                          t:danklynn
Distributed Traversal / Indexing

e:dan@fullcontact.com
                         p:+13039316251




                          t:danklynn
Distributed Traversal / Indexing

e:dan@fullcontact.com
                                         p:+13039316251


                    Pi v
                           ot v
                                  e rt
                                         ex

                                         t:danklynn
Distributed Traversal / Indexing

 e:dan@fullcontact.com
                          p:+13039316251




Ma pReduce ove r
out bou nd edges
                           t:danklynn
Distributed Traversal / Indexing

  e:dan@fullcontact.com
                           p:+13039316251




Em it vertexes an d edge
dat a gro upe d by
the piv ot               t:danklynn
Distributed Traversal / Indexing

   Re duc e key                p:+13039316251




“Ou t” vertex
                e:dan@fullcontact.com



                                        t:danklynn
“In” vertex
Distributed Traversal / Indexing


e:dan@fullcontact.com       t:danklynn




Re duc er em its higher-order edge
Distributed Traversal / Indexing




Ite rat ion 0
Distributed Traversal / Indexing




Ite rat ion 1
Distributed Traversal / Indexing




Ite rat ion 2
Distributed Traversal / Indexing




                               Reuse edges created
                               during previ ous
                               iterat ions




Ite rat ion 2
Distributed Traversal / Indexing




Ite rat ion 3
Distributed Traversal / Indexing




                               Reuse edges created
                               during previ ous
                               iterat ions




Ite rat ion 3
Distributed Traversal / Indexing


   hop s req uires on ly

                   ite rat ion s
Tips / Gotchas
Do implement your own comparator
public static class Comparator
               extends WritableComparator {


    public int compare(
        byte[] b1, int s1, int l1,
        byte[] b2, int s2, int l2) {

        // .....
    }
}
                                              java
Do implement your own comparator


static {
    WritableComparator.define(VertexKeyWritable,
         new VertexKeyWritable.Comparator())
}



                                                   java
MultiScanTableInputFormat

MultiScanTableInputFormat.setTable(conf,
   "graph");

MultiScanTableInputFormat.addScan(conf,
   new Scan());

job.setInputFormatClass(
   MultiScanTableInputFormat.class);


                                          java
TableMapReduceUtil



TableMapReduceUtil.initTableReducerJob(
    "graph", MyReducer.class, job);

                                      java
Elastic
MapReduce
Elastic MapReduce

HFi les
Elastic MapReduce

HFi les
     Copy to S3



  Seq uen ceFiles
Elastic MapReduce

HFi les
     Copy to S3
                     Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
Elastic MapReduce

HFi les
     Copy to S3
                     Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
Elastic MapReduce

HFi les
     Copy to S3
                                Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
          HFileOutputFormat.configureIncrementalLoad(job, outputTable)



  HFi les
Elastic MapReduce

HFi les
     Copy to S3
                                Elastic MapReduce



  Seq uen ceFiles Seq uen ceFiles
          HFileOutputFormat.configureIncrementalLoad(job, outputTable)



  HFi les                                          HBase
                   $ hadoop jar hbase-VERSION.jar completebulkload
Additional Resources
Google Pregel: BSP-based graph processing system

Apache Giraph: Implementation of Pregel for Hadoop

MultiScanTableInputFormat example

Apache Mahout - Distributed machine learning on Hadoop
Thanks!
dan@fullcontact.com

More Related Content

What's hot (20)

PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Indexed Hive
NikhilDeshpande
 
KEY
OSCON 2011 Learning CouchDB
Bradley Holt
 
PDF
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough
 
PPT
Mapreduce in Search
Amund Tveit
 
PDF
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
PDF
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PPTX
Apache pig
Jigar Parekh
 
PPTX
How LinkedIn Uses Scalding for Data Driven Product Development
Sasha Ovsankin
 
DOCX
Apache Drill with Oracle, Hive and HBase
Nag Arvind Gudiseva
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Apache Drill @ PJUG, Jan 15, 2013
Gera Shegalov
 
PDF
Cassandra data structures and algorithms
Duyhai Doan
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
KEY
OSCON 2011 CouchApps
Bradley Holt
 
PDF
Spark and shark
DataWorks Summit
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Indexed Hive
NikhilDeshpande
 
OSCON 2011 Learning CouchDB
Bradley Holt
 
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough
 
Mapreduce in Search
Amund Tveit
 
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Apache pig
Jigar Parekh
 
How LinkedIn Uses Scalding for Data Driven Product Development
Sasha Ovsankin
 
Apache Drill with Oracle, Hive and HBase
Nag Arvind Gudiseva
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Apache Drill @ PJUG, Jan 15, 2013
Gera Shegalov
 
Cassandra data structures and algorithms
Duyhai Doan
 
Hive User Meeting August 2009 Facebook
ragho
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
OSCON 2011 CouchApps
Bradley Holt
 
Spark and shark
DataWorks Summit
 

Viewers also liked (20)

PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
PDF
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2015: HBase @ CyberAgent
HBaseCon
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
PDF
HBase @ Twitter
ctrezzo
 
PPTX
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
DataWorks Summit
 
PDF
안드로이드에서 플러리를 쉽게 사용하기
Booseol Shin
 
PPTX
2011.10.14 Apache Giraph - Hortonworks
Avery Ching
 
KEY
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
PPTX
Neo, Titan & Cassandra
johnrjenson
 
PPTX
Redis vs Aerospike
Sayyaparaju Sunil
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PDF
Facebook Messages & HBase
强 王
 
PDF
Streaming architecture patterns
hadooparchbook
 
PPTX
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon
 
PPTX
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
PPTX
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon
 
PPTX
Real-time HBase: Lessons from the Cloud
HBaseCon
 
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
Cloudera, Inc.
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
HBase @ Twitter
ctrezzo
 
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...
DataWorks Summit
 
안드로이드에서 플러리를 쉽게 사용하기
Booseol Shin
 
2011.10.14 Apache Giraph - Hortonworks
Avery Ching
 
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
Neo, Titan & Cassandra
johnrjenson
 
Redis vs Aerospike
Sayyaparaju Sunil
 
Apache Hadoop and HBase
Cloudera, Inc.
 
Facebook Messages & HBase
强 王
 
Streaming architecture patterns
hadooparchbook
 
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon
 
Real-time HBase: Lessons from the Cloud
HBaseCon
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
 
Ad

Similar to Storing and manipulating graphs in HBase (20)

PDF
1st UIM-GDB - Connections to the Real World
Achim Friedland
 
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Chris Huang
 
PDF
O connor bosc2010
BOSC 2010
 
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
KEY
Perform Like a frAg Star
renaebair
 
KEY
Ruby on Big Data (Cassandra + Hadoop)
Brian O'Neill
 
PDF
Bcn On Rails May2010 On Graph Databases
Pere Urbón-Bayes
 
PDF
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Big Data Spain
 
PPTX
Above the cloud: Big Data and BI
Denny Lee
 
PDF
NoSQL overview #phptostart turin 11.07.2011
David Funaro
 
PPTX
Attack on graph
Scott Miao
 
PPTX
No sql solutions - 공개용
Byeongweon Moon
 
PDF
Omaha Java Users Group - Introduction to HBase and Hadoop
Shawn Hermans
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PDF
HypergraphDB
Jan Drozen
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PDF
HBase Advanced - Lars George
JAX London
 
1st UIM-GDB - Connections to the Real World
Achim Friedland
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Chris Huang
 
O connor bosc2010
BOSC 2010
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Perform Like a frAg Star
renaebair
 
Ruby on Big Data (Cassandra + Hadoop)
Brian O'Neill
 
Bcn On Rails May2010 On Graph Databases
Pere Urbón-Bayes
 
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Big Data Spain
 
Above the cloud: Big Data and BI
Denny Lee
 
NoSQL overview #phptostart turin 11.07.2011
David Funaro
 
Attack on graph
Scott Miao
 
No sql solutions - 공개용
Byeongweon Moon
 
Omaha Java Users Group - Introduction to HBase and Hadoop
Shawn Hermans
 
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
elliando dias
 
HypergraphDB
Jan Drozen
 
Hadoop - Lessons Learned
tcurdt
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
HBase Advanced - Lars George
JAX London
 
Ad

More from Dan Lynn (9)

PDF
The Holy Grail of Data Analytics
Dan Lynn
 
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
PDF
Hands on with Apache Spark
Dan Lynn
 
PDF
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
Dan Lynn
 
PDF
Data Streaming Technology Overview
Dan Lynn
 
PDF
Data decay and the illusion of the present
Dan Lynn
 
PDF
Storm - As deep into real-time data processing as you can get in 30 minutes.
Dan Lynn
 
PDF
Storm: The Real-Time Layer - GlueCon 2012
Dan Lynn
 
KEY
When it rains: Prepare for scale with Amazon EC2
Dan Lynn
 
The Holy Grail of Data Analytics
Dan Lynn
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Hands on with Apache Spark
Dan Lynn
 
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
Dan Lynn
 
Data Streaming Technology Overview
Dan Lynn
 
Data decay and the illusion of the present
Dan Lynn
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Dan Lynn
 
Storm: The Real-Time Layer - GlueCon 2012
Dan Lynn
 
When it rains: Prepare for scale with Amazon EC2
Dan Lynn
 

Recently uploaded (20)

PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
July Patch Tuesday
Ivanti
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Python basic programing language for automation
DanialHabibi2
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 

Storing and manipulating graphs in HBase