SlideShare a Scribd company logo
Building a Business on Open Source
       Distributed Computing


  company: www.visibletechnologies.com

      blog: www.roadtofailure.com
         twitter: @lusciouspear
Social Media and Scaling
Social Media and Scaling

•Scalability Matters Now.
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
•Big Data enabling new fields for
  companies
What Visible Does
What Visible Does



•BI and Brand Management on Social
  Media
What Visible Does



•BI and Brand Management on Social
  Media

•Listen, Monitor, Engage
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Old Product: RDBMS
Old Product: RDBMS


•A few MSSQL servers on boxes
Old Product: RDBMS


•A few MSSQL servers on boxes
•Lots of ETL
Old Product: RDBMS


•A few MSSQL servers on boxes
•Lots of ETL
•Several TB, inserts slow, deletes
  impossible, random fail
Why RDBMS Bad
Why RDBMS Bad
•Nonlinear scale cost
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
•Impedance Mismatch - Try to be High-
  Throughput, Low-Latency
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
•Impedance Mismatch - Try to be High-
  Throughput, Low-Latency

•Swiss-army knife, unstable,
  transactions, advanced SQL, tuning
Why OSS?
Why OSS?

•Previously all MS
Why OSS?

•Previously all MS
•It exists!
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
•It’s Enterprise Now!
Goals for New Platform
Goals for New Platform

•“Golden Timeline”
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
•“Collect the Social Internet”
HOW TO SCALE
HOW TO SCALE



•What makes you special?
HOW TO SCALE



•What makes you special?
•What are you willing to sacrifice?
HOW TO SCALE



•What makes you special?
•What are you willing to sacrifice?
•How will you structure the data?
Avoiding Impedance Mismatch
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency

 •Get a lot of data eventually, or a little
   now
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency

 •Get a lot of data eventually, or a little
   now

 •MapReduce vs. Sharding/Indexing
Ecosystem
                                                                    Compiled
                                   Pig     Cascading        Hive
                                                                   Processing
            Katta / Applications




                                                                      Raw
Zookeeper




                                           MapReduce
                                                                   Processing
                                                                    Structured
                                                        HBase
                                                                     Storage
                                                                   Unstructured
                                           Hadoop DFS
                                                                     Storage
Simple Workflow
                       Semantic     Unstructured
Hadoop      Collect
                       Analysis       Analysis



                       Structured
                        Analysis
Hadoop +    Store in
 HBase      HBase
                                     Store in
                       Indexing
                                     Hadoop


Lucene+                 Load/
              Pull
 Solr+                 Replicate
            Indexes
 Katta                  Shards           Search
Unstructured Processing Cluster


                     Semantic   Unstructured   Structured
Internet   Collect                                Store
                     Analysis     Analysis

                                   HBase
           HTML        XML
                                  Records
Hadoop + MR
Hadoop + MR


•Special: Crunch web-scale data fast
Hadoop + MR


•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
  Random Access, Updates
Hadoop + MR


•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
  Random Access, Updates

•Structure: Chunked flat files
Structured Processing Cluster
                          Enriched Data

                           Structured
                            Analysis
Unstructured   Store in
  Cluster      HBase
                                            Store in     Search
                            Indexing
                                            Hadoop       Cluster
                HBase
               Records
                                            Sharded
                          Lucene Index
                                          Lucene Index
Document Structure


ContentID: 00BAC189
Title: Iron Maiden Rules
Body: I think Janick Gers is an amazing guitarist blah blah
PostDT: 20090718
ParentID: 0FDEADBEEF
Permalink: www.roadtofailure.com/post?=20
HBase
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,
  Transactions (kind of)
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,
  Transactions (kind of)

•Structure: BigTable - column oriented
Search Cluster


   Lucene                 Load/
                 Pull
Indexes from             Replicate
               Indexes
    HDFS                  Shards     Search

               Lucene    Lucene
               Indexes   Indexes
Search
Katta + Solr
Katta + Solr



•Special: Sharded search
Katta + Solr



•Special: Sharded search
•Sacrifice: Consistency, high-throughput
Katta + Solr



•Special: Sharded search
•Sacrifice: Consistency, high-throughput
•Structure: Reverse index
BI
BI


•Group, Sort, Filter, Count, Sum
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
•Faceted Search
Examples
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Challenges
Challenges


•Scaling Search
Challenges


•Scaling Search
•Understanding Latency
Challenges


•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
 customers wait for big data?
Challenges


•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
 customers wait for big data?

•Monitoring
Recap: Rules for Scaling
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
•Know your data structure
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
•Know your data structure
•Ponder Latency
What Next?
What Next?



•HBase Analytics?
What Next?



•HBase Analytics?
•“What would make a bank trust it”
What Next?



•HBase Analytics?
•“What would make a bank trust it”
•Teach people to think about data
...
The End


company: www.visibletechnologies.com

    blog: www.roadtofailure.com
       twitter: @lusciouspear

    bradfordstephens@gmail.com

More Related Content

What's hot (20)

PPTX
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PPTX
Hadoop overview
Siva Pandeti
 
PPTX
The Hadoop Ecosystem
J Singh
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
PPTX
Introduction to Hadoop
Ran Ziv
 
PPTX
Using Apache Drill
Chicago Hadoop Users Group
 
PPTX
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
HBase in Practice
larsgeorge
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PDF
Hadoop 101
EMC
 
PPTX
Analyzing Real-World Data with Apache Drill
tshiran
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Hadoop User Group - Status Apache Drill
MapR Technologies
 
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
ODP
Hadoop demo ppt
Phil Young
 
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Hadoop overview
Siva Pandeti
 
The Hadoop Ecosystem
J Singh
 
Introduction to Big Data & Hadoop
Edureka!
 
Introduction to Hadoop
Ran Ziv
 
Using Apache Drill
Chicago Hadoop Users Group
 
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
HBase in Practice
larsgeorge
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop 101
EMC
 
Analyzing Real-World Data with Apache Drill
tshiran
 
Hadoop Overview & Architecture
EMC
 
Hadoop User Group - Status Apache Drill
MapR Technologies
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
Hadoop demo ppt
Phil Young
 

Viewers also liked (20)

PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
Big Data
Peter Parycek
 
KEY
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
PDF
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
 
PDF
Titan: The Rise of Big Graph Data
Marko Rodriguez
 
PDF
Big Data Overview
IMEX Research
 
PDF
How to Interview a Data Scientist
Daniel Tunkelang
 
PDF
Titan: Big Graph Data with Cassandra
Matthias Broecheler
 
PDF
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
 
PDF
Introduction to R for Data Mining
Revolution Analytics
 
PPTX
Big data Overview
Arnon Rotem-Gal-Oz
 
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
PDF
A Primer on Big Data for Business
Leslie Bradshaw
 
PDF
Turning Big Data to Business Advantage
Teradata Aster
 
PDF
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
PPTX
What is big data?
David Wellman
 
PPT
Big Data
NGDATA
 
PPT
Big data ppt
IDBI Bank Ltd.
 
PPTX
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
What is Big Data?
Bernard Marr
 
Big data ppt
Nasrin Hussain
 
Big Data
Peter Parycek
 
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
 
Titan: The Rise of Big Graph Data
Marko Rodriguez
 
Big Data Overview
IMEX Research
 
How to Interview a Data Scientist
Daniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Matthias Broecheler
 
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
 
Introduction to R for Data Mining
Revolution Analytics
 
Big data Overview
Arnon Rotem-Gal-Oz
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
A Primer on Big Data for Business
Leslie Bradshaw
 
Turning Big Data to Business Advantage
Teradata Aster
 
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
What is big data?
David Wellman
 
Big Data
NGDATA
 
Big data ppt
IDBI Bank Ltd.
 
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
Ad

Similar to Building a Business on Hadoop, HBase, and Open Source Distributed Computing (20)

PDF
Next Generation Data Platforms - Deon Thomas
Thoughtworks
 
KEY
Processing Big Data
cwensel
 
PDF
Intro to HBase - Lars George
JAX London
 
PDF
Searching conversations with hadoop
DataWorks Summit
 
PPTX
4. hadoop גיא לבנברג
Taldor Group
 
PDF
Facebook keynote-nicolas-qcon
Yiwei Ma
 
PDF
支撑Facebook消息处理的h base存储系统
yongboy
 
PDF
Facebook Messages & HBase
强 王
 
PPTX
Horizon for Big Data
Schubert Zhang
 
PDF
High-Performance Storage Services with HailDB and Java
sunnygleason
 
KEY
Real Time BI with Hadoop
Bradford Stephens
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PDF
Emergent Distributed Data Storage
hybrid cloud
 
PDF
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
Cloudera, Inc.
 
PPTX
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
PPTX
Drill njhug -19 feb2013
MapR Technologies
 
PDF
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
PDF
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Cloudera, Inc.
 
KEY
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Next Generation Data Platforms - Deon Thomas
Thoughtworks
 
Processing Big Data
cwensel
 
Intro to HBase - Lars George
JAX London
 
Searching conversations with hadoop
DataWorks Summit
 
4. hadoop גיא לבנברג
Taldor Group
 
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
yongboy
 
Facebook Messages & HBase
强 王
 
Horizon for Big Data
Schubert Zhang
 
High-Performance Storage Services with HailDB and Java
sunnygleason
 
Real Time BI with Hadoop
Bradford Stephens
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Emergent Distributed Data Storage
hybrid cloud
 
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
Cloudera, Inc.
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Drill njhug -19 feb2013
MapR Technologies
 
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Cloudera, Inc.
 
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Ad

Recently uploaded (20)

PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 

Building a Business on Hadoop, HBase, and Open Source Distributed Computing