SlideShare a Scribd company logo
Building a Business on Open Source
       Distributed Computing


  company: www.visibletechnologies.com

      blog: www.roadtofailure.com
         twitter: @lusciouspear
Social Media and Scaling
Social Media and Scaling

•Scalability Matters Now.
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
Social Media and Scaling

•Scalability Matters Now.
•SM produces large, complex data
•Anyone can collect the web
•Make a Twitter in a few days
•Easy to get TBs of data
•Big Data enabling new fields for
  companies
What Visible Does
What Visible Does



•BI and Brand Management on Social
  Media
What Visible Does



•BI and Brand Management on Social
  Media

•Listen, Monitor, Engage
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Old Product: RDBMS
Old Product: RDBMS


•A few MSSQL servers on boxes
Old Product: RDBMS


•A few MSSQL servers on boxes
•Lots of ETL
Old Product: RDBMS


•A few MSSQL servers on boxes
•Lots of ETL
•Several TB, inserts slow, deletes
  impossible, random fail
Why RDBMS Bad
Why RDBMS Bad
•Nonlinear scale cost
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
•Impedance Mismatch - Try to be High-
  Throughput, Low-Latency
Why RDBMS Bad
•Nonlinear scale cost
•Used as a storage abstraction
•Mainly Select, Join, Group, Count
•Specialized Scale-Out ones ‘meh’
•Impedance Mismatch - Try to be High-
  Throughput, Low-Latency

•Swiss-army knife, unstable,
  transactions, advanced SQL, tuning
Why OSS?
Why OSS?

•Previously all MS
Why OSS?

•Previously all MS
•It exists!
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
Why OSS?

•Previously all MS
•It exists!
•Scaling + Licensing = No
•Can’t build a platform without source
•It’s Enterprise Now!
Goals for New Platform
Goals for New Platform

•“Golden Timeline”
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
Goals for New Platform

•“Golden Timeline”
•Search/Analyze *any* data
•Linear Cost
•Not Hacked Together
•“Collect the Social Internet”
HOW TO SCALE
HOW TO SCALE



•What makes you special?
HOW TO SCALE



•What makes you special?
•What are you willing to sacrifice?
HOW TO SCALE



•What makes you special?
•What are you willing to sacrifice?
•How will you structure the data?
Avoiding Impedance Mismatch
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency

 •Get a lot of data eventually, or a little
   now
Avoiding Impedance Mismatch


 •Most problems can be divided into
   High or Low latency

 •Get a lot of data eventually, or a little
   now

 •MapReduce vs. Sharding/Indexing
Ecosystem
                                                                    Compiled
                                   Pig     Cascading        Hive
                                                                   Processing
            Katta / Applications




                                                                      Raw
Zookeeper




                                           MapReduce
                                                                   Processing
                                                                    Structured
                                                        HBase
                                                                     Storage
                                                                   Unstructured
                                           Hadoop DFS
                                                                     Storage
Simple Workflow
                       Semantic     Unstructured
Hadoop      Collect
                       Analysis       Analysis



                       Structured
                        Analysis
Hadoop +    Store in
 HBase      HBase
                                     Store in
                       Indexing
                                     Hadoop


Lucene+                 Load/
              Pull
 Solr+                 Replicate
            Indexes
 Katta                  Shards           Search
Unstructured Processing Cluster


                     Semantic   Unstructured   Structured
Internet   Collect                                Store
                     Analysis     Analysis

                                   HBase
           HTML        XML
                                  Records
Hadoop + MR
Hadoop + MR


•Special: Crunch web-scale data fast
Hadoop + MR


•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
  Random Access, Updates
Hadoop + MR


•Special: Crunch web-scale data fast
•Sacrifice: Low-Latency, Transactions,
  Random Access, Updates

•Structure: Chunked flat files
Structured Processing Cluster
                          Enriched Data

                           Structured
                            Analysis
Unstructured   Store in
  Cluster      HBase
                                            Store in     Search
                            Indexing
                                            Hadoop       Cluster
                HBase
               Records
                                            Sharded
                          Lucene Index
                                          Lucene Index
Document Structure


ContentID: 00BAC189
Title: Iron Maiden Rules
Body: I think Janick Gers is an amazing guitarist blah blah
PostDT: 20090718
ParentID: 0FDEADBEEF
Permalink: www.roadtofailure.com/post?=20
HBase
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,
  Transactions (kind of)
HBase


•Special: Scalable random/sequential
  access almost as fast as RDBMS

•Sacrifice: Joins, Secondary Indexes,
  Transactions (kind of)

•Structure: BigTable - column oriented
Search Cluster


   Lucene                 Load/
                 Pull
Indexes from             Replicate
               Indexes
    HDFS                  Shards     Search

               Lucene    Lucene
               Indexes   Indexes
Search
Katta + Solr
Katta + Solr



•Special: Sharded search
Katta + Solr



•Special: Sharded search
•Sacrifice: Consistency, high-throughput
Katta + Solr



•Special: Sharded search
•Sacrifice: Consistency, high-throughput
•Structure: Reverse index
BI
BI


•Group, Sort, Filter, Count, Sum
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
BI


•Group, Sort, Filter, Count, Sum
•Semi-additive (Avg) rare but not hard
•MapReduce Jobs
•Faceted Search
Examples
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Challenges
Challenges


•Scaling Search
Challenges


•Scaling Search
•Understanding Latency
Challenges


•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
 customers wait for big data?
Challenges


•Scaling Search
•Understanding Latency
•What do we need ‘now’? Can
 customers wait for big data?

•Monitoring
Recap: Rules for Scaling
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
•Know your data structure
Recap: Rules for Scaling

•RDBMS is not a Swiss-Army Knife
•Know your sacrifices
•Know your specialness
•Know your data structure
•Ponder Latency
What Next?
What Next?



•HBase Analytics?
What Next?



•HBase Analytics?
•“What would make a bank trust it”
What Next?



•HBase Analytics?
•“What would make a bank trust it”
•Teach people to think about data
...
The End


company: www.visibletechnologies.com

    blog: www.roadtofailure.com
       twitter: @lusciouspear

    bradfordstephens@gmail.com

More Related Content

What's hot (20)

PPTX
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PPTX
Hadoop overview
Siva Pandeti
 
PPTX
The Hadoop Ecosystem
J Singh
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
PPTX
Introduction to Hadoop
Ran Ziv
 
PPTX
Using Apache Drill
Chicago Hadoop Users Group
 
PPTX
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PPTX
HBase in Practice
larsgeorge
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PDF
Hadoop 101
EMC
 
PPTX
Analyzing Real-World Data with Apache Drill
tshiran
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Hadoop User Group - Status Apache Drill
MapR Technologies
 
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
ODP
Hadoop demo ppt
Phil Young
 
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Hadoop overview
Siva Pandeti
 
The Hadoop Ecosystem
J Singh
 
Introduction to Big Data & Hadoop
Edureka!
 
Introduction to Hadoop
Ran Ziv
 
Using Apache Drill
Chicago Hadoop Users Group
 
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
HBase in Practice
larsgeorge
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop 101
EMC
 
Analyzing Real-World Data with Apache Drill
tshiran
 
Hadoop Overview & Architecture
EMC
 
Hadoop User Group - Status Apache Drill
MapR Technologies
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
Hadoop demo ppt
Phil Young
 

Viewers also liked (20)

PPTX
What is Big Data?
Bernard Marr
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
Big Data
Peter Parycek
 
KEY
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
PDF
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
 
PDF
Titan: The Rise of Big Graph Data
Marko Rodriguez
 
PDF
Big Data Overview
IMEX Research
 
PDF
How to Interview a Data Scientist
Daniel Tunkelang
 
PDF
Titan: Big Graph Data with Cassandra
Matthias Broecheler
 
PDF
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
 
PDF
Introduction to R for Data Mining
Revolution Analytics
 
PPTX
Big data Overview
Arnon Rotem-Gal-Oz
 
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
PDF
A Primer on Big Data for Business
Leslie Bradshaw
 
PDF
Turning Big Data to Business Advantage
Teradata Aster
 
PDF
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
PPTX
What is big data?
David Wellman
 
PPT
Big Data
NGDATA
 
PPT
Big data ppt
IDBI Bank Ltd.
 
PPTX
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
What is Big Data?
Bernard Marr
 
Big data ppt
Nasrin Hussain
 
Big Data
Peter Parycek
 
Intro to Data Science for Enterprise Big Data
Paco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
 
Titan: The Rise of Big Graph Data
Marko Rodriguez
 
Big Data Overview
IMEX Research
 
How to Interview a Data Scientist
Daniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Matthias Broecheler
 
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
 
Introduction to R for Data Mining
Revolution Analytics
 
Big data Overview
Arnon Rotem-Gal-Oz
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
A Primer on Big Data for Business
Leslie Bradshaw
 
Turning Big Data to Business Advantage
Teradata Aster
 
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
What is big data?
David Wellman
 
Big Data
NGDATA
 
Big data ppt
IDBI Bank Ltd.
 
Big Data - 25 Amazing Facts Everyone Should Know
Bernard Marr
 
Ad

Similar to Building a Business on Hadoop, HBase, and Open Source Distributed Computing (20)

PPTX
Big data hadoop ecosystem and nosql
Khanderao Kand
 
PPTX
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
PPTX
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Krishnan Parasuraman
 
PDF
Searching conversations with hadoop
DataWorks Summit
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
PDF
Scaling Databases On The Cloud
Imaginea
 
PDF
Scaing databases on the cloud
Imaginea
 
PPTX
Real time hadoop + mapreduce intro
Geoff Hendrey
 
PPT
Hadoop by sunitha
Sunitha Satyadas
 
PDF
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
DataWorks Summit
 
KEY
Processing Big Data
cwensel
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PPTX
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
PDF
HugeTable:Application-Oriented Structure Data Storage System
qlw5
 
PDF
Improving MySQL performance with Hadoop
Sagar Jauhari
 
PDF
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Krishnan Parasuraman
 
Searching conversations with hadoop
DataWorks Summit
 
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hive @ Hadoop day seattle_2010
nzhang
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
Scaling Databases On The Cloud
Imaginea
 
Scaing databases on the cloud
Imaginea
 
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop by sunitha
Sunitha Satyadas
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
DataWorks Summit
 
Processing Big Data
cwensel
 
Big data ppt
Thirunavukkarasu Ps
 
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
HugeTable:Application-Oriented Structure Data Storage System
qlw5
 
Improving MySQL performance with Hadoop
Sagar Jauhari
 
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Designing Production-Ready AI Agents
Kunal Rai
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 

Building a Business on Hadoop, HBase, and Open Source Distributed Computing