SlideShare a Scribd company logo
Realtime Analytics using
                    Hadoop & HBase
                                Lars George,
                      Solutions Architect @ Cloudera
                            lars@cloudera.com



Monday, July 25, 11
About Me

                      • Solutions Architect @ Cloudera
                      • Apache HBase & Whirr Committer
                      • Working with Hadoop & HBase since 2007
                      • Author of O’Reilly’s “HBase - The Definitive
                        Guide”



Monday, July 25, 11
The Application Stack
                      • Solve Business Goals
                      • Rely on Proven Building Blocks
                      • Rapid Prototyping
                       ‣ Templates, MVC, Reference
                         Implementations
                      • Evolutionary Innovation Cycles
                                 “Let there be light!”
Monday, July 25, 11
LAMP



Monday, July 25, 11
L   Linux


                      A   Apache


                      M   MySQL


                      P   PHP/Perl




Monday, July 25, 11
L   Linux


                      A   Apache


                      M   MySQL


                      M   Memcache


                      P   PHP/Perl


Monday, July 25, 11
The Dawn of Big Data
                      •   Industry verticals produce a staggering amount of data
                      •   Not only web properties, but also “brick and mortar”
                          businesses
                          ‣   Smart Grid, Bio Informatics, Financial, Telco
                      •   Scalable computation frameworks allow analysis of all the data
                          ‣   No sampling anymore
                      •   Suitable algorithms derive even more data
                          ‣   Machine learning
                      •   “The Unreasonable Effectiveness of Data”
                          ‣   More data is better than smart algorithms

Monday, July 25, 11
Hadoop

                      • HDFS + MapReduce
                      • Based on Google Papers
                      • Distributed Storage and Computation
                        Framework
                      • Affordable Hardware, Free Software
                      • Significant Adoption
Monday, July 25, 11
HDFS
                      •   Reliably store petabytes of replicated data across
                          thousands of nodes
                          ‣ Data divided into 64MB blocks, each block replicated
                            three times
                      •   Master/Slave Architecture
                          ‣ Master NameNode contains meta data
                          ‣ Slave DataNode manages block on local file system
                      •   Built on “commodity” hardware
                          ‣ No 15k RPM disks or RAID required (nor wanted!)
                          ‣ Commodity Server Hardware

Monday, July 25, 11
MapReduce
                      • Distributed programming model to reliably
                        process petabytes of data
                      • Locality of data to processing is vital
                        ‣ Run code where data resides
                      • Inspired by map and reduce functions in
                        functional programming

          Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
Monday, July 25, 11
From Short to Long Term
                      Internet


                      LAM(M)P
                                 • Serves the Client
                                 • Stores Intermediate Data

                      Hadoop
                                 • Background Batch Processing
                                 • Stores Long-Term Data

Monday, July 25, 11
Batch Processing
                      •   Scale is Unlimited
                          ‣ Bound only by Hardware
                      •   Harness the Power of the Cluster
                          ‣ CPUs, Disks, Memory

                      •   Disks extend Memory
                          ‣ Spills represent Swapping

                      •   Trade Size Limitations with Time
                          ‣ Jobs run for a few minutes to hours, days
Monday, July 25, 11
From Batch to Realtime
                      •   “Time is Money”
                      •   Bridging the gap between batch and “now”
                      •   Realtime often means “faster than batch”
                      •   80/20 Rule
                          ‣ Hadoop solves the 80% easily
                          ‣ The remaining 20% is taking 80% of the
                            effort
                      •   Go as close as possible, don’t overdo it!

Monday, July 25, 11
Stop Gap Solutions
                      •   In Memory
                          ‣   Memcached
                          ‣   MemBase
                          ‣   GigaSpaces
                      •   Relational Databases
                          ‣   MySQL
                          ‣   PostgreSQL
                      •   NoSQL
                          ‣   Cassandra
                          ‣   HBase

Monday, July 25, 11
HBase Architecture




Monday, July 25, 11
Client Access




Monday, July 25, 11
Auto Sharding




Monday, July 25, 11
Distribution




Monday, July 25, 11
HBase Key Design




Monday, July 25, 11
Key Cardinality




Monday, July 25, 11
Fold, Store, and Shift




Monday, July 25, 11
Complemental Design #1
                      Internet
                                 • Keep Backup in HDFS
                                 • MapReduce over HDFS
                                 • Synchronize HBase
                      LAM(M)P      ‣Batch Puts
                                   ‣Bulk Import

                      Hadoop     HBase



Monday, July 25, 11
Complemental Design #2
                      Internet
                                 • Add Log Support
                                 • Synchronize HBase
                      LAM(M)P      ‣Batch Puts
                       Flume
                                   ‣Bulk Import


                      Hadoop     HBase


Monday, July 25, 11
Mitigation Planning
                      • Reliable storage has top priority
                      • Disaster Recovery
                      • HBase Backups
                        ‣ Export - but what if HBase is “down”
                        ‣ CopyTable - same issue
                        ‣ Snapshots - not available

Monday, July 25, 11
Complemental Design #3
                      Internet
                                  • Add Log Processing
                                  • Remove Direct Connection
                      LAM(M)P     • Synchronize HBase
                                    ‣Batch Puts
                       Flume        ‣Bulk Import

                                 Log
                      Hadoop                HBase
                                 Proc


Monday, July 25, 11
Facebook Insights

                      • > 20B Events per Day
                      • 1M Counter Updates per Second
                        ‣ 100 Nodes Cluster
                        ‣ 10K OPS per Node


                      Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase

Monday, July 25, 11
Collection Layer

                      • “Like” button triggers AJAX request
                      • Event written to log file using Scribe
                        ‣ Handles aggregation, delivery, file roll
                          over, etc.
                        ‣ Uses HDFS to store files
                      ✓ Use Flume or Scribe

Monday, July 25, 11
Filter Layer
                      • Ptail “follows” logs written by Scribe
                      • Aggregates from multiple logs
                      • Separates into event types
                        ‣ Sharding for future growth
                      • Facebook internal tool
                      ✓ Use Flume

Monday, July 25, 11
Batching Layer
                      • Puma batches updates
                        ‣ 1 sec, staggered
                      • Flush batch, when last is done
                      • Duration limited by key distribution
                      • Facebook internal tool
                      ✓ Use Coprocessors (0.92.0)

Monday, July 25, 11
Counters
                      •   Store counters per Domain and per URL
                          ‣ Leverage HBase increment (atomic read-modify-
                            write) feature
                      •   Each row is one specific Domain or URL
                      •   The columns are the counters for specific metrics
                      •   Column families are used to group counters by time
                          range
                          ‣ Set time-to-live on CF level to auto-expire counters
                            by age to save space, e.g., 2 weeks on “Daily
                            Counters” family

Monday, July 25, 11
Key Design
           •          Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
                 ‣      Helps keeping pages per site close, as HBase efficiently scans blocks
                        of sorted keys
           •          Domain Row Key =
                      MD5(Reversed Domain) + Reversed Domain
                 ‣      Leading MD5 hash spreads keys randomly across all regions for
                        load balancing reasons
                 ‣      Only hashing the domain groups per site (and per subdomain if
                        needed)
           •          URL Row Key =
                      MD5(Reversed Domain) + Reversed Domain + URL ID
                 ‣      Unique ID per URL already available, make use of it

Monday, July 25, 11
Insights Schema
  Row Key: Domain Row Key
  Columns:
          Hourly Counters CF               Daily Counters CF                     Lifetime Counters CF
   6pm 6pm            6pm   7pm                1/1    1/1   2/1
                                 ... 1/1 Total                     ...   Total      Male Female    US    ...
   Total Male          US    ...               Male   US     ...
    100   50           92    45        1000    320    670   990          10000      6780   3220   9900

  Row Key: URL Row Key
  Columns:
          Hourly Counters CF               Daily Counters CF                     Lifetime Counters CF
   6pm 6pm            6pm   7pm                1/1    1/1   2/1
                                 ... 1/1 Total                     ...   Total      Male Female    US    ...
   Total Male          US    ...               Male   US     ...
    10    5            9     4         100      20    70    99            100        8      92     100




Monday, July 25, 11
Summary
                      • Design for Use-Case
                       ‣ Read, Write, or Both?
                      • Avoid Hotspotting
                       ‣ Region and Table
                      • Manage Automatism at Scale
                       ‣ For now!

Monday, July 25, 11
Monday, July 25, 11
Monday, July 25, 11
Questions?


                      lars@cloudera.com
                      http://cloudera.com




Monday, July 25, 11

More Related Content

What's hot (19)

PDF
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
PPTX
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
 
PDF
Non-Stop Hadoop for Hortonworks
Hortonworks
 
PPTX
Geo-based content processing using hbase
Ravi Veeramachaneni
 
PPTX
NoSQL Needs SomeSQL
DataWorks Summit
 
PPTX
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
PPTX
Hadoop configuration & performance tuning
Vitthal Gogate
 
PPTX
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
PPTX
A Survey of HBase Application Archetypes
HBaseCon
 
PDF
Building a Hadoop Data Warehouse with Impala
huguk
 
PPTX
Content Identification using HBase
HBaseCon
 
PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PDF
Hadoop 101
EMC
 
PDF
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
PPTX
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
 
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Geo-based content processing using hbase
Ravi Veeramachaneni
 
NoSQL Needs SomeSQL
DataWorks Summit
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Allen Day, PhD
 
Hadoop configuration & performance tuning
Vitthal Gogate
 
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
A Survey of HBase Application Archetypes
HBaseCon
 
Building a Hadoop Data Warehouse with Impala
huguk
 
Content Identification using HBase
HBaseCon
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Hadoop 101
EMC
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 

Viewers also liked (8)

PDF
Kafka and Storm - event processing in realtime
Guido Schmutz
 
PPTX
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.
 
PDF
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 
PPT
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
PDF
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
PDF
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Kafka and Storm - event processing in realtime
Guido Schmutz
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
Cloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Exploring BigData with Google BigQuery
Dharmesh Vaya
 
Ad

Similar to Realtime Analytics with Hadoop and HBase (20)

PDF
Firebird meets NoSQL
Mind The Firebird
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PPTX
001 hbase introduction
Scott Miao
 
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
PPTX
Practical introduction to hadoop
inside-BigData.com
 
KEY
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
PPTX
Lviv EDGE 2 - NoSQL
zenyk
 
PDF
Hive at booking
David Morel
 
PPTX
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
PPTX
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Krishnan Parasuraman
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PPTX
Big Data Strategy for the Relational World
Andrew Brust
 
PPTX
Big Data in the Microsoft Platform
Jesus Rodriguez
 
PDF
Intro to NoSQL and MongoDB
DATAVERSITY
 
PDF
Fb talk arch_summit
drewz lin
 
PDF
MAD Skills: New Analysis Practices for Big Data
Christan Grant
 
PDF
Intro to Big Data
Zohar Elkayam
 
Firebird meets NoSQL
Mind The Firebird
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
001 hbase introduction
Scott Miao
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Practical introduction to hadoop
inside-BigData.com
 
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
Lviv EDGE 2 - NoSQL
zenyk
 
Hive at booking
David Morel
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Krishnan Parasuraman
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Big Data Strategy for the Relational World
Andrew Brust
 
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Intro to NoSQL and MongoDB
DATAVERSITY
 
Fb talk arch_summit
drewz lin
 
MAD Skills: New Analysis Practices for Big Data
Christan Grant
 
Intro to Big Data
Zohar Elkayam
 
Ad

More from larsgeorge (14)

PPTX
HBase in Practice
larsgeorge
 
PPTX
Backup and Disaster Recovery in Hadoop
larsgeorge
 
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
PDF
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge
 
PDF
Big Data is not Rocket Science
larsgeorge
 
PDF
HBase Sizing Guide
larsgeorge
 
PDF
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
PDF
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
PDF
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
PDF
HBase Sizing Notes
larsgeorge
 
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
PDF
Social Networks and the Richness of Data
larsgeorge
 
HBase in Practice
larsgeorge
 
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge
 
Big Data is not Rocket Science
larsgeorge
 
HBase Sizing Guide
larsgeorge
 
HBase Applications - Atlanta HUG - May 2014
larsgeorge
 
Parquet - Data I/O - Philadelphia 2013
larsgeorge
 
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
HBase Sizing Notes
larsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
Social Networks and the Richness of Data
larsgeorge
 

Recently uploaded (20)

PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 

Realtime Analytics with Hadoop and HBase

  • 1. Realtime Analytics using Hadoop & HBase Lars George, Solutions Architect @ Cloudera [email protected] Monday, July 25, 11
  • 2. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Working with Hadoop & HBase since 2007 • Author of O’Reilly’s “HBase - The Definitive Guide” Monday, July 25, 11
  • 3. The Application Stack • Solve Business Goals • Rely on Proven Building Blocks • Rapid Prototyping ‣ Templates, MVC, Reference Implementations • Evolutionary Innovation Cycles “Let there be light!” Monday, July 25, 11
  • 5. L Linux A Apache M MySQL P PHP/Perl Monday, July 25, 11
  • 6. L Linux A Apache M MySQL M Memcache P PHP/Perl Monday, July 25, 11
  • 7. The Dawn of Big Data • Industry verticals produce a staggering amount of data • Not only web properties, but also “brick and mortar” businesses ‣ Smart Grid, Bio Informatics, Financial, Telco • Scalable computation frameworks allow analysis of all the data ‣ No sampling anymore • Suitable algorithms derive even more data ‣ Machine learning • “The Unreasonable Effectiveness of Data” ‣ More data is better than smart algorithms Monday, July 25, 11
  • 8. Hadoop • HDFS + MapReduce • Based on Google Papers • Distributed Storage and Computation Framework • Affordable Hardware, Free Software • Significant Adoption Monday, July 25, 11
  • 9. HDFS • Reliably store petabytes of replicated data across thousands of nodes ‣ Data divided into 64MB blocks, each block replicated three times • Master/Slave Architecture ‣ Master NameNode contains meta data ‣ Slave DataNode manages block on local file system • Built on “commodity” hardware ‣ No 15k RPM disks or RAID required (nor wanted!) ‣ Commodity Server Hardware Monday, July 25, 11
  • 10. MapReduce • Distributed programming model to reliably process petabytes of data • Locality of data to processing is vital ‣ Run code where data resides • Inspired by map and reduce functions in functional programming Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output Monday, July 25, 11
  • 11. From Short to Long Term Internet LAM(M)P • Serves the Client • Stores Intermediate Data Hadoop • Background Batch Processing • Stores Long-Term Data Monday, July 25, 11
  • 12. Batch Processing • Scale is Unlimited ‣ Bound only by Hardware • Harness the Power of the Cluster ‣ CPUs, Disks, Memory • Disks extend Memory ‣ Spills represent Swapping • Trade Size Limitations with Time ‣ Jobs run for a few minutes to hours, days Monday, July 25, 11
  • 13. From Batch to Realtime • “Time is Money” • Bridging the gap between batch and “now” • Realtime often means “faster than batch” • 80/20 Rule ‣ Hadoop solves the 80% easily ‣ The remaining 20% is taking 80% of the effort • Go as close as possible, don’t overdo it! Monday, July 25, 11
  • 14. Stop Gap Solutions • In Memory ‣ Memcached ‣ MemBase ‣ GigaSpaces • Relational Databases ‣ MySQL ‣ PostgreSQL • NoSQL ‣ Cassandra ‣ HBase Monday, July 25, 11
  • 21. Fold, Store, and Shift Monday, July 25, 11
  • 22. Complemental Design #1 Internet • Keep Backup in HDFS • MapReduce over HDFS • Synchronize HBase LAM(M)P ‣Batch Puts ‣Bulk Import Hadoop HBase Monday, July 25, 11
  • 23. Complemental Design #2 Internet • Add Log Support • Synchronize HBase LAM(M)P ‣Batch Puts Flume ‣Bulk Import Hadoop HBase Monday, July 25, 11
  • 24. Mitigation Planning • Reliable storage has top priority • Disaster Recovery • HBase Backups ‣ Export - but what if HBase is “down” ‣ CopyTable - same issue ‣ Snapshots - not available Monday, July 25, 11
  • 25. Complemental Design #3 Internet • Add Log Processing • Remove Direct Connection LAM(M)P • Synchronize HBase ‣Batch Puts Flume ‣Bulk Import Log Hadoop HBase Proc Monday, July 25, 11
  • 26. Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second ‣ 100 Nodes Cluster ‣ 10K OPS per Node Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase Monday, July 25, 11
  • 27. Collection Layer • “Like” button triggers AJAX request • Event written to log file using Scribe ‣ Handles aggregation, delivery, file roll over, etc. ‣ Uses HDFS to store files ✓ Use Flume or Scribe Monday, July 25, 11
  • 28. Filter Layer • Ptail “follows” logs written by Scribe • Aggregates from multiple logs • Separates into event types ‣ Sharding for future growth • Facebook internal tool ✓ Use Flume Monday, July 25, 11
  • 29. Batching Layer • Puma batches updates ‣ 1 sec, staggered • Flush batch, when last is done • Duration limited by key distribution • Facebook internal tool ✓ Use Coprocessors (0.92.0) Monday, July 25, 11
  • 30. Counters • Store counters per Domain and per URL ‣ Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range ‣ Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family Monday, July 25, 11
  • 31. Key Design • Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog” ‣ Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key = MD5(Reversed Domain) + Reversed Domain ‣ Leading MD5 hash spreads keys randomly across all regions for load balancing reasons ‣ Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID ‣ Unique ID per URL already available, make use of it Monday, July 25, 11
  • 32. Insights Schema Row Key: Domain Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 100 50 92 45 1000 320 670 990 10000 6780 3220 9900 Row Key: URL Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 10 5 9 4 100 20 70 99 100 8 92 100 Monday, July 25, 11
  • 33. Summary • Design for Use-Case ‣ Read, Write, or Both? • Avoid Hotspotting ‣ Region and Table • Manage Automatism at Scale ‣ For now! Monday, July 25, 11
  • 36. Questions? [email protected] http://cloudera.com Monday, July 25, 11