SlideShare a Scribd company logo
Hadoop at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter




                           TM
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis
‣   Data Products
The Twitter Data Lifecycle
‣   Data Input: Scribe, Crane
‣   Data Storage: Elephant Bird, HBase
‣   Data Analysis: Pig, Oink
‣   Data Products: Birdbrain



1 Community Open Source
2 Twitter Open Source (or soon)
My Background
‣   Studied Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh routing algorithms,
    GBs of data
‣   Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣   Twitter: Hadoop, Pig, machine learning, visualization, social
    graph analysis, (soon) PBs of data
The Twitter Data Lifecycle
‣   Data Input: Scribe, Crane
‣   Data Storage
‣   Data Analysis
‣   Data Products



1 Community Open Source
2 Twitter Open Source
What Data?
‣   Two main kinds of raw data
‣   	   Logs
‣   	   Tabular data
Logs
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Logs
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources overwhelmed
‣   Lost data
Scribe
‣   Scribe daemon runs locally; reliable in network outage
‣   Nodes only know downstream
                                              FE             FE           FE
    writer; hierarchical, scalable
‣   Pluggable outputs, per category

                                                    Agg             Agg




                                       File                  HDFS
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 57 different categories logged from multiple sources
‣     FE: Javascript, Ruby on Rails
‣     Middle tier: Ruby on Rails, Scala
‣     Backend: Scala, Java, C++
‣   7 TB/day into HDFS


‣   Log first, ask questions later.
Scribe at Twitter
‣   We’ve contributed to it as we’ve used       it1

‣       Improved logging, monitoring, writing to HDFS, compression
‣       Added ZooKeeper-based config
‣       Continuing to work with FB on patches


‣   Also: working with Cloudera to evaluate Flume

    1 http://github.com/traviscrawford/scribe
Tabular Data
‣   Most site data is in MySQL
‣     Tweets, users, devices, client applications, etc
‣   Need to move it between MySQL and HDFS
‣     Also between MySQL and HBase, or MySQL and MySQL


‣   Crane: configuration driven ETL tool
Crane

                           Driver
Source        Configuration/Batch Management
                                                   Sink
                  ZooKeeper Registration


                         Transform
             Extract                     Load
           Protobuf P1               Protobuf P2
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
‣   Load
‣   	   MySQL, Local file, Stdout, HDFS, HBase
Crane
‣   Extract
‣   	   MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣   Transform
‣   	   IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
‣   Load
‣   	   MySQL, Local file, Stdout, HDFS, HBase
‣   ZooKeeper coordination, intelligent date management
‣   	   Run all the time from multiple servers, self healing
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage: Elephant Bird, HBase
‣   Data Analysis
‣   Data Products



1 Community Open Source
2 Twitter Open Source
Storage Basics
‣   Incoming data: 7 TB/day
‣   LZO encode everything
‣   	   Save 3-4x on storage, pay little CPU
‣   	   Splittable!1

‣   	   IO-bound jobs ==> 3-4x perf increase



1   http://www.github.com/kevinweil/hadoop-lzo
http://www.flickr.com/photos/jagadish/3072134867/




Elephant Bird




1 http://github.com/kevinweil/elephant-bird
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
‣   Why shouldn’t we just continue, and codegen more glue?
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
‣   Why shouldn’t we just continue, and codegen more glue?
‣     InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
    HBaseLoaders
Elephant Bird
‣   We have data coming in as protocol buffers via Crane...
‣   Protobufs: codegen for efficient ser-de of data structures
‣   Why shouldn’t we just continue, and codegen more glue?
‣     InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
    HBaseLoaders
‣     Also now does part of this with Thrift, soon Avro
‣     And JSON, W3C Logs
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣     Logs are easy.
‣     But our tables change.
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣       Logs are easy.
‣       But our tables change.
‣   Handling rapidly changing data in HDFS: not trivial.
‣   
    Don’t worry about updated data
‣   
    Refresh entire dataset
‣   
 Download updates, tombstone old versions of data, ensure jobs
    only run over current versions of data, occasionally rewrite full dataset
Challenge: Mutable Data
‣   HDFS is write-once: no seek on write, no append (yet)
‣     Logs are easy.
‣     But our tables change.
‣   Handling changing data in HDFS: not trivial.
HBase
‣   Has already solved the update problem
‣     Bonus: low-latency query API
‣     Bonus: rich, BigTable-style data model based on column families
HBase at Twitter
‣   Crane loads data directly into HBase
‣      One CF for protobuf bytes, one CF to denormalize columns for
    indexing or quicker batch access
‣     Processing updates transparent, so we always have accurate data in
    HBase
‣    Pig Loader for HBase in Elephant Bird makes integration with
    existing analyses easy
HBase at Twitter
‣   Crane loads data directly into HBase
‣      One CF for protobuf bytes, one CF to denormalize columns for
    indexing or quicker batch access
‣     Processing updates transparent, so we always have accurate data in
    HBase
‣    Pig Loader for HBase in Elephant Bird
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis: Pig, Oink
‣   Data Products



1 Community Open Source
2 Twitter Open Source
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   UDFs are first-class citizens
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Democratizes Large-scale Data
Analysis
‣   The Pig version is:
‣     5% of the code
‣     5% of the time
‣     Within 30% of the execution time.
‣   Innovation increasingly driven from large-scale data analysis
‣     Need fast iteration to understand the right questions
‣     More minds contributing = more value from your data
Pig Examples
‣   Using the HBase Loader




‣   Using the protobuf loaders
Pig Workflow
‣   Oink: framework around Pig for loading, combining, running,
    post-processing
‣   	   Everyone I know has one of these
‣   	   Points to an opening for innovation; discussion beginning
‣   
   Something we’re looking at: Ruby DSL for Pig,   Piglet1




1 http://github.com/ningliang/piglet
Counting Big Data
‣                standard counts, min, max, std dev
‣   How many requests do we serve in a day?
‣   What is the average latency? 95% latency?
‣   Group by response code. What is the hourly distribution?
‣   How many searches happen each day on Twitter?
‣   How many unique queries, how many unique users?
‣   What is their geographic distribution?
Correlating Big Data
‣                 probabilities, covariance, influence
‣   How does usage differ for mobile users?
‣   How about for users with 3rd party desktop clients?
‣   Cohort analyses
‣   Site problems: what goes wrong at the same time?
‣   Which features get users hooked?
‣   Which features do successful users use often?
‣   Search corrections, search suggestions
‣   A/B testing
Research on Big Data
‣           prediction, graph analysis, natural language
‣   What can we tell about a user from their tweets?
‣     From the tweets of those they follow?
‣     From the tweets of their followers?
‣     From the ratio of followers/following?
‣   What graph structures lead to successful networks?
‣   User reputation
Research on Big Data
‣            prediction, graph analysis, natural language
‣   Sentiment analysis
‣   What features get a tweet retweeted?
‣     How deep is the corresponding retweet tree?
‣   Long-term duplicate detection
‣   Machine learning
‣   Language detection
‣   ... the list goes on.
The Twitter Data Lifecycle
‣   Data Input
‣   Data Storage
‣   Data Analysis
‣   Data Products: Birdbrain



1 Community Open Source
2 Twitter Open Source
Data Products
‣   Ad Hoc Analyses
‣     Answer questions to keep the business agile, do research
‣   Online Products
‣     Name search, other upcoming products
‣   Company Dashboard
‣     Birdbrain
Questions?                                          Follow me at
                                                           twitter.com/kevinweil




‣   P.S. We’re hiring. Help us build the next step: realtime big data analytics.

                                                                        TM

More Related Content

What's hot (20)

PDF
Hadoop - How It Works
Vladimír Hanušniak
 
PPT
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
PDF
Introduction To Apache Pig at WHUG
Adam Kawa
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PDF
Hadoop to spark_v2
elephantscale
 
PDF
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
PPTX
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
PPTX
MapReduce basic
Chirag Ahuja
 
PDF
OCF.tw's talk about "Introduction to spark"
Giivee The
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
PPTX
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Mithun Radhakrishnan
 
PDF
Elephant in the cloud
rhatr
 
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
PPT
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Folio3 Software
 
Hadoop - How It Works
Vladimír Hanušniak
 
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Introduction To Apache Pig at WHUG
Adam Kawa
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop to spark_v2
elephantscale
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Scalable Hadoop with succinct Python: the best of both worlds
DataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
MapReduce basic
Chirag Ahuja
 
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Mithun Radhakrishnan
 
Elephant in the cloud
rhatr
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Folio3 Software
 

Viewers also liked (20)

PPT
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
KEY
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
KEY
Big Data at Twitter, Chirp 2010
Kevin Weil
 
PDF
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PPT
Scaling hadoopapplications
Milind Bhandarkar
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PDF
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
PDF
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
PDF
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
PDF
Measuring CDN performance and why you're doing it wrong
Fastly
 
PDF
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Cloudera, Inc.
 
PDF
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
PPT
從 Java programmer 的觀點看 ruby
建興 王
 
PDF
Hadoop.TW : Now and Future
Jazz Yao-Tsung Wang
 
KEY
Spatial Analytics, Where 2.0 2010
Kevin Weil
 
PPTX
全文搜尋引擎的進階實作與應用
建興 王
 
PDF
大資料趨勢介紹與相關使用技術
Wei-Yu Chen
 
KEY
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
PPT
Using CDN to improve performance
Gea-Suan Lin
 
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
Big Data at Twitter, Chirp 2010
Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Kevin Weil
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
Scaling hadoopapplications
Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
Measuring CDN performance and why you're doing it wrong
Fastly
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Cloudera, Inc.
 
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
從 Java programmer 的觀點看 ruby
建興 王
 
Hadoop.TW : Now and Future
Jazz Yao-Tsung Wang
 
Spatial Analytics, Where 2.0 2010
Kevin Weil
 
全文搜尋引擎的進階實作與應用
建興 王
 
大資料趨勢介紹與相關使用技術
Wei-Yu Chen
 
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
Using CDN to improve performance
Gea-Suan Lin
 
Ad

Similar to Hadoop at Twitter (Hadoop Summit 2010) (20)

PPT
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
PPT
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
PPT
Hadoop Frameworks Panel__HadoopSummit2010
Yahoo Developer Network
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPTX
Hadoop Solutions
zenyk
 
PDF
The ABC of Big Data
André Faria Gomes
 
PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
DataWorks Summit
 
PDF
The Hadoop Ecosystem
Mathias Herberts
 
PDF
From a student to an apache committer practice of apache io tdb
jixuan1989
 
PPTX
Python in big data world
Rohit
 
PDF
20081022cca
Jeff Hammerbacher
 
PPTX
Coding serbia
Dusan Zamurovic
 
PPT
Architecting Big Data Ingest & Manipulation
George Long
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PPTX
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PPTX
Open Source india 2014
lohitvijayarenu
 
PDF
Twitter word frequency count using hadoop components 150331221753
pradip patel
 
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
Hadoop Frameworks Panel__HadoopSummit2010
Yahoo Developer Network
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Hadoop Solutions
zenyk
 
The ABC of Big Data
André Faria Gomes
 
Scaling Big Data Mining Infrastructure Twitter Experience
DataWorks Summit
 
The Hadoop Ecosystem
Mathias Herberts
 
From a student to an apache committer practice of apache io tdb
jixuan1989
 
Python in big data world
Rohit
 
20081022cca
Jeff Hammerbacher
 
Coding serbia
Dusan Zamurovic
 
Architecting Big Data Ingest & Manipulation
George Long
 
Pig and Python to Process Big Data
Shawn Hermans
 
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
Hands on Hadoop and pig
Sudar Muthu
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Open Source india 2014
lohitvijayarenu
 
Twitter word frequency count using hadoop components 150331221753
pradip patel
 
Ad

Hadoop at Twitter (Hadoop Summit 2010)

  • 1. Hadoop at Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM
  • 2. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products
  • 3. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis: Pig, Oink ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source (or soon)
  • 4. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, (soon) PBs of data
  • 5. The Twitter Data Lifecycle ‣ Data Input: Scribe, Crane ‣ Data Storage ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 6. What Data? ‣ Two main kinds of raw data ‣ Logs ‣ Tabular data
  • 7. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 8. Logs ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 9. Scribe ‣ Scribe daemon runs locally; reliable in network outage ‣ Nodes only know downstream FE FE FE writer; hierarchical, scalable ‣ Pluggable outputs, per category Agg Agg File HDFS
  • 10. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 57 different categories logged from multiple sources ‣ FE: Javascript, Ruby on Rails ‣ Middle tier: Ruby on Rails, Scala ‣ Backend: Scala, Java, C++ ‣ 7 TB/day into HDFS ‣ Log first, ask questions later.
  • 11. Scribe at Twitter ‣ We’ve contributed to it as we’ve used it1 ‣ Improved logging, monitoring, writing to HDFS, compression ‣ Added ZooKeeper-based config ‣ Continuing to work with FB on patches ‣ Also: working with Cloudera to evaluate Flume 1 http://github.com/traviscrawford/scribe
  • 12. Tabular Data ‣ Most site data is in MySQL ‣ Tweets, users, devices, client applications, etc ‣ Need to move it between MySQL and HDFS ‣ Also between MySQL and HBase, or MySQL and MySQL ‣ Crane: configuration driven ETL tool
  • 13. Crane Driver Source Configuration/Batch Management Sink ZooKeeper Registration Transform Extract Load Protobuf P1 Protobuf P2
  • 14. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
  • 15. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
  • 16. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase
  • 17. Crane ‣ Extract ‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights ‣ Transform ‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic ‣ Load ‣ MySQL, Local file, Stdout, HDFS, HBase ‣ ZooKeeper coordination, intelligent date management ‣ Run all the time from multiple servers, self healing
  • 18. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage: Elephant Bird, HBase ‣ Data Analysis ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 19. Storage Basics ‣ Incoming data: 7 TB/day ‣ LZO encode everything ‣ Save 3-4x on storage, pay little CPU ‣ Splittable!1 ‣ IO-bound jobs ==> 3-4x perf increase 1 http://www.github.com/kevinweil/hadoop-lzo
  • 21. Elephant Bird ‣ We have data coming in as protocol buffers via Crane...
  • 22. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures
  • 23. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue?
  • 24. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders
  • 25. Elephant Bird ‣ We have data coming in as protocol buffers via Crane... ‣ Protobufs: codegen for efficient ser-de of data structures ‣ Why shouldn’t we just continue, and codegen more glue? ‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig HBaseLoaders ‣ Also now does part of this with Thrift, soon Avro ‣ And JSON, W3C Logs
  • 26. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change.
  • 27. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling rapidly changing data in HDFS: not trivial. ‣ Don’t worry about updated data ‣ Refresh entire dataset ‣ Download updates, tombstone old versions of data, ensure jobs only run over current versions of data, occasionally rewrite full dataset
  • 28. Challenge: Mutable Data ‣ HDFS is write-once: no seek on write, no append (yet) ‣ Logs are easy. ‣ But our tables change. ‣ Handling changing data in HDFS: not trivial.
  • 29. HBase ‣ Has already solved the update problem ‣ Bonus: low-latency query API ‣ Bonus: rich, BigTable-style data model based on column families
  • 30. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird makes integration with existing analyses easy
  • 31. HBase at Twitter ‣ Crane loads data directly into HBase ‣ One CF for protobuf bytes, one CF to denormalize columns for indexing or quicker batch access ‣ Processing updates transparent, so we always have accurate data in HBase ‣ Pig Loader for HBase in Elephant Bird
  • 32. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis: Pig, Oink ‣ Data Products 1 Community Open Source 2 Twitter Open Source
  • 33. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ UDFs are first-class citizens ‣ Easier than SQL?
  • 34. Why Pig? ‣ Because I bet you can read the following script.
  • 35. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 37. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the time ‣ Within 30% of the execution time. ‣ Innovation increasingly driven from large-scale data analysis ‣ Need fast iteration to understand the right questions ‣ More minds contributing = more value from your data
  • 38. Pig Examples ‣ Using the HBase Loader ‣ Using the protobuf loaders
  • 39. Pig Workflow ‣ Oink: framework around Pig for loading, combining, running, post-processing ‣ Everyone I know has one of these ‣ Points to an opening for innovation; discussion beginning ‣ Something we’re looking at: Ruby DSL for Pig, Piglet1 1 http://github.com/ningliang/piglet
  • 40. Counting Big Data ‣ standard counts, min, max, std dev ‣ How many requests do we serve in a day? ‣ What is the average latency? 95% latency? ‣ Group by response code. What is the hourly distribution? ‣ How many searches happen each day on Twitter? ‣ How many unique queries, how many unique users? ‣ What is their geographic distribution?
  • 41. Correlating Big Data ‣ probabilities, covariance, influence ‣ How does usage differ for mobile users? ‣ How about for users with 3rd party desktop clients? ‣ Cohort analyses ‣ Site problems: what goes wrong at the same time? ‣ Which features get users hooked? ‣ Which features do successful users use often? ‣ Search corrections, search suggestions ‣ A/B testing
  • 42. Research on Big Data ‣ prediction, graph analysis, natural language ‣ What can we tell about a user from their tweets? ‣ From the tweets of those they follow? ‣ From the tweets of their followers? ‣ From the ratio of followers/following? ‣ What graph structures lead to successful networks? ‣ User reputation
  • 43. Research on Big Data ‣ prediction, graph analysis, natural language ‣ Sentiment analysis ‣ What features get a tweet retweeted? ‣ How deep is the corresponding retweet tree? ‣ Long-term duplicate detection ‣ Machine learning ‣ Language detection ‣ ... the list goes on.
  • 44. The Twitter Data Lifecycle ‣ Data Input ‣ Data Storage ‣ Data Analysis ‣ Data Products: Birdbrain 1 Community Open Source 2 Twitter Open Source
  • 45. Data Products ‣ Ad Hoc Analyses ‣ Answer questions to keep the business agile, do research ‣ Online Products ‣ Name search, other upcoming products ‣ Company Dashboard ‣ Birdbrain
  • 46. Questions? Follow me at twitter.com/kevinweil ‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics. TM