SlideShare a Scribd company logo
Hadoop and Pig @Twitter
              Kevin Weil -- @kevinweil
              Analytics Lead, Twitter




                                         TM




Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
My Background
           ‣     Mathematics and Physics at Harvard, Physics at
                 Stanford
           ‣     Tropos Networks (city-wide wireless): mesh
                 routing algorithms, GBs of data
           ‣     Cooliris (web media): Hadoop and Pig for
                 analytics, TBs of data
           ‣     Twitter: Hadoop, Pig, HBase, Cassandra,
                 machine learning, visualization, social graph
                 analysis, soon to be PBs data



Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
Data is Getting Big
           ‣     NYSE: 1 TB/day
           ‣     Facebook: 20+ TB
                 compressed/day
           ‣     CERN/LHC: 40 TB/day
                 (15 PB/year)
           ‣     And growth is
                 accelerating
           ‣     Need multiple machines,
                 horizontal scalability


Friday, July 23, 2010
Hadoop
           ‣      Distributed file system (hard to store a PB)
           ‣      Fault-tolerant, handles replication, node failure,
                  etc
           ‣      MapReduce-based parallel computation
                  (even harder to process a PB)
           ‣      Generic key-value based computation interface
                  allows for wide applicability




Friday, July 23, 2010
Hadoop
           ‣      Open source: top-level Apache project
           ‣      Scalable: Y! has a 4000-node cluster
           ‣      Powerful: sorted a TB of random integers in 62
                  seconds


           ‣      Easy Packaging: Cloudera RPMs, DEBs




Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
MapReduce Workflow
    Inputs
                                                           ‣   Challenge: how many tweets per
                        Map
                              Shuffle/Sort                      user, given tweets table?
                        Map
                                                           ‣   Input: key=row, value=tweet info
                                                 Outputs
                        Map             Reduce             ‣   Map: output key=user_id,
                        Map             Reduce
                                                               value=1
                        Map             Reduce             ‣   Shuffle: sort by user_id
                        Map                                ‣   Reduce: for each user_id, sum
                        Map
                                                           ‣   Output: user_id, tweet count
                                                           ‣   With 2x machines, runs 2x faster

Friday, July 23, 2010
But...
           ‣     Analysis typically in Java
           ‣     Single-input, two-stage
                 data flow is rigid
           ‣     Projections, filters:
                 custom code
           ‣     Joins are lengthy, error-prone
           ‣     Hard to manage n-stage jobs
           ‣     Exploration requires compilation!



Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
Enter Pig
          ‣      High level language
          ‣      Transformations on
                 sets of records
          ‣      Process data one step at a time
          ‣      Easier than SQL?


          ‣      Top-level Apache project



Friday, July 23, 2010
Why Pig?
             ‣      Because I bet you can read the following script.




Friday, July 23, 2010
A Real Pig Script




Friday, July 23, 2010
Now, just for fun...
             ‣      The same calculation in vanilla MapReduce




Friday, July 23, 2010
No, seriously.




Friday, July 23, 2010
Pig Democratizes Large-scale
           Data Analysis
           ‣     The Pig version is:
           ‣            5% of the code
           ‣            5% of the development time
           ‣            Within 25% of the execution time
           ‣            Readable, reusable




Friday, July 23, 2010
One Thing I’ve Learned
           ‣     It’s easy to answer questions
           ‣     It’s hard to ask the right questions


           ‣     Value the system that promotes innovation and
                 iteration




Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
MySQL, MySQL, MySQL
           ‣     We all start there.
           ‣     But MySQL is not built for analysis.
           ‣     select count(*) from users? Maybe.
           ‣     select count(*) from tweets? Uh...
           ‣     Imagine joining them.
           ‣     And grouping.
           ‣     Then sorting.



Friday, July 23, 2010
Non-Pig Hadoop at Twitter
           ‣     Data Sink via Scribe
           ‣     Distributed Grep
           ‣     A few performance-critical, simple jobs
           ‣     People Search




Friday, July 23, 2010
People Search?
           ‣     First real product built with Hadoop
           ‣     “Find People”
           ‣     Old version: offline process on
                 a single node
           ‣     New version: complex graph
                 calculations, hit internal network
                 services, custom indexing
           ‣     	      Faster, more reliable,
                 more observable
Friday, July 23, 2010
People Search
           ‣     Import user data into HBase
           ‣     Periodic MapReduce job reading from HBase
           ‣      Hits FlockDB, other internal services in
                 mapper
           ‣            Custom partitioning
           ‣     Data sucked across to sharded, replicated,
                 horizontally scalable, in-memory, low-latency
                 Scala service
           ‣       Build a trie, do case folding/normalization,
                 suggestions, etc
Friday, July 23, 2010
Agenda
           ‣     Hadoop Overview
           ‣     Pig: Rapid Learning Over Big Data
           ‣     Data-Driven Products
           ‣     Hadoop/Pig and Analytics




Friday, July 23, 2010
Order of Operations

          ‣      Counting



          ‣      Correlating



          ‣      Research/
                 Algorithmic
                 Learning

Friday, July 23, 2010
Counting
           ‣     How many requests per day?
           ‣     What’s the average latency? 95% latency?
           ‣     What’s the response code distribution?
           ‣     How many searches per day? Unique users?
           ‣     What’s the geographic breakdown of requests?
           ‣     How many tweets? From what clients?
           ‣     How many signups? Profile completeness?
           ‣     How many SMS notifications did we send?


Friday, July 23, 2010
Correlating
           ‣     How does usage differ for mobile users?
           ‣     ... for desktop client users (Tweetdeck, etc)?
           ‣     Cohort analyses
           ‣     What services fail at the same time?
           ‣     What features get users hooked?
           ‣     What do successful users do often?
           ‣     How does tweet volume change over time?



Friday, July 23, 2010
Research
           ‣     What can we infer from a user’s tweets?
           ‣     ... from the tweets of their followers? followees?
           ‣     What features tend to get a tweet retweeted?
           ‣     ... and what influences the retweet tree depth?
           ‣     Duplicate detection, language detection
           ‣     What graph structures lead to increased usage?
           ‣     Sentiment analysis, entity extraction
           ‣     User reputation


Friday, July 23, 2010
If We Had More Time...
           ‣     HBase
           ‣     LZO compression and Hadoop
           ‣     Protocol buffers
           ‣     Our open source: hadoop-lzo, elephant-bird
           ‣     Analytics and Cassandra




Friday, July 23, 2010
Questions?
                   Follow me at
                   twitter.com/kevinweil



                                           TM




Friday, July 23, 2010

More Related Content

KEY
Spatial Analytics, Where 2.0 2010
Kevin Weil
 
KEY
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
KEY
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
KEY
Big Data at Twitter, Chirp 2010
Kevin Weil
 
KEY
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
KEY
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
PPT
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 
Spatial Analytics, Where 2.0 2010
Kevin Weil
 
NoSQL at Twitter (NoSQL EU 2010)
Kevin Weil
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Kevin Weil
 
Big Data at Twitter, Chirp 2010
Kevin Weil
 
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 

Viewers also liked (20)

PPT
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Practical Hadoop using Pig
David Wellman
 
PDF
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
PDF
Introduction To Apache Pig at WHUG
Adam Kawa
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PDF
Un introduction à Pig
Modern Data Stack France
 
PPT
RainBird
Ben Taylor
 
PPTX
Pig workshop
Sudar Muthu
 
PPT
Scaling hadoopapplications
Milind Bhandarkar
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PDF
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
PDF
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
PDF
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
PDF
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
PDF
Measuring CDN performance and why you're doing it wrong
Fastly
 
PDF
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
PPTX
Pig statements
Ganesh Sanap
 
PDF
Apache pig
Mudassir Khan Pathan
 
PDF
Hadoop, Pig, and Python (PyData NYC 2012)
mortardata
 
KEY
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop and Pig at Twitter__HadoopSummit2010
Yahoo Developer Network
 
Practical Hadoop using Pig
David Wellman
 
Hadoop Pig: MapReduce the easy way!
Nathan Bijnens
 
Introduction To Apache Pig at WHUG
Adam Kawa
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Un introduction à Pig
Modern Data Stack France
 
RainBird
Ben Taylor
 
Pig workshop
Sudar Muthu
 
Scaling hadoopapplications
Milind Bhandarkar
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Milind Bhandarkar
 
Hadoop: The Default Machine Learning Platform ?
Milind Bhandarkar
 
Future of Data Intensive Applicaitons
Milind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
Milind Bhandarkar
 
Measuring CDN performance and why you're doing it wrong
Fastly
 
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Pig statements
Ganesh Sanap
 
Hadoop, Pig, and Python (PyData NYC 2012)
mortardata
 
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Ad

Similar to Hadoop and pig at twitter (oscon 2010) (20)

KEY
Geo Analytics Tutorial - Where 2.0 2011
Peter Skomoroch
 
PPTX
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
shravanthium111
 
KEY
Hadoop london
Yahoo Developer Network
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PDF
Large Scale Data Processing & Storage
Ilayaraja P
 
PDF
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
PDF
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
PROIDEA
 
PPTX
introduction to Complete Map and Reduce Framework
harikumar288574
 
PDF
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
PDF
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
PPTX
Intro to Big Data using Hadoop
Sergejus Barinovas
 
PPTX
Computational Social Science, Lecture 03: Counting at Scale, Part I
jakehofman
 
PDF
Hadoop Overview & Architecture
EMC
 
PPTX
Python in big data world
Rohit
 
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
harikumar288574
 
KEY
MapReduce and NoSQL
Aaron Cordova
 
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
PPTX
Streaming Python on Hadoop
Vivian S. Zhang
 
PDF
Map/Reduce intro
CARLOS III UNIVERSITY OF MADRID
 
PPTX
Big data analytics involves examining large, complex datasets
anamikaagithkumar
 
Geo Analytics Tutorial - Where 2.0 2011
Peter Skomoroch
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
shravanthium111
 
Pig and Python to Process Big Data
Shawn Hermans
 
Large Scale Data Processing & Storage
Ilayaraja P
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman
 
Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroc...
PROIDEA
 
introduction to Complete Map and Reduce Framework
harikumar288574
 
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Intro to Big Data using Hadoop
Sergejus Barinovas
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
jakehofman
 
Hadoop Overview & Architecture
EMC
 
Python in big data world
Rohit
 
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
harikumar288574
 
MapReduce and NoSQL
Aaron Cordova
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Streaming Python on Hadoop
Vivian S. Zhang
 
Big data analytics involves examining large, complex datasets
anamikaagithkumar
 
Ad

Recently uploaded (20)

PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Future of Artificial Intelligence (AI)
Mukul
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Doc9.....................................
SofiaCollazos
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 

Hadoop and pig at twitter (oscon 2010)

  • 1. Hadoop and Pig @Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM Friday, July 23, 2010
  • 2. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, machine learning, visualization, social graph analysis, soon to be PBs data Friday, July 23, 2010
  • 4. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 5. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability Friday, July 23, 2010
  • 6. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability Friday, July 23, 2010
  • 7. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds ‣ Easy Packaging: Cloudera RPMs, DEBs Friday, July 23, 2010
  • 8. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 9. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 10. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 11. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 12. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 13. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 14. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 15. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins are lengthy, error-prone ‣ Hard to manage n-stage jobs ‣ Exploration requires compilation! Friday, July 23, 2010
  • 16. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 17. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL? ‣ Top-level Apache project Friday, July 23, 2010
  • 18. Why Pig? ‣ Because I bet you can read the following script. Friday, July 23, 2010
  • 19. A Real Pig Script Friday, July 23, 2010
  • 20. Now, just for fun... ‣ The same calculation in vanilla MapReduce Friday, July 23, 2010
  • 22. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the development time ‣ Within 25% of the execution time ‣ Readable, reusable Friday, July 23, 2010
  • 23. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration Friday, July 23, 2010
  • 24. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 25. MySQL, MySQL, MySQL ‣ We all start there. ‣ But MySQL is not built for analysis. ‣ select count(*) from users? Maybe. ‣ select count(*) from tweets? Uh... ‣ Imagine joining them. ‣ And grouping. ‣ Then sorting. Friday, July 23, 2010
  • 26. Non-Pig Hadoop at Twitter ‣ Data Sink via Scribe ‣ Distributed Grep ‣ A few performance-critical, simple jobs ‣ People Search Friday, July 23, 2010
  • 27. People Search? ‣ First real product built with Hadoop ‣ “Find People” ‣ Old version: offline process on a single node ‣ New version: complex graph calculations, hit internal network services, custom indexing ‣ Faster, more reliable, more observable Friday, July 23, 2010
  • 28. People Search ‣ Import user data into HBase ‣ Periodic MapReduce job reading from HBase ‣ Hits FlockDB, other internal services in mapper ‣ Custom partitioning ‣ Data sucked across to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service ‣ Build a trie, do case folding/normalization, suggestions, etc Friday, July 23, 2010
  • 29. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 30. Order of Operations ‣ Counting ‣ Correlating ‣ Research/ Algorithmic Learning Friday, July 23, 2010
  • 31. Counting ‣ How many requests per day? ‣ What’s the average latency? 95% latency? ‣ What’s the response code distribution? ‣ How many searches per day? Unique users? ‣ What’s the geographic breakdown of requests? ‣ How many tweets? From what clients? ‣ How many signups? Profile completeness? ‣ How many SMS notifications did we send? Friday, July 23, 2010
  • 32. Correlating ‣ How does usage differ for mobile users? ‣ ... for desktop client users (Tweetdeck, etc)? ‣ Cohort analyses ‣ What services fail at the same time? ‣ What features get users hooked? ‣ What do successful users do often? ‣ How does tweet volume change over time? Friday, July 23, 2010
  • 33. Research ‣ What can we infer from a user’s tweets? ‣ ... from the tweets of their followers? followees? ‣ What features tend to get a tweet retweeted? ‣ ... and what influences the retweet tree depth? ‣ Duplicate detection, language detection ‣ What graph structures lead to increased usage? ‣ Sentiment analysis, entity extraction ‣ User reputation Friday, July 23, 2010
  • 34. If We Had More Time... ‣ HBase ‣ LZO compression and Hadoop ‣ Protocol buffers ‣ Our open source: hadoop-lzo, elephant-bird ‣ Analytics and Cassandra Friday, July 23, 2010
  • 35. Questions? Follow me at twitter.com/kevinweil TM Friday, July 23, 2010