Hadoop at Twitter (Hadoop Summit 2010)

Hadoop at Twitter
Kevin Weil -- @kevinweil
Analytics Lead, Twitter

TM

The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis
‣ Data Products

‣ Data Input: Scribe, Crane
‣ Data Storage: Elephant Bird, HBase
‣ Data Analysis: Pig, Oink
‣ Data Products: Birdbrain

1 Community Open Source
2 Twitter Open Source (or soon)

My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, machine learning, visualization, social
graph analysis, (soon) PBs of data

‣ Data Input: Scribe, Crane
‣ Data Storage
‣ Data Analysis
‣ Data Products

2 Twitter Open Source

What Data?
‣ Two main kinds of raw data
‣ Logs
‣ Tabular data

Logs
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale

Logs
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
‣ Resources overwhelmed
‣ Lost data

Scribe
‣ Scribe daemon runs locally; reliable in network outage
‣ Nodes only know downstream
FE FE FE
writer; hierarchical, scalable
‣ Pluggable outputs, per category

Agg Agg

File HDFS

Scribe at Twitter
‣ Solved our problem, opened new vistas
‣ Currently 57 different categories logged from multiple sources
‣ FE: Javascript, Ruby on Rails
‣ Middle tier: Ruby on Rails, Scala
‣ Backend: Scala, Java, C++
‣ 7 TB/day into HDFS

‣ Log first, ask questions later.

Scribe at Twitter
‣ We’ve contributed to it as we’ve used it1

‣ Improved logging, monitoring, writing to HDFS, compression
‣ Added ZooKeeper-based config
‣ Continuing to work with FB on patches

‣ Also: working with Cloudera to evaluate Flume

1 http://github.com/traviscrawford/scribe

Tabular Data
‣ Most site data is in MySQL
‣ Tweets, users, devices, client applications, etc
‣ Need to move it between MySQL and HDFS
‣ Also between MySQL and HBase, or MySQL and MySQL

‣ Crane: configuration driven ETL tool

Crane

Driver
Source Conﬁguration/Batch Management
Sink
ZooKeeper Registration

Transform
Extract Load
Protobuf P1 Protobuf P2

Crane
‣ Extract
‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights

Crane
‣ Extract
‣ Transform
‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic

Crane
‣ Extract
‣ Transform
‣ Load
‣ MySQL, Local file, Stdout, HDFS, HBase

Crane
‣ Extract
‣ Transform
‣ Load
‣ MySQL, Local file, Stdout, HDFS, HBase
‣ ZooKeeper coordination, intelligent date management
‣ Run all the time from multiple servers, self healing

‣ Data Input
‣ Data Storage: Elephant Bird, HBase
‣ Data Analysis
‣ Data Products


Storage Basics
‣ Incoming data: 7 TB/day
‣ LZO encode everything
‣ Save 3-4x on storage, pay little CPU
‣ Splittable!1

‣ IO-bound jobs ==> 3-4x perf increase

1 http://www.github.com/kevinweil/hadoop-lzo

http://www.ﬂickr.com/photos/jagadish/3072134867/

Elephant Bird

1 http://github.com/kevinweil/elephant-bird

Elephant Bird
‣ We have data coming in as protocol buffers via Crane...

Elephant Bird
‣ Protobufs: codegen for efﬁcient ser-de of data structures

Elephant Bird
‣ Why shouldn’t we just continue, and codegen more glue?

Elephant Bird
‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
HBaseLoaders

Elephant Bird
‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
HBaseLoaders
‣ Also now does part of this with Thrift, soon Avro
‣ And JSON, W3C Logs

Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.

‣ Logs are easy.
‣ Handling rapidly changing data in HDFS: not trivial.
‣
Don’t worry about updated data
‣
Refresh entire dataset
‣
Download updates, tombstone old versions of data, ensure jobs
only run over current versions of data, occasionally rewrite full dataset

‣ Logs are easy.
‣ Handling changing data in HDFS: not trivial.

HBase
‣ Has already solved the update problem
‣ Bonus: low-latency query API
‣ Bonus: rich, BigTable-style data model based on column families

HBase at Twitter
‣ Crane loads data directly into HBase
‣ One CF for protobuf bytes, one CF to denormalize columns for
indexing or quicker batch access
‣ Processing updates transparent, so we always have accurate data in
HBase
‣ Pig Loader for HBase in Elephant Bird makes integration with
existing analyses easy

HBase at Twitter
‣ Crane loads data directly into HBase
‣ One CF for protobuf bytes, one CF to denormalize columns for
indexing or quicker batch access
‣ Processing updates transparent, so we always have accurate data in
HBase
‣ Pig Loader for HBase in Elephant Bird

‣ Data Input
‣ Data Storage
‣ Data Analysis: Pig, Oink
‣ Data Products


Enter Pig

‣ High level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ UDFs are first-class citizens
‣ Easier than SQL?

Why Pig?
‣ Because I bet you can read the following script.

A Real Pig Script

‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.

Pig Democratizes Large-scale Data
Analysis
‣ The Pig version is:
‣ 5% of the code
‣ 5% of the time
‣ Within 30% of the execution time.
‣ Innovation increasingly driven from large-scale data analysis
‣ Need fast iteration to understand the right questions
‣ More minds contributing = more value from your data

Pig Examples
‣ Using the HBase Loader

‣ Using the protobuf loaders

Pig Workflow
‣ Oink: framework around Pig for loading, combining, running,
post-processing
‣ Everyone I know has one of these
‣ Points to an opening for innovation; discussion beginning
‣
Something we’re looking at: Ruby DSL for Pig, Piglet1

1 http://github.com/ningliang/piglet

Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?

Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing

Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation

Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.

‣ Data Input
‣ Data Storage
‣ Data Analysis
‣ Data Products: Birdbrain


Data Products
‣ Ad Hoc Analyses
‣ Answer questions to keep the business agile, do research
‣ Online Products
‣ Name search, other upcoming products
‣ Company Dashboard
‣ Birdbrain

Questions? Follow me at
twitter.com/kevinweil

‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics.

TM

Hadoop at Twitter (Hadoop Summit 2010)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop at Twitter (Hadoop Summit 2010) (20)

Hadoop at Twitter (Hadoop Summit 2010)