SlideShare a Scribd company logo
Building Scalable Big Data
Infrastructure Using Open Source
Software
Sam William
What is StumbleUpon?
Help users find content they did not expect to find
The best way to discover new and
interesting things from across the
Web.
How StumbleUpon works
1. Register 2. Tell us your interests 3. Start Stumbling and
rating web pages
We use your interests and behavior to
recommend new content for you!
StumbleUpon
The Data Challenge
1 Data collection
2 Real time metrics
3 Batch processing / ETL
4 Data warehousing & ad-hoc analysis
5 Business intelligence & Reporting
Challenges in data collection
• Different services deployed of different tech stacks
• Add minimal latency to the production services
• Application DBs for Analytics / batch processing
– From HBase & MySQL
Site Rec /Ad Server
Other internal
services
Apache / PHP Scala / Finagle Java / Scala / PHP
Data Processing and Warehousing
Raw Data
E
T
L
Warehouse
(HDFS)
Tables
Massively
Denormalized
Tables
Challenges/Requirements:
•Scale over 100 TBs of data
•End product works with easy
querying tools/languages
•Reliable and Scalable – powers
analytics and internal reporting.
Real-time analytics and metrics
• Atomic counters
• Tracking product launches
• Monitoring the health of the site
• Latency – live metrics makes sense
• A/B tests
Open Source at SU
Data Collection at SU
Activity Streams and Logs
All messages are Protocol Buffers
Fast and Efficient
Multiple Language Bindings ( Java/ C++ / PHP )
Compact
Very well documented
Extensible
Apache Kafka
• Distributed pub-sub system
• Developed @ LinkedIn
• Offers message persistence
• Very high throughput
– ~300K messages/sec
• Horizontally scalable
• Multiple subscribers for topics .
– Easy to rewind
Kafka
• Near real time process can be taken
offline and done at the consumer level
• Semantic partitioning through topics
• Partitions for parallel consumption
• High-level consumer API using ZK
• Simple to deploy- only requires Zookeeper
Kafka At SU
• 4 Broker nodes with RAID10 disks
• 25 topics
• Peak of 3500 msg/s
• 350 bytes avg. message size
• 30 days of data retention
Sutro
• Scala/Finagle
• Generic Kafka message producer
• Deployed on all prod servers
• Local http daemon
• Publishes to Kafka asynchronously
• Snowflake to generate unique Ids
Sutro - Kafka
Site
-
Apache/PHP
Sutro
Ad Server
-
Scala/Finagle
Sutro
Rec Server
-
Scala/Finagle
Sutro
Other
Services
Sutro
Kafka
Broker Broker
Broker
Application Data for
Analytics & Batch Processing
• HBase inter-cluster replication
(from production to batch
cluster)
• Near real-time sync on batch
cluster
• Readily available in Hive for
analysis
HBase
• MySQL replication to Batch
DB Servers
• Sqoop incremental data
transfer to HDFS
• HDFS flat files mapped to
Hive tables & made available
for analysis
MySQL
Real-time metrics
1. HBase – Atomic Counters
2. Asynchbase - Coalesced counter inc++
1. OpenTSDB (developed at SU)
– A distributed time-series DB on HBase
– Collects over 2 Billion data points a day
– Plotting time series graphs
– Tagged data points
Real-time counters
Real time metrics from OpenTSDB
Kafka Consumer framework aka Postie
• Distributed system for consuming messages
• Scala/Akka -on top of Kafka’s consumer API
• Generic consumer - understands protobuf
• Predefined sinks HBase / HDFS (Text/Binary) / Redis
• Consumers configured with configuration files
• Distributed / uses ZK to co-ordinate
• Extensible
Postie
Akka
• Building concurrent applications made
easy !!
• The distributed nodes are behind Remote
Actors
• Load balancing through custom Routers
• The predefined sink and services are
accessed through local actors
• Fault-tolerance through actor monitoring
Postie
Batch processing / ETL
Batch processing /
ETL
GOAL: Create simplified data-sets from complex data
Create highly
denormalized data
sets for faster
querying
Power the reporting
DB with daily stats
Output structured data
for specific analysis
e.g. Registration Flow
analysis
Our favourite ETL tools:
• Pig
– Optional Schema
– Work on byte arrays
– Many simple operations can be done without UDFs
– Developing UDFs is simple (understand Bags/Tuples)
– Concise scripts compared to the M/R equivalents
• Scalding
– Functional programming in Scala over Hadoop
– Built on top of Cascading
– Operating over tuples is like operating over collections in Scala
– No UDFs .. Your entire program is in a full-fledged general
purpose language
Warehouse - Hive
Uses SQL-like querying language
All Analysts and Data Scientists versed in SQL
Supports Hadoop Streaming (Python/R)
UDFs and Serdes make it highly extensible
Supports partitioning with many table properties
configurable at the partition level
Hive at StumbleUpon
HBaseSerde
• Reads binary data from HBase
• Parses composite binary values into
multiple columns in Hive (mainly on key)
ProtobufSerde
• For creating Hive tables on top of binary
protobuf files stored in HDFS
• Serde uses Java reflection to parse and
project columns
Data Infrastructure at SU
Data Consumption
End Users of Data
Data Pipeline
Warehouse
Who uses this data?
• Data Scientists/Analysts
• Offline Rec pipeline
• Ads Team
… all this work allows them to focus on querying
and analysis which is critical to the business.
Business Analytics / Data
Scientists
• Feature-rich set of data to work on
• Enriched/Denormalized tables reduce JOINs,
simplifies and speeds queries – shortening path to
analysis.
• R: our favorite tool for analysis post Hadoop/Hive.
Recommendation platform
• URL score pipeline
– M/R and Hive on Oozie
– Filter / Classify into buckets
– Score / Loop
– Load ES/HBase index
• Keyword generation pipeline
– Parse URL data
– Generate Tag mappings
URL score pipeline
Advertisement Platform
• Billing Service
– RT Kafka consumer
– Calculates skips
– Bills customers
• Audience Estimation tool
– Pre-crunched data into multiple dimensions
– A UI tool for Advertisers to estimate target audience
• Sales team tools
– Built with PHP leveraging Hive or pre-crunched
ETL data in HBase
More stuff on the pipeline
• Storm from Twitter
– Scope for lot more real time analytics.
– Very high throughput and extensible
– Applications in JVM language
• BI tools
– Our current BI tools / dashboards are minimal
– Google charts powered by our reporting DB
(HBase primarily).
Open Source FTW!!
• Actively developed and maintained
• Community support
• Built with web-scale in mind
• Distributed systems – Easy with
Akka/ZK/Finagle
• Inexpensive
• Only one major catch !!
– Hire and retain good engineers !!
Thank You!
Questions
?

More Related Content

Similar to Building Scalable Big Data Infrastructure Using Open Source Software Presentation 1.ppt (20)

PDF
Meta scale kognitio hadoop webinar
Michael Hiskey
 
PPTX
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Baiji He
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PDF
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
PPTX
DC Migration and Hadoop Scale For Big Billion Days
Rahul Agarwal
 
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
PDF
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Precisely
 
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
PPTX
Pacemaker hadoop infrastructure and soft serve experience
Vitaliy Bashun
 
PPTX
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
SoftServe
 
PDF
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PDF
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
DATAVERSITY
 
PPTX
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
PDF
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
PPT
Survey of Real-time Processing Systems for Big Data
Luiz Henrique Zambom Santana
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
PDF
Building real time data-driven products
Lars Albertsson
 
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Rakuten techconf2015.baiji.he.bigdataforsmallstartupandbeyond
Baiji He
 
Introduction To Hadoop Ecosystem
InSemble
 
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
DC Migration and Hadoop Scale For Big Billion Days
Rahul Agarwal
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Precisely
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Pacemaker hadoop infrastructure and soft serve experience
Vitaliy Bashun
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
SoftServe
 
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
DATAVERSITY
 
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Survey of Real-time Processing Systems for Big Data
Luiz Henrique Zambom Santana
 
Introduction to Hadoop and Big Data
Joe Alex
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Building real time data-driven products
Lars Albertsson
 

Recently uploaded (20)

PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Ad

Building Scalable Big Data Infrastructure Using Open Source Software Presentation 1.ppt

  • 1. Building Scalable Big Data Infrastructure Using Open Source Software Sam William
  • 2. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new and interesting things from across the Web.
  • 3. How StumbleUpon works 1. Register 2. Tell us your interests 3. Start Stumbling and rating web pages We use your interests and behavior to recommend new content for you!
  • 5. The Data Challenge 1 Data collection 2 Real time metrics 3 Batch processing / ETL 4 Data warehousing & ad-hoc analysis 5 Business intelligence & Reporting
  • 6. Challenges in data collection • Different services deployed of different tech stacks • Add minimal latency to the production services • Application DBs for Analytics / batch processing – From HBase & MySQL Site Rec /Ad Server Other internal services Apache / PHP Scala / Finagle Java / Scala / PHP
  • 7. Data Processing and Warehousing Raw Data E T L Warehouse (HDFS) Tables Massively Denormalized Tables Challenges/Requirements: •Scale over 100 TBs of data •End product works with easy querying tools/languages •Reliable and Scalable – powers analytics and internal reporting.
  • 8. Real-time analytics and metrics • Atomic counters • Tracking product launches • Monitoring the health of the site • Latency – live metrics makes sense • A/B tests
  • 11. Activity Streams and Logs All messages are Protocol Buffers Fast and Efficient Multiple Language Bindings ( Java/ C++ / PHP ) Compact Very well documented Extensible
  • 12. Apache Kafka • Distributed pub-sub system • Developed @ LinkedIn • Offers message persistence • Very high throughput – ~300K messages/sec • Horizontally scalable • Multiple subscribers for topics . – Easy to rewind
  • 13. Kafka • Near real time process can be taken offline and done at the consumer level • Semantic partitioning through topics • Partitions for parallel consumption • High-level consumer API using ZK • Simple to deploy- only requires Zookeeper
  • 14. Kafka At SU • 4 Broker nodes with RAID10 disks • 25 topics • Peak of 3500 msg/s • 350 bytes avg. message size • 30 days of data retention
  • 15. Sutro • Scala/Finagle • Generic Kafka message producer • Deployed on all prod servers • Local http daemon • Publishes to Kafka asynchronously • Snowflake to generate unique Ids
  • 16. Sutro - Kafka Site - Apache/PHP Sutro Ad Server - Scala/Finagle Sutro Rec Server - Scala/Finagle Sutro Other Services Sutro Kafka Broker Broker Broker
  • 17. Application Data for Analytics & Batch Processing • HBase inter-cluster replication (from production to batch cluster) • Near real-time sync on batch cluster • Readily available in Hive for analysis HBase • MySQL replication to Batch DB Servers • Sqoop incremental data transfer to HDFS • HDFS flat files mapped to Hive tables & made available for analysis MySQL
  • 18. Real-time metrics 1. HBase – Atomic Counters 2. Asynchbase - Coalesced counter inc++ 1. OpenTSDB (developed at SU) – A distributed time-series DB on HBase – Collects over 2 Billion data points a day – Plotting time series graphs – Tagged data points
  • 19. Real-time counters Real time metrics from OpenTSDB
  • 20. Kafka Consumer framework aka Postie • Distributed system for consuming messages • Scala/Akka -on top of Kafka’s consumer API • Generic consumer - understands protobuf • Predefined sinks HBase / HDFS (Text/Binary) / Redis • Consumers configured with configuration files • Distributed / uses ZK to co-ordinate • Extensible
  • 22. Akka • Building concurrent applications made easy !! • The distributed nodes are behind Remote Actors • Load balancing through custom Routers • The predefined sink and services are accessed through local actors • Fault-tolerance through actor monitoring
  • 24. Batch processing / ETL Batch processing / ETL GOAL: Create simplified data-sets from complex data Create highly denormalized data sets for faster querying Power the reporting DB with daily stats Output structured data for specific analysis e.g. Registration Flow analysis
  • 25. Our favourite ETL tools: • Pig – Optional Schema – Work on byte arrays – Many simple operations can be done without UDFs – Developing UDFs is simple (understand Bags/Tuples) – Concise scripts compared to the M/R equivalents • Scalding – Functional programming in Scala over Hadoop – Built on top of Cascading – Operating over tuples is like operating over collections in Scala – No UDFs .. Your entire program is in a full-fledged general purpose language
  • 26. Warehouse - Hive Uses SQL-like querying language All Analysts and Data Scientists versed in SQL Supports Hadoop Streaming (Python/R) UDFs and Serdes make it highly extensible Supports partitioning with many table properties configurable at the partition level
  • 27. Hive at StumbleUpon HBaseSerde • Reads binary data from HBase • Parses composite binary values into multiple columns in Hive (mainly on key) ProtobufSerde • For creating Hive tables on top of binary protobuf files stored in HDFS • Serde uses Java reflection to parse and project columns
  • 30. End Users of Data Data Pipeline Warehouse Who uses this data? • Data Scientists/Analysts • Offline Rec pipeline • Ads Team … all this work allows them to focus on querying and analysis which is critical to the business.
  • 31. Business Analytics / Data Scientists • Feature-rich set of data to work on • Enriched/Denormalized tables reduce JOINs, simplifies and speeds queries – shortening path to analysis. • R: our favorite tool for analysis post Hadoop/Hive.
  • 32. Recommendation platform • URL score pipeline – M/R and Hive on Oozie – Filter / Classify into buckets – Score / Loop – Load ES/HBase index • Keyword generation pipeline – Parse URL data – Generate Tag mappings
  • 34. Advertisement Platform • Billing Service – RT Kafka consumer – Calculates skips – Bills customers • Audience Estimation tool – Pre-crunched data into multiple dimensions – A UI tool for Advertisers to estimate target audience • Sales team tools – Built with PHP leveraging Hive or pre-crunched ETL data in HBase
  • 35. More stuff on the pipeline • Storm from Twitter – Scope for lot more real time analytics. – Very high throughput and extensible – Applications in JVM language • BI tools – Our current BI tools / dashboards are minimal – Google charts powered by our reporting DB (HBase primarily).
  • 36. Open Source FTW!! • Actively developed and maintained • Community support • Built with web-scale in mind • Distributed systems – Easy with Akka/ZK/Finagle • Inexpensive • Only one major catch !! – Hire and retain good engineers !!

Editor's Notes

  • #2: Introduce myself
  • #3: Best way to discover interesting stuff All kinds of content , media, photos.video, memes , news
  • #4: Interests like Art, Travel or Science, and we'll show you related sites, videos and photos. Every time you click the Stumble! button, we will surprise you with new and interesting stuff Rate the Stuff We Show You......so we get better at recommending what you'll enjoy.
  • #7: Conventional log files
  • #8: Add what DS and BA want