SlideShare a Scribd company logo
Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011
$2  What is fog computing? Deep Thoughts $1  What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5  What is vapor computing? Local cloud computing
About Me Lead Software Engineer Manager of Data Warehouse Operations Background in Software Systems Architecture, Design and Development Expertise of data warehouse, data analytics, statistics, databases, and distributed systems Ph.D. in Applied Physics from Harvard University
Brands and Websites of CBS interactive, Samples GAMES & MOVIES  TECH, BIZ & NEWS  SPORTS ENTERTAINMENT  MUSIC
CBSi Scale Top 20 global web property 235M worldwide monthly unique users Hadoop Cluster size: Currently workers: 40 nodes (260 TB) This month: add 24 nodes, 800 TB total Next quarter: ~ 80 nodes,  ~ 1 PB DW peak processing:  > 500M events/day globally, doubling next quarter (ad logs) 1 - Source: comScore, March 2011
Web Analytics Processing Collect web logs for web metrics analysis Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc Provide internal metrics for web sites monitoring A/B testing Billers apps, external reporting Ad event tracking to support sales Provide data service Support marketing by providing data for data mining User-centric datastore (stay tuned) Optimize user experience 1 - Source: comScore, March 2011
Modernize the platform The web log processing using a proprietary platform ran into the limit Code base was 10 years old The version we used vendor is no longer supporting Not fault-tolerant Upgrade to the newer version not cost-effective Data volume is increasing all the time 300+ web sites Video tracking increasing the fastest To support new initiatives of business Use open source systems as much as possible
Hadoop to the Rescue / Research Open-source: scalable data processing framework based on MapReduce Processing PB of data using Hadoop Distributed files system (HDFS) high throughput Fault-Tolerant Distributed computing model  Functional programming model based  MapReduce (M|S|R)  Execution engine  Used as a cluster for ETL Collect data (distributed harvester) Analyze data (M/R, streaming + scripting + R, Pig/Hive) Archive data (distributed archive)
The Plan Build web logs collection (codename Fido) Apache web log piped to cronolog Hourly M/R collector job to Gzip hourly log files & checksum Scp from web servers to Hadoop datanodes Put on HDFS Build Python ETL framework (codename Lumberjack) Based stdin/stdout streaming, one process/one thread Can run stand-alone or on Hadoop Pipeline Filter Schema Build web log processing with Lumberjack Parse Sessionize Lookup Format data/Load to DB
Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
The Python ETL Framework Lumberjack on Hadoop Written in Python Foundation Classes Pipeline, stdin/stdout streaming, consisting of connected Filters Schema,   metadata describes the data being sent between Filters String, for encoding/decoding Datetime, timezone Numeric, validaton Null handling  Filter, stage in a Pipeline Pipe, connecting Filters
The Python ETL Framework Lumberjack on Hadoop Filter Extract DelimitedFileInput DbQuery RegexpFileInput Transform Expression Lookup Regex Range Lookup Load DelimitedFileOutput DbLoader PickledFileOutput
The Python ETL Framework Lumberjack on Hadoop, Example Python schema schema = Schema(( SchemaField(u'anon_cookie', 'unicode', False, default=u'-', maxlen=19, io_encoding='utf-8', encoding_errors='replace', cache_size=0, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'client_ip_addr', 'int', False, default=0, signed=False, bits=32, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_id', 'int', False, default=-1, signed=True, bits=64, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ht', 'datetime', False, default='1970-01-01 08:00:00', timezone='PRC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ut', 'datetime', False, default='1970-01-01 00:00:00', timezone='UTC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), ))
The Python ETL Framework Lumberjack on Hadoop, Example Pipeline pl = etl.Pipeline( etl.file.DelimitedInput(stdin, input_schema, drop=('empty1', 'empty2', 'bytes_sent'), errors_policy=etl.file.DelimitedInputSkipAndCountErrors( etl.counter.HadoopStreamingCounter('Dropped Records',  group='DelimitedInput'))) | etl.transform.StringCleaner() | dw_py_mod_utils.IPFilter('ip_address', 'CNET_Exclude_Addresses.txt', None, None) | cnclear_parser.CnclearParser() | etl.transform.StringCleaner(fields=('xref', 'src_url', 'title')) | parse_url.ParseURL('src_url') | event_url_title_cleaner.CNClearURLTitleCleaner() | dw_py_mod_utils.Sha1Hash(['user_agent', 'skc_url', 'title']) | parse_ua.ParseUA('user_agent') | etl.transform.ProjectToOutputSchema(output_schema) | etl.file.DelimitedOutput(stdout) ) pl.setup() pl.run()
The Python ETL Framework Lumberjack on Hadoop Based on Hadoop streaming Written in Python Mapper and Reducer as pipeline (stdin/stdout streaming) Mapper for all transformations Expression Lookup by using Tokyo cabinet Range lookup Regex Use the shuffle phase (in between M/R) for sorting Aggregation by Reducer eg Sessionize in detail
Web log Processing by Hadoop Streaming and Python-ETL Parsing web logs IAB filtering and checking Parsing user agents by regex IP range lookup Look up product key etc Sessionization Prepare Sessionize Sessionize Filter-unpack Process huge dimensions, URL/Page Title Load Facts Format Load data/Load data to DB
Sessionize on Hadoop in Detail Group web events (page impression, click, video tracking etc) into user sessions based on a set of business rules, eg 30 minutes timeout Enable analysis of user behavior patterns Gather session-level facts
Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types
Prep-sessionize (Mapper) Pre-sessionize lookups Normalize event records of all event types to the same schema Order fields in the same sequence for all event types IP range lookup
Sorting before Sessionize Sorting events according to anon_cookie (the same user) + event_dt_ht (event stream) Hadoop streaming implied sorting -D stream.num.map.output.key.fields=2 \ -D mapred.text.key.partitioner.options=-k1 \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Sessionize (Reducer) Apply sessionize business rules Assign session_id from a distributed sequence Summary to gather session-level facts join session facts with web-events Assemble all output rows to one output streams (web events, sessions, rejects), by just extending the fields, reject_flag to mark rows rejected  event_type (session) indicator field
Output (Mapper) from Sessionize Filter for specific event type events Unpack the event type + sessions fact Provide data to prepare_load
Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
Benefits to Ops Processing time to reaching SLA, saving 6 hours Running 2 years in production without any big issues Withstood the test of 50% / year data volume increase Architecture by design made easy to add new processing logic Robust and Fault-Tolerant Five dead datanodes, jobs still ran OK Upgraded JVM on a few datanodes while jobs running Reprocessing old data while processing data of current day
Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work Python ETL Framework Home grown, under review for open-source release Rich functionalities by Python Extensible NLS support Put on top of another platform, eg Hadoop Distributed/Parallel Sorting Aggregation
Conclusions II – Power and Flexibility for Processing Big Data Hadoop - scale and computing horsepower Robustness Fault-tolerance Scalability Significant reduction of processing time to reach SLA Cost-effective Commodity HW Free SW Currently: Build  Multi-tenant Hadoop clusters using Fair Scheduler
The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak
Questions? [email_address] Follow up Lumberjack [email_address]
Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

More Related Content

What's hot (20)

PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
PPT
Drill / SQL / Optiq
Julian Hyde
 
PPTX
Cost-based query optimization in Apache Hive
Julian Hyde
 
PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
PPTX
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Michael Rys
 
PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
PDF
SparkSQL and Dataframe
Namgee Lee
 
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PDF
Tactical data engineering
Julian Hyde
 
PDF
pandas - Python Data Analysis
Andrew Henshaw
 
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
PDF
AfterGlow
Raffael Marty
 
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
PPTX
Hadoop - Stock Analysis
Vaibhav Jain
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Michael Rys
 
Drill / SQL / Optiq
Julian Hyde
 
Cost-based query optimization in Apache Hive
Julian Hyde
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Michael Rys
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
Michael Rys
 
U-SQL Reading & Writing Files (SQLBits 2016)
Michael Rys
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
SparkSQL and Dataframe
Namgee Lee
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Tactical data engineering
Julian Hyde
 
pandas - Python Data Analysis
Andrew Henshaw
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 
AfterGlow
Raffael Marty
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Michael Rys
 
Hadoop - Stock Analysis
Vaibhav Jain
 

Viewers also liked (20)

PPTX
Mongo db and hadoop driving business insights - final
MongoDB
 
PPTX
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB
 
PDF
Hadoop to spark-v2
Sujee Maniyam
 
PPTX
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PPTX
Dealing with Changed Data in Hadoop
DataWorks Summit
 
PDF
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Gord Sissons
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
POTX
Webinar: MongoDB + Hadoop
MongoDB
 
PPT
Upgrading the Curriculum
Janet Hale
 
PDF
DieHarder (CCS 2010, WOOT 2011)
Emery Berger
 
PPTX
VMware title
tlevers
 
PDF
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
steelman182
 
PPTX
How to Create Creative Commons Licensing Buttons for Your Website
NET:101
 
PDF
Xstrata Article
Vicki Shaw
 
PPTX
Week 2: Setting up your Account
Edel14201341
 
PPT
Marc Prensky & the digital divides
kavismusings
 
PPT
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Emery Berger
 
Mongo db and hadoop driving business insights - final
MongoDB
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB
 
Hadoop to spark-v2
Sujee Maniyam
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Dealing with Changed Data in Hadoop
DataWorks Summit
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Gord Sissons
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Webinar: MongoDB + Hadoop
MongoDB
 
Upgrading the Curriculum
Janet Hale
 
DieHarder (CCS 2010, WOOT 2011)
Emery Berger
 
VMware title
tlevers
 
Las 27-maneras-en-que-la-mente-distorciona-la-realidad
steelman182
 
How to Create Creative Commons Licensing Buttons for Your Website
NET:101
 
Xstrata Article
Vicki Shaw
 
Week 2: Setting up your Account
Edel14201341
 
Marc Prensky & the digital divides
kavismusings
 
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Emery Berger
 
Ad

Similar to Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi (20)

PPTX
Hadoop project design and a usecase
sudhakara st
 
KEY
Processing Big Data
cwensel
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Common and unique use cases for Apache Hadoop
Brock Noland
 
PDF
Commonanduniqueusecases 110831113310-phpapp01
eimhee
 
PPTX
Big Data for QAs
Ahmed Misbah
 
PDF
Hadoop For OpenStack Log Analysis
OpenStack Foundation
 
PDF
Pittaro open stackloganalysis_20130416
OpenStack Foundation
 
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
PPTX
Big Data Lessons from the Cloud
MapR Technologies
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
PPTX
Intro to hadoop ecosystem
Grzegorz Kolpuc
 
PDF
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Hadoop project design and a usecase
sudhakara st
 
Processing Big Data
cwensel
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Common and unique use cases for Apache Hadoop
Brock Noland
 
Commonanduniqueusecases 110831113310-phpapp01
eimhee
 
Big Data for QAs
Ahmed Misbah
 
Hadoop For OpenStack Log Analysis
OpenStack Foundation
 
Pittaro open stackloganalysis_20130416
OpenStack Foundation
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
Big Data Lessons from the Cloud
MapR Technologies
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Intro to hadoop ecosystem
Grzegorz Kolpuc
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
StampedeCon
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
Introduction to Hadoop
Ovidiu Dimulescu
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
larsgeorge
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

  • 1. Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011
  • 2. $2 What is fog computing? Deep Thoughts $1 What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5 What is vapor computing? Local cloud computing
  • 3. About Me Lead Software Engineer Manager of Data Warehouse Operations Background in Software Systems Architecture, Design and Development Expertise of data warehouse, data analytics, statistics, databases, and distributed systems Ph.D. in Applied Physics from Harvard University
  • 4. Brands and Websites of CBS interactive, Samples GAMES & MOVIES TECH, BIZ & NEWS SPORTS ENTERTAINMENT MUSIC
  • 5. CBSi Scale Top 20 global web property 235M worldwide monthly unique users Hadoop Cluster size: Currently workers: 40 nodes (260 TB) This month: add 24 nodes, 800 TB total Next quarter: ~ 80 nodes, ~ 1 PB DW peak processing: > 500M events/day globally, doubling next quarter (ad logs) 1 - Source: comScore, March 2011
  • 6. Web Analytics Processing Collect web logs for web metrics analysis Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc Provide internal metrics for web sites monitoring A/B testing Billers apps, external reporting Ad event tracking to support sales Provide data service Support marketing by providing data for data mining User-centric datastore (stay tuned) Optimize user experience 1 - Source: comScore, March 2011
  • 7. Modernize the platform The web log processing using a proprietary platform ran into the limit Code base was 10 years old The version we used vendor is no longer supporting Not fault-tolerant Upgrade to the newer version not cost-effective Data volume is increasing all the time 300+ web sites Video tracking increasing the fastest To support new initiatives of business Use open source systems as much as possible
  • 8. Hadoop to the Rescue / Research Open-source: scalable data processing framework based on MapReduce Processing PB of data using Hadoop Distributed files system (HDFS) high throughput Fault-Tolerant Distributed computing model Functional programming model based MapReduce (M|S|R) Execution engine Used as a cluster for ETL Collect data (distributed harvester) Analyze data (M/R, streaming + scripting + R, Pig/Hive) Archive data (distributed archive)
  • 9. The Plan Build web logs collection (codename Fido) Apache web log piped to cronolog Hourly M/R collector job to Gzip hourly log files & checksum Scp from web servers to Hadoop datanodes Put on HDFS Build Python ETL framework (codename Lumberjack) Based stdin/stdout streaming, one process/one thread Can run stand-alone or on Hadoop Pipeline Filter Schema Build web log processing with Lumberjack Parse Sessionize Lookup Format data/Load to DB
  • 10. Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
  • 11. The Python ETL Framework Lumberjack on Hadoop Written in Python Foundation Classes Pipeline, stdin/stdout streaming, consisting of connected Filters Schema, metadata describes the data being sent between Filters String, for encoding/decoding Datetime, timezone Numeric, validaton Null handling Filter, stage in a Pipeline Pipe, connecting Filters
  • 12. The Python ETL Framework Lumberjack on Hadoop Filter Extract DelimitedFileInput DbQuery RegexpFileInput Transform Expression Lookup Regex Range Lookup Load DelimitedFileOutput DbLoader PickledFileOutput
  • 13. The Python ETL Framework Lumberjack on Hadoop, Example Python schema schema = Schema(( SchemaField(u'anon_cookie', 'unicode', False, default=u'-', maxlen=19, io_encoding='utf-8', encoding_errors='replace', cache_size=0, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'client_ip_addr', 'int', False, default=0, signed=False, bits=32, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_id', 'int', False, default=-1, signed=True, bits=64, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ht', 'datetime', False, default='1970-01-01 08:00:00', timezone='PRC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ut', 'datetime', False, default='1970-01-01 00:00:00', timezone='UTC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), ))
  • 14. The Python ETL Framework Lumberjack on Hadoop, Example Pipeline pl = etl.Pipeline( etl.file.DelimitedInput(stdin, input_schema, drop=('empty1', 'empty2', 'bytes_sent'), errors_policy=etl.file.DelimitedInputSkipAndCountErrors( etl.counter.HadoopStreamingCounter('Dropped Records', group='DelimitedInput'))) | etl.transform.StringCleaner() | dw_py_mod_utils.IPFilter('ip_address', 'CNET_Exclude_Addresses.txt', None, None) | cnclear_parser.CnclearParser() | etl.transform.StringCleaner(fields=('xref', 'src_url', 'title')) | parse_url.ParseURL('src_url') | event_url_title_cleaner.CNClearURLTitleCleaner() | dw_py_mod_utils.Sha1Hash(['user_agent', 'skc_url', 'title']) | parse_ua.ParseUA('user_agent') | etl.transform.ProjectToOutputSchema(output_schema) | etl.file.DelimitedOutput(stdout) ) pl.setup() pl.run()
  • 15. The Python ETL Framework Lumberjack on Hadoop Based on Hadoop streaming Written in Python Mapper and Reducer as pipeline (stdin/stdout streaming) Mapper for all transformations Expression Lookup by using Tokyo cabinet Range lookup Regex Use the shuffle phase (in between M/R) for sorting Aggregation by Reducer eg Sessionize in detail
  • 16. Web log Processing by Hadoop Streaming and Python-ETL Parsing web logs IAB filtering and checking Parsing user agents by regex IP range lookup Look up product key etc Sessionization Prepare Sessionize Sessionize Filter-unpack Process huge dimensions, URL/Page Title Load Facts Format Load data/Load data to DB
  • 17. Sessionize on Hadoop in Detail Group web events (page impression, click, video tracking etc) into user sessions based on a set of business rules, eg 30 minutes timeout Enable analysis of user behavior patterns Gather session-level facts
  • 18. Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types
  • 19. Prep-sessionize (Mapper) Pre-sessionize lookups Normalize event records of all event types to the same schema Order fields in the same sequence for all event types IP range lookup
  • 20. Sorting before Sessionize Sorting events according to anon_cookie (the same user) + event_dt_ht (event stream) Hadoop streaming implied sorting -D stream.num.map.output.key.fields=2 \ -D mapred.text.key.partitioner.options=-k1 \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • 21. Sessionize (Reducer) Apply sessionize business rules Assign session_id from a distributed sequence Summary to gather session-level facts join session facts with web-events Assemble all output rows to one output streams (web events, sessions, rejects), by just extending the fields, reject_flag to mark rows rejected event_type (session) indicator field
  • 22. Output (Mapper) from Sessionize Filter for specific event type events Unpack the event type + sessions fact Provide data to prepare_load
  • 23. Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
  • 24. Benefits to Ops Processing time to reaching SLA, saving 6 hours Running 2 years in production without any big issues Withstood the test of 50% / year data volume increase Architecture by design made easy to add new processing logic Robust and Fault-Tolerant Five dead datanodes, jobs still ran OK Upgraded JVM on a few datanodes while jobs running Reprocessing old data while processing data of current day
  • 25. Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work Python ETL Framework Home grown, under review for open-source release Rich functionalities by Python Extensible NLS support Put on top of another platform, eg Hadoop Distributed/Parallel Sorting Aggregation
  • 26. Conclusions II – Power and Flexibility for Processing Big Data Hadoop - scale and computing horsepower Robustness Fault-tolerance Scalability Significant reduction of processing time to reach SLA Cost-effective Commodity HW Free SW Currently: Build Multi-tenant Hadoop clusters using Fair Scheduler
  • 27. The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak
  • 28. Questions? [email_address] Follow up Lumberjack [email_address]
  • 29. Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

Editor's Notes

  • #5: CBSi has a number of brands, this slide shows the biggest ones.
  • #6: We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now.