Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011

$2 What is fog computing? Deep Thoughts $1 What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5 What is vapor computing? Local cloud computing

About Me Lead Software Engineer Manager of Data Warehouse Operations Background in Software Systems Architecture, Design and Development Expertise of data warehouse, data analytics, statistics, databases, and distributed systems Ph.D. in Applied Physics from Harvard University

Brands and Websites of CBS interactive, Samples GAMES & MOVIES TECH, BIZ & NEWS SPORTS ENTERTAINMENT MUSIC

CBSi Scale Top 20 global web property 235M worldwide monthly unique users Hadoop Cluster size: Currently workers: 40 nodes (260 TB) This month: add 24 nodes, 800 TB total Next quarter: ~ 80 nodes, ~ 1 PB DW peak processing: > 500M events/day globally, doubling next quarter (ad logs) 1 - Source: comScore, March 2011

Web Analytics Processing Collect web logs for web metrics analysis Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc Provide internal metrics for web sites monitoring A/B testing Billers apps, external reporting Ad event tracking to support sales Provide data service Support marketing by providing data for data mining User-centric datastore (stay tuned) Optimize user experience 1 - Source: comScore, March 2011

Modernize the platform The web log processing using a proprietary platform ran into the limit Code base was 10 years old The version we used vendor is no longer supporting Not fault-tolerant Upgrade to the newer version not cost-effective Data volume is increasing all the time 300+ web sites Video tracking increasing the fastest To support new initiatives of business Use open source systems as much as possible

Hadoop to the Rescue / Research Open-source: scalable data processing framework based on MapReduce Processing PB of data using Hadoop Distributed files system (HDFS) high throughput Fault-Tolerant Distributed computing model Functional programming model based MapReduce (M|S|R) Execution engine Used as a cluster for ETL Collect data (distributed harvester) Analyze data (M/R, streaming + scripting + R, Pig/Hive) Archive data (distributed archive)

The Plan Build web logs collection (codename Fido) Apache web log piped to cronolog Hourly M/R collector job to Gzip hourly log files & checksum Scp from web servers to Hadoop datanodes Put on HDFS Build Python ETL framework (codename Lumberjack) Based stdin/stdout streaming, one process/one thread Can run stand-alone or on Hadoop Pipeline Filter Schema Build web log processing with Lumberjack Parse Sessionize Lookup Format data/Load to DB

Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems

The Python ETL Framework Lumberjack on Hadoop Written in Python Foundation Classes Pipeline, stdin/stdout streaming, consisting of connected Filters Schema, metadata describes the data being sent between Filters String, for encoding/decoding Datetime, timezone Numeric, validaton Null handling Filter, stage in a Pipeline Pipe, connecting Filters

The Python ETL Framework Lumberjack on Hadoop Filter Extract DelimitedFileInput DbQuery RegexpFileInput Transform Expression Lookup Regex Range Lookup Load DelimitedFileOutput DbLoader PickledFileOutput

The Python ETL Framework Lumberjack on Hadoop, Example Python schema schema = Schema(( SchemaField(u'anon_cookie', 'unicode', False, default=u'-', maxlen=19, io_encoding='utf-8', encoding_errors='replace', cache_size=0, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'client_ip_addr', 'int', False, default=0, signed=False, bits=32, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_id', 'int', False, default=-1, signed=True, bits=64, precision=None, cache_size=1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ht', 'datetime', False, default='1970-01-01 08:00:00', timezone='PRC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), SchemaField(u'session_start_dt_ut', 'datetime', False, default='1970-01-01 00:00:00', timezone='UTC', io_format='%Y-%m-%d %H:%M:%S', cache_size=-1, on_null='strict', on_type_error='strict', on_range_error='strict'), ))

The Python ETL Framework Lumberjack on Hadoop, Example Pipeline pl = etl.Pipeline( etl.file.DelimitedInput(stdin, input_schema, drop=('empty1', 'empty2', 'bytes_sent'), errors_policy=etl.file.DelimitedInputSkipAndCountErrors( etl.counter.HadoopStreamingCounter('Dropped Records', group='DelimitedInput'))) | etl.transform.StringCleaner() | dw_py_mod_utils.IPFilter('ip_address', 'CNET_Exclude_Addresses.txt', None, None) | cnclear_parser.CnclearParser() | etl.transform.StringCleaner(fields=('xref', 'src_url', 'title')) | parse_url.ParseURL('src_url') | event_url_title_cleaner.CNClearURLTitleCleaner() | dw_py_mod_utils.Sha1Hash(['user_agent', 'skc_url', 'title']) | parse_ua.ParseUA('user_agent') | etl.transform.ProjectToOutputSchema(output_schema) | etl.file.DelimitedOutput(stdout) ) pl.setup() pl.run()

The Python ETL Framework Lumberjack on Hadoop Based on Hadoop streaming Written in Python Mapper and Reducer as pipeline (stdin/stdout streaming) Mapper for all transformations Expression Lookup by using Tokyo cabinet Range lookup Regex Use the shuffle phase (in between M/R) for sorting Aggregation by Reducer eg Sessionize in detail

Web log Processing by Hadoop Streaming and Python-ETL Parsing web logs IAB filtering and checking Parsing user agents by regex IP range lookup Look up product key etc Sessionization Prepare Sessionize Sessionize Filter-unpack Process huge dimensions, URL/Page Title Load Facts Format Load data/Load data to DB

Sessionize on Hadoop in Detail Group web events (page impression, click, video tracking etc) into user sessions based on a set of business rules, eg 30 minutes timeout Enable analysis of user behavior patterns Gather session-level facts

Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types

Prep-sessionize (Mapper) Pre-sessionize lookups Normalize event records of all event types to the same schema Order fields in the same sequence for all event types IP range lookup

Sorting before Sessionize Sorting events according to anon_cookie (the same user) + event_dt_ht (event stream) Hadoop streaming implied sorting -D stream.num.map.output.key.fields=2 \ -D mapred.text.key.partitioner.options=-k1 \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Sessionize (Reducer) Apply sessionize business rules Assign session_id from a distributed sequence Summary to gather session-level facts join session facts with web-events Assemble all output rows to one output streams (web events, sessions, rejects), by just extending the fields, reject_flag to mark rows rejected event_type (session) indicator field

Output (Mapper) from Sessionize Filter for specific event type events Unpack the event type + sessions fact Provide data to prepare_load

Benefits to Ops Processing time to reaching SLA, saving 6 hours Running 2 years in production without any big issues Withstood the test of 50% / year data volume increase Architecture by design made easy to add new processing logic Robust and Fault-Tolerant Five dead datanodes, jobs still ran OK Upgraded JVM on a few datanodes while jobs running Reprocessing old data while processing data of current day

Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work Python ETL Framework Home grown, under review for open-source release Rich functionalities by Python Extensible NLS support Put on top of another platform, eg Hadoop Distributed/Parallel Sorting Aggregation

Conclusions II – Power and Flexibility for Processing Big Data Hadoop - scale and computing horsepower Robustness Fault-tolerance Scalability Significant reduction of processing time to reach SLA Cost-effective Commodity HW Free SW Currently: Build Multi-tenant Hadoop clusters using Fair Scheduler

The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak

Questions? [email_address] Follow up Lumberjack [email_address]

Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi

Editor's Notes