Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points:
- CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage.
- They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases.
- Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics.
- Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their