Learn from HomeAway Hadoop Development and Operations Best Practices

HomeAway
The world leader for vacation rentals
Over a million listings
worldwide and growing!

Hadoop is changing
You …
● Need faster ROI
● Need compelling
use cases
● Need more with less
● Need to leverage existing
talent

Harnessing the power of hadoop
● MapReduce
○ Divides into smaller problems;;
Assemble smaller answers into the
answers to the bigger problems.
● MapReduce
○ Can be hard to learn
○ Verbose;; Tedious
○ Historically slow
● New Engine Options
○ Apache Tez
○ Apache Spark
○ Apache Flink

Problem at HomeAway
Cascading

Speaker Panel
• Austin Tobin - Software Engineer
File Storage Quotas :: Introduction to Cascading
• Michael McAllister - Staff Data Warehouse Engineer
Supplier Analytics :: Phoenix, HBase and Driven
• Francois Forster - Architect
User Analytics :: A/B Test Readouts

File Storage Quotas :: Introduction to Cascading
© Copyright 2015 HomeAway, Inc.

Introduction
1. What is it we are trying to solve
2. What is Cascading
3. How we applied Cascading to solve this problem

What is Mesa? What is the problem with Mesa?
▪Mesa is an internal file system
▪Divided up into buckets, each bucket has a quota
▪Each bucket maintains a statistics file, locked on write and delete
▪As usage increases, this locking creates performance bottlenecks
9

• Kafka
• High performance messaging technology
• Used to insert high volume of consistent log messages very quickly
• Avro
• Compressible file-format. Binarized, highly portable.
• Hadoop
• Distributed file store and processing framework
• enables near infinite horizontal scalability for storage and processing
• Cascading...
Key Technologies

Cascading
• Taps can be either sources or sinks
• Sources are data inputs, and sinks are data outputs
• They require a scheme, which is a set of column names (tuples), and a text-
delimiter
• The sink of one flow can be the source of another flow.
• Pipes
• Abstractions to perform functions or transformations
• Functions include split, merge, expression, and filter
• The output of one pipe may be another pipe,
• chain together to perform sequences of transformations
• Flows
• Connect sources to sinks via pipes into a flow
• Can connect multiple flows together into
a CASCADE
CASCADING

Cascading
The Cascading Archetype is project which makes it very easy to get started with cascading
applications. Currently an internal project, which uses Spring to make defining taps and flows
very easy.
1. Define your Taps
2. Build your Flows.
3. Cascade!
Cascading Archetype

Hadoop
Log Events
Mesa Stats Job
Mesa Metadata
Old Catalog +
Log EventsNew Catalog
+ Statistics
Mesa
Mesa Stats - The Big Picture

OLD
CATALOG
TAP
EVENT TAP
Clean
Events
Pipe
Build
New
Catalog
Pipe
NEW CATALOG
SINK
Flow Def - Create the New Catalog

Filter Non
Mesa Events
Split the
Message
Field into
multiple
Fields
Remove
Extraneous
Fields
Pipe - Clean the Events

Cascading
Pipe - Clean the Events

Cleaned
Event Pipe
Catalog Pipe
Sort Events
by Latest
Desc
Take
Top 1
Event
Remove
Deleted
Events
Merge Events
With Catalog
Pipe
Pipe - Build the New Catalog

Cascading
Pipe - Build the Catalog

Cascading
Update Catalog Flow Def - Revisited

NEW CATALOG TAP MESA QUOTA
TAP
Sum File
Sizes Per
Bucket
Merge
on
Bucket
Names
Divide
Bucket
File
Sizes By
Quota
STATISTICS SINK
Flow Def - Calculate the
Statistics

Cascading
Pipe - Sum and Merge

Cascading
Flow Def - Calculate the Statistics

Cascading
Flow Def - Statistics Revisited

Thank you all!
• Cascading For the Impatient

Supplier Analytics :: Phoenix, HBase and Driven

The goal
● The goal: Expose our EDW analytics to suppliers.
● But ...
○ More users of analytics = requirement to
horizontally scale
○ SQL Server EDW + Managed Storage =
Expensive to horizontally scale

The solution
●Use Cascading with HBase / Phoenix
○Cascading for ETL
○Apache Phoenix as an abstraction layer over
Hbase
○HomeAway created Cascading Phoenix Tap to
simplify use of Phoenix.

What does our Cascading ETL look like?
● Daily jobs scheduled in oozie
● Runs Cascading ETL developed as Java programs
● Examples:-
○ETL listings that have changed since yesterday from
EDW to HBase
○ETL listing metrics from current periodic snapshot fact
partition over to HBase.
○ETL market group metrics from current periodic
snapshot fact partition over to HBase

What does our Cascading ETL look like?
● Extract - SQL statement issued against SQL Server JDBC
tap
● Transform
○ Simple - do it in your SQL statement
○ Complex - do it in your pipes - filters, cogroups, user
defined functions, etc
● Load - sink tap bound to Apache Phoenix Cascading tap
○ This tap is in essence a HBase table

How Driven simplifies using Cascading

A real simple Cascading flow definition

User Analytics :: A/B Test Readouts

A/B Test Readouts
• We’re always running many A/B tests concurrently on our sites
• Daily Cascading Job performs A/B test readout
– Readout for all running A/B tests at once
– Rolling 3-week
• Sliced and diced by site, by day, by test as well as various roll ups
• Multiple conversion metrics
• Millions of daily test exposures and conversions

A/B Test Readout Flow
Not The Full Cascade!

A/B Test Readout Cascade
• Includes Daily Intermediate Files
–cascade.setFlowSkipStrategy(new FlowSkipIfSinkExists());

Using Driven For Performance Tuning
• Driven makes it easy to look at the time it takes to execute
– Including the number of mappers or reducers
– Increase if needed:
pipe.getStepConfigDef().setProperty("mapreduce.job.reduces","20");

Cascading Tips
• Store intermediate files to avoid re-processing the same
data over and over again
–When running frequent jobs on rolling window
• Breakup your complex flows
• Use Driven to tweak # of reducers at various points

Deployment / Operational Issues

HomeAway CI/CD Pipeline
cascading-archetype
job-A
job-B
oozie-job-deployer

HomeAway
#wholevacation
Thank you!

Learn from HomeAway Hadoop Development and Operations Best Practices

More Related Content

What's hot (20)

Similar to Learn from HomeAway Hadoop Development and Operations Best Practices (20)

Recently uploaded (20)

Learn from HomeAway Hadoop Development and Operations Best Practices