SlideShare a Scribd company logo
Cascading  Webinar
HomeAway
The world leader for vacation rentals
Over  a  million  listings  
worldwide  and  growing!
Hadoop  is  changing
You  …
● Need  faster  ROI
● Need  compelling
use  cases
● Need  more  with  less
● Need  to  leverage  existing  
talent
Harnessing  the  power  of  hadoop
● MapReduce
○ Divides  into  smaller  problems;;
Assemble  smaller  answers  into  the  
answers  to  the  bigger  problems.
● MapReduce
○ Can  be  hard  to  learn
○ Verbose;;  Tedious
○ Historically  slow
● New  Engine  Options
○ Apache  Tez
○ Apache  Spark
○ Apache  Flink
Problem  at  HomeAway
Cascading
Speaker  Panel
• Austin  Tobin  -­ Software  Engineer
File  Storage  Quotas  ::  Introduction  to  Cascading
• Michael  McAllister  -­ Staff  Data  Warehouse  Engineer
Supplier  Analytics  ::  Phoenix,  HBase  and  Driven
• Francois  Forster  -­ Architect
User  Analytics  ::  A/B  Test  Readouts
File  Storage  Quotas  ::  Introduction  to  Cascading
©  Copyright  2015  HomeAway,  Inc.
Introduction
1. What  is  it  we  are  trying  to  solve
2. What  is  Cascading
3. How  we  applied  Cascading  to  solve  this  problem
©  Copyright  2015  HomeAway,  Inc.
What  is  Mesa?  What  is  the  problem  with  Mesa?
▪Mesa  is  an  internal  file  system
▪Divided  up  into  buckets,  each  bucket  has  a  quota
▪Each  bucket  maintains  a  statistics  file,  locked  on  write  and  delete
▪As  usage  increases,  this  locking  creates  performance  bottlenecks
9
• Kafka
• High  performance  messaging  technology
• Used  to  insert  high  volume  of  consistent  log  messages  very  quickly
• Avro
• Compressible  file-­format.  Binarized,  highly  portable.
• Hadoop
• Distributed  file  store  and  processing  framework
• enables  near  infinite  horizontal  scalability  for  storage  and  processing
• Cascading...
Key  Technologies
Cascading
• Taps  can  be  either  sources  or  sinks
• Sources  are  data  inputs,  and  sinks  are  data  outputs
• They  require  a  scheme,  which  is  a  set  of  column  names  (tuples),  and  a  text-­
delimiter
• The  sink  of  one  flow  can  be  the  source  of  another  flow.
• Pipes
• Abstractions  to  perform  functions  or  transformations
• Functions  include  split,  merge,  expression,  and  filter
• The  output  of  one  pipe  may  be  another  pipe,  
• chain  together  to  perform  sequences  of  transformations  
• Flows
• Connect  sources  to  sinks  via  pipes  into  a  flow
• Can  connect  multiple  flows  together  into
a  CASCADE
CASCADING
Cascading
The  Cascading  Archetype  is  project  which  makes  it  very  easy  to  get  started  with  cascading  
applications.  Currently  an  internal  project,  which  uses  Spring  to  make  defining  taps  and  flows  
very  easy.
1. Define your Taps
2. Build your Flows.
3. Cascade!
Cascading  Archetype
©  Copyright  2015  HomeAway,  Inc.
Hadoop
Log  Events  
Mesa  Stats  Job
Mesa  Metadata
Old  Catalog  +  
Log  EventsNew  Catalog  
+  Statistics
Mesa
Mesa  Stats  -­ The  Big  Picture
OLD
CATALOG
TAP
EVENT  TAP
Clean  
Events  
Pipe
Build  
New  
Catalog  
Pipe
NEW  CATALOG  
SINK
Flow  Def  -­ Create  the  New  Catalog
Cascading
Old  Catalog  Tap
Filter  Non  
Mesa  Events
Split  the  
Message  
Field  into  
multiple    
Fields
Remove  
Extraneous  
Fields
Pipe  -­ Clean  the  Events  
Cascading
Pipe  -­ Clean  the  Events
Cleaned  
Event  Pipe
Catalog  Pipe
Sort  Events  
by  Latest  
Desc
Take  
Top  1  
Event
Remove  
Deleted  
Events
Merge  Events  
With  Catalog  
Pipe
Pipe  -­ Build  the  New  Catalog
Cascading
Pipe  -­ Build  the  Catalog
Cascading
Update  Catalog  Flow  Def  -­ Revisited
NEW  CATALOG  TAP MESA  QUOTA  
TAP
Sum  File  
Sizes  Per  
Bucket
Merge  
on  
Bucket  
Names
Divide  
Bucket  
File  
Sizes  By  
Quota
STATISTICS  SINK
Flow  Def  -­ Calculate  the  
Statistics
Cascading
Pipe  -­ Sum  and  Merge  
Cascading
Flow  Def  -­ Calculate  the  Statistics  
Cascading
Flow  Def  -­ Statistics  Revisited
Thank  you  all!
• Cascading  For  the  Impatient
Supplier  Analytics  ::  Phoenix,  HBase  and  Driven
The  goal
● The  goal:  Expose  our  EDW  analytics  to  suppliers.
● But  ...
○ More  users  of  analytics  =  requirement  to  
horizontally  scale
○ SQL  Server  EDW  +  Managed  Storage  =  
Expensive  to  horizontally  scale
The  solution
●Use  Cascading  with  HBase  /  Phoenix
○Cascading  for  ETL
○Apache  Phoenix  as  an  abstraction  layer  over  
Hbase
○HomeAway  created  Cascading  Phoenix  Tap  to  
simplify  use  of  Phoenix.
What  does  our  Cascading  ETL  look  like?
● Daily  jobs  scheduled  in  oozie
● Runs  Cascading  ETL  developed  as  Java  programs
● Examples:-­
○ETL  listings  that  have  changed  since  yesterday  from  
EDW  to  HBase
○ETL  listing  metrics  from  current  periodic  snapshot  fact  
partition  over  to  HBase.  
○ETL  market  group  metrics  from  current  periodic  
snapshot  fact  partition  over  to  HBase
What  does  our  Cascading  ETL  look  like?
● Extract  -­ SQL  statement  issued  against  SQL  Server  JDBC  
tap
● Transform
○ Simple  -­ do  it  in  your  SQL  statement
○ Complex  -­ do  it  in  your  pipes  -­ filters,  cogroups,  user  
defined  functions,  etc
● Load  -­ sink  tap  bound  to  Apache  Phoenix  Cascading  tap
○ This  tap  is  in  essence  a  HBase  table
How  Driven  simplifies  using  Cascading
How  Driven  simplifies  using  Cascading
How  Driven  simplifies  using  Cascading
How  Driven  simplifies  using  Cascading
A  real  simple  Cascading  flow  definition
User  Analytics  ::  A/B  Test  Readouts
A/B  Test  Readouts
• We’re  always  running  many  A/B  tests  concurrently  on  our  sites
• Daily  Cascading  Job  performs  A/B  test  readout
– Readout  for  all  running  A/B  tests  at  once
– Rolling  3-­week
• Sliced  and  diced  by  site,  by  day,  by  test  as  well  as  various  roll  ups
• Multiple  conversion  metrics
• Millions  of  daily  test  exposures  and  conversions
A/B  Test  Readout  Flow
Not  The  Full  Cascade!
A/B  Test  Readout  Cascade
• Includes  Daily  Intermediate  Files
–cascade.setFlowSkipStrategy(new FlowSkipIfSinkExists());
Using  Driven  For  Performance  Tuning
• Driven  makes  it  easy  to  look  at  the  time  it  takes  to  execute
– Including  the  number  of  mappers  or  reducers  
– Increase  if  needed:
pipe.getStepConfigDef().setProperty("mapreduce.job.reduces","20");
Cascading  Tips
• Store  intermediate  files  to  avoid  re-­processing  the  same  
data  over  and  over  again
–When  running  frequent  jobs  on  rolling  window
• Breakup  your  complex  flows
• Use  Driven  to  tweak  #  of  reducers  at  various  points
Deployment  /  Operational  Issues
HomeAway  CI/CD  Pipeline
cascading-­archetype
job-­A
job-­B
oozie-­job-­deployer
HomeAway
#wholevacation
Thank  you!

More Related Content

What's hot (20)

PPTX
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
PDF
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
Big Data Spain
 
PPTX
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
PDF
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
PDF
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
PPTX
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
PPTX
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PPTX
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 
PDF
PagerDuty: Span the WAN? Yes you can!
DataStax Academy
 
PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
PDF
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
ScyllaDB
 
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
PDF
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
PDF
Running Scylla on Kubernetes with Scylla Operator
ScyllaDB
 
PPTX
What is Change Data Capture (CDC) and Why is it Important?
FlyData Inc.
 
PDF
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
ScyllaDB
 
PDF
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Kim Hammar
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
Big Data Spain
 
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
 
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
ScyllaDB
 
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 
PagerDuty: Span the WAN? Yes you can!
DataStax Academy
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Pr...
ScyllaDB
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
Running Scylla on Kubernetes with Scylla Operator
ScyllaDB
 
What is Change Data Capture (CDC) and Why is it Important?
FlyData Inc.
 
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
ScyllaDB
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Kim Hammar
 

Similar to Learn from HomeAway Hadoop Development and Operations Best Practices (20)

PPT
NoSQL_Night
Clarence J M Tauro
 
PPTX
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Rendy Bambang Junior
 
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
PPTX
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella
 
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
PDF
Traveloka's data journey — Traveloka data meetup #2
Traveloka
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
PPTX
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
PDF
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Karthik Babu Sekar
 
PDF
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Karthik Babu Sekar
 
PDF
Scalable Preservation Workflows
SCAPE Project
 
PPTX
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
PDF
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
PDF
Leveraging Databricks for Spark pipelines
Rose Toomey
 
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
PDF
Building real time data-driven products
Lars Albertsson
 
NoSQL_Night
Clarence J M Tauro
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Rendy Bambang Junior
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
 
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Traveloka's data journey — Traveloka data meetup #2
Traveloka
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
Couchbase Chennai Meetup: Developing with Couchbase- made easy
Karthik Babu Sekar
 
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !!
Karthik Babu Sekar
 
Scalable Preservation Workflows
SCAPE Project
 
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Building real time data-driven products
Lars Albertsson
 
Ad

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Ad

Learn from HomeAway Hadoop Development and Operations Best Practices

  • 2. HomeAway The world leader for vacation rentals Over  a  million  listings   worldwide  and  growing!
  • 3. Hadoop  is  changing You  … ● Need  faster  ROI ● Need  compelling use  cases ● Need  more  with  less ● Need  to  leverage  existing   talent
  • 4. Harnessing  the  power  of  hadoop ● MapReduce ○ Divides  into  smaller  problems;; Assemble  smaller  answers  into  the   answers  to  the  bigger  problems. ● MapReduce ○ Can  be  hard  to  learn ○ Verbose;;  Tedious ○ Historically  slow ● New  Engine  Options ○ Apache  Tez ○ Apache  Spark ○ Apache  Flink
  • 6. Speaker  Panel • Austin  Tobin  -­ Software  Engineer File  Storage  Quotas  ::  Introduction  to  Cascading • Michael  McAllister  -­ Staff  Data  Warehouse  Engineer Supplier  Analytics  ::  Phoenix,  HBase  and  Driven • Francois  Forster  -­ Architect User  Analytics  ::  A/B  Test  Readouts
  • 7. File  Storage  Quotas  ::  Introduction  to  Cascading ©  Copyright  2015  HomeAway,  Inc.
  • 8. Introduction 1. What  is  it  we  are  trying  to  solve 2. What  is  Cascading 3. How  we  applied  Cascading  to  solve  this  problem
  • 9. ©  Copyright  2015  HomeAway,  Inc. What  is  Mesa?  What  is  the  problem  with  Mesa? ▪Mesa  is  an  internal  file  system ▪Divided  up  into  buckets,  each  bucket  has  a  quota ▪Each  bucket  maintains  a  statistics  file,  locked  on  write  and  delete ▪As  usage  increases,  this  locking  creates  performance  bottlenecks 9
  • 10. • Kafka • High  performance  messaging  technology • Used  to  insert  high  volume  of  consistent  log  messages  very  quickly • Avro • Compressible  file-­format.  Binarized,  highly  portable. • Hadoop • Distributed  file  store  and  processing  framework • enables  near  infinite  horizontal  scalability  for  storage  and  processing • Cascading... Key  Technologies
  • 11. Cascading • Taps  can  be  either  sources  or  sinks • Sources  are  data  inputs,  and  sinks  are  data  outputs • They  require  a  scheme,  which  is  a  set  of  column  names  (tuples),  and  a  text-­ delimiter • The  sink  of  one  flow  can  be  the  source  of  another  flow. • Pipes • Abstractions  to  perform  functions  or  transformations • Functions  include  split,  merge,  expression,  and  filter • The  output  of  one  pipe  may  be  another  pipe,   • chain  together  to  perform  sequences  of  transformations   • Flows • Connect  sources  to  sinks  via  pipes  into  a  flow • Can  connect  multiple  flows  together  into a  CASCADE CASCADING
  • 12. Cascading The  Cascading  Archetype  is  project  which  makes  it  very  easy  to  get  started  with  cascading   applications.  Currently  an  internal  project,  which  uses  Spring  to  make  defining  taps  and  flows   very  easy. 1. Define your Taps 2. Build your Flows. 3. Cascade! Cascading  Archetype
  • 13. ©  Copyright  2015  HomeAway,  Inc. Hadoop Log  Events   Mesa  Stats  Job Mesa  Metadata Old  Catalog  +   Log  EventsNew  Catalog   +  Statistics Mesa Mesa  Stats  -­ The  Big  Picture
  • 14. OLD CATALOG TAP EVENT  TAP Clean   Events   Pipe Build   New   Catalog   Pipe NEW  CATALOG   SINK Flow  Def  -­ Create  the  New  Catalog
  • 16. Filter  Non   Mesa  Events Split  the   Message   Field  into   multiple     Fields Remove   Extraneous   Fields Pipe  -­ Clean  the  Events  
  • 17. Cascading Pipe  -­ Clean  the  Events
  • 18. Cleaned   Event  Pipe Catalog  Pipe Sort  Events   by  Latest   Desc Take   Top  1   Event Remove   Deleted   Events Merge  Events   With  Catalog   Pipe Pipe  -­ Build  the  New  Catalog
  • 19. Cascading Pipe  -­ Build  the  Catalog
  • 20. Cascading Update  Catalog  Flow  Def  -­ Revisited
  • 21. NEW  CATALOG  TAP MESA  QUOTA   TAP Sum  File   Sizes  Per   Bucket Merge   on   Bucket   Names Divide   Bucket   File   Sizes  By   Quota STATISTICS  SINK Flow  Def  -­ Calculate  the   Statistics
  • 22. Cascading Pipe  -­ Sum  and  Merge  
  • 23. Cascading Flow  Def  -­ Calculate  the  Statistics  
  • 24. Cascading Flow  Def  -­ Statistics  Revisited
  • 25. Thank  you  all! • Cascading  For  the  Impatient
  • 26. Supplier  Analytics  ::  Phoenix,  HBase  and  Driven
  • 27. The  goal ● The  goal:  Expose  our  EDW  analytics  to  suppliers. ● But  ... ○ More  users  of  analytics  =  requirement  to   horizontally  scale ○ SQL  Server  EDW  +  Managed  Storage  =   Expensive  to  horizontally  scale
  • 28. The  solution ●Use  Cascading  with  HBase  /  Phoenix ○Cascading  for  ETL ○Apache  Phoenix  as  an  abstraction  layer  over   Hbase ○HomeAway  created  Cascading  Phoenix  Tap  to   simplify  use  of  Phoenix.
  • 29. What  does  our  Cascading  ETL  look  like? ● Daily  jobs  scheduled  in  oozie ● Runs  Cascading  ETL  developed  as  Java  programs ● Examples:-­ ○ETL  listings  that  have  changed  since  yesterday  from   EDW  to  HBase ○ETL  listing  metrics  from  current  periodic  snapshot  fact   partition  over  to  HBase.   ○ETL  market  group  metrics  from  current  periodic   snapshot  fact  partition  over  to  HBase
  • 30. What  does  our  Cascading  ETL  look  like? ● Extract  -­ SQL  statement  issued  against  SQL  Server  JDBC   tap ● Transform ○ Simple  -­ do  it  in  your  SQL  statement ○ Complex  -­ do  it  in  your  pipes  -­ filters,  cogroups,  user   defined  functions,  etc ● Load  -­ sink  tap  bound  to  Apache  Phoenix  Cascading  tap ○ This  tap  is  in  essence  a  HBase  table
  • 31. How  Driven  simplifies  using  Cascading
  • 32. How  Driven  simplifies  using  Cascading
  • 33. How  Driven  simplifies  using  Cascading
  • 34. How  Driven  simplifies  using  Cascading
  • 35. A  real  simple  Cascading  flow  definition
  • 36. User  Analytics  ::  A/B  Test  Readouts
  • 37. A/B  Test  Readouts • We’re  always  running  many  A/B  tests  concurrently  on  our  sites • Daily  Cascading  Job  performs  A/B  test  readout – Readout  for  all  running  A/B  tests  at  once – Rolling  3-­week • Sliced  and  diced  by  site,  by  day,  by  test  as  well  as  various  roll  ups • Multiple  conversion  metrics • Millions  of  daily  test  exposures  and  conversions
  • 38. A/B  Test  Readout  Flow Not  The  Full  Cascade!
  • 39. A/B  Test  Readout  Cascade • Includes  Daily  Intermediate  Files –cascade.setFlowSkipStrategy(new FlowSkipIfSinkExists());
  • 40. Using  Driven  For  Performance  Tuning • Driven  makes  it  easy  to  look  at  the  time  it  takes  to  execute – Including  the  number  of  mappers  or  reducers   – Increase  if  needed: pipe.getStepConfigDef().setProperty("mapreduce.job.reduces","20");
  • 41. Cascading  Tips • Store  intermediate  files  to  avoid  re-­processing  the  same   data  over  and  over  again –When  running  frequent  jobs  on  rolling  window • Breakup  your  complex  flows • Use  Driven  to  tweak  #  of  reducers  at  various  points