SlideShare a Scribd company logo
Hadoop and Hive at Orbitz
Jonathan Seidman and Ramesh Venkataramaiah
Hadoop World 2010
Agenda
•  Orbitz Worldwide
•  The challenge of big data at Orbitz
•  Hadoop as a solution to the data challenge
•  Applications of Hadoop and Hive at Orbitz – improving hotel
sort
•  Sample analysis and data trends
•  Other uses of Hadoop and Hive at Orbitz
•  Lessons learned and conclusion
page 2
page 3
Launched: 2001, Chicago, IL
page 4
Orbitz…
…poster children for Hadoop
Data Challenges at Orbitz
On Orbitz alone we do millions of searches and transactions daily,	

which leads to hundreds of gigabytes of log data every day.	

So how do we store and process all of this data?	

page 5
page 6
Utterly redonkulous amounts of money 	

$ per managed TB
page 7
Utterly redonkulous amounts of money	

More reasonable amounts of money	

$ per managed TB
•  Adding data to our data warehouse also requires a lengthy
plan/implement/deploy cycle.
•  Because of the expense and time our data teams need to be
very judicious about which data gets added. This means that
potentially valuable data may not be saved.
•  We needed a solution that would allow us to economically store
and process the growing volumes of data we collect.
page 8
page 9
Hadoop brings our cost per TB down to $1500 (or even less)
•  Important to note that Hadoop is not a replacement to a data
warehouse, but rather is a complement to it.
•  On the other hand, Hadoop offers benefits other than just cost.
page 10
page 11
page 12
How can we improve hotel ranking?	

Hey! Let’s use machine learning!	

All the cool kids are doing it!
Requires data – lots of data
•  Web analytics software providing session data about user
behavior.
•  Unfortunately specific data fields we needed weren’t loaded
into our data warehouse, and just to make things worse the
only archive of raw logs available only went back a few days.
•  We decided to turn to Hadoop to provide a long-term archive
for these logs.
•  Storing raw data in HDFS provides access to data not available
elsewhere, for example “hotel impression” data:
–  115004,1,70.00;35217,2,129.00;239756,3,99.00;83389,4,99.00!
page 13
Now we need to process the data…
•  Extract data from raw Webtrends logs for input to a trained
classification process.
•  Logs provide input to MapReduce processing which extracts
required fields.
•  Previous process used a series of Perl and Bash scripts to
extract data serially.
•  Comparison of performance
–  Months worth of data
–  Manual process took 109m14s
–  MapReduce process took 25m58s
page 14
Processing Flow – Step 1
page 15
Processing Flow – Step 2
page 16
Processing Flow – Step 3
page 17
Processing Flow – Step 4
page 18
Processing Flow – Step 5
page 19
Processing Flow – Step 6
page 20
Once data is in hive…
•  Provides input data to machine learning processes.
•  Used to create data exports for further analysis with R scripts,
allowing us to derive more complex statistics and visualizations
of our data.
•  Provides useful metrics, many of which were unavailable with
our existing data stores.
•  Used for aggregating data for import into our data warehouse
for creation of new data cubes, providing analysts access to
data unavailable in existing data cubes.
page 21
Statistical Analysis: Infrastructure and Dataset
page 22
•  Hive + R platform for query processing and statistical analysis.
•  R - Open-source stat package with visualization.
•  Hive Dataset:
–  Customer hotel booking on our sites and User rating of hotels.
•  Investigation:
–  Are there built-in data bias? Any Lurking variables?
–  What approximations and biases exist?
–  Are variables pair-wise correlated?
–  Are there macro patterns?
Statistical Analysis - Positional Bias
page 23
•  Lurking variable is…
Positional Bias.
•  Top positions invariably
picked the most.
•  Aim to position Best Ranked
Hotels at the top based on
customer search criteria and
user ratings.
Statistical Analysis - Kernel Density
page 24
•  User Ratings of Hotels
•  Strongly affected by the number
of bins used.
•  Kernel density plots are usually
a much more effective way to
overcome the limitations of
histograms.
Statistical Analysis - Exploratory correlation
page 25
Statistical Analysis - More seasonal variations
page 26
•  Customer hotel stay gets longer during summer months
•  Could help in designing search based on seasons.
•  Outliers removed.
Analysis: take away’s…
page 27
•  Costs of cleaning and processing data is significant.
•  Tendency to create stories out of noise.
•  “Median is not the message”; Find macro patterns first.
•  If website originated data, watch for hidden bias in data collection.
Lessons Learned
•  Make sure you’re using the appropriate tool – avoid the temptation to
start throwing all of your data in Hadoop when a relational store may be
a better choice.
•  Expect the unexpected in your data. When processing billions of records
of data it’s inevitable that you’ll encounter at least one bad record which
will blow up your processing.
•  To get buy-in from upper management,
present a long-term, unstructured
data growth story and explain how this
will help harness long-tail opportunities.
page 28
Lessons Learned (continued)
•  Hadoop’s limited security model creates challenges when
trying to deploy Hadoop in the enterprise.
•  Configuration currently seems to be a black art. It can be
difficult to understand which parameters to set and how to
determine an optimal configuration.
•  Watch your memory use. Sloppy programming practices will
bite you when your code needs to process large volumes of
data.
page 29
Hadoop is a virus…
page 30
Just a few more examples of how Hadoop is being used at Orbitz…
•  Measuring page download performance: using web analytics logs as
input, a set of MapReduce scripts are used to derive detailed client
side performance metrics which allow us to track trends in page
download times.
•  Searching production logs: an effort is underway to utilize Hadoop to
store and process our large volume of production logs, allowing
developers and analysts to perform tasks such as troubleshooting
production issues.
•  Cache analysis: extraction and aggregation of data to provide input to
analyses intended to improve the performance of data caches utilized
by our web sites.
page 31
Applications of Hadoop at orbitz are just beginning…
•  We’re in the process of quadrupling the capacity of our
production cluster.
•  Multiple teams are working on new applications of Hadoop
•  We continue to explore the use of associated tools – Hbase,
Pig, Flume, etc.
page 32
References
•  Hadoop project: http://hadoop.apache.org/
•  Hive project: http://hadoop.apache.org/hive/
•  Hive – A Petabyte Scale Data Warehouse Using Hadoop:
http://i.stanford.edu/~ragho/hive-icde2010.pdf
•  Hadoop The Definitive Guide, Tom White, O’Reilly Press, 2009
•  Why Model, J. Epstein, 2008
•  Beautiful Data, T. Segaran & J. Hammerbacher, 2009
•  Karmasphere Developer Study: http://www.karmasphere.com/
images/documents/Karmasphere-HadoopDeveloperResearch.pdf
page 33
Contact
•  Jonathan Seidman:
–  jseidman@orbitz.com
–  @jseidman
–  Chicago area Hadoop User Group: http://www.meetup.com/
Chicago-area-Hadoop-User-Group-CHUG/
•  Ramesh Venkataramaiah:
–  rvenkataramaiah@orbitz.com
page 34

More Related Content

PDF
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
PDF
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
PDF
비용 관점에서 AWS 클라우드 아키텍처 디자인하기::류한진::AWS Summit Seoul 2018
Amazon Web Services Korea
 
PDF
[JAWS DAYS 2019] Amazon DocumentDB(with MongoDB Compatibility)入門
Shuji Kikuchi
 
PDF
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
Amazon Web Services Korea
 
PDF
Enterprise class deployment for GeoServer and GeoWebcache Optimizing perform...
GeoSolutions
 
PPTX
/path/to/content - the Apache Jackrabbit content repository
Jukka Zitting
 
PDF
AWS Black Belt Online Seminar 2017 Amazon Relational Database Service (Amazon...
Amazon Web Services Japan
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
비용 관점에서 AWS 클라우드 아키텍처 디자인하기::류한진::AWS Summit Seoul 2018
Amazon Web Services Korea
 
[JAWS DAYS 2019] Amazon DocumentDB(with MongoDB Compatibility)入門
Shuji Kikuchi
 
아름답고 유연한 데이터 파이프라인 구축을 위한 Amazon Managed Workflow for Apache Airflow - 유다니엘 A...
Amazon Web Services Korea
 
Enterprise class deployment for GeoServer and GeoWebcache Optimizing perform...
GeoSolutions
 
/path/to/content - the Apache Jackrabbit content repository
Jukka Zitting
 
AWS Black Belt Online Seminar 2017 Amazon Relational Database Service (Amazon...
Amazon Web Services Japan
 

What's hot (12)

PPTX
20211109 bleaの使い方(基本編)
Amazon Web Services Japan
 
PDF
後悔しないもんごもんごの使い方 〜アプリ編〜
Masakazu Matsushita
 
PDF
20180714 하둡 스터디 종료 보고 및 연구과제 발표자료
BOMI KIM
 
PDF
JCR, Sling or AEM? Which API should I use and when?
connectwebex
 
PPS
SQL & NoSQL
Ahmad Awsaf-uz-zaman
 
PDF
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
PDF
Kinesis→Redshift連携を、KCLからFirehoseに切り替えたお話
Hajime Sano
 
PDF
AWS for Games - 게임만을 위한 AWS 서비스 길라잡이 (레벨 200) - 진교선, 솔루션즈 아키텍트, AWS ::: Game...
Amazon Web Services Korea
 
PPTX
MongoDB Sharding
Rob Walters
 
PPTX
CloudFront経由でのCORS利用
Yuta Imai
 
PDF
Graph Databases - RedisGraph and RedisInsight
Md. Farhan Memon
 
PPTX
Introduction to ELK
YuHsuan Chen
 
20211109 bleaの使い方(基本編)
Amazon Web Services Japan
 
後悔しないもんごもんごの使い方 〜アプリ編〜
Masakazu Matsushita
 
20180714 하둡 스터디 종료 보고 및 연구과제 발표자료
BOMI KIM
 
JCR, Sling or AEM? Which API should I use and when?
connectwebex
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Databricks
 
Kinesis→Redshift連携を、KCLからFirehoseに切り替えたお話
Hajime Sano
 
AWS for Games - 게임만을 위한 AWS 서비스 길라잡이 (레벨 200) - 진교선, 솔루션즈 아키텍트, AWS ::: Game...
Amazon Web Services Korea
 
MongoDB Sharding
Rob Walters
 
CloudFront経由でのCORS利用
Yuta Imai
 
Graph Databases - RedisGraph and RedisInsight
Md. Farhan Memon
 
Introduction to ELK
YuHsuan Chen
 
Ad

Similar to Hadoop and Hive at Orbitz, Hadoop World 2010 (20)

PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PPTX
unit 1 big data.pptx
MohammedShahid562503
 
PDF
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Jonathan Seidman
 
PPT
Big data.ppt
IdontKnow66967
 
PPTX
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
 
PPTX
Lecture1
Manish Singh
 
PPT
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
PPTX
Big Data and Hadoop
MaulikLakhani
 
PDF
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
PPTX
Big Data Processing
Michael Ming Lei
 
PDF
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
PPTX
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
PPTX
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
PPTX
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
PDF
Big data
roysonli
 
PPTX
Big data meet_up_08042016
Mark Smith
 
PPT
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Cloudera, Inc.
 
PPTX
Introduction to Hadoop and MapReduce
Csaba Toth
 
PPTX
Big data by Mithlesh sadh
Mithlesh Sadh
 
PPTX
When to Use MongoDB...and When You Should Not...
MongoDB
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
unit 1 big data.pptx
MohammedShahid562503
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Jonathan Seidman
 
Big data.ppt
IdontKnow66967
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
 
Lecture1
Manish Singh
 
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Big Data and Hadoop
MaulikLakhani
 
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Big Data Processing
Michael Ming Lei
 
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
Lecture1 BIG DATA and Types of data in details
AbhishekKumarAgrahar2
 
IARE_BDBA_ PPT_0.pptx
AIMLSEMINARS
 
Big data
roysonli
 
Big data meet_up_08042016
Mark Smith
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Cloudera, Inc.
 
Introduction to Hadoop and MapReduce
Csaba Toth
 
Big data by Mithlesh sadh
Mithlesh Sadh
 
When to Use MongoDB...and When You Should Not...
MongoDB
 
Ad

More from Jonathan Seidman (13)

PDF
Foundations for Successful Data Projects – Strata London 2019
Jonathan Seidman
 
PDF
Foundations strata sf-2019_final
Jonathan Seidman
 
PDF
Architecting a Next Gen Data Platform – Strata New York 2018
Jonathan Seidman
 
PDF
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 
PDF
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
PDF
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
PPTX
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
PDF
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
PDF
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
PDF
Extending the EDW with Hadoop - Chicago Data Summit 2011
Jonathan Seidman
 
PDF
Real World Machine Learning at Orbitz, Strata 2011
Jonathan Seidman
 
Foundations for Successful Data Projects – Strata London 2019
Jonathan Seidman
 
Foundations strata sf-2019_final
Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Jonathan Seidman
 

Hadoop and Hive at Orbitz, Hadoop World 2010

  • 1. Hadoop and Hive at Orbitz Jonathan Seidman and Ramesh Venkataramaiah Hadoop World 2010
  • 2. Agenda •  Orbitz Worldwide •  The challenge of big data at Orbitz •  Hadoop as a solution to the data challenge •  Applications of Hadoop and Hive at Orbitz – improving hotel sort •  Sample analysis and data trends •  Other uses of Hadoop and Hive at Orbitz •  Lessons learned and conclusion page 2
  • 5. Data Challenges at Orbitz On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day. So how do we store and process all of this data? page 5
  • 6. page 6 Utterly redonkulous amounts of money $ per managed TB
  • 7. page 7 Utterly redonkulous amounts of money More reasonable amounts of money $ per managed TB
  • 8. •  Adding data to our data warehouse also requires a lengthy plan/implement/deploy cycle. •  Because of the expense and time our data teams need to be very judicious about which data gets added. This means that potentially valuable data may not be saved. •  We needed a solution that would allow us to economically store and process the growing volumes of data we collect. page 8
  • 9. page 9 Hadoop brings our cost per TB down to $1500 (or even less)
  • 10. •  Important to note that Hadoop is not a replacement to a data warehouse, but rather is a complement to it. •  On the other hand, Hadoop offers benefits other than just cost. page 10
  • 12. page 12 How can we improve hotel ranking? Hey! Let’s use machine learning! All the cool kids are doing it!
  • 13. Requires data – lots of data •  Web analytics software providing session data about user behavior. •  Unfortunately specific data fields we needed weren’t loaded into our data warehouse, and just to make things worse the only archive of raw logs available only went back a few days. •  We decided to turn to Hadoop to provide a long-term archive for these logs. •  Storing raw data in HDFS provides access to data not available elsewhere, for example “hotel impression” data: –  115004,1,70.00;35217,2,129.00;239756,3,99.00;83389,4,99.00! page 13
  • 14. Now we need to process the data… •  Extract data from raw Webtrends logs for input to a trained classification process. •  Logs provide input to MapReduce processing which extracts required fields. •  Previous process used a series of Perl and Bash scripts to extract data serially. •  Comparison of performance –  Months worth of data –  Manual process took 109m14s –  MapReduce process took 25m58s page 14
  • 15. Processing Flow – Step 1 page 15
  • 16. Processing Flow – Step 2 page 16
  • 17. Processing Flow – Step 3 page 17
  • 18. Processing Flow – Step 4 page 18
  • 19. Processing Flow – Step 5 page 19
  • 20. Processing Flow – Step 6 page 20
  • 21. Once data is in hive… •  Provides input data to machine learning processes. •  Used to create data exports for further analysis with R scripts, allowing us to derive more complex statistics and visualizations of our data. •  Provides useful metrics, many of which were unavailable with our existing data stores. •  Used for aggregating data for import into our data warehouse for creation of new data cubes, providing analysts access to data unavailable in existing data cubes. page 21
  • 22. Statistical Analysis: Infrastructure and Dataset page 22 •  Hive + R platform for query processing and statistical analysis. •  R - Open-source stat package with visualization. •  Hive Dataset: –  Customer hotel booking on our sites and User rating of hotels. •  Investigation: –  Are there built-in data bias? Any Lurking variables? –  What approximations and biases exist? –  Are variables pair-wise correlated? –  Are there macro patterns?
  • 23. Statistical Analysis - Positional Bias page 23 •  Lurking variable is… Positional Bias. •  Top positions invariably picked the most. •  Aim to position Best Ranked Hotels at the top based on customer search criteria and user ratings.
  • 24. Statistical Analysis - Kernel Density page 24 •  User Ratings of Hotels •  Strongly affected by the number of bins used. •  Kernel density plots are usually a much more effective way to overcome the limitations of histograms.
  • 25. Statistical Analysis - Exploratory correlation page 25
  • 26. Statistical Analysis - More seasonal variations page 26 •  Customer hotel stay gets longer during summer months •  Could help in designing search based on seasons. •  Outliers removed.
  • 27. Analysis: take away’s… page 27 •  Costs of cleaning and processing data is significant. •  Tendency to create stories out of noise. •  “Median is not the message”; Find macro patterns first. •  If website originated data, watch for hidden bias in data collection.
  • 28. Lessons Learned •  Make sure you’re using the appropriate tool – avoid the temptation to start throwing all of your data in Hadoop when a relational store may be a better choice. •  Expect the unexpected in your data. When processing billions of records of data it’s inevitable that you’ll encounter at least one bad record which will blow up your processing. •  To get buy-in from upper management, present a long-term, unstructured data growth story and explain how this will help harness long-tail opportunities. page 28
  • 29. Lessons Learned (continued) •  Hadoop’s limited security model creates challenges when trying to deploy Hadoop in the enterprise. •  Configuration currently seems to be a black art. It can be difficult to understand which parameters to set and how to determine an optimal configuration. •  Watch your memory use. Sloppy programming practices will bite you when your code needs to process large volumes of data. page 29
  • 30. Hadoop is a virus… page 30
  • 31. Just a few more examples of how Hadoop is being used at Orbitz… •  Measuring page download performance: using web analytics logs as input, a set of MapReduce scripts are used to derive detailed client side performance metrics which allow us to track trends in page download times. •  Searching production logs: an effort is underway to utilize Hadoop to store and process our large volume of production logs, allowing developers and analysts to perform tasks such as troubleshooting production issues. •  Cache analysis: extraction and aggregation of data to provide input to analyses intended to improve the performance of data caches utilized by our web sites. page 31
  • 32. Applications of Hadoop at orbitz are just beginning… •  We’re in the process of quadrupling the capacity of our production cluster. •  Multiple teams are working on new applications of Hadoop •  We continue to explore the use of associated tools – Hbase, Pig, Flume, etc. page 32
  • 33. References •  Hadoop project: http://hadoop.apache.org/ •  Hive project: http://hadoop.apache.org/hive/ •  Hive – A Petabyte Scale Data Warehouse Using Hadoop: http://i.stanford.edu/~ragho/hive-icde2010.pdf •  Hadoop The Definitive Guide, Tom White, O’Reilly Press, 2009 •  Why Model, J. Epstein, 2008 •  Beautiful Data, T. Segaran & J. Hammerbacher, 2009 •  Karmasphere Developer Study: http://www.karmasphere.com/ images/documents/Karmasphere-HadoopDeveloperResearch.pdf page 33
  • 34. Contact •  Jonathan Seidman: –  [email protected] –  @jseidman –  Chicago area Hadoop User Group: http://www.meetup.com/ Chicago-area-Hadoop-User-Group-CHUG/ •  Ramesh Venkataramaiah: –  [email protected] page 34