SlideShare a Scribd company logo
The$Evolu*on$of$Data$Analysis$with$
Hadoop$
Tom$Wheeler$ |$$StampedeCon$2014$
About$the$Presenta*on…$
•  What’s$ahead$
•  Defining$Hadoop$
•  Data$Processing$with$MapReduce$
•  Simplifying$Development$with$Apache$Crunch$
•  Bringing$MapReduce$to$Analysts$with$Apache$Hive$
•  GeMng$Results$Faster$with$Cloudera$Impala$
•  Finding$Data$Made$Easy$with$Apache$Solr$/$Cloudera$Search$
•  Conclusion$+$Q&A$
Important$Trends$
•  Ubiquitous$connec*vity$
•  We$produce$more$data$than$ever$
•  UserVgenerated$content$
•  Lacks$rigid$structure$
•  Inexpensive$storage$
•  Permanent$reten*on$
f
Big$Data$Can$Mean$Big$Opportunity$
•  One$tweet$is$an$anecdote$
•  But$a$million$tweets$can$signal$important$trends$
•  One$person’s$product$review$is$an$opinion$
•  But$a$million$reviews$might$reveal$a$design$flaw$
•  One$person’s$diagnosis$is$an$isolated$case$
•  But$a$million$medical$records$could$lead$to$a$cure$
What$is$Apache$Hadoop?$
•  Distributed$data$storage$and$processing$
•  Scalable,$flexible,$and$economical$
•  Open$source$
•  Inspired$by$Google$
•  Two$main$components$
•  Hadoop$Distributed$File$System$(HDFS)$
•  MapReduce$
GeMng$Data$into$HDFS$
•  HDFS$is$dis*nct$from$your$local$filesystem$
Local Filesystem
Hadoop Distributed
File System (HDFS)
Local Filesystem
What$is$MapReduce?$
•  MapReduce$is$a$programming)model)
•  You$supply$two$processing$func*ons:$Map$and$Reduce$
•  Map:$typically$used$to$transform,$parse,$or$filter$data$
•  Reduce:$typically$used$to$summarize$results$(op*onal)$
•  MapReduce$in$Hadoop$is$batchVoriented$
Why$MapReduce?$
•  MapReduce$simplifies$parallel$processing$
•  Code$is$typically$wricen$in$Java$
•  Shields$developers$from$complexity$of$distributed$compu*ng$
•  No$explicit$synchroniza*on,$network$sockets,$or$file$I/O$
•  S*ll,$it$is$tedious$to$write$MapReduce$directly…$
But$MapReduce$is$like$Assembly$Language…$
•  MapReduce$is$powerful$and$scalable$
•  But$wri*ng$MapReduce$code$directly$in$Java$can$be$tedious$
•  Business$logic$typically$comprises$just$a$frac*on$of$overall$code$
•  Many$realVworld$computa*ons$involve$a$sequence$of$jobs$
•  Chaining$mul*ple$MapReduce$jobs$increases$the$complexity$
•  Apache$Crunch$is$designed$to$address$these$problems$
What$is$Apache$Crunch?$
•  Apache$Crunch$is$a$library$that$simplifies$parallel$processing$
•  OpenVsource$implementa*on$of$Google's$internal$library$
•  Provides$a$highVlevel$API$targeted$at$Java$developers$
•  No$detailed$knowledge$of$MapReduce$required$
•  Faster$and$easier$than$wri*ng$MapReduce$code$directly$
•  Retains$the$power$and$expressiveness$of$Java$
What$is$Apache$Hive?$
•  HighVlevel$data$processing$on$Hadoop$
•  Another$alterna*ve$to$wri*ng$MapReduce$code$
•  Queries$data$in$HDFS$using$a$SQLVlike$language$
SELECT customers.cust_id, SUM(cost) AS total
FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
GROUP BY customers.cust_id
ORDER BY total DESC;
Hive$Data$and$Metadata$
•  As$with$a$database,$you$query$one$or$more$tables$
•  Hive$tables$are$just$a$façade$for$a$directory$of$data$in$HDFS$
•  Default$file$format$is$delimited$text,$but$many$others$supported$
•  Table$structure$and$loca*on$are$specified$during$crea*on$
•  Metadata$is$stored$in$an$RDBMS$
•  Tables$can$be$populated$by$loading$$
data$into$HDFS$directory$
Data$in$HDFS
mytable
1
2
Metastore
What$is$Cloudera$Impala?$
•  Massively$parallel$SQL$engine$for$Hadoop$
•  Supports$ad$hoc$/$interac*ve$queries$on$data$in$HDFS$
•  Uses$custom$execu*on$engine$instead$of$MapReduce$
•  Query$syntax$virtually$iden*cal$to$HiveQL$/$SQL$
•  Shares$metadata$with$Hive$
•  Much,$much$faster$than$Hive$
•  Impala$is$100%$open$source$(ApacheVlicensed)$
Apache$Solr$(and$Cloudera$Search)$
•  Apache$Solr$provides$highVperformance$indexing$and$search$
•  Mature$plajorm$with$widespread$deployment$
•  Requires$licle$technical$skill$for$end$users,$yet$s*ll$powerful$
•  Cloudera$integrates$Solr$to$search$data$in$HDFS $$
•  CDH$offers$scalability$and$reliability$
•  Distributed$data$storage$and$indexing$
•  Cloudera$Search$is$open$source,$just$like$Apache$Solr$itself$
Conclusion$
•  Thanks$for$having$me!$
•  Any$ques*ons?$
The Evolution of Data Analysis with Hadoop - StampedeCon 2014

More Related Content

Viewers also liked (20)

PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
PPTX
Hadoop and Hive in Enterprises
markgrover
 
PPTX
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PPTX
Video Analysis in Hadoop
DataWorks Summit
 
KEY
Big Data Trends
David Feinleib
 
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
PDF
Large-scale social media analysis with Hadoop
jakehofman
 
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
PPTX
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
 
PDF
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
PDF
Big image analytics for (Re-) insurer
Flavio Trolese
 
PPTX
What is big data?
David Wellman
 
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
PDF
Cloudera impala
Swiss Big Data User Group
 
PPTX
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
PDF
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
PPTX
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
PDF
Cloudera Impala technical deep dive
huguk
 
PPTX
Big data ppt
Nasrin Hussain
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
Hadoop and Hive in Enterprises
markgrover
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Video Analysis in Hadoop
DataWorks Summit
 
Big Data Trends
David Feinleib
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Large-scale social media analysis with Hadoop
jakehofman
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
 
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
Big image analytics for (Re-) insurer
Flavio Trolese
 
What is big data?
David Wellman
 
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Cloudera impala
Swiss Big Data User Group
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Cloudera Impala technical deep dive
huguk
 
Big data ppt
Nasrin Hussain
 

Similar to The Evolution of Data Analysis with Hadoop - StampedeCon 2014 (20)

PPTX
Keynote - Cloudera - Mike Olson - Hadoop World 2010
Cloudera, Inc.
 
PPTX
Cloudera - Mike Olson - Hadoop World 2010
Cloudera, Inc.
 
PPTX
Foxvalley bigdata
Tom Rogers
 
PPTX
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
DOCX
Hadoop Based Data Discovery
Benjamin Ashkar
 
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
PDF
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Adam Kawa
 
PPTX
Hadoop, Evolution of Hadoop, Features of Hadoop
Dr Neelesh Jain
 
PPTX
Analysing of big data using map reduce
Paladion Networks
 
PPTX
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Rahul Singh
 
PPTX
Hadoop overview
Siva Pandeti
 
PDF
Big Data Hadoop Local and Public Cloud (Amazon EMR)
IMC Institute
 
PDF
Hadoop Case Studies in the Real World
Mobin Ranjbar
 
PDF
Semantic web meetup 14.november 2013
Jean-Pierre König
 
PPTX
Hadoop for Data Warehousing professionals
Edureka!
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PDF
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
tarubmawuna
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
Keynote - Cloudera - Mike Olson - Hadoop World 2010
Cloudera, Inc.
 
Cloudera - Mike Olson - Hadoop World 2010
Cloudera, Inc.
 
Foxvalley bigdata
Tom Rogers
 
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Hadoop Based Data Discovery
Benjamin Ashkar
 
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Adam Kawa
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Dr Neelesh Jain
 
Analysing of big data using map reduce
Paladion Networks
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Rahul Singh
 
Hadoop overview
Siva Pandeti
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
IMC Institute
 
Hadoop Case Studies in the Real World
Mobin Ranjbar
 
Semantic web meetup 14.november 2013
Jean-Pierre König
 
Hadoop for Data Warehousing professionals
Edureka!
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
tarubmawuna
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Ad

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
PDF
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
PPTX
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 
Ad

Recently uploaded (20)

PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Simple and concise overview about Quantum computing..pptx
mughal641
 

The Evolution of Data Analysis with Hadoop - StampedeCon 2014