The Evolution of Data Analysis with Hadoop - StampedeCon 2014

The$Evolu*on$of$Data$Analysis$with$
Hadoop$
Tom$Wheeler$ |$$StampedeCon$2014$

About$the$Presenta*on…$
•  What’s$ahead$
•  Deﬁning$Hadoop$
•  Data$Processing$with$MapReduce$
•  Simplifying$Development$with$Apache$Crunch$
•  Bringing$MapReduce$to$Analysts$with$Apache$Hive$
•  GeMng$Results$Faster$with$Cloudera$Impala$
•  Finding$Data$Made$Easy$with$Apache$Solr$/$Cloudera$Search$
•  Conclusion$+$Q&A$

Important$Trends$
•  Ubiquitous$connec*vity$
•  We$produce$more$data$than$ever$
•  UserVgenerated$content$
•  Lacks$rigid$structure$
•  Inexpensive$storage$
•  Permanent$reten*on$
f

Big$Data$Can$Mean$Big$Opportunity$
•  One$tweet$is$an$anecdote$
•  But$a$million$tweets$can$signal$important$trends$
•  One$person’s$product$review$is$an$opinion$
•  But$a$million$reviews$might$reveal$a$design$ﬂaw$
•  One$person’s$diagnosis$is$an$isolated$case$
•  But$a$million$medical$records$could$lead$to$a$cure$

What$is$Apache$Hadoop?$
•  Distributed$data$storage$and$processing$
•  Scalable,$ﬂexible,$and$economical$
•  Open$source$
•  Inspired$by$Google$
•  Two$main$components$
•  Hadoop$Distributed$File$System$(HDFS)$
•  MapReduce$

GeMng$Data$into$HDFS$
•  HDFS$is$dis*nct$from$your$local$ﬁlesystem$
Local Filesystem
Hadoop Distributed
File System (HDFS)
Local Filesystem

What$is$MapReduce?$
•  MapReduce$is$a$programming)model)
•  You$supply$two$processing$func*ons:$Map$and$Reduce$
•  Map:$typically$used$to$transform,$parse,$or$ﬁlter$data$
•  Reduce:$typically$used$to$summarize$results$(op*onal)$
•  MapReduce$in$Hadoop$is$batchVoriented$

Why$MapReduce?$
•  MapReduce$simpliﬁes$parallel$processing$
•  Code$is$typically$wricen$in$Java$
•  Shields$developers$from$complexity$of$distributed$compu*ng$
•  No$explicit$synchroniza*on,$network$sockets,$or$ﬁle$I/O$
•  S*ll,$it$is$tedious$to$write$MapReduce$directly…$

But$MapReduce$is$like$Assembly$Language…$
•  MapReduce$is$powerful$and$scalable$
•  But$wri*ng$MapReduce$code$directly$in$Java$can$be$tedious$
•  Business$logic$typically$comprises$just$a$frac*on$of$overall$code$
•  Many$realVworld$computa*ons$involve$a$sequence$of$jobs$
•  Chaining$mul*ple$MapReduce$jobs$increases$the$complexity$
•  Apache$Crunch$is$designed$to$address$these$problems$

What$is$Apache$Crunch?$
•  Apache$Crunch$is$a$library$that$simpliﬁes$parallel$processing$
•  OpenVsource$implementa*on$of$Google's$internal$library$
•  Provides$a$highVlevel$API$targeted$at$Java$developers$
•  No$detailed$knowledge$of$MapReduce$required$
•  Faster$and$easier$than$wri*ng$MapReduce$code$directly$
•  Retains$the$power$and$expressiveness$of$Java$

What$is$Apache$Hive?$
•  HighVlevel$data$processing$on$Hadoop$
•  Another$alterna*ve$to$wri*ng$MapReduce$code$
•  Queries$data$in$HDFS$using$a$SQLVlike$language$
SELECT customers.cust_id, SUM(cost) AS total
FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
GROUP BY customers.cust_id
ORDER BY total DESC;

Hive$Data$and$Metadata$
•  As$with$a$database,$you$query$one$or$more$tables$
•  Hive$tables$are$just$a$façade$for$a$directory$of$data$in$HDFS$
•  Default$ﬁle$format$is$delimited$text,$but$many$others$supported$
•  Table$structure$and$loca*on$are$speciﬁed$during$crea*on$
•  Metadata$is$stored$in$an$RDBMS$
•  Tables$can$be$populated$by$loading$$
data$into$HDFS$directory$
Data$in$HDFS
mytable
1
2
Metastore

What$is$Cloudera$Impala?$
•  Massively$parallel$SQL$engine$for$Hadoop$
•  Supports$ad$hoc$/$interac*ve$queries$on$data$in$HDFS$
•  Uses$custom$execu*on$engine$instead$of$MapReduce$
•  Query$syntax$virtually$iden*cal$to$HiveQL$/$SQL$
•  Shares$metadata$with$Hive$
•  Much,$much$faster$than$Hive$
•  Impala$is$100%$open$source$(ApacheVlicensed)$

Apache$Solr$(and$Cloudera$Search)$
•  Apache$Solr$provides$highVperformance$indexing$and$search$
•  Mature$plajorm$with$widespread$deployment$
•  Requires$licle$technical$skill$for$end$users,$yet$s*ll$powerful$
•  Cloudera$integrates$Solr$to$search$data$in$HDFS $$
•  CDH$oﬀers$scalability$and$reliability$
•  Distributed$data$storage$and$indexing$
•  Cloudera$Search$is$open$source,$just$like$Apache$Solr$itself$

Conclusion$
•  Thanks$for$having$me!$
•  Any$ques*ons?$

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

More Related Content

Viewers also liked (20)

Similar to The Evolution of Data Analysis with Hadoop - StampedeCon 2014 (20)

More from StampedeCon (20)

Recently uploaded (20)

The Evolution of Data Analysis with Hadoop - StampedeCon 2014