SlideShare a Scribd company logo
Hadoop	
  and	
  Spark	
  
Shravan	
  (Sean)	
  Pabba	
  
1	
  
About	
  Me	
  
•  Diverse	
  roles/languages	
  and	
  pla=orms.	
  
•  Middleware	
  space	
  in	
  recent	
  years.	
  
•  Worked	
  for	
  IBM/Grid	
  Dynamics/GigaSpaces.	
  
•  Working	
  as	
  Systems	
  Engineer	
  for	
  Cloudera	
  
since	
  last	
  July.	
  
•  Work	
  with	
  and	
  educate	
  clients/prospects.	
  
2	
  
Agenda	
  
•  IntroducLon	
  to	
  Spark	
  
–  Map	
  Reduce	
  Review	
  
–  Why	
  Spark	
  
–  Architecture	
  (Stand-­‐alone	
  AND	
  Cloudera)	
  
•  Concepts	
  
•  Examples/Use	
  Cases	
  
•  Spark	
  Streaming	
  
•  Shark	
  
–  Shark	
  Vs	
  Impala	
  
•  Demo	
  
3	
  
Have	
  you	
  done?	
  
•  Programming	
  languages	
  (Java/
Python/Scala)	
  
•  WriUen	
  mulL-­‐threaded	
  or	
  
distributed	
  programs	
  
•  Numerical	
  Programming/StaLsLcal	
  
CompuLng	
  (R,	
  MATLAB)	
  
•  Hadoop	
  
4	
  
INTRODUCTION	
  TO	
  SPARK	
  
5	
  
A	
  brief	
  review	
  of	
  MapReduce	
  
Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
  
Reduce	
   Reduce	
   Reduce	
   Reduce	
  
Key	
  advances	
  by	
  MapReduce:	
  
	
  
•  Data	
  Locality:	
  AutomaLc	
  split	
  computaLon	
  and	
  launch	
  of	
  mappers	
  appropriately	
  
•  Fault	
  tolerance:	
  Write	
  intermediate	
  results	
  and	
  restartable	
  mappers	
  means	
  ability	
  
to	
  run	
  on	
  commodity	
  hardware	
  
•  Linear	
  scalability:	
  CombinaLon	
  of	
  locality	
  +	
  programming	
  model	
  that	
  forces	
  
developers	
  to	
  write	
  generally	
  scalable	
  soluLons	
  to	
  problems	
  
6	
  
MapReduce	
  sufficient	
  for	
  many	
  classes	
  
of	
  problems	
  
MapReduce	
  
Hive	
   Pig	
   Mahout	
   Crunch	
   Solr	
  
A	
  bit	
  like	
  Haiku:	
  
	
  
•  Limited	
  expressivity	
  
•  But	
  can	
  be	
  used	
  to	
  approach	
  diverse	
  problem	
  domains	
  
	
  
7	
  
BUT…	
  Can	
  we	
  do	
  beUer?	
  
Areas	
  ripe	
  for	
  improvement,	
  
•  Launching	
  Mappers/Reducers	
  takes	
  Lme	
  
•  Having	
  to	
  write	
  to	
  disk	
  (replicated)	
  between	
  
each	
  step	
  
•  Reading	
  data	
  back	
  from	
  disk	
  in	
  the	
  next	
  step	
  
•  Each	
  Map/Reduce	
  step	
  has	
  to	
  go	
  back	
  into	
  the	
  
queue	
  and	
  get	
  its	
  resources	
  
•  Not	
  In	
  Memory	
  
•  Cannot	
  iterate	
  fast	
  
8	
  
What	
  is	
  Spark?	
  
Spark	
  is	
  a	
  general	
  purpose	
  computaLonal	
  framework	
  -­‐	
  more	
  flexibility	
  than	
  
MapReduce.	
  It	
  is	
  an	
  implementaLon	
  of	
  a	
  2010	
  Berkley	
  paper	
  [1].	
  
	
  
Key	
  properBes:	
  
•  Leverages	
  distributed	
  memory	
  
•  Full	
  Directed	
  Graph	
  expressions	
  for	
  data	
  parallel	
  computaLons	
  
•  Improved	
  developer	
  experience	
  
Yet	
  retains:	
  
Linear	
  scalability,	
  Fault-­‐tolerance	
  and	
  Data	
  Locality	
  based	
  
computaLons	
  
1	
  -­‐	
  hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf	
  
9	
  
Spark:	
  Easy	
  and	
  Fast	
  Big	
  Data	
  
•  Easy	
  to	
  Develop	
  
– Rich	
  APIs	
  in	
  Java,	
  
Scala,	
  Python	
  
– InteracLve	
  shell	
  
•  Fast	
  to	
  Run	
  
– General	
  execuLon	
  
graphs	
  
– In-­‐memory	
  storage	
  
2-­‐5×	
  less	
  code	
   Up	
  to	
  10×	
  faster	
  on	
  disk,	
  
100×	
  in	
  memory	
  
	
  10	
  
Easy:	
  Get	
  Started	
  Immediately	
  
•  MulL-­‐language	
  support	
  
•  InteracLve	
  Shell	
  
Python	
  
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala	
  
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java	
  
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
11	
  
Spark	
  Ecosystem	
  
hUp://www.databricks.com/spark/#sparkhadoop	
  
12	
  
Spring	
  Framework	
  
hUp://docs.spring.io/spring/docs/1.2.9/reference/introducLon.html	
  
13	
  
Spark	
  in	
  Cloudera	
  EDH	
  
3RD	
  PARTY	
  
APPS	
  
STORAGE	
  FOR	
  ANY	
  TYPE	
  OF	
  DATA	
  
UNIFIED,	
  ELASTIC,	
  RESILIENT,	
  SECURE	
  
	
  
	
  
	
  
	
  
	
  
CLOUDERA’S	
  ENTERPRISE	
  DATA	
  HUB	
  
BATCH	
  
PROCESSING	
  
MAPREDUCE	
  
SPARK	
  
ANALYTIC	
  
SQL	
  
IMPALA	
  
SEARCH	
  
ENGINE	
  
SOLR	
  
MACHINE	
  
LEARNING	
  
SPARK	
  
STREAM	
  
PROCESSING	
  
SPARK	
  STREAMING	
  
WORKLOAD	
  MANAGEMENT	
   YARN	
  
FILESYSTEM	
  
HDFS	
  
ONLINE	
  NOSQL	
  
HBASE	
  
DATA	
  
MANAGEMENT	
  
CLOUDERA	
  NAVIGATOR	
  
SYSTEM	
  
MANAGEMENT	
  
CLOUDERA	
  MANAGER	
  
SENTRY	
  ,	
  SECURE	
  
14	
  
AdopLon	
  
•  SupporLng:	
  
– DataBricks	
  
•  ContribuLng:	
  
– UC	
  Berkley,	
  DataBricks,	
  Yahoo,	
  etc	
  
•  Well	
  known	
  use-­‐cases:	
  
– Conviva,	
  QuanLfind,	
  Bizo	
  
15	
  
CONCEPTS	
  
16	
  
Spark	
  Concepts	
  -­‐	
  Overview	
  
•  Driver	
  &	
  Workers	
  
•  RDD	
  –	
  Resilient	
  Distributed	
  Dataset	
  
•  TransformaLons	
  
•  AcLons	
  
•  Caching	
  
17	
  
Driver	
  and	
  Workers	
  
Driver	
  
Worker	
  
Worker	
  
Worker	
  
Data	
  
RAM	
  
Data	
  
RAM	
  
Data	
  
RAM	
  
18	
  
RDD	
  –	
  Resilient	
  Distributed	
  Dataset	
  
•  Read-­‐only	
  parLLoned	
  collecLon	
  of	
  records	
  
•  Created	
  through:	
  
– TransformaLon	
  of	
  data	
  in	
  storage	
  
– TransformaLon	
  of	
  RDDs	
  
•  Contains	
  lineage	
  to	
  compute	
  from	
  storage	
  
•  Lazy	
  materializaLon	
  
•  Users	
  control	
  persistence	
  and	
  parLLoning	
  
19	
  
OperaLons	
  
TransformaBons	
  
•  Map	
  
•  Filter	
  
•  Sample	
  
•  Join	
  
AcBons	
  
•  Reduce	
  
•  Count	
  
•  First,	
  Take	
  
•  SaveAs	
  
20	
  
OperaLons	
  
•  TransformaBons	
  create	
  new	
  RDD	
  from	
  an	
  exisLng	
  one	
  
•  AcBons	
  run	
  computaLon	
  on	
  RDD	
  and	
  return	
  a	
  value	
  
•  TransformaLons	
  are	
  lazy.	
  	
  
•  AcLons	
  materialize	
  RDDs	
  by	
  compuLng	
  transformaLons.	
  
•  RDDs	
  can	
  be	
  cached	
  to	
  avoid	
  re-­‐compuLng.	
  
21	
  
Fault	
  Tolerance	
  
•  RDDs	
  contain	
  lineage.	
  
•  Lineage	
  –	
  source	
  locaLon	
  and	
  list	
  of	
  
transformaLons	
  
•  Lost	
  parLLons	
  can	
  be	
  re-­‐computed	
  from	
  
source	
  data	
  
	
  
	
  
	
  
	
  
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS	
  File	
   Filtered	
  RDD	
   Mapped	
  RDD	
  
filter	
  
(func	
  =	
  startsWith(…))	
  
map	
  
(func	
  =	
  split(...))	
  
22	
  
Caching	
  
•  Persist()	
  and	
  cache()	
  mark	
  data	
  	
  
•  RDD	
  is	
  cached	
  ater	
  first	
  acLon	
  
•  Fault	
  tolerant	
  –	
  lost	
  parLLons	
  will	
  re-­‐compute	
  
•  If	
  not	
  enough	
  memory	
  –	
  	
  
some	
  parLLons	
  will	
  not	
  be	
  cached	
  
•  Future	
  acLons	
  are	
  performed	
  on	
  cached	
  
parLLoned	
  
•  So	
  they	
  are	
  much	
  faster	
  
	
  
Use	
  caching	
  for	
  iteraBve	
  algorithms	
  
	
  
23	
  
Caching	
  –	
  Storage	
  Levels	
  
•  MEMORY_ONLY	
  
•  MEMORY_AND_DISK	
  
•  MEMORY_ONLY_SER	
  
•  MEMORY_AND_DISK_SER	
  
•  DISK_ONLY	
  
•  MEMORY_ONLY_2,	
  MEMORY_AND_DISK_2…	
  
24	
  
SPARK	
  EXAMPLES	
  
25	
  
Easy:	
  Example	
  –	
  Word	
  Count	
  
•  Spark	
  
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
•  Hadoop	
  MapReduce	
  
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
26	
  
Easy:	
  Example	
  –	
  Word	
  Count	
  
•  Spark	
  
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
•  Hadoop	
  MapReduce	
  
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
27	
  
Spark	
  Word	
  Count	
  in	
  Java	
  
JavaSparkContext sc = new JavaSparkContext(...);!
JavaRDD<String> lines = ctx.textFile("hdfs://...");!
JavaRDD<String> words = lines.flatMap(!
new FlatMapFunction<String, String>() {!
public Iterable<String> call(String s) {!
return Arrays.asList(s.split(" "));!
}!
}!
);!
!
JavaPairRDD<String, Integer> ones = words.map(!
new PairFunction<String, String, Integer>() {!
public Tuple2<String, Integer> call(String s) {!
return new Tuple2(s, 1);!
}!
}!
);!
!
JavaPairRDD<String, Integer> counts =
ones.reduceByKey(!
new Function2<Integer, Integer, Integer>() {!
public Integer call(Integer i1, Integer i2) {!
return i1 + i2;!
}!
}!
);!
JavaRDD<String> lines =
sc.textFile("hdfs://log.txt");!
!
JavaRDD<String> words =!
lines.flatMap(line ->
Arrays.asList(line.split(" ")));!
!
JavaPairRDD<String, Integer> ones
=!
words.mapToPair(w -> new
Tuple2<String, Integer>(w, 1));!
!
JavaPairRDD<String, Integer>
counts =!
ones.reduceByKey((x, y) -> x
+ y);!
Java	
  8	
  
Lamba	
  
Expression	
  [1]	
  
1	
  -­‐	
  hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html	
  
28	
  
Log	
  Mining	
  
•  Load	
  error	
  messages	
  from	
  a	
  log	
  into	
  memory	
  
•  InteracLvely	
  search	
  for	
  paUerns	
  
29	
  
Log	
  Mining	
  
lines = sparkContext.textFile(“hdfs://…”)!
errors = lines.filter(_.startsWith(“ERROR”)!
messages = errors.map(_.split(‘t’)(2))!
!
cachedMsgs = messages.cache()!
!
cachedMsgs.filter(_.contains(“foo”)).count!
cachedMsgs.filter(_.contains(“bar”)).count!
…!
Base	
  RDD	
  
Transformed	
  
RDD	
  
AcLon	
  
30	
  
LogisLc	
  Regression	
  
•  Read	
  two	
  sets	
  of	
  points	
  
•  Looks	
  for	
  a	
  plane	
  W	
  that	
  separates	
  them	
  
•  Perform	
  gradient	
  descent:	
  
– Start	
  with	
  random	
  W	
  
– On	
  each	
  iteraLon,	
  sum	
  a	
  funcLon	
  of	
  W	
  over	
  the	
  
data	
  
– Move	
  W	
  in	
  a	
  direcLon	
  that	
  improves	
  it	
  
31	
  
IntuiLon	
  
32	
  
LogisLc	
  Regression	
  
val points =
spark.textFile(…).map(parsePoint).cache()!
!
val w = Vector.random(D)!
!
for (I <- 1 to ITERATIONS) {!
"val gradient = points.map(p => !
" "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )!
" ".reduce(_+_)!
"w -= gradient!
}!
println(“Final separating plane: ” + w)!
33	
  
Conviva	
  Use-­‐Case	
  [1]	
  
•  Monitor	
  online	
  video	
  consumpLon	
  
•  Analyze	
  trends	
  
Need	
  to	
  run	
  tens	
  of	
  queries	
  like	
  this	
  a	
  day:	
  
	
  
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
1	
  -­‐	
  hUp://www.conviva.com/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva/	
  
34	
  
Conviva	
  With	
  Spark	
  
val	
  sessions	
  =	
  sparkContext.sequenceFile[SessionSummary,NullWritable]
(pathToSessionSummaryOnHdfs)	
  
	
  
val	
  cachedSessions	
  =	
  sessions.filter(whereCondiLonToFilterSessions).cache	
  
	
  
val	
  mapFn	
  :	
  SessionSummary	
  =>	
  (String,	
  Long)	
  =	
  {	
  s	
  =>	
  (s.videoName,	
  1)	
  }	
  
val	
  reduceFn	
  :	
  (Long,	
  Long)	
  =>	
  Long	
  =	
  {	
  (a,b)	
  =>	
  a+b	
  }	
  
	
  
val	
  results	
  =	
  
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap	
  
	
  
35	
  
SPARK	
  STREAMING	
  
36	
  
Large-­‐Scale	
  Stream	
  Processing	
  
Requires	
  
•  Fault	
  Tolerance	
  –	
  for	
  crashes	
  and	
  strugglers	
  
•  Efficiency	
  
•  Row-­‐by-­‐row	
  (conLnuous	
  operator)	
  systems	
  
do	
  not	
  handle	
  struggler	
  nodes	
  
	
  
•  Batch	
  Processing	
  provides	
  fault	
  tolerance	
  
efficiently	
  
Job	
  is	
  divided	
  into	
  determinisLc	
  tasks	
  
	
  
	
   37	
  
Key	
  QuesLon	
  	
  
•  How	
  fast	
  can	
  the	
  system	
  recover?	
  
38	
  
Spark	
  Streaming	
  
hUp://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html	
  
39	
  
Spark	
  Streaming	
  
–  Run	
  con$nuous	
  processing	
  of	
  data	
  using	
  Spark’s	
  core	
  API.	
  	
  
–  Extends	
  Spark	
  concept	
  of	
  RDD’s	
  to	
  DStreams	
  (DiscreLzed	
  Streams)	
  which	
  
are	
  fault	
  tolerant,	
  transformable	
  streams.	
  Users	
  can	
  re-­‐use	
  exisLng	
  code	
  for	
  
batch/offline	
  processing.	
  
–  Adds	
  “rolling	
  window”	
  operaLons.	
  E.g.	
  compute	
  rolling	
  averages	
  or	
  counts	
  
for	
  data	
  over	
  last	
  five	
  minutes.	
  
–  Example	
  use	
  cases:	
  
•  “On-­‐the-­‐fly”	
  ETL	
  as	
  data	
  is	
  ingested	
  into	
  Hadoop/HDFS.	
  
•  DetecLng	
  anomalous	
  behavior	
  and	
  triggering	
  alerts.	
  
•  ConLnuous	
  reporLng	
  of	
  summary	
  metrics	
  for	
  incoming	
  data.	
  
40	
  
val	
  tweets	
  =	
  ssc.twitterStream()	
  
val	
  hashTags	
  =	
  tweets.flatMap	
  (status	
  =>	
  getTags(status))	
  
hashTags.saveAsHadoopFiles("hdfs://...")	
  
	
  
flatMap flatMap flatMap
save save save
batch	
  @	
  t+1	
  batch	
  @	
  t	
   batch	
  @	
  t+2	
  
tweets	
  DStream	
  
hashTags	
  DStream	
  
Stream	
  composed	
  of	
  
small	
  (1-­‐10s)	
  batch	
  
computaLons	
  
“Micro-­‐batch”	
  Architecture	
  
41	
  
SHARK	
  
42	
  
Shark	
  Architecture	
  
•  IdenLcal	
  to	
  Hive	
  
•  Same	
  CLI,	
  JDBC,	
  	
  
	
  	
  	
  	
  SQL	
  Parser,	
  Metastore	
  
•  Replaced	
  the	
  opLmizer,	
  	
  
	
  	
  	
  	
  plan	
  generator	
  and	
  the	
  execuLon	
  engine.	
  	
  
•  Added	
  Cache	
  Manager.	
  	
  
•  Generate	
  Spark	
  code	
  instead	
  of	
  Map	
  Reduce	
  
43	
  
Hive	
  CompaLbility	
  
•  MetaStore	
  
•  HQL	
  
•  UDF	
  /	
  UDAF	
  
•  SerDes	
  
•  Scripts	
  
44	
  
Shark	
  Vs	
  Impala	
  
•  Shark	
  inherits	
  Hive	
  limitaLons	
  while	
  Impala	
  is	
  
purpose	
  built	
  for	
  SQL.	
  
•  Impala	
  is	
  significantly	
  faster	
  per	
  our	
  tests.	
  
•  Shark	
  does	
  not	
  have	
  security,	
  audit/lineage,	
  
support	
  for	
  high-­‐concurrency,	
  operaLonal	
  
tooling	
  for	
  config/monitor/reporLng/
debugging.	
  
•  InteracLve	
  SQL	
  needed	
  for	
  connecLng	
  BI	
  
Tools.	
  Shark	
  not	
  cerLfied	
  by	
  any	
  BI	
  vendor.	
  
45	
  
DEMO	
  
46	
  
SUMMARY	
  
47	
  
Why	
  Spark?	
  
•  Flexible	
  like	
  MapReduce	
  
•  High	
  performance	
  
•  Machine	
  learning,	
  iteraLve	
  algorithms	
  
•  InteracLve	
  data	
  exploraLons	
  
•  Developer	
  producLvity	
  
48	
  
How	
  Spark	
  Works?	
  
•  RDDs	
  –	
  resilient	
  distributed	
  data	
  
•  Lazy	
  transformaLons	
  
•  Caching	
  
•  Fault	
  tolerance	
  by	
  storing	
  lineage	
  
•  Streams	
  –	
  micro-­‐batches	
  of	
  RDDs	
  
•  Shark	
  –	
  Hive	
  +	
  Spark	
  
49	
  

More Related Content

What's hot (20)

PDF
High performance computing tutorial, with checklist and tips to optimize clus...
Pradeep Redddy Raamana
 
PDF
Spark overview
Lisa Hua
 
PDF
Apache spark
shima jafari
 
PPTX
Introduction to Pig
Prashanth Babu
 
PDF
Introduction to High Performance Computing
Umarudin Zaenuri
 
PPTX
Spark
Heena Madan
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PDF
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
PDF
Spark shuffle introduction
colorant
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Change Data Feed in Delta
Databricks
 
PPTX
Real time analytics
Leandro Totino Pereira
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
High performance computing tutorial, with checklist and tips to optimize clus...
Pradeep Redddy Raamana
 
Spark overview
Lisa Hua
 
Apache spark
shima jafari
 
Introduction to Pig
Prashanth Babu
 
Introduction to High Performance Computing
Umarudin Zaenuri
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Apache Spark Introduction
sudhakara st
 
Introduction to apache spark
Aakashdata
 
Apache Spark Core
Girish Khanzode
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Spark shuffle introduction
colorant
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Change Data Feed in Delta
Databricks
 
Real time analytics
Leandro Totino Pereira
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Seminar Presentation Hadoop
Varun Narang
 

Viewers also liked (20)

PDF
Performance of Spark vs MapReduce
Edureka!
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PDF
Introduction to Spark
Li Ming Tsai
 
PDF
Map reduce vs spark
Tudor Lapusan
 
PDF
Introduction to Apache Spark
datamantra
 
PPTX
How Hadoop Exploits Data Locality
Uday Vakalapudi
 
PPTX
Climate smart agriculture project
FAO
 
PPTX
Spark for big data analytics
Edureka!
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PDF
Streaming Big Data & Analytics For Scale
Helena Edelson
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PPTX
The R Ecosystem
Revolution Analytics
 
PDF
Climate smart agriculture
Dr. Shalini Pandey
 
PPTX
Apache Spark An Overview
Mohit Jain
 
PDF
Apache Spark & Hadoop
MapR Technologies
 
PPTX
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Edureka!
 
PDF
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
PDF
Apache Spark & Hadoop : Train-the-trainer
IMC Institute
 
PPSX
Hadoop
Nishant Gandhi
 
PPTX
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
Edureka!
 
Performance of Spark vs MapReduce
Edureka!
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Introduction to Spark
Li Ming Tsai
 
Map reduce vs spark
Tudor Lapusan
 
Introduction to Apache Spark
datamantra
 
How Hadoop Exploits Data Locality
Uday Vakalapudi
 
Climate smart agriculture project
FAO
 
Spark for big data analytics
Edureka!
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Streaming Big Data & Analytics For Scale
Helena Edelson
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
The R Ecosystem
Revolution Analytics
 
Climate smart agriculture
Dr. Shalini Pandey
 
Apache Spark An Overview
Mohit Jain
 
Apache Spark & Hadoop
MapR Technologies
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Edureka!
 
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
Apache Spark & Hadoop : Train-the-trainer
IMC Institute
 
What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...
Edureka!
 
Ad

Similar to Hadoop and Spark (20)

PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
PDF
OCF.tw's talk about "Introduction to spark"
Giivee The
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
PPTX
Dec6 meetup spark presentation
Ramesh Mudunuri
 
PPTX
Spark Study Notes
Richard Kuo
 
PDF
Introduction to apache spark
UserReport
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
20140614 introduction to spark-ben white
Data Con LA
 
PDF
[@NaukriEngineering] Apache Spark
Naukri.com
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Bds session 13 14
Infinity Tech Solutions
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Spark Study Notes
Richard Kuo
 
Introduction to apache spark
UserReport
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Scala and spark
Fabio Fumarola
 
Apache Spark Fundamentals
Zahra Eskandari
 
20140614 introduction to spark-ben white
Data Con LA
 
[@NaukriEngineering] Apache Spark
Naukri.com
 
Ad

Recently uploaded (20)

PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
July Patch Tuesday
Ivanti
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Biography of Daniel Podor.pdf
Daniel Podor
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
July Patch Tuesday
Ivanti
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 

Hadoop and Spark

  • 1. Hadoop  and  Spark   Shravan  (Sean)  Pabba   1  
  • 2. About  Me   •  Diverse  roles/languages  and  pla=orms.   •  Middleware  space  in  recent  years.   •  Worked  for  IBM/Grid  Dynamics/GigaSpaces.   •  Working  as  Systems  Engineer  for  Cloudera   since  last  July.   •  Work  with  and  educate  clients/prospects.   2  
  • 3. Agenda   •  IntroducLon  to  Spark   –  Map  Reduce  Review   –  Why  Spark   –  Architecture  (Stand-­‐alone  AND  Cloudera)   •  Concepts   •  Examples/Use  Cases   •  Spark  Streaming   •  Shark   –  Shark  Vs  Impala   •  Demo   3  
  • 4. Have  you  done?   •  Programming  languages  (Java/ Python/Scala)   •  WriUen  mulL-­‐threaded  or   distributed  programs   •  Numerical  Programming/StaLsLcal   CompuLng  (R,  MATLAB)   •  Hadoop   4  
  • 6. A  brief  review  of  MapReduce   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Reduce   Reduce   Reduce   Reduce   Key  advances  by  MapReduce:     •  Data  Locality:  AutomaLc  split  computaLon  and  launch  of  mappers  appropriately   •  Fault  tolerance:  Write  intermediate  results  and  restartable  mappers  means  ability   to  run  on  commodity  hardware   •  Linear  scalability:  CombinaLon  of  locality  +  programming  model  that  forces   developers  to  write  generally  scalable  soluLons  to  problems   6  
  • 7. MapReduce  sufficient  for  many  classes   of  problems   MapReduce   Hive   Pig   Mahout   Crunch   Solr   A  bit  like  Haiku:     •  Limited  expressivity   •  But  can  be  used  to  approach  diverse  problem  domains     7  
  • 8. BUT…  Can  we  do  beUer?   Areas  ripe  for  improvement,   •  Launching  Mappers/Reducers  takes  Lme   •  Having  to  write  to  disk  (replicated)  between   each  step   •  Reading  data  back  from  disk  in  the  next  step   •  Each  Map/Reduce  step  has  to  go  back  into  the   queue  and  get  its  resources   •  Not  In  Memory   •  Cannot  iterate  fast   8  
  • 9. What  is  Spark?   Spark  is  a  general  purpose  computaLonal  framework  -­‐  more  flexibility  than   MapReduce.  It  is  an  implementaLon  of  a  2010  Berkley  paper  [1].     Key  properBes:   •  Leverages  distributed  memory   •  Full  Directed  Graph  expressions  for  data  parallel  computaLons   •  Improved  developer  experience   Yet  retains:   Linear  scalability,  Fault-­‐tolerance  and  Data  Locality  based   computaLons   1  -­‐  hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf   9  
  • 10. Spark:  Easy  and  Fast  Big  Data   •  Easy  to  Develop   – Rich  APIs  in  Java,   Scala,  Python   – InteracLve  shell   •  Fast  to  Run   – General  execuLon   graphs   – In-­‐memory  storage   2-­‐5×  less  code   Up  to  10×  faster  on  disk,   100×  in  memory    10  
  • 11. Easy:  Get  Started  Immediately   •  MulL-­‐language  support   •  InteracLve  Shell   Python   lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala   val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java   JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 11  
  • 14. Spark  in  Cloudera  EDH   3RD  PARTY   APPS   STORAGE  FOR  ANY  TYPE  OF  DATA   UNIFIED,  ELASTIC,  RESILIENT,  SECURE             CLOUDERA’S  ENTERPRISE  DATA  HUB   BATCH   PROCESSING   MAPREDUCE   SPARK   ANALYTIC   SQL   IMPALA   SEARCH   ENGINE   SOLR   MACHINE   LEARNING   SPARK   STREAM   PROCESSING   SPARK  STREAMING   WORKLOAD  MANAGEMENT   YARN   FILESYSTEM   HDFS   ONLINE  NOSQL   HBASE   DATA   MANAGEMENT   CLOUDERA  NAVIGATOR   SYSTEM   MANAGEMENT   CLOUDERA  MANAGER   SENTRY  ,  SECURE   14  
  • 15. AdopLon   •  SupporLng:   – DataBricks   •  ContribuLng:   – UC  Berkley,  DataBricks,  Yahoo,  etc   •  Well  known  use-­‐cases:   – Conviva,  QuanLfind,  Bizo   15  
  • 17. Spark  Concepts  -­‐  Overview   •  Driver  &  Workers   •  RDD  –  Resilient  Distributed  Dataset   •  TransformaLons   •  AcLons   •  Caching   17  
  • 18. Driver  and  Workers   Driver   Worker   Worker   Worker   Data   RAM   Data   RAM   Data   RAM   18  
  • 19. RDD  –  Resilient  Distributed  Dataset   •  Read-­‐only  parLLoned  collecLon  of  records   •  Created  through:   – TransformaLon  of  data  in  storage   – TransformaLon  of  RDDs   •  Contains  lineage  to  compute  from  storage   •  Lazy  materializaLon   •  Users  control  persistence  and  parLLoning   19  
  • 20. OperaLons   TransformaBons   •  Map   •  Filter   •  Sample   •  Join   AcBons   •  Reduce   •  Count   •  First,  Take   •  SaveAs   20  
  • 21. OperaLons   •  TransformaBons  create  new  RDD  from  an  exisLng  one   •  AcBons  run  computaLon  on  RDD  and  return  a  value   •  TransformaLons  are  lazy.     •  AcLons  materialize  RDDs  by  compuLng  transformaLons.   •  RDDs  can  be  cached  to  avoid  re-­‐compuLng.   21  
  • 22. Fault  Tolerance   •  RDDs  contain  lineage.   •  Lineage  –  source  locaLon  and  list  of   transformaLons   •  Lost  parLLons  can  be  re-­‐computed  from   source  data           msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File   Filtered  RDD   Mapped  RDD   filter   (func  =  startsWith(…))   map   (func  =  split(...))   22  
  • 23. Caching   •  Persist()  and  cache()  mark  data     •  RDD  is  cached  ater  first  acLon   •  Fault  tolerant  –  lost  parLLons  will  re-­‐compute   •  If  not  enough  memory  –     some  parLLons  will  not  be  cached   •  Future  acLons  are  performed  on  cached   parLLoned   •  So  they  are  much  faster     Use  caching  for  iteraBve  algorithms     23  
  • 24. Caching  –  Storage  Levels   •  MEMORY_ONLY   •  MEMORY_AND_DISK   •  MEMORY_ONLY_SER   •  MEMORY_AND_DISK_SER   •  DISK_ONLY   •  MEMORY_ONLY_2,  MEMORY_AND_DISK_2…   24  
  • 26. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 26  
  • 27. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 27  
  • 28. Spark  Word  Count  in  Java   JavaSparkContext sc = new JavaSparkContext(...);! JavaRDD<String> lines = ctx.textFile("hdfs://...");! JavaRDD<String> words = lines.flatMap(! new FlatMapFunction<String, String>() {! public Iterable<String> call(String s) {! return Arrays.asList(s.split(" "));! }! }! );! ! JavaPairRDD<String, Integer> ones = words.map(! new PairFunction<String, String, Integer>() {! public Tuple2<String, Integer> call(String s) {! return new Tuple2(s, 1);! }! }! );! ! JavaPairRDD<String, Integer> counts = ones.reduceByKey(! new Function2<Integer, Integer, Integer>() {! public Integer call(Integer i1, Integer i2) {! return i1 + i2;! }! }! );! JavaRDD<String> lines = sc.textFile("hdfs://log.txt");! ! JavaRDD<String> words =! lines.flatMap(line -> Arrays.asList(line.split(" ")));! ! JavaPairRDD<String, Integer> ones =! words.mapToPair(w -> new Tuple2<String, Integer>(w, 1));! ! JavaPairRDD<String, Integer> counts =! ones.reduceByKey((x, y) -> x + y);! Java  8   Lamba   Expression  [1]   1  -­‐  hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html   28  
  • 29. Log  Mining   •  Load  error  messages  from  a  log  into  memory   •  InteracLvely  search  for  paUerns   29  
  • 30. Log  Mining   lines = sparkContext.textFile(“hdfs://…”)! errors = lines.filter(_.startsWith(“ERROR”)! messages = errors.map(_.split(‘t’)(2))! ! cachedMsgs = messages.cache()! ! cachedMsgs.filter(_.contains(“foo”)).count! cachedMsgs.filter(_.contains(“bar”)).count! …! Base  RDD   Transformed   RDD   AcLon   30  
  • 31. LogisLc  Regression   •  Read  two  sets  of  points   •  Looks  for  a  plane  W  that  separates  them   •  Perform  gradient  descent:   – Start  with  random  W   – On  each  iteraLon,  sum  a  funcLon  of  W  over  the   data   – Move  W  in  a  direcLon  that  improves  it   31  
  • 33. LogisLc  Regression   val points = spark.textFile(…).map(parsePoint).cache()! ! val w = Vector.random(D)! ! for (I <- 1 to ITERATIONS) {! "val gradient = points.map(p => ! " "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )! " ".reduce(_+_)! "w -= gradient! }! println(“Final separating plane: ” + w)! 33  
  • 34. Conviva  Use-­‐Case  [1]   •  Monitor  online  video  consumpLon   •  Analyze  trends   Need  to  run  tens  of  queries  like  this  a  day:     SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 1  -­‐  hUp://www.conviva.com/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva/   34  
  • 35. Conviva  With  Spark   val  sessions  =  sparkContext.sequenceFile[SessionSummary,NullWritable] (pathToSessionSummaryOnHdfs)     val  cachedSessions  =  sessions.filter(whereCondiLonToFilterSessions).cache     val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }   val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =   cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap     35  
  • 37. Large-­‐Scale  Stream  Processing   Requires   •  Fault  Tolerance  –  for  crashes  and  strugglers   •  Efficiency   •  Row-­‐by-­‐row  (conLnuous  operator)  systems   do  not  handle  struggler  nodes     •  Batch  Processing  provides  fault  tolerance   efficiently   Job  is  divided  into  determinisLc  tasks       37  
  • 38. Key  QuesLon     •  How  fast  can  the  system  recover?   38  
  • 40. Spark  Streaming   –  Run  con$nuous  processing  of  data  using  Spark’s  core  API.     –  Extends  Spark  concept  of  RDD’s  to  DStreams  (DiscreLzed  Streams)  which   are  fault  tolerant,  transformable  streams.  Users  can  re-­‐use  exisLng  code  for   batch/offline  processing.   –  Adds  “rolling  window”  operaLons.  E.g.  compute  rolling  averages  or  counts   for  data  over  last  five  minutes.   –  Example  use  cases:   •  “On-­‐the-­‐fly”  ETL  as  data  is  ingested  into  Hadoop/HDFS.   •  DetecLng  anomalous  behavior  and  triggering  alerts.   •  ConLnuous  reporLng  of  summary  metrics  for  incoming  data.   40  
  • 41. val  tweets  =  ssc.twitterStream()   val  hashTags  =  tweets.flatMap  (status  =>  getTags(status))   hashTags.saveAsHadoopFiles("hdfs://...")     flatMap flatMap flatMap save save save batch  @  t+1  batch  @  t   batch  @  t+2   tweets  DStream   hashTags  DStream   Stream  composed  of   small  (1-­‐10s)  batch   computaLons   “Micro-­‐batch”  Architecture   41  
  • 43. Shark  Architecture   •  IdenLcal  to  Hive   •  Same  CLI,  JDBC,            SQL  Parser,  Metastore   •  Replaced  the  opLmizer,            plan  generator  and  the  execuLon  engine.     •  Added  Cache  Manager.     •  Generate  Spark  code  instead  of  Map  Reduce   43  
  • 44. Hive  CompaLbility   •  MetaStore   •  HQL   •  UDF  /  UDAF   •  SerDes   •  Scripts   44  
  • 45. Shark  Vs  Impala   •  Shark  inherits  Hive  limitaLons  while  Impala  is   purpose  built  for  SQL.   •  Impala  is  significantly  faster  per  our  tests.   •  Shark  does  not  have  security,  audit/lineage,   support  for  high-­‐concurrency,  operaLonal   tooling  for  config/monitor/reporLng/ debugging.   •  InteracLve  SQL  needed  for  connecLng  BI   Tools.  Shark  not  cerLfied  by  any  BI  vendor.   45  
  • 48. Why  Spark?   •  Flexible  like  MapReduce   •  High  performance   •  Machine  learning,  iteraLve  algorithms   •  InteracLve  data  exploraLons   •  Developer  producLvity   48  
  • 49. How  Spark  Works?   •  RDDs  –  resilient  distributed  data   •  Lazy  transformaLons   •  Caching   •  Fault  tolerance  by  storing  lineage   •  Streams  –  micro-­‐batches  of  RDDs   •  Shark  –  Hive  +  Spark   49  

Editor's Notes

  • #9: * MapReduce struggles from performance optimization for individual systems because of its design* Google has used both techniques in-house quite a bit and the future will contain both
  • #25: Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.