Hadoop and Spark

Hadoop
and
Spark

Shravan
(Sean)
Pabba

1

About
Me

•  Diverse
roles/languages
and
pla=orms.

•  Middleware
space
in
recent
years.

•  Worked
for
IBM/Grid
Dynamics/GigaSpaces.

•  Working
as
Systems
Engineer
for
Cloudera

since
last
July.

•  Work
with
and
educate
clients/prospects.

2

Agenda

•  IntroducLon
to
Spark

–  Map
Reduce
Review

–  Why
Spark

–  Architecture
(Stand-‐alone
AND
Cloudera)

•  Concepts

•  Examples/Use
Cases

•  Spark
Streaming

•  Shark

–  Shark
Vs
Impala

•  Demo

3

Have
you
done?

•  Programming
languages
(Java/
Python/Scala)

•  WriUen
mulL-‐threaded
or

distributed
programs

•  Numerical
Programming/StaLsLcal

CompuLng
(R,
MATLAB)

•  Hadoop

4

INTRODUCTION
TO
SPARK

5

A
brief
review
of
MapReduce

Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map

Reduce
Reduce
Reduce
Reduce

Key
advances
by
MapReduce:

•  Data
Locality:
AutomaLc
split
computaLon
and
launch
of
mappers
appropriately

•  Fault
tolerance:
Write
intermediate
results
and
restartable
mappers
means
ability

to
run
on
commodity
hardware

•  Linear
scalability:
CombinaLon
of
locality
+
programming
model
that
forces

developers
to
write
generally
scalable
soluLons
to
problems

6

MapReduce
suﬃcient
for
many
classes

of
problems

MapReduce

Hive
Pig
Mahout
Crunch
Solr

A
bit
like
Haiku:

•  Limited
expressivity

•  But
can
be
used
to
approach
diverse
problem
domains

7

BUT…
Can
we
do
beUer?

Areas
ripe
for
improvement,

•  Launching
Mappers/Reducers
takes
Lme

•  Having
to
write
to
disk
(replicated)
between

each
step

•  Reading
data
back
from
disk
in
the
next
step

•  Each
Map/Reduce
step
has
to
go
back
into
the

queue
and
get
its
resources

•  Not
In
Memory

•  Cannot
iterate
fast

8

What
is
Spark?

Spark
is
a
general
purpose
computaLonal
framework
-‐
more
ﬂexibility
than

MapReduce.
It
is
an
implementaLon
of
a
2010
Berkley
paper
[1].

Key
properBes:

•  Leverages
distributed
memory

•  Full
Directed
Graph
expressions
for
data
parallel
computaLons

•  Improved
developer
experience

Yet
retains:

Linear
scalability,
Fault-‐tolerance
and
Data
Locality
based

computaLons

1
-‐
hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf

9

Spark:
Easy
and
Fast
Big
Data

•  Easy
to
Develop

– Rich
APIs
in
Java,

Scala,
Python

– InteracLve
shell

•  Fast
to
Run

– General
execuLon

graphs

– In-‐memory
storage

2-‐5×
less
code
Up
to
10×
faster
on
disk,

100×
in
memory

10

Easy:
Get
Started
Immediately

•  MulL-‐language
support

•  InteracLve
Shell

Python

lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala

val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java

JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
11

Spark
Ecosystem

hUp://www.databricks.com/spark/#sparkhadoop

12

Spring
Framework

hUp://docs.spring.io/spring/docs/1.2.9/reference/introducLon.html

13

Spark
in
Cloudera
EDH

3RD
PARTY

APPS

STORAGE
FOR
ANY
TYPE
OF
DATA

UNIFIED,
ELASTIC,
RESILIENT,
SECURE

CLOUDERA’S
ENTERPRISE
DATA
HUB

BATCH

PROCESSING

MAPREDUCE

SPARK

ANALYTIC

SQL

IMPALA

SEARCH

ENGINE

SOLR

MACHINE

LEARNING

SPARK

STREAM

PROCESSING

SPARK
STREAMING

WORKLOAD
MANAGEMENT
YARN

FILESYSTEM

HDFS

ONLINE
NOSQL

HBASE

DATA

MANAGEMENT

CLOUDERA
NAVIGATOR

SYSTEM

MANAGEMENT

CLOUDERA
MANAGER

SENTRY
,
SECURE

14

AdopLon

•  SupporLng:

– DataBricks

•  ContribuLng:

– UC
Berkley,
DataBricks,
Yahoo,
etc

•  Well
known
use-‐cases:

– Conviva,
QuanLﬁnd,
Bizo

15

Spark
Concepts
-‐
Overview

•  Driver
&
Workers

•  RDD
–
Resilient
Distributed
Dataset

•  TransformaLons

•  AcLons

•  Caching

17

Driver
and
Workers

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM

18

RDD
–
Resilient
Distributed
Dataset

•  Read-‐only
parLLoned
collecLon
of
records

•  Created
through:

– TransformaLon
of
data
in
storage

– TransformaLon
of
RDDs

•  Contains
lineage
to
compute
from
storage

•  Lazy
materializaLon

•  Users
control
persistence
and
parLLoning

19

OperaLons

TransformaBons

•  Map

•  Filter

•  Sample

•  Join

AcBons

•  Reduce

•  Count

•  First,
Take

•  SaveAs

20

OperaLons

•  TransformaBons
create
new
RDD
from
an
exisLng
one

•  AcBons
run
computaLon
on
RDD
and
return
a
value

•  TransformaLons
are
lazy.

•  AcLons
materialize
RDDs
by
compuLng
transformaLons.

•  RDDs
can
be
cached
to
avoid
re-‐compuLng.

21

Fault
Tolerance

•  RDDs
contain
lineage.

•  Lineage
–
source
locaLon
and
list
of

transformaLons

•  Lost
parLLons
can
be
re-‐computed
from

source
data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS
File
Filtered
RDD
Mapped
RDD

ﬁlter

(func
=
startsWith(…))

map

(func
=
split(...))

22

Caching

•  Persist()
and
cache()
mark
data

•  RDD
is
cached
ater
ﬁrst
acLon

•  Fault
tolerant
–
lost
parLLons
will
re-‐compute

•  If
not
enough
memory
–

some
parLLons
will
not
be
cached

•  Future
acLons
are
performed
on
cached

parLLoned

•  So
they
are
much
faster

Use
caching
for
iteraBve
algorithms

23

Caching
–
Storage
Levels

•  MEMORY_ONLY

•  MEMORY_AND_DISK

•  MEMORY_ONLY_SER

•  MEMORY_AND_DISK_SER

•  DISK_ONLY

•  MEMORY_ONLY_2,
MEMORY_AND_DISK_2…

24

Easy:
Example
–
Word
Count

•  Spark

public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
•  Hadoop
MapReduce

val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
26

Easy:
Example
–
Word
Count

•  Spark

public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
•  Hadoop
MapReduce

val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
27

Spark
Word
Count
in
Java

JavaSparkContext sc = new JavaSparkContext(...);!
JavaRDD<String> lines = ctx.textFile("hdfs://...");!
JavaRDD<String> words = lines.flatMap(!
new FlatMapFunction<String, String>() {!
public Iterable<String> call(String s) {!
return Arrays.asList(s.split(" "));!
}!
}!
);!
!
JavaPairRDD<String, Integer> ones = words.map(!
new PairFunction<String, String, Integer>() {!
public Tuple2<String, Integer> call(String s) {!
return new Tuple2(s, 1);!
}!
}!
);!
!
JavaPairRDD<String, Integer> counts =
ones.reduceByKey(!
new Function2<Integer, Integer, Integer>() {!
public Integer call(Integer i1, Integer i2) {!
return i1 + i2;!
}!
}!
);!
JavaRDD<String> lines =
sc.textFile("hdfs://log.txt");!
!
JavaRDD<String> words =!
lines.flatMap(line ->
Arrays.asList(line.split(" ")));!
!
JavaPairRDD<String, Integer> ones
=!
words.mapToPair(w -> new
Tuple2<String, Integer>(w, 1));!
!
JavaPairRDD<String, Integer>
counts =!
ones.reduceByKey((x, y) -> x
+ y);!
Java
8

Lamba

Expression
[1]

1
-‐
hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html

28

Log
Mining

•  Load
error
messages
from
a
log
into
memory

•  InteracLvely
search
for
paUerns

29

Log
Mining

lines = sparkContext.textFile(“hdfs://…”)!
errors = lines.filter(_.startsWith(“ERROR”)!
messages = errors.map(_.split(‘t’)(2))!
!
cachedMsgs = messages.cache()!
!
cachedMsgs.filter(_.contains(“foo”)).count!
cachedMsgs.filter(_.contains(“bar”)).count!
…!
Base
RDD

Transformed

RDD

AcLon

30

LogisLc
Regression

•  Read
two
sets
of
points

•  Looks
for
a
plane
W
that
separates
them

•  Perform
gradient
descent:

– Start
with
random
W

– On
each
iteraLon,
sum
a
funcLon
of
W
over
the

data

– Move
W
in
a
direcLon
that
improves
it

31

LogisLc
Regression

val points =
spark.textFile(…).map(parsePoint).cache()!
!
val w = Vector.random(D)!
!
for (I <- 1 to ITERATIONS) {!
"val gradient = points.map(p => !
" "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )!
" ".reduce(_+_)!
"w -= gradient!
}!
println(“Final separating plane: ” + w)!
33

Conviva
Use-‐Case
[1]

•  Monitor
online
video
consumpLon

•  Analyze
trends

Need
to
run
tens
of
queries
like
this
a
day:

SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
1
-‐
hUp://www.conviva.com/using-‐spark-‐and-‐hive-‐to-‐process-‐bigdata-‐at-‐conviva/

34

Conviva
With
Spark

val
sessions
=
sparkContext.sequenceFile[SessionSummary,NullWritable]
(pathToSessionSummaryOnHdfs)

val
cachedSessions
=
sessions.ﬁlter(whereCondiLonToFilterSessions).cache

val
mapFn
:
SessionSummary
=>
(String,
Long)
=
{
s
=>
(s.videoName,
1)
}

val
reduceFn
:
(Long,
Long)
=>
Long
=
{
(a,b)
=>
a+b
}

val
results
=

cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

35

Large-‐Scale
Stream
Processing

Requires

•  Fault
Tolerance
–
for
crashes
and
strugglers

•  Eﬃciency

•  Row-‐by-‐row
(conLnuous
operator)
systems

do
not
handle
struggler
nodes

•  Batch
Processing
provides
fault
tolerance

eﬃciently

Job
is
divided
into
determinisLc
tasks

37

Key
QuesLon

•  How
fast
can
the
system
recover?

38

Spark
Streaming

hUp://spark.apache.org/docs/latest/streaming-‐programming-‐guide.html

39

Spark
Streaming

–  Run
con$nuous
processing
of
data
using
Spark’s
core
API.

–  Extends
Spark
concept
of
RDD’s
to
DStreams
(DiscreLzed
Streams)
which

are
fault
tolerant,
transformable
streams.
Users
can
re-‐use
exisLng
code
for

batch/offline
processing.

–  Adds
“rolling
window”
operaLons.
E.g.
compute
rolling
averages
or
counts

for
data
over
last
five
minutes.

–  Example
use
cases:

•  “On-‐the-‐fly”
ETL
as
data
is
ingested
into
Hadoop/HDFS.

•  DetecLng
anomalous
behavior
and
triggering
alerts.

•  ConLnuous
reporLng
of
summary
metrics
for
incoming
data.

40

val
tweets
=
ssc.twitterStream()

val
hashTags
=
tweets.flatMap
(status
=>
getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap
save save save
batch
@
t+1
batch
@
t
batch
@
t+2

tweets
DStream

hashTags
DStream

Stream
composed
of

small
(1-‐10s)
batch

computaLons

“Micro-‐batch”
Architecture

41

Shark
Architecture

•  IdenLcal
to
Hive

•  Same
CLI,
JDBC,

SQL
Parser,
Metastore

•  Replaced
the
opLmizer,

plan
generator
and
the
execuLon
engine.

•  Added
Cache
Manager.

•  Generate
Spark
code
instead
of
Map
Reduce

43

Hive
CompaLbility

•  MetaStore

•  HQL

•  UDF
/
UDAF

•  SerDes

•  Scripts

44

Shark
Vs
Impala

•  Shark
inherits
Hive
limitaLons
while
Impala
is

purpose
built
for
SQL.

•  Impala
is
significantly
faster
per
our
tests.

•  Shark
does
not
have
security,
audit/lineage,

support
for
high-‐concurrency,
operaLonal

tooling
for
config/monitor/reporLng/
debugging.

•  InteracLve
SQL
needed
for
connecLng
BI

Tools.
Shark
not
cerLfied
by
any
BI
vendor.

45

Why
Spark?

•  Flexible
like
MapReduce

•  High
performance

•  Machine
learning,
iteraLve
algorithms

•  InteracLve
data
exploraLons

•  Developer
producLvity

48

How
Spark
Works?

•  RDDs
–
resilient
distributed
data

•  Lazy
transformaLons

•  Caching

•  Fault
tolerance
by
storing
lineage

•  Streams
–
micro-‐batches
of
RDDs

•  Shark
–
Hive
+
Spark

49

Hadoop and Spark

More Related Content

What's hot(20)

Viewers also liked(20)

Similar to Hadoop and Spark(20)

Recently uploaded(20)

Hadoop and Spark

Editor's Notes