Insight on "From Hadoop to Spark" by Mark Kerzner

Webinar: From Hadoop to Spark
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark

2
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Webinar Objectives
 Intro: what is Hadoop and what is Spark?
 Spark's capabilities and advantages vs Hadoop
 From Hadoop to Spark – how to?
2

Introduction
Introduction
From Hadoop to Spark

4
Hadoop in 20 Seconds
 ‘The’ Big data platform
 Very well field tested
 Scales to peta-bytes of data
 MapReduce : Batch oriented compute

5
Hadoop Eco System
BatchReal Time

6
Hadoop Ecosystem – by function
 HDFS
– provides distributed storage
 Map Reduce
– Provides distributed computing
 Pig
– High level MapReduce
 Hive
– SQL layer over Hadoop
 HBase
– NoSQL storage for real-time queries

7
Spark in 20 Seconds
 Fast & Expressive Cluster computing engine
 Compatible with Hadoop
 Came out of Berkeley AMP Lab
 Now Apache project
 Version 1.3 just released (April 2015)
“First Big Data platform to integrate batch, streaming and
interactive computations in a unified framework” – stratio.com

8
Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time Machine Learning
Stand alone YARN MESOS
Cluster
managers
GraphX
Graph processing

9
Hypo-meter 

10
Spark Job Trends

11
Spark Benchmarks
Source : stratio.com

12
Spark Code / Activity
©
Source : stratio.com

13
Timeline : Hadoop & Spark

Introduction
Going from Hadoop to Spark
Session 2: Introduction to Spark

15
Hadoop Vs. Spark
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/

16
Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration

17
Hadoop + Yarn : OS for Distributed Compute
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
(or at least, that’s the idea)

18
Spark Is Better Fit for Iterative Workloads

19
Spark Programming Model
 More generic than MapReduce

20
Is Spark Replacing Hadoop?
 Spark runs on Hadoop / YARN
– Complimentary
 Spark programming model is more flexible than MapReduce
 Spark is really great if data fits in memory (few hundred gigs),
 Spark is ‘storage agnostic’ (see next slide)

21
Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???

22
Spark & Hadoop
Use Case Other Spark
Batch processing Hadoop’s MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying Hadoop : Hive Spark SQL
Stream Processing / Real
Time processing
Storm
Kafka
Spark Streaming
Machine Learning Mahout Spark ML Lib
Real time lookups NoSQL (Hbase,
Cassandra ..etc)
No Spark component.
But Spark can query data
in NoSQL stores

23
Hadoop & Spark Future ???

Introduction
Session 2: Introduction to Spark

25
Why Move From Hadoop to Spark?
 Spark is ‘easier’ than Hadoop
 ‘friendlier’ for data scientists / analysts
– Interactive shell
• fast development cycles
• adhoc exploration
 API supports multiple languages
– Java, Scala, Python
 Great for small (Gigs) to medium (100s of Gigs) data

26
Spark : ‘Unified’ Stack
 Spark supports multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
 All modules are tightly integrated
– Facilitates rich applications
 Spark can be the only stack you need !
– No need to run multiple clusters
(Hadoop cluster, Storm cluster, … etc.)
Image: buymeposters.com

27
Migrating From Hadoop  Spark
Functionality Hadoop Spark
Distributed Storage HDFS Cloud storage like
Amazon S3
Or NFS mounts
SQL querying Hive Spark SQL
ETL work flow Pig - Spork : Pig on
Spark
- Mix of Spark SQL
Machine Learning Mahout ML Lib
NoSQL DB HBase ???

28
Five Steps of Moving From Hadoop to Spark
1. Data size
2. File System
3. SQL
4. ETL
5. Machine Learning

29
Data Size : “You Don’t Have Big Data”

30
1) Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop
Spark

31
1) Data Size
 Lot of Spark adoption at SMALL – MEDIUM scale
– Good fit
– Data might fit in memory !!
– Hadoop may be overkill
 Applications
– Iterative workloads (Machine learning, etc.)
– Streaming
 Hadoop is still preferred platform for TB + data

32
2) File System
 Hadoop = Storage + Compute
Spark = Compute only
Spark needs a distributed FS
 File system choices for Spark
– HDFS - Hadoop File System
• Reliable
• Good performance (data locality)
• Field tested for PB of data
– S3 : Amazon
• Reliable cloud storage
• Huge scale
– NFS : Network File System (‘shared FS across machines)

33
Spark File Systems

34
File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month

35
File Systems Throughput Comparison
 Data : 10G + (11.3 G)
 Each file : ~1+ G ( x 10)
 400 million records total
 Partition size : 128 M
 On HDFS & S3
 Cluster :
– 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
– Hadoop cluster , Latest Horton Works HDP v2.2
– Spark : on same 8 nodes, stand-alone, v 1.2

36
HDFS Vs. S3 (lower is better)
©

37
HDFS Vs. S3 Conclusions
HDFS S3
Data locality  much higher
throughput
Data is streamed  lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain 
convenient
Large data sets (TB + ) Good use case:
- Smallish data sets (few gigs)
- Load once and cache and re-use

38
3) SQL in Hadoop / Spark
Hadoop Spark
Engine Hive Spark SQL
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Petabytes Terabytes ?
Inter operability Can read Hive tables or
stand alone data
Formats CSV, JSON, Parquet CSV, JSON, Parquet

39
Spark SQL Vs. Hive
©
Fast on same
HDFS data !

40
4) ETL on Hadoop / Spark
Hadoop Spark
ETL Tools Pig, Cascading, Oozie Native RDD
programming
(Scala, Java, Python)
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding

41
4) ETL On Hadoop / Spark : Conclusions
 Try spork or spark-scalding
– Code re-use
– Not re-writing from scratch
 Program RDDs directly
– More flexible
– Multiple language support : Scala / Java / Python
– Simpler / faster in some cases
 Our experience of porting a financial application
– Tresata vs. RDD

42
5) Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions

43
Our experience, legal (eDiscovery)
FreeEed (Hadoop) 3VEed (Storm, Spark)
Scalable document processing
All Enron docs in 1 hour (50-node Hadoop)
Allows dynamically adding data sources
Use case: more data discovered for the
same lawsuit
Allows real-time data processing
User case: real-time emails
Provide much improved load balancing
Example: 10 GB PST mailbox
Overall: a much better fit for modern data
governance
43Copyright © 2015 Elephant Scale LLC. All rights reserved.

44
Final Thoughts
 Already on Hadoop?
– Try Spark side-by-side
– Process some data in HDFS
– Try Spark SQL for Hive tables
 Contemplating Hadoop?
– Try Spark (standalone)
– Choose NFS or S3 file system
 Take advantage of caching
– Iterative loads
– Spark Job servers
– Tachyon
 Build new class of ‘big / medium data’ apps

45
Thanks !
http://elephantscale.com
Expert consulting & training in Big Data
(Now offering Spark training)

46
Spark Caching!
 Reading data from remote FS (S3) can be slow
 For small / medium data ( 10 – 100s of GB) use caching
– Pay read penalty once
– Cache
– Then very high speed computes (in memory)
– Recommended for iterative work-loads

47
Caching Results
Cached!

48
Spark Caching
 Caching is pretty effective (small / medium data sets)
 Cached data can not be shared across applications
(each application executes in its own sandbox)

49
Sharing Cached Data
 1) ‘spark job server’
– Multiplexer
– All requests are executed through same ‘context’
– Provides web-service interface
 2) Tachyon
– Distributed In-memory file system
– Memory is the new disk!
– Out of AMP lab , Berkeley
– Early stages (very promising)

50
Spark Job Server

51
Spark Job Server
 Open sourced from Ooyala
 ‘Spark as a Service’ – simple REST interface to launch jobs
 Sub-second latency !
 Pre-load jars for even faster spinup
 Share cached RDDs across requests (NamedRDD)
App1 :
ctx.saveRDD(“my cached rdd”, rdd1)
App2:
RDD rdd2 = ctx.loadRDD (“my cached rdd”)
 https://github.com/spark-jobserver/spark-jobserver

52
Tachyon + Spark

53
Next : New Big Data Applications With Spark

54
Big Data Applications : Now
 Analysis is done in batch mode (minutes / hours)
 Final results are stored in a real time data store like
Cassandra / Hbase
 These results are displayed in a dashboard / web UI
 Doing interactive analysis ????
– Need special BI tools

55
With Spark…
 Load data set (Giga bytes) from S3 and cache it (one time)
 Super fast (sub-seconds) queries to data
 Response time : seconds (just like a web app !)

56
Lessons Learned
 Build sophisticated apps !
 Web-response-time (few seconds) !!
 In-depth analytics
– Leverage existing libraries in Java / Scala / Python
 ‘data analytics as a service’

57
•57
www.synerzip.com
Ashish Shanker
Ashish.Shanker@synerzip.com
469.374.0500

58
Synerzip in a Nutshell
 Software product development partner for small/mid-sized technology
companies
• Exclusive focus on small/mid-sized technology companies, typically venture-
backed companies in growth phase
• By definition, all Synerzip work is the IP of its respective clients
• Deep experience in full SDLC – design, dev, QA/testing, deployment
 Dedicated team of high caliber software professionals for each client
• Seamlessly extends client’s local team offering full transparency
• Stable teams with very low turn-over
• NOT just “staff augmentation, but provide full management support
 Actually reduces risk of development/delivery
• Experienced team – uses appropriate level of engineering discipline
• Practices Agile development – responsive yet disciplined
 Reduces cost – dual-site team, 50% cost advantage
 Offers long-term flexibility – allows (facilitates) taking offshore team
captive – aka “BOT” option
58

59
Synerzip Clients
59

60
Join Us In Person
Agile Texas 2015 Tour
Presented by
Hemant Elhence & Vinayak Joglekar
60

61
Next Webinar
7 Sins of Scrum and other Agile Anti-Patterns
Complimentary Webinar:
Tuesday, September 22, 2015 @ Noon CST
Presented by: Todd Little
IHM
61

62
Ashish Shanker
Ashish.shanker@synerzip.com
469.374.0500
Connect with Synerzip
@Synerzip_Agile
linkedin.com/company/synerzip
facebook.com/Synerzip
62

Insight on "From Hadoop to Spark" by Mark Kerzner

More Related Content

What's hot (16)

Similar to Insight on "From Hadoop to Spark" by Mark Kerzner (20)

More from Synerzip (20)

Recently uploaded (20)

Insight on "From Hadoop to Spark" by Mark Kerzner

Editor's Notes