SlideShare a Scribd company logo
Webinar: From Hadoop to Spark
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
2
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Webinar Objectives
 Intro: what is Hadoop and what is Spark?
 Spark's capabilities and advantages vs Hadoop
 From Hadoop to Spark – how to?
2
Introduction
Introduction
Hadoop and Spark Comparison
From Hadoop to Spark
4
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop in 20 Seconds
 ‘The’ Big data platform
 Very well field tested
 Scales to peta-bytes of data
 MapReduce : Batch oriented compute
5
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop Eco System
BatchReal Time
6
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop Ecosystem – by function
 HDFS
– provides distributed storage
 Map Reduce
– Provides distributed computing
 Pig
– High level MapReduce
 Hive
– SQL layer over Hadoop
 HBase
– NoSQL storage for real-time queries
7
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark in 20 Seconds
 Fast & Expressive Cluster computing engine
 Compatible with Hadoop
 Came out of Berkeley AMP Lab
 Now Apache project
 Version 1.3 just released (April 2015)
“First Big Data platform to integrate batch, streaming and
interactive computations in a unified framework” – stratio.com
8
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time Machine Learning
Stand alone YARN MESOS
Cluster
managers
GraphX
Graph processing
9
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hypo-meter 
10
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Job Trends
11
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Benchmarks
Source : stratio.com
12
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Code / Activity
©
Source : stratio.com
13
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Timeline : Hadoop & Spark
Hadoop and Spark Comparison
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
15
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop Vs. Spark
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/
16
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Up 10x faster for data on disk
- Up to 100x faster for data in
memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration
17
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop + Yarn : OS for Distributed Compute
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
(or at least, that’s the idea)
18
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Is Better Fit for Iterative Workloads
19
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Programming Model
 More generic than MapReduce
20
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Is Spark Replacing Hadoop?
 Spark runs on Hadoop / YARN
– Complimentary
 Spark programming model is more flexible than MapReduce
 Spark is really great if data fits in memory (few hundred gigs),
 Spark is ‘storage agnostic’ (see next slide)
21
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???
22
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark & Hadoop
Use Case Other Spark
Batch processing Hadoop’s MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying Hadoop : Hive Spark SQL
Stream Processing / Real
Time processing
Storm
Kafka
Spark Streaming
Machine Learning Mahout Spark ML Lib
Real time lookups NoSQL (Hbase,
Cassandra ..etc)
No Spark component.
But Spark can query data
in NoSQL stores
23
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Hadoop & Spark Future ???
Going from Hadoop to Spark
Introduction
Hadoop and Spark Comparison
Going from Hadoop to Spark
Session 2: Introduction to Spark
25
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Why Move From Hadoop to Spark?
 Spark is ‘easier’ than Hadoop
 ‘friendlier’ for data scientists / analysts
– Interactive shell
• fast development cycles
• adhoc exploration
 API supports multiple languages
– Java, Scala, Python
 Great for small (Gigs) to medium (100s of Gigs) data
26
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark : ‘Unified’ Stack
 Spark supports multiple programming models
– Map reduce style batch processing
– Streaming / real time processing
– Querying via SQL
– Machine learning
 All modules are tightly integrated
– Facilitates rich applications
 Spark can be the only stack you need !
– No need to run multiple clusters
(Hadoop cluster, Storm cluster, … etc.)
Image: buymeposters.com
27
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Migrating From Hadoop  Spark
Functionality Hadoop Spark
Distributed Storage HDFS Cloud storage like
Amazon S3
Or NFS mounts
SQL querying Hive Spark SQL
ETL work flow Pig - Spork : Pig on
Spark
- Mix of Spark SQL
Machine Learning Mahout ML Lib
NoSQL DB HBase ???
28
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Five Steps of Moving From Hadoop to Spark
1. Data size
2. File System
3. SQL
4. ETL
5. Machine Learning
29
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Data Size : “You Don’t Have Big Data”
30
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
1) Data Size (T-shirt sizing)
Image credit : blog.trumpi.co.za
10 G + 100 G +
1 TB + 100 TB + PB +
< few G
Hadoop
Spark
31
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
1) Data Size
 Lot of Spark adoption at SMALL – MEDIUM scale
– Good fit
– Data might fit in memory !!
– Hadoop may be overkill
 Applications
– Iterative workloads (Machine learning, etc.)
– Streaming
 Hadoop is still preferred platform for TB + data
32
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
2) File System
 Hadoop = Storage + Compute
Spark = Compute only
Spark needs a distributed FS
 File system choices for Spark
– HDFS - Hadoop File System
• Reliable
• Good performance (data locality)
• Field tested for PB of data
– S3 : Amazon
• Reliable cloud storage
• Huge scale
– NFS : Network File System (‘shared FS across machines)
33
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark File Systems
34
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
File Systems For Spark
HDFS NFS Amazon S3
Data locality High
(best)
Local enough None
(ok)
Throughput High
(best)
Medium
(good)
Low
(ok)
Latency Low
(best)
Low High
Reliability Very High
(replicated)
Low Very High
Cost Varies Varies $30 / TB / Month
35
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
File Systems Throughput Comparison
 Data : 10G + (11.3 G)
 Each file : ~1+ G ( x 10)
 400 million records total
 Partition size : 128 M
 On HDFS & S3
 Cluster :
– 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )
– Hadoop cluster , Latest Horton Works HDP v2.2
– Spark : on same 8 nodes, stand-alone, v 1.2
36
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
HDFS Vs. S3 (lower is better)
©
37
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
HDFS Vs. S3 Conclusions
HDFS S3
Data locality  much higher
throughput
Data is streamed  lower
throughput
Need to maintain an Hadoop cluster No Hadoop cluster to maintain 
convenient
Large data sets (TB + ) Good use case:
- Smallish data sets (few gigs)
- Load once and cache and re-use
38
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
3) SQL in Hadoop / Spark
Hadoop Spark
Engine Hive Spark SQL
Language HiveQL - HiveQL
- RDD programming in
Java / Python / Scala
Scale Petabytes Terabytes ?
Inter operability Can read Hive tables or
stand alone data
Formats CSV, JSON, Parquet CSV, JSON, Parquet
39
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark SQL Vs. Hive
©
Fast on same
HDFS data !
40
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
4) ETL on Hadoop / Spark
Hadoop Spark
ETL Tools Pig, Cascading, Oozie Native RDD
programming
(Scala, Java, Python)
Pig High level ETL workflow Spork : Pig on Spark
Cascading High level Spark-scalding
41
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
4) ETL On Hadoop / Spark : Conclusions
 Try spork or spark-scalding
– Code re-use
– Not re-writing from scratch
 Program RDDs directly
– More flexible
– Multiple language support : Scala / Java / Python
– Simpler / faster in some cases
 Our experience of porting a financial application
– Tresata vs. RDD
42
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
5) Machine Learning : Hadoop / Spark
Hadoop Spark
Tool Mahout MLLib
API Java Java / Scala / Python
Iterative Algorithms Slower Very fast
(in memory)
In Memory processing No YES
Mahout runs on Hadoop
or on Spark
New and young lib
Latest news! Mahout only accepts new
code that runs on Spark
Mahout & MLLib on Spark
Future? Many opinions
43
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Our experience, legal (eDiscovery)
FreeEed (Hadoop) 3VEed (Storm, Spark)
Scalable document processing
All Enron docs in 1 hour (50-node Hadoop)
Allows dynamically adding data sources
Use case: more data discovered for the
same lawsuit
Allows real-time data processing
User case: real-time emails
Provide much improved load balancing
Example: 10 GB PST mailbox
Overall: a much better fit for modern data
governance
43Copyright © 2015 Elephant Scale LLC. All rights reserved.
44
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Final Thoughts
 Already on Hadoop?
– Try Spark side-by-side
– Process some data in HDFS
– Try Spark SQL for Hive tables
 Contemplating Hadoop?
– Try Spark (standalone)
– Choose NFS or S3 file system
 Take advantage of caching
– Iterative loads
– Spark Job servers
– Tachyon
 Build new class of ‘big / medium data’ apps
45
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Thanks !
http://elephantscale.com
Expert consulting & training in Big Data
(Now offering Spark training)
46
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Caching!
 Reading data from remote FS (S3) can be slow
 For small / medium data ( 10 – 100s of GB) use caching
– Pay read penalty once
– Cache
– Then very high speed computes (in memory)
– Recommended for iterative work-loads
47
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Caching Results
Cached!
48
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Caching
 Caching is pretty effective (small / medium data sets)
 Cached data can not be shared across applications
(each application executes in its own sandbox)
49
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Sharing Cached Data
 1) ‘spark job server’
– Multiplexer
– All requests are executed through same ‘context’
– Provides web-service interface
 2) Tachyon
– Distributed In-memory file system
– Memory is the new disk!
– Out of AMP lab , Berkeley
– Early stages (very promising)
50
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Job Server
51
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Spark Job Server
 Open sourced from Ooyala
 ‘Spark as a Service’ – simple REST interface to launch jobs
 Sub-second latency !
 Pre-load jars for even faster spinup
 Share cached RDDs across requests (NamedRDD)
App1 :
ctx.saveRDD(“my cached rdd”, rdd1)
App2:
RDD rdd2 = ctx.loadRDD (“my cached rdd”)
 https://github.com/spark-jobserver/spark-jobserver
52
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Tachyon + Spark
53
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Next : New Big Data Applications With Spark
54
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Big Data Applications : Now
 Analysis is done in batch mode (minutes / hours)
 Final results are stored in a real time data store like
Cassandra / Hbase
 These results are displayed in a dashboard / web UI
 Doing interactive analysis ????
– Need special BI tools
55
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
With Spark…
 Load data set (Giga bytes) from S3 and cache it (one time)
 Super fast (sub-seconds) queries to data
 Response time : seconds (just like a web app !)
56
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Lessons Learned
 Build sophisticated apps !
 Web-response-time (few seconds) !!
 In-depth analytics
– Leverage existing libraries in Java / Scala / Python
 ‘data analytics as a service’
57
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
•57
www.synerzip.com
Ashish Shanker
Ashish.Shanker@synerzip.com
469.374.0500
58
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Synerzip in a Nutshell
 Software product development partner for small/mid-sized technology
companies
• Exclusive focus on small/mid-sized technology companies, typically venture-
backed companies in growth phase
• By definition, all Synerzip work is the IP of its respective clients
• Deep experience in full SDLC – design, dev, QA/testing, deployment
 Dedicated team of high caliber software professionals for each client
• Seamlessly extends client’s local team offering full transparency
• Stable teams with very low turn-over
• NOT just “staff augmentation, but provide full management support
 Actually reduces risk of development/delivery
• Experienced team – uses appropriate level of engineering discipline
• Practices Agile development – responsive yet disciplined
 Reduces cost – dual-site team, 50% cost advantage
 Offers long-term flexibility – allows (facilitates) taking offshore team
captive – aka “BOT” option
58
59
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Synerzip Clients
59
60
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Join Us In Person
Agile Texas 2015 Tour
Presented by
Hemant Elhence & Vinayak Joglekar
60
61
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Next Webinar
7 Sins of Scrum and other Agile Anti-Patterns
Complimentary Webinar:
Tuesday, September 22, 2015 @ Noon CST
Presented by: Todd Little
IHM
61
62
www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved.
Ashish Shanker
Ashish.shanker@synerzip.com
469.374.0500
Connect with Synerzip
@Synerzip_Agile
linkedin.com/company/synerzip
facebook.com/Synerzip
62

More Related Content

What's hot (16)

PDF
Performance tuning your Hadoop/Spark clusters to use cloud storage
DataWorks Summit
 
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
PPTX
Hivemail: Scalable Machine Learning Library for Apache Hive
DataWorks Summit
 
PDF
Getting Spark ready for real-time, operational analytics
airisData
 
PPTX
Scalable Machine Learning with PySpark
Ladle Patel
 
PDF
Emerging trends in data analytics
Wei-Chiu Chuang
 
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
PPTX
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
javier ramirez
 
PPTX
From raw data to business insights. A modern data lake
javier ramirez
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PPTX
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
PPTX
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PPTX
SF Big Analytics: Machine Learning with Presto by Christopher Berner
Chester Chen
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
DataWorks Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Edureka!
 
Hivemail: Scalable Machine Learning Library for Apache Hive
DataWorks Summit
 
Getting Spark ready for real-time, operational analytics
airisData
 
Scalable Machine Learning with PySpark
Ladle Patel
 
Emerging trends in data analytics
Wei-Chiu Chuang
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019
javier ramirez
 
From raw data to business insights. A modern data lake
javier ramirez
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
Chester Chen
 

Similar to Insight on "From Hadoop to Spark" by Mark Kerzner (20)

PDF
Hadoop to spark_v2
elephantscale
 
PDF
Spark Intro @ analytics big data summit
Sujee Maniyam
 
PPTX
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Intellipaat
 
PDF
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
 
PDF
Big data knolx
Knoldus Inc.
 
PPTX
A short introduction to Spark and its benefits
Johan Picard
 
PDF
Hadoop to spark-v2
Sujee Maniyam
 
PDF
Apache spark with java 8
Janu Jahnavi
 
PPTX
Apache spark with java 8
Janu Jahnavi
 
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
PDF
Hadoop Vs Spark — Choosing the Right Big Data Framework
Alaina Carter
 
PDF
Building an MLOps Stack for Companies at Reasonable Scale
Merelda
 
PDF
Spark forplainoldjavageeks svforum_20140724
sdeeg
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PDF
spark_v1_2
Frank Schroeter
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PPTX
Data platform at Samsung (Big Learning)
ZhuanzhuanDing
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Hadoop to spark_v2
elephantscale
 
Spark Intro @ analytics big data summit
Sujee Maniyam
 
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Intellipaat
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
 
Big data knolx
Knoldus Inc.
 
A short introduction to Spark and its benefits
Johan Picard
 
Hadoop to spark-v2
Sujee Maniyam
 
Apache spark with java 8
Janu Jahnavi
 
Apache spark with java 8
Janu Jahnavi
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Alaina Carter
 
Building an MLOps Stack for Companies at Reasonable Scale
Merelda
 
Spark forplainoldjavageeks svforum_20140724
sdeeg
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
spark_v1_2
Frank Schroeter
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Data platform at Samsung (Big Learning)
ZhuanzhuanDing
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Ad

More from Synerzip (20)

PDF
HOW VOCERA LEVERAGES SYNERZIP FOR ENHANCEMENT OF VOCERA PLATFORM & ITS USER E...
Synerzip
 
PPT
The QA/Testing Process
Synerzip
 
PPT
Test Driven Development – What Works And What Doesn’t
Synerzip
 
PDF
Distributed/Dual-Shore Agile Software Development – Is It Effective?
Synerzip
 
PPT
Using Agile Approach with Fixed Budget Projects
Synerzip
 
PDF
QA Role in Agile Teams
Synerzip
 
PDF
Agile For Mobile App Development
Synerzip
 
PDF
Using Agile in Non-Ideal Situations
Synerzip
 
PDF
Accelerating Agile Transformations - Ravi Verma
Synerzip
 
PDF
Agile Product Management Basics
Synerzip
 
PDF
Product Portfolio Kanban - by Erik Huddleston
Synerzip
 
PDF
Modern Software Practices - by Damon Poole
Synerzip
 
PPT
Context Driven Agile Leadership
Synerzip
 
PDF
Adopting TDD - by Don McGreal
Synerzip
 
PDF
Pragmatics of Agility - by Venkat Subramaniam
Synerzip
 
PPT
Cross Platform Mobile App Development
Synerzip
 
PPT
Agile2011 Conference – Key Take Aways
Synerzip
 
PPT
Performance Evaluation in Agile
Synerzip
 
PDF
Scrum And Kanban (for better agile teams)
Synerzip
 
PPT
Managing Technical Debt - by Michael Hall
Synerzip
 
HOW VOCERA LEVERAGES SYNERZIP FOR ENHANCEMENT OF VOCERA PLATFORM & ITS USER E...
Synerzip
 
The QA/Testing Process
Synerzip
 
Test Driven Development – What Works And What Doesn’t
Synerzip
 
Distributed/Dual-Shore Agile Software Development – Is It Effective?
Synerzip
 
Using Agile Approach with Fixed Budget Projects
Synerzip
 
QA Role in Agile Teams
Synerzip
 
Agile For Mobile App Development
Synerzip
 
Using Agile in Non-Ideal Situations
Synerzip
 
Accelerating Agile Transformations - Ravi Verma
Synerzip
 
Agile Product Management Basics
Synerzip
 
Product Portfolio Kanban - by Erik Huddleston
Synerzip
 
Modern Software Practices - by Damon Poole
Synerzip
 
Context Driven Agile Leadership
Synerzip
 
Adopting TDD - by Don McGreal
Synerzip
 
Pragmatics of Agility - by Venkat Subramaniam
Synerzip
 
Cross Platform Mobile App Development
Synerzip
 
Agile2011 Conference – Key Take Aways
Synerzip
 
Performance Evaluation in Agile
Synerzip
 
Scrum And Kanban (for better agile teams)
Synerzip
 
Managing Technical Debt - by Michael Hall
Synerzip
 
Ad

Recently uploaded (20)

PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Biography of Daniel Podor.pdf
Daniel Podor
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 

Insight on "From Hadoop to Spark" by Mark Kerzner

  • 1. Webinar: From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark
  • 2. 2 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Webinar Objectives  Intro: what is Hadoop and what is Spark?  Spark's capabilities and advantages vs Hadoop  From Hadoop to Spark – how to? 2
  • 3. Introduction Introduction Hadoop and Spark Comparison From Hadoop to Spark
  • 4. 4 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop in 20 Seconds  ‘The’ Big data platform  Very well field tested  Scales to peta-bytes of data  MapReduce : Batch oriented compute
  • 5. 5 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Eco System BatchReal Time
  • 6. 6 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Ecosystem – by function  HDFS – provides distributed storage  Map Reduce – Provides distributed computing  Pig – High level MapReduce  Hive – SQL layer over Hadoop  HBase – NoSQL storage for real-time queries
  • 7. 7 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark in 20 Seconds  Fast & Expressive Cluster computing engine  Compatible with Hadoop  Came out of Berkeley AMP Lab  Now Apache project  Version 1.3 just released (April 2015) “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com
  • 8. 8 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Eco-System Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning Stand alone YARN MESOS Cluster managers GraphX Graph processing
  • 9. 9 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hypo-meter 
  • 10. 10 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Trends
  • 11. 11 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Benchmarks Source : stratio.com
  • 12. 12 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Code / Activity © Source : stratio.com
  • 13. 13 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Timeline : Hadoop & Spark
  • 14. Hadoop and Spark Comparison Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark
  • 15. 15 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop Vs. Spark Hadoop Spark Source : http://www.kwigger.com/mit-skifte-til-mac/
  • 16. 16 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Comparison With Hadoop Hadoop Spark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce framework Generalized computation Usually data on disk (HDFS) On disk / in memory Not ideal for iterative work Great at Iterative workloads (machine learning ..etc) Batch process - Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
  • 17. 17 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop + Yarn : OS for Distributed Compute HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Applications (or at least, that’s the idea)
  • 18. 18 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Is Better Fit for Iterative Workloads
  • 19. 19 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Programming Model  More generic than MapReduce
  • 20. 20 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Is Spark Replacing Hadoop?  Spark runs on Hadoop / YARN – Complimentary  Spark programming model is more flexible than MapReduce  Spark is really great if data fits in memory (few hundred gigs),  Spark is ‘storage agnostic’ (see next slide)
  • 21. 21 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra ???
  • 22. 22 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark & Hadoop Use Case Other Spark Batch processing Hadoop’s MapReduce (Java, Pig, Hive) Spark RDDs (java / scala / python) SQL querying Hadoop : Hive Spark SQL Stream Processing / Real Time processing Storm Kafka Spark Streaming Machine Learning Mahout Spark ML Lib Real time lookups NoSQL (Hbase, Cassandra ..etc) No Spark component. But Spark can query data in NoSQL stores
  • 23. 23 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Hadoop & Spark Future ???
  • 24. Going from Hadoop to Spark Introduction Hadoop and Spark Comparison Going from Hadoop to Spark Session 2: Introduction to Spark
  • 25. 25 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Why Move From Hadoop to Spark?  Spark is ‘easier’ than Hadoop  ‘friendlier’ for data scientists / analysts – Interactive shell • fast development cycles • adhoc exploration  API supports multiple languages – Java, Scala, Python  Great for small (Gigs) to medium (100s of Gigs) data
  • 26. 26 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark : ‘Unified’ Stack  Spark supports multiple programming models – Map reduce style batch processing – Streaming / real time processing – Querying via SQL – Machine learning  All modules are tightly integrated – Facilitates rich applications  Spark can be the only stack you need ! – No need to run multiple clusters (Hadoop cluster, Storm cluster, … etc.) Image: buymeposters.com
  • 27. 27 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Migrating From Hadoop  Spark Functionality Hadoop Spark Distributed Storage HDFS Cloud storage like Amazon S3 Or NFS mounts SQL querying Hive Spark SQL ETL work flow Pig - Spork : Pig on Spark - Mix of Spark SQL Machine Learning Mahout ML Lib NoSQL DB HBase ???
  • 28. 28 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Five Steps of Moving From Hadoop to Spark 1. Data size 2. File System 3. SQL 4. ETL 5. Machine Learning
  • 29. 29 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Data Size : “You Don’t Have Big Data”
  • 30. 30 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 1) Data Size (T-shirt sizing) Image credit : blog.trumpi.co.za 10 G + 100 G + 1 TB + 100 TB + PB + < few G Hadoop Spark
  • 31. 31 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 1) Data Size  Lot of Spark adoption at SMALL – MEDIUM scale – Good fit – Data might fit in memory !! – Hadoop may be overkill  Applications – Iterative workloads (Machine learning, etc.) – Streaming  Hadoop is still preferred platform for TB + data
  • 32. 32 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 2) File System  Hadoop = Storage + Compute Spark = Compute only Spark needs a distributed FS  File system choices for Spark – HDFS - Hadoop File System • Reliable • Good performance (data locality) • Field tested for PB of data – S3 : Amazon • Reliable cloud storage • Huge scale – NFS : Network File System (‘shared FS across machines)
  • 33. 33 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark File Systems
  • 34. 34 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. File Systems For Spark HDFS NFS Amazon S3 Data locality High (best) Local enough None (ok) Throughput High (best) Medium (good) Low (ok) Latency Low (best) Low High Reliability Very High (replicated) Low Very High Cost Varies Varies $30 / TB / Month
  • 35. 35 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. File Systems Throughput Comparison  Data : 10G + (11.3 G)  Each file : ~1+ G ( x 10)  400 million records total  Partition size : 128 M  On HDFS & S3  Cluster : – 8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD ) – Hadoop cluster , Latest Horton Works HDP v2.2 – Spark : on same 8 nodes, stand-alone, v 1.2
  • 36. 36 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. HDFS Vs. S3 (lower is better) ©
  • 37. 37 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. HDFS Vs. S3 Conclusions HDFS S3 Data locality  much higher throughput Data is streamed  lower throughput Need to maintain an Hadoop cluster No Hadoop cluster to maintain  convenient Large data sets (TB + ) Good use case: - Smallish data sets (few gigs) - Load once and cache and re-use
  • 38. 38 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 3) SQL in Hadoop / Spark Hadoop Spark Engine Hive Spark SQL Language HiveQL - HiveQL - RDD programming in Java / Python / Scala Scale Petabytes Terabytes ? Inter operability Can read Hive tables or stand alone data Formats CSV, JSON, Parquet CSV, JSON, Parquet
  • 39. 39 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark SQL Vs. Hive © Fast on same HDFS data !
  • 40. 40 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 4) ETL on Hadoop / Spark Hadoop Spark ETL Tools Pig, Cascading, Oozie Native RDD programming (Scala, Java, Python) Pig High level ETL workflow Spork : Pig on Spark Cascading High level Spark-scalding
  • 41. 41 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 4) ETL On Hadoop / Spark : Conclusions  Try spork or spark-scalding – Code re-use – Not re-writing from scratch  Program RDDs directly – More flexible – Multiple language support : Scala / Java / Python – Simpler / faster in some cases  Our experience of porting a financial application – Tresata vs. RDD
  • 42. 42 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. 5) Machine Learning : Hadoop / Spark Hadoop Spark Tool Mahout MLLib API Java Java / Scala / Python Iterative Algorithms Slower Very fast (in memory) In Memory processing No YES Mahout runs on Hadoop or on Spark New and young lib Latest news! Mahout only accepts new code that runs on Spark Mahout & MLLib on Spark Future? Many opinions
  • 43. 43 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Our experience, legal (eDiscovery) FreeEed (Hadoop) 3VEed (Storm, Spark) Scalable document processing All Enron docs in 1 hour (50-node Hadoop) Allows dynamically adding data sources Use case: more data discovered for the same lawsuit Allows real-time data processing User case: real-time emails Provide much improved load balancing Example: 10 GB PST mailbox Overall: a much better fit for modern data governance 43Copyright © 2015 Elephant Scale LLC. All rights reserved.
  • 44. 44 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Final Thoughts  Already on Hadoop? – Try Spark side-by-side – Process some data in HDFS – Try Spark SQL for Hive tables  Contemplating Hadoop? – Try Spark (standalone) – Choose NFS or S3 file system  Take advantage of caching – Iterative loads – Spark Job servers – Tachyon  Build new class of ‘big / medium data’ apps
  • 45. 45 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Thanks ! http://elephantscale.com Expert consulting & training in Big Data (Now offering Spark training)
  • 46. 46 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Caching!  Reading data from remote FS (S3) can be slow  For small / medium data ( 10 – 100s of GB) use caching – Pay read penalty once – Cache – Then very high speed computes (in memory) – Recommended for iterative work-loads
  • 47. 47 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Caching Results Cached!
  • 48. 48 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Caching  Caching is pretty effective (small / medium data sets)  Cached data can not be shared across applications (each application executes in its own sandbox)
  • 49. 49 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Sharing Cached Data  1) ‘spark job server’ – Multiplexer – All requests are executed through same ‘context’ – Provides web-service interface  2) Tachyon – Distributed In-memory file system – Memory is the new disk! – Out of AMP lab , Berkeley – Early stages (very promising)
  • 50. 50 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Server
  • 51. 51 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Spark Job Server  Open sourced from Ooyala  ‘Spark as a Service’ – simple REST interface to launch jobs  Sub-second latency !  Pre-load jars for even faster spinup  Share cached RDDs across requests (NamedRDD) App1 : ctx.saveRDD(“my cached rdd”, rdd1) App2: RDD rdd2 = ctx.loadRDD (“my cached rdd”)  https://github.com/spark-jobserver/spark-jobserver
  • 52. 52 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Tachyon + Spark
  • 53. 53 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Next : New Big Data Applications With Spark
  • 54. 54 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Big Data Applications : Now  Analysis is done in batch mode (minutes / hours)  Final results are stored in a real time data store like Cassandra / Hbase  These results are displayed in a dashboard / web UI  Doing interactive analysis ???? – Need special BI tools
  • 55. 55 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. With Spark…  Load data set (Giga bytes) from S3 and cache it (one time)  Super fast (sub-seconds) queries to data  Response time : seconds (just like a web app !)
  • 56. 56 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Lessons Learned  Build sophisticated apps !  Web-response-time (few seconds) !!  In-depth analytics – Leverage existing libraries in Java / Scala / Python  ‘data analytics as a service’
  • 57. 57 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. •57 www.synerzip.com Ashish Shanker [email protected] 469.374.0500
  • 58. 58 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Synerzip in a Nutshell  Software product development partner for small/mid-sized technology companies • Exclusive focus on small/mid-sized technology companies, typically venture- backed companies in growth phase • By definition, all Synerzip work is the IP of its respective clients • Deep experience in full SDLC – design, dev, QA/testing, deployment  Dedicated team of high caliber software professionals for each client • Seamlessly extends client’s local team offering full transparency • Stable teams with very low turn-over • NOT just “staff augmentation, but provide full management support  Actually reduces risk of development/delivery • Experienced team – uses appropriate level of engineering discipline • Practices Agile development – responsive yet disciplined  Reduces cost – dual-site team, 50% cost advantage  Offers long-term flexibility – allows (facilitates) taking offshore team captive – aka “BOT” option 58
  • 59. 59 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Synerzip Clients 59
  • 60. 60 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Join Us In Person Agile Texas 2015 Tour Presented by Hemant Elhence & Vinayak Joglekar 60
  • 61. 61 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Next Webinar 7 Sins of Scrum and other Agile Anti-Patterns Complimentary Webinar: Tuesday, September 22, 2015 @ Noon CST Presented by: Todd Little IHM 61
  • 62. 62 www.synerzip.com Webinar Series 2015 Copyright © 2015 Elephant Scale LLC. All rights reserved. Ashish Shanker [email protected] 469.374.0500 Connect with Synerzip @Synerzip_Agile linkedin.com/company/synerzip facebook.com/Synerzip 62

Editor's Notes

  • #2: 1
  • #3: 2
  • #4: 3
  • #15: 14
  • #18: Hadoop is evolving into a platform for other distributed applications
  • #19: In Hadoop data has to be persisted in HDFS between jobs In Spark, it can be kept in memory
  • #22: Spark can work with lots of storage types
  • #25: 24
  • #26: You can use python libraries for Machine learning ..etc
  • #28: It is possible to go from Hadoop to Spark Consider the alternatives
  • #43: TODO : our experience Ted Dunning: Mahout is true and verified, and focussed, MLLib is more of a loose collection Frank Dai (Spark contributor): Mahout will concentrate on machine learning and have a rich set of algorithms, while MLLib will adopt only most essential and mature algorithms
  • #60: 59
  • #63: 62