Introduction to apache spark

Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark

 Before Spark
 Need for Spark
 What is Apache Spark ?
 Goals
 Why Spark ?
 RDD & its Operations
 Features Of Spark
Agenda

Before Spark
Batch
Processing
Stream
Processing
Interactive
Processing
Graph
Processing
Machine
Learning

Need For Spark
• Need for a powerful engine that can process the data in Real-Time
(streaming) as well as in Batch mode
• Need for a powerful engine that can respond in Sub-second and
perform In-memory analytics
• Need for a powerful engine that can handle diverse workloads:
– Batch
– Streaming
– Interactive
– Graph
– Machine Learning

Apache Spark is a powerful open source engine which can handle:
– Batch processing
– Real-time (stream)
– Interactive
– Graph
– Machine Learning (Iterative)
– In-memory
What is Apache Spark?

Introduction to Apache Spark
 Lightening fast cluster computing tool
 General purpose distributed system
 Provides APIs in Scala, Java, Python, and R

History
Introduced by
UC Berkeley
Open
Sourced
Donated to
Apache
Became Top-level
project
World record
in sorting
Most active
project at Apache
2010 2011 2012 2013 2014 20152009

Sort Record
Hadoop MapReduce Spark
Data Size 102.5 TB 100 TB
Time Taken 72 min 23 min
No of nodes 2100 206
No of cores 50400 physical 6592 virtualized
Cluster disk throughput 3150 GBPS 618 GBPS
Network Dedicated 10 Gbps Virtualized 10 Gbps
Hadoop-MapReduce
2100 Nodes
206 Nodes
72 min
23 min
Src: Databricks
Spark

Goals
Batch
StreamingInteractive
One
Stack to
Rule them all
 Easy to combine batch, streaming, and interactive computations

Goals
 Easy to develop sophisticated algorithms

Goals
 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem

Why Spark ?
 100x faster than Hadoop.

Why Spark ?
 In-memory computation.
Operation1
Operation2
Disk …
Operation1
Operation1
…Disk

Why Spark ?
Operation 1 Operation 2
Disk
…
Disk
Operation n
Disk
Disk
Operation 1 Operation 2 … Operation n
Disk
Disk

Why Spark ?
 Language support like Scala, Java, Python and R.

Why Spark ?
 Support Real time and Batch Processing.
Spark
Streaming
Spark
Engine
Input data
stream
Batches of
Input data
Batches of
Processed data

Why Spark ?
 Lazy Operations – optimize the job before execution.

Why Spark ?
 Support for multiple transformations and actions.
RDD1 RDD3RDD2 Result
Transformation 1
map()
Transformation 2
filter()
Action
(collect)

Why Spark ?
 Support for multiple transformations and actions.
 Compatible with hadoop, can process existing hadoop data.

Spark
Architecture

Nodes
Master Node Slave Nodes
Master Worker
Spark Nodes

Basic Spark Architecture
Sub Work Sub Work Sub Work Sub Work
Sub WorkSub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Work

Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
Obj1
Obj2
Obj3
Obj n
....
RDD

 RDD can contain any type of (scala, java, python and R) objects.
RDD
Objects

 RDD can contain any type of (scala, java, python and R) objects.
 Each RDD is split-up into different partitions, which may be computed on
different nodes of clusters.
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
RDD
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6

Employee-data.txt
B1
B2
B3
B4 B9
B5
B10
B12
B11 B6
B8
B7
Partition-1
Partition-2
Partition-3
Partition-4
Partition-5
. . .
RDD
Create RDD
Hadoop Cluster

RDD Operations
RDD
Operations
PersistenceActionsTransformations

RDD Operations – Transformation
Transformation:
 Set of operations that define how RDD should be transformed
 Creates a new RDD from the existing one to process the data
 Lazy evaluation: Computation doesn’t start until an action associated
 E.g. Map, FlatMap, Filter, Union, GroupBy, etc.

RDD Operations – Action
Action:
 Triggers job execution.
 Returns the result or write it to the storage.
 E.g. Count, Collect, Reduce, Take, etc.

RDD Operations – Persistence
Persistence:
 Spark allows caching/Persisting entire dataset in memory
 Caches the RDD in the memory for future operations
Primary Storage
Cache

RDD
Parent RDD
Lineage
Transformations
Actions
Result
Creates a new
RDD based on
custom business
logic
(map(), flatMap()…)
(saveAsTextFile(), count()…)
Returns output to
Driver or exports
data to storage
system after
computation
RDD
RDD Operations

Features of Spark
Processing
Memory
Management
Window
Criteria
Fault
Tolerance
Duplicate
Elimination
Speed
Process every
record exactly
once
100 X Faster
Than Hadoop
Automatic
Memory
Management
Recovers
Automatically
Time based
window criteria
Diverse
processing
platform

Thank You
DataFlair
/c/DataFlairWS /DataFlairWS

Introduction to apache spark

More Related Content

What's hot

Similar to Introduction to apache spark

Recently uploaded

In this document

Introduction to apache spark