Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark
Certified Apache Spark and Scala Training – DataFlair
 Before Spark
 Need for Spark
 What is Apache Spark ?
 Goals
 Why Spark ?
 RDD & its Operations
 Features Of Spark
Agenda
Certified Apache Spark and Scala Training – DataFlair
Before Spark
Batch
Processing
Stream
Processing
Interactive
Processing
Graph
Processing
Machine
Learning
Certified Apache Spark and Scala Training – DataFlair
Need For Spark
• Need for a powerful engine that can process the data in Real-Time
(streaming) as well as in Batch mode
• Need for a powerful engine that can respond in Sub-second and
perform In-memory analytics
• Need for a powerful engine that can handle diverse workloads:
– Batch
– Streaming
– Interactive
– Graph
– Machine Learning
Certified Apache Spark and Scala Training – DataFlair
Apache Spark is a powerful open source engine which can handle:
– Batch processing
– Real-time (stream)
– Interactive
– Graph
– Machine Learning (Iterative)
– In-memory
What is Apache Spark?
Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark
 Lightening fast cluster computing tool
 General purpose distributed system
 Provides APIs in Scala, Java, Python, and R
Certified Apache Spark and Scala Training – DataFlair
History
Introduced by
UC Berkeley
Open
Sourced
Donated to
Apache
Became Top-level
project
World record
in sorting
Most active
project at Apache
2010 2011 2012 2013 2014 20152009
Certified Apache Spark and Scala Training – DataFlair
Sort Record
Hadoop MapReduce Spark
Data Size 102.5 TB 100 TB
Time Taken 72 min 23 min
No of nodes 2100 206
No of cores 50400 physical 6592 virtualized
Cluster disk throughput 3150 GBPS 618 GBPS
Network Dedicated 10 Gbps Virtualized 10 Gbps
Hadoop-MapReduce
2100 Nodes
206 Nodes
72 min
23 min
Src: Databricks
Spark
Certified Apache Spark and Scala Training – DataFlair
Goals
Batch
StreamingInteractive
One
Stack to
Rule them all
 Easy to combine batch, streaming, and interactive computations
Certified Apache Spark and Scala Training – DataFlair
Goals
 Easy to combine batch, streaming, and interactive computations
 Easy to develop sophisticated algorithms
Certified Apache Spark and Scala Training – DataFlair
Goals
 Easy to combine batch, streaming, and interactive computations
 Easy to develop sophisticated algorithms
 Compatible with existing open source ecosystem
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
Operation1
Operation2
Disk …
Operation1
Operation1
…Disk
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
Operation 1 Operation 2
Disk
…
Disk
Operation n
Disk
Disk
Operation 1 Operation 2 … Operation n
Disk
Disk
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
Spark
Streaming
Spark
Engine
Input data
stream
Batches of
Input data
Batches of
Processed data
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.
RDD1 RDD3RDD2 Result
Transformation 1
map()
Transformation 2
filter()
Action
(collect)
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
 100x faster than Hadoop.
 In-memory computation.
 Language support like Scala, Java, Python and R.
 Support Real time and Batch Processing.
 Lazy Operations – optimize the job before execution.
 Support for multiple transformations and actions.
 Compatible with hadoop, can process existing hadoop data.
Certified Apache Spark and Scala Training – DataFlair
Spark
Architecture
Certified Apache Spark and Scala Training – DataFlair
Nodes
Master Node Slave Nodes
Master Worker
Spark Nodes
Certified Apache Spark and Scala Training – DataFlair
Basic Spark Architecture
Sub Work Sub Work Sub Work Sub Work
Sub WorkSub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Work
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
Obj1
Obj2
Obj3
Obj n
....
RDD
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R) objects.
RDD
Objects
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
 RDD is a simple and immutable collection of objects.
 RDD can contain any type of (scala, java, python and R) objects.
 Each RDD is split-up into different partitions, which may be computed on
different nodes of clusters.
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
RDD
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Certified Apache Spark and Scala Training – DataFlair
Employee-data.txt
B1
B2
B3
B4 B9
B5
B10
B12
B11 B6
B8
B7
Partition-1
Partition-2
Partition-3
Partition-4
Partition-5
. . .
RDD
Create RDD
Resilient Distributed Dataset (RDD)
Hadoop Cluster
Certified Apache Spark and Scala Training – DataFlair
RDD Operations
RDD
Operations
PersistenceActionsTransformations
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Transformation
Transformation:
 Set of operations that define how RDD should be transformed
 Creates a new RDD from the existing one to process the data
 Lazy evaluation: Computation doesn’t start until an action associated
 E.g. Map, FlatMap, Filter, Union, GroupBy, etc.
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Action
Action:
 Triggers job execution.
 Returns the result or write it to the storage.
 E.g. Count, Collect, Reduce, Take, etc.
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Persistence
Persistence:
 Spark allows caching/Persisting entire dataset in memory
 Caches the RDD in the memory for future operations
Primary Storage
Cache
Certified Apache Spark and Scala Training – DataFlair
RDD
Parent RDD
Lineage
Transformations
Actions
Result
Creates a new
RDD based on
custom business
logic
(map(), flatMap()…)
(saveAsTextFile(), count()…)
Returns output to
Driver or exports
data to storage
system after
computation
RDD
RDD Operations
Certified Apache Spark and Scala Training – DataFlair
Features of Spark
Processing
Memory
Management
Window
Criteria
Fault
Tolerance
Duplicate
Elimination
Speed
Process every
record exactly
once
100 X Faster
Than Hadoop
Automatic
Memory
Management
Recovers
Automatically
Time based
window criteria
Diverse
processing
platform
Certified Apache Spark and Scala Training – DataFlair
Thank You
DataFlair
/c/DataFlairWS /DataFlairWS

Introduction to apache spark

  • 1.
    Certified Apache Sparkand Scala Training – DataFlair Introduction to Apache Spark
  • 2.
    Certified Apache Sparkand Scala Training – DataFlair  Before Spark  Need for Spark  What is Apache Spark ?  Goals  Why Spark ?  RDD & its Operations  Features Of Spark Agenda
  • 3.
    Certified Apache Sparkand Scala Training – DataFlair Before Spark Batch Processing Stream Processing Interactive Processing Graph Processing Machine Learning
  • 4.
    Certified Apache Sparkand Scala Training – DataFlair Need For Spark • Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode • Need for a powerful engine that can respond in Sub-second and perform In-memory analytics • Need for a powerful engine that can handle diverse workloads: – Batch – Streaming – Interactive – Graph – Machine Learning
  • 5.
    Certified Apache Sparkand Scala Training – DataFlair Apache Spark is a powerful open source engine which can handle: – Batch processing – Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory What is Apache Spark?
  • 6.
    Certified Apache Sparkand Scala Training – DataFlair Introduction to Apache Spark  Lightening fast cluster computing tool  General purpose distributed system  Provides APIs in Scala, Java, Python, and R
  • 7.
    Certified Apache Sparkand Scala Training – DataFlair History Introduced by UC Berkeley Open Sourced Donated to Apache Became Top-level project World record in sorting Most active project at Apache 2010 2011 2012 2013 2014 20152009
  • 8.
    Certified Apache Sparkand Scala Training – DataFlair Sort Record Hadoop MapReduce Spark Data Size 102.5 TB 100 TB Time Taken 72 min 23 min No of nodes 2100 206 No of cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GBPS 618 GBPS Network Dedicated 10 Gbps Virtualized 10 Gbps Hadoop-MapReduce 2100 Nodes 206 Nodes 72 min 23 min Src: Databricks Spark
  • 9.
    Certified Apache Sparkand Scala Training – DataFlair Goals Batch StreamingInteractive One Stack to Rule them all  Easy to combine batch, streaming, and interactive computations
  • 10.
    Certified Apache Sparkand Scala Training – DataFlair Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms
  • 11.
    Certified Apache Sparkand Scala Training – DataFlair Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatible with existing open source ecosystem
  • 12.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.
  • 13.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation. Operation1 Operation2 Disk … Operation1 Operation1 …Disk
  • 14.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation. Operation 1 Operation 2 Disk … Disk Operation n Disk Disk Operation 1 Operation 2 … Operation n Disk Disk
  • 15.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.
  • 16.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing. Spark Streaming Spark Engine Input data stream Batches of Input data Batches of Processed data
  • 17.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing.  Lazy Operations – optimize the job before execution.
  • 18.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing.  Lazy Operations – optimize the job before execution.  Support for multiple transformations and actions. RDD1 RDD3RDD2 Result Transformation 1 map() Transformation 2 filter() Action (collect)
  • 19.
    Certified Apache Sparkand Scala Training – DataFlair Why Spark ?  100x faster than Hadoop.  In-memory computation.  Language support like Scala, Java, Python and R.  Support Real time and Batch Processing.  Lazy Operations – optimize the job before execution.  Support for multiple transformations and actions.  Compatible with hadoop, can process existing hadoop data.
  • 20.
    Certified Apache Sparkand Scala Training – DataFlair Spark Architecture
  • 21.
    Certified Apache Sparkand Scala Training – DataFlair Nodes Master Node Slave Nodes Master Worker Spark Nodes
  • 22.
    Certified Apache Sparkand Scala Training – DataFlair Basic Spark Architecture Sub Work Sub Work Sub Work Sub Work Sub WorkSub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Sub Work Work
  • 23.
    Certified Apache Sparkand Scala Training – DataFlair Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects. Obj1 Obj2 Obj3 Obj n .... RDD
  • 24.
    Certified Apache Sparkand Scala Training – DataFlair Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects.  RDD can contain any type of (scala, java, python and R) objects. RDD Objects
  • 25.
    Certified Apache Sparkand Scala Training – DataFlair Resilient Distributed Dataset (RDD)  RDD is a simple and immutable collection of objects.  RDD can contain any type of (scala, java, python and R) objects.  Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. Partition1 Partition2 Partition3 Partition4 Partition5 Partition6 RDD Partition1 Partition2 Partition3 Partition4 Partition5 Partition6
  • 26.
    Certified Apache Sparkand Scala Training – DataFlair Employee-data.txt B1 B2 B3 B4 B9 B5 B10 B12 B11 B6 B8 B7 Partition-1 Partition-2 Partition-3 Partition-4 Partition-5 . . . RDD Create RDD Resilient Distributed Dataset (RDD) Hadoop Cluster
  • 27.
    Certified Apache Sparkand Scala Training – DataFlair RDD Operations RDD Operations PersistenceActionsTransformations
  • 28.
    Certified Apache Sparkand Scala Training – DataFlair RDD Operations – Transformation Transformation:  Set of operations that define how RDD should be transformed  Creates a new RDD from the existing one to process the data  Lazy evaluation: Computation doesn’t start until an action associated  E.g. Map, FlatMap, Filter, Union, GroupBy, etc.
  • 29.
    Certified Apache Sparkand Scala Training – DataFlair RDD Operations – Action Action:  Triggers job execution.  Returns the result or write it to the storage.  E.g. Count, Collect, Reduce, Take, etc.
  • 30.
    Certified Apache Sparkand Scala Training – DataFlair RDD Operations – Persistence Persistence:  Spark allows caching/Persisting entire dataset in memory  Caches the RDD in the memory for future operations Primary Storage Cache
  • 31.
    Certified Apache Sparkand Scala Training – DataFlair RDD Parent RDD Lineage Transformations Actions Result Creates a new RDD based on custom business logic (map(), flatMap()…) (saveAsTextFile(), count()…) Returns output to Driver or exports data to storage system after computation RDD RDD Operations
  • 32.
    Certified Apache Sparkand Scala Training – DataFlair Features of Spark Processing Memory Management Window Criteria Fault Tolerance Duplicate Elimination Speed Process every record exactly once 100 X Faster Than Hadoop Automatic Memory Management Recovers Automatically Time based window criteria Diverse processing platform
  • 33.
    Certified Apache Sparkand Scala Training – DataFlair Thank You DataFlair /c/DataFlairWS /DataFlairWS