APACHE SPARK
What is Spark?
 Spark is an open source distributed computing engine.
 It is for processing and analyzing a large amount of data.
 It is purposely designed for fast computation in Big Data world.
 Likewise, hadoop mapreduce, it also works to distribute data across the cluster.
 The most Sparkling feature of Apache Spark is it offers in-memory cluster
computing. In-memory cluster computing enhances the processing speed of an
application.
Big Data
 any voluminous amount
 structured,
 semistructured
 and unstructured data that has the potential to be mined for information.
 it is a collection of large datasets that cannot be processed using traditional
computing techniques.
3V of Big Data
We have.. Then why Spark..?
 Batch processing
 Stream processing
 Interactive processing
 Graph processing
 Hadoop MapReduce
 Apache Storm
 Apache Impala / Apache Tez.
 Neo4j / Apache Giraph
Why Spark?
Criteria Spark Hadoop MapReduce
Processing Location In-memory Persists on disk after map
and reduce functions
Ease of use Easy as based on Scala Difficult as based on Java
Speed Up to 100 times faster than
Hadoop MapReduce
Slower
Latency Lower Higher
Computation Iterative computation
possible
single computation
possible
Task Scheduling Schedules tasks itself Requires external
schedulers.
Features of Apache Spark
Apache Spark Components
Spark Core Module
 Spark uses master/slave architecture i.e. one central coordinator and many
distributed workers. Here, the central coordinator is called the driver.
 The driver runs in its own Java process.
 These drivers communicate with a potentially large number of distributed workers
called executors.
 Each executor is a separate java process
Abstractions on which spark architecture
is based
 Resilient Distributed Datasets (RDD)
 Directed Acyclic Graph (DAG)
RDD- Resilient Distributed Datasets
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able
to recompute missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.
 Dataset represents records of the data you work with. The user can load the
data set externally which can be either JSON file, CSV file, text file or database
via JDBC with no specific data structure.
What is RDD..?
 Fundamental data structure of Apache Spark .
 A RDD is a resilient and distributed collection of records spread over one or many
partitions.
 Spark RDDs are immutable in nature.
 An immutable collection of objects which computes on the different node of the
cluster.
 Each and every dataset in Spark RDD is logically partitioned.
 It supports in-memory computation over spark cluster.
Features of RDD in Spark
Spark RDD Operations
Transformations
 Spark RDD Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output.
 They do not change the input RDD but always produce one or more new RDDs
by applying the computations they represent e.g. Map(), filter(), reduceByKey()
etc.
 Transformations are lazy operations on an RDD.
 It creates one or many new RDDs, which executes when an Action occurs.
Hence, Transformation creates a new dataset from an existing one.
 These are of 2 types- Narrow & Wide
“
”
• It is the result of map, filter and such that the data is from a single partition only, i.e. it is self-sufficient.
• An output RDD has partitions with records that originate from a single partition in the parent RDD.
“
”
• The data required to compute the records in a single partition may live in many partitions of the parent RDD.
• Wide transformations are also known as shuffle transformations because they may or may not depend on a
shuffle.
Actions
 An Action in Spark returns final result of RDD
computations.
 It triggers execution using lineage graph to
load the data into original RDD, carry out
all intermediate transformations and return
final results to Driver program
Eg: first(), take(), reduce(), collect(), the count()
Raw Data
Cars Data
American Data
Sum
Average
Linkage graph
Spark In-Memory Computing
 Data is kept in random access memory(RAM) instead of some slow disk drives and
is processed in parallel.
 This has become popular because it reduces the cost of memory.
 So, in-memory processing is economic for applications. The two main columns of
in-memory computation are-
 RAM storage
 Parallel distributed processing.
RDD Persistence and Caching Mechanism
 Spark RDD persistence is an optimization technique in which saves the result of
RDD evaluation. Using this we save the intermediate result so that we can use it
further if required. It reduces the computation overhead.
 We can make persisted RDD through cache() and persist() methods.
 When we use the cache() method we can store all the RDD in-memory.
 We can persist the RDD in memory and use it efficiently across parallel operations.
What’s difference between cache() and persist()
Cache
 we can store all the RDD in-memory.
 default storage level is MEMORY_ONLY
Persistence
 We can persist the RDD in memory and use
it efficiently across parallel operations.
 MEMORY_ONLY
 MEMORY_AND_DISK
 MEMORY_ONLY_SER
 MEMORY_AND_DISK_SER
 DISK_ONLY
 MEMORY_ONLY_2 and MEMORY_AND_DISK_2
Benefits of RDD
 Time efficient
 Cost efficient
 Lessen the execution time.
Limitation of Spark RDD
DAG - Directed Acyclic Graph
 Directed- Graph which is directly connected from one node to another. This
creates a sequence.
 Acyclic – It defines that there is no cycle or loop available.
 Graph – It is a combination of vertices and edges, with all the connections in a
sequence
Need of Directed Acyclic Graph in Spark
The computation through MapReduce is carried in three steps:
 The data is read from HDFS.
 Map and Reduce operations are applied.
 The computed result is written back to HDFS.
In Spark ,DAG of consecutive computation stages is formed. In this way, the
execution plan is optimized, e.g. to minimize shuffling data around. In contrast, it is
done manually in MapReduce by tuning each MapReduce step.
How DAG works in Spark?
 Using a Scala interpreter, Spark interprets the code with some modifications.
 Spark creates an operator graph when you enter your code in Spark console.
 When an Action is called on Spark RDD at a high level, Spark submits the operator graph to the DAG
Scheduler.
 Operators are divided into stages of the task in the DAG Scheduler. A stage contains task based on
the partition of the input data. The DAG scheduler pipelines operators together. For example, map
operators are scheduled in a single stage.
 The stages are passed on to the Task Scheduler. It launches task through cluster manager. The
dependencies of stages are unknown to the task scheduler.
 The Workers execute the task on the slave.
Advantages of DAG in Spark
 The lost RDD can be recovered
 Map Reduce has just two queries the map, and reduce but in DAG we have
multiple levels. So to execute SQL query, DAG is more flexible.
 DAG helps to achieve fault tolerance. Thus the lost data can be recovered.
 Better optimization than a system like Hadoop MapReduce
Spark Streaming
 A data stream is an unbounded sequence of data arriving continuously.
 Streaming divides continuously flowing input data into discrete units for further
processing.
 Stream processing is low latency processing and analyzing of streaming data.
Why Streaming in Spark?
 Batch processing systems like Apache Hadoop have high latency that is not
suitable for near real time processing requirements.
 Processing of a record is guaranteed by Storm if it hasn’t been processed, but this
can lead to inconsistency as repetition of record processing might be there. The
state is lost if a node running Storm goes down.
 In most environments, Hadoop is used for batch processing while Storm is used
for stream processing that causes an increase in code size, number of bugs to fix,
development effort, introduces a learning curve, and causes other issues
Cntd…
 Spark Streaming helps in fixing these issues and provides a
 scalable,
 efficient,
 resilient,
 integrated (with batch processing) system
Discretized Stream Processing - DStream
 Spark DStream (Discretized Stream) is the basic abstraction of Spark
Streaming.
 Spark DStream, which represents a stream of data divided into small batches.
 DStreams are built on Spark RDDs, Spark’s core data abstraction.
Discretized Stream Processing
38
Spark
Spark
Streaming
batches of X seconds
live data stream
processed
results
 Chop up the live stream into batches of X seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches
DEMO
Limitations of Apache Spark Programming
Apache Spark use cases
Summary
 Stream processing framework that is ...
- Scalable to large clusters
- Achieves second-scale latencies
- Has simple programming model
- Integrates with batch & interactive workloads
- Ensures efficient fault-tolerance in stateful computations
Spark

Spark

  • 1.
  • 2.
    What is Spark? Spark is an open source distributed computing engine.  It is for processing and analyzing a large amount of data.  It is purposely designed for fast computation in Big Data world.  Likewise, hadoop mapreduce, it also works to distribute data across the cluster.  The most Sparkling feature of Apache Spark is it offers in-memory cluster computing. In-memory cluster computing enhances the processing speed of an application.
  • 3.
    Big Data  anyvoluminous amount  structured,  semistructured  and unstructured data that has the potential to be mined for information.  it is a collection of large datasets that cannot be processed using traditional computing techniques.
  • 4.
  • 5.
    We have.. Thenwhy Spark..?  Batch processing  Stream processing  Interactive processing  Graph processing  Hadoop MapReduce  Apache Storm  Apache Impala / Apache Tez.  Neo4j / Apache Giraph
  • 6.
  • 7.
    Criteria Spark HadoopMapReduce Processing Location In-memory Persists on disk after map and reduce functions Ease of use Easy as based on Scala Difficult as based on Java Speed Up to 100 times faster than Hadoop MapReduce Slower Latency Lower Higher Computation Iterative computation possible single computation possible Task Scheduling Schedules tasks itself Requires external schedulers.
  • 8.
  • 9.
  • 11.
  • 12.
     Spark usesmaster/slave architecture i.e. one central coordinator and many distributed workers. Here, the central coordinator is called the driver.  The driver runs in its own Java process.  These drivers communicate with a potentially large number of distributed workers called executors.  Each executor is a separate java process
  • 14.
    Abstractions on whichspark architecture is based  Resilient Distributed Datasets (RDD)  Directed Acyclic Graph (DAG)
  • 15.
    RDD- Resilient DistributedDatasets  Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute missing or damaged partitions due to node failures.  Distributed, since Data resides on multiple nodes.  Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
  • 16.
    What is RDD..? Fundamental data structure of Apache Spark .  A RDD is a resilient and distributed collection of records spread over one or many partitions.  Spark RDDs are immutable in nature.  An immutable collection of objects which computes on the different node of the cluster.  Each and every dataset in Spark RDD is logically partitioned.  It supports in-memory computation over spark cluster.
  • 17.
  • 18.
  • 19.
    Transformations  Spark RDDTransformations are functions that take an RDD as the input and produce one or many RDDs as the output.  They do not change the input RDD but always produce one or more new RDDs by applying the computations they represent e.g. Map(), filter(), reduceByKey() etc.  Transformations are lazy operations on an RDD.  It creates one or many new RDDs, which executes when an Action occurs. Hence, Transformation creates a new dataset from an existing one.  These are of 2 types- Narrow & Wide
  • 20.
    “ ” • It isthe result of map, filter and such that the data is from a single partition only, i.e. it is self-sufficient. • An output RDD has partitions with records that originate from a single partition in the parent RDD.
  • 21.
    “ ” • The datarequired to compute the records in a single partition may live in many partitions of the parent RDD. • Wide transformations are also known as shuffle transformations because they may or may not depend on a shuffle.
  • 22.
    Actions  An Actionin Spark returns final result of RDD computations.  It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program Eg: first(), take(), reduce(), collect(), the count() Raw Data Cars Data American Data Sum Average Linkage graph
  • 23.
    Spark In-Memory Computing Data is kept in random access memory(RAM) instead of some slow disk drives and is processed in parallel.  This has become popular because it reduces the cost of memory.  So, in-memory processing is economic for applications. The two main columns of in-memory computation are-  RAM storage  Parallel distributed processing.
  • 24.
    RDD Persistence andCaching Mechanism  Spark RDD persistence is an optimization technique in which saves the result of RDD evaluation. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.  We can make persisted RDD through cache() and persist() methods.  When we use the cache() method we can store all the RDD in-memory.  We can persist the RDD in memory and use it efficiently across parallel operations.
  • 25.
    What’s difference betweencache() and persist() Cache  we can store all the RDD in-memory.  default storage level is MEMORY_ONLY Persistence  We can persist the RDD in memory and use it efficiently across parallel operations.  MEMORY_ONLY  MEMORY_AND_DISK  MEMORY_ONLY_SER  MEMORY_AND_DISK_SER  DISK_ONLY  MEMORY_ONLY_2 and MEMORY_AND_DISK_2
  • 26.
    Benefits of RDD Time efficient  Cost efficient  Lessen the execution time.
  • 27.
  • 28.
    DAG - DirectedAcyclic Graph  Directed- Graph which is directly connected from one node to another. This creates a sequence.  Acyclic – It defines that there is no cycle or loop available.  Graph – It is a combination of vertices and edges, with all the connections in a sequence
  • 29.
    Need of DirectedAcyclic Graph in Spark The computation through MapReduce is carried in three steps:  The data is read from HDFS.  Map and Reduce operations are applied.  The computed result is written back to HDFS. In Spark ,DAG of consecutive computation stages is formed. In this way, the execution plan is optimized, e.g. to minimize shuffling data around. In contrast, it is done manually in MapReduce by tuning each MapReduce step.
  • 31.
    How DAG worksin Spark?  Using a Scala interpreter, Spark interprets the code with some modifications.  Spark creates an operator graph when you enter your code in Spark console.  When an Action is called on Spark RDD at a high level, Spark submits the operator graph to the DAG Scheduler.  Operators are divided into stages of the task in the DAG Scheduler. A stage contains task based on the partition of the input data. The DAG scheduler pipelines operators together. For example, map operators are scheduled in a single stage.  The stages are passed on to the Task Scheduler. It launches task through cluster manager. The dependencies of stages are unknown to the task scheduler.  The Workers execute the task on the slave.
  • 32.
    Advantages of DAGin Spark  The lost RDD can be recovered  Map Reduce has just two queries the map, and reduce but in DAG we have multiple levels. So to execute SQL query, DAG is more flexible.  DAG helps to achieve fault tolerance. Thus the lost data can be recovered.  Better optimization than a system like Hadoop MapReduce
  • 34.
    Spark Streaming  Adata stream is an unbounded sequence of data arriving continuously.  Streaming divides continuously flowing input data into discrete units for further processing.  Stream processing is low latency processing and analyzing of streaming data.
  • 35.
    Why Streaming inSpark?  Batch processing systems like Apache Hadoop have high latency that is not suitable for near real time processing requirements.  Processing of a record is guaranteed by Storm if it hasn’t been processed, but this can lead to inconsistency as repetition of record processing might be there. The state is lost if a node running Storm goes down.  In most environments, Hadoop is used for batch processing while Storm is used for stream processing that causes an increase in code size, number of bugs to fix, development effort, introduces a learning curve, and causes other issues
  • 36.
    Cntd…  Spark Streaminghelps in fixing these issues and provides a  scalable,  efficient,  resilient,  integrated (with batch processing) system
  • 37.
    Discretized Stream Processing- DStream  Spark DStream (Discretized Stream) is the basic abstraction of Spark Streaming.  Spark DStream, which represents a stream of data divided into small batches.  DStreams are built on Spark RDDs, Spark’s core data abstraction.
  • 38.
    Discretized Stream Processing 38 Spark Spark Streaming batchesof X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  • 39.
  • 40.
    Limitations of ApacheSpark Programming
  • 41.
  • 42.
    Summary  Stream processingframework that is ... - Scalable to large clusters - Achieves second-scale latencies - Has simple programming model - Integrates with batch & interactive workloads - Ensures efficient fault-tolerance in stateful computations

Editor's Notes

  • #30 Each MapReduce operation is independent of each other and HADOOP has no idea of which Map reduce would come next. Sometimes for some iteration, it is irrelevant to read and write back the immediate result between two map-reduce jobs. In such case, the memory in stable storage (HDFS) or disk memory gets wasted.