Spark

What is Spark?
 Spark is an open source distributed computing engine.
 It is for processing and analyzing a large amount of data.
 It is purposely designed for fast computation in Big Data world.
 Likewise, hadoop mapreduce, it also works to distribute data across the cluster.
 The most Sparkling feature of Apache Spark is it offers in-memory cluster
computing. In-memory cluster computing enhances the processing speed of an
application.

Big Data
 any voluminous amount
 structured,
 semistructured
 and unstructured data that has the potential to be mined for information.
 it is a collection of large datasets that cannot be processed using traditional
computing techniques.

We have.. Then why Spark..?
 Batch processing
 Stream processing
 Interactive processing
 Graph processing
 Hadoop MapReduce
 Apache Storm
 Apache Impala / Apache Tez.
 Neo4j / Apache Giraph

Criteria Spark Hadoop MapReduce
Processing Location In-memory Persists on disk after map
and reduce functions
Ease of use Easy as based on Scala Difficult as based on Java
Speed Up to 100 times faster than
Hadoop MapReduce
Slower
Latency Lower Higher
Computation Iterative computation
possible
single computation
possible
Task Scheduling Schedules tasks itself Requires external
schedulers.

 Spark uses master/slave architecture i.e. one central coordinator and many
distributed workers. Here, the central coordinator is called the driver.
 The driver runs in its own Java process.
 These drivers communicate with a potentially large number of distributed workers
called executors.
 Each executor is a separate java process

Abstractions on which spark architecture
is based
 Resilient Distributed Datasets (RDD)
 Directed Acyclic Graph (DAG)

RDD- Resilient Distributed Datasets
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able
to recompute missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.
 Dataset represents records of the data you work with. The user can load the
data set externally which can be either JSON file, CSV file, text file or database
via JDBC with no specific data structure.

What is RDD..?
 Fundamental data structure of Apache Spark .
 A RDD is a resilient and distributed collection of records spread over one or many
partitions.
 Spark RDDs are immutable in nature.
 An immutable collection of objects which computes on the different node of the
cluster.
 Each and every dataset in Spark RDD is logically partitioned.
 It supports in-memory computation over spark cluster.

Transformations
 Spark RDD Transformations are functions that take an RDD as the input and
produce one or many RDDs as the output.
 They do not change the input RDD but always produce one or more new RDDs
by applying the computations they represent e.g. Map(), filter(), reduceByKey()
etc.
 Transformations are lazy operations on an RDD.
 It creates one or many new RDDs, which executes when an Action occurs.
Hence, Transformation creates a new dataset from an existing one.
 These are of 2 types- Narrow & Wide

“
”
• It is the result of map, filter and such that the data is from a single partition only, i.e. it is self-sufficient.
• An output RDD has partitions with records that originate from a single partition in the parent RDD.

“
”
• The data required to compute the records in a single partition may live in many partitions of the parent RDD.
• Wide transformations are also known as shuffle transformations because they may or may not depend on a
shuffle.

Actions
 An Action in Spark returns final result of RDD
computations.
 It triggers execution using lineage graph to
load the data into original RDD, carry out
all intermediate transformations and return
final results to Driver program
Eg: first(), take(), reduce(), collect(), the count()
Raw Data
Cars Data
American Data
Sum
Average
Linkage graph

Spark In-Memory Computing
 Data is kept in random access memory(RAM) instead of some slow disk drives and
is processed in parallel.
 This has become popular because it reduces the cost of memory.
 So, in-memory processing is economic for applications. The two main columns of
in-memory computation are-
 RAM storage
 Parallel distributed processing.

RDD Persistence and Caching Mechanism
 Spark RDD persistence is an optimization technique in which saves the result of
RDD evaluation. Using this we save the intermediate result so that we can use it
further if required. It reduces the computation overhead.
 We can make persisted RDD through cache() and persist() methods.
 When we use the cache() method we can store all the RDD in-memory.
 We can persist the RDD in memory and use it efficiently across parallel operations.

What’s difference between cache() and persist()
Cache
 we can store all the RDD in-memory.
 default storage level is MEMORY_ONLY
Persistence
 We can persist the RDD in memory and use
it efficiently across parallel operations.
 MEMORY_ONLY
 MEMORY_AND_DISK
 MEMORY_ONLY_SER
 MEMORY_AND_DISK_SER
 DISK_ONLY
 MEMORY_ONLY_2 and MEMORY_AND_DISK_2

Benefits of RDD
 Time efficient
 Cost efficient
 Lessen the execution time.

DAG - Directed Acyclic Graph
 Directed- Graph which is directly connected from one node to another. This
creates a sequence.
 Acyclic – It defines that there is no cycle or loop available.
 Graph – It is a combination of vertices and edges, with all the connections in a
sequence

Need of Directed Acyclic Graph in Spark
The computation through MapReduce is carried in three steps:
 The data is read from HDFS.
 Map and Reduce operations are applied.
 The computed result is written back to HDFS.
In Spark ,DAG of consecutive computation stages is formed. In this way, the
execution plan is optimized, e.g. to minimize shuffling data around. In contrast, it is
done manually in MapReduce by tuning each MapReduce step.

How DAG works in Spark?
 Using a Scala interpreter, Spark interprets the code with some modifications.
 Spark creates an operator graph when you enter your code in Spark console.
 When an Action is called on Spark RDD at a high level, Spark submits the operator graph to the DAG
Scheduler.
 Operators are divided into stages of the task in the DAG Scheduler. A stage contains task based on
the partition of the input data. The DAG scheduler pipelines operators together. For example, map
operators are scheduled in a single stage.
 The stages are passed on to the Task Scheduler. It launches task through cluster manager. The
dependencies of stages are unknown to the task scheduler.
 The Workers execute the task on the slave.

Advantages of DAG in Spark
 The lost RDD can be recovered
 Map Reduce has just two queries the map, and reduce but in DAG we have
multiple levels. So to execute SQL query, DAG is more flexible.
 DAG helps to achieve fault tolerance. Thus the lost data can be recovered.
 Better optimization than a system like Hadoop MapReduce

Spark Streaming
 A data stream is an unbounded sequence of data arriving continuously.
 Streaming divides continuously flowing input data into discrete units for further
processing.
 Stream processing is low latency processing and analyzing of streaming data.

Why Streaming in Spark?
 Batch processing systems like Apache Hadoop have high latency that is not
suitable for near real time processing requirements.
 Processing of a record is guaranteed by Storm if it hasn’t been processed, but this
can lead to inconsistency as repetition of record processing might be there. The
state is lost if a node running Storm goes down.
 In most environments, Hadoop is used for batch processing while Storm is used
for stream processing that causes an increase in code size, number of bugs to fix,
development effort, introduces a learning curve, and causes other issues

Cntd…
 Spark Streaming helps in fixing these issues and provides a
 scalable,
 efficient,
 resilient,
 integrated (with batch processing) system

Discretized Stream Processing - DStream
 Spark DStream (Discretized Stream) is the basic abstraction of Spark
Streaming.
 Spark DStream, which represents a stream of data divided into small batches.
 DStreams are built on Spark RDDs, Spark’s core data abstraction.

Discretized Stream Processing
38
Spark
Spark
Streaming
batches of X seconds
live data stream
processed
results
 Chop up the live stream into batches of X seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches

Limitations of Apache Spark Programming

Summary
 Stream processing framework that is ...
- Scalable to large clusters
- Achieves second-scale latencies
- Has simple programming model
- Integrates with batch & interactive workloads
- Ensures efficient fault-tolerance in stateful computations

Spark

More Related Content

What's hot

Similar to Spark

Recently uploaded

Spark

Editor's Notes