Apache Spark for Beginners

Available Distributed
Programming models
• MapReduce
• Storm
• Flink
• Spark

Major limitations with available
distributed models
1. Difficulty in programming directly in MapReduce
2. No support for in-memory computation in
MapReduce
3. MR uses batch processing (does not fit every
use-case ).
4. Flink is not ready for production level projects.
5. Flink primarily works on streaming data.
6. Storm is slower than Spark.

Hadoop Ecosystem
Note: This is just an illustrative figure, not all components shown may be production ready

What is Spark?
• Spark is the open standard for flexible in-
memory data processing for batch, real-time,
and advanced analytics.
• Powerful open source processing engine built
around speed, ease of use, and sophisticated
analytics.
• First high-level programing framework for fast,
distributed data processing.

Some key points about Spark
• Handles batch, interactive, and real-time
within a single framework
(MR for Batch and Flink for Stream)
• Native integration with Java, Python, Scala
and R
• More general: map/reduce is just one set of
supported constructs

How does Spark Work?
• Often used in tandem with a distributed storage system to
write the data processed and a cluster manager to manage
the distribution of the application across the cluster.
• Spark currently supports three kinds of cluster managers:
1. The manager included in Spark, called the Standalone Cluster Manager,
which requires Spark to be installed in each node of a cluster.
2. Apache Mesos
3. Hadoop YARN.

Spark Data processing eco system
Figure: Components of
Spark Architecture Model

Spark Cluster
Figure: Spark Cluster Mode Overview

Spark Ecosystem
Spark Core API
R SQL Python Scala Java
Spark SQL Streaming MLlib GraphX
Programming languages used in Spark Source:

Spark Core
• Spark Core, the main data processing framework in the Spark ecosystem
• Spark Core is the underlying general execution engine for the Spark
platform that all other functionality is built on top of.
• It provides in-memory computing capabilities to deliver speed, a
generalized execution model to support a wide variety of applications, and
Java, Scala, and Python APIs for ease of development.
• In addition to Spark Core, the Spark ecosystem includes a number of other
first-party components for more specific data processing tasks, including
Spark SQL, Spark MLLib, Spark ML, and Graph X.
• These components have many of the same generic performance
considerations as the core. However, some of them have unique
considerations - like SQL’s different optimizer.

Spark SQL
• Spark SQL is a Spark module for structured data processing.
• Provides a programming abstraction called Data Frames & can also act as
distributed SQL query engine.
• Defines an interface for a semi-structured data type,
called DataFrames and a typed version called Dataset.
• Very important component for Spark performance, and almost all that can
be accomplished with Spark core can be applied to Spark SQL.
• DataFrames and Datasets interfaces are the future of Spark performance,
with more efficient storage options, advanced optimizer, and direct
operations on serialized data.
• Datasets was introduced in Spark 1.6, DataFrames in Spark 1.3, and the
SQL engine in Spark 1.0.

• Spark SQL supports structured queries in batch and streaming
modes (with the latter as a separate module of Spark SQL
called Structured Streaming).
• As of Spark 2.0, Spark SQL is now de facto the primary and
feature-rich interface to Spark’s underlying in-memory
distributed platform (hiding Spark Core’s RDDs behind higher-
level abstractions).

Spark SQL’s different APIs
• Dataset API (formerly DataFrame API) with a strongly-typed
LINQ-like Query DSL that Scala programmers will likely find
very appealing to use.
• Structured Streaming API (aka Streaming Datasets) for
continuous incremental execution of structured queries.
• Non-programmers will likely use SQL as their query language
through direct integration with Hive
• JDBC/ODBC fans can use JDBC interface (through Thrift
JDBC/ODBC Server) and connect their tools to Spark’s
distributed query engine.

Machine Learning
• Spark has two machine learning packages, ML and MLlib.
• Spark ML is still in the early stages, but since Spark 1.2, it provides a
higher-level API than MLlib that helps users create practical machine
learning pipelines more easily.
• Spark MLLib is built on top of RDDs, on the other hand ML is build on top
of SparkSQL data frames.
• Spark community plans to move over to ML deprecating MLlib.
• Spark ML and MLLib have some unique performance considerations,
especially when working with large data sizes and caching.

Spark Streaming
• Running on top of Spark, Spark Streaming enables powerful interactive
and analytical applications across both streaming and historical data,
while inheriting
• Uses the scheduling of the Spark Core for streaming analytics on mini
batches of data.
• Has a number of unique considerations such as the window sizes used for
batches.
• Running on top of Spark, it enables powerful interactive and analytical
applications across both streaming and historical data, while inheriting
Spark’s ease of use and fault tolerance characteristics.
• Readily integrates with a wide variety of popular data sources, including
HDFS, Flume, Kafka, and Twitter.

Graph X
• GraphX is a graph computation engine built on top of Spark that enables
users to interactively build, transform and reason about graph structured
data at scale.
• Comes complete with a library of common algorithms.
• Least mature components of Spark.
• Typed graph functionality will start to be introduced on top of the Dataset
API in upcoming version.

Spark Model of Parallel Computing:
RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements partitioned across
machines, that can be operated on in parallel.
• Each RDD is split into multiple partitions, which may be computed on
different nodes of the cluster.
• RDDs are distributed data-sets that can stay in-memory or fall back to disk
gracefully.
• RDDs are resilient because they have a long lineage. Whenever there's a
failure in the system, they can re-compute themselves using the prior
information using lineage.
• RDDs are a representation of lazily evaluated statically typed distributed
collections.

• Spark stores data in RDDs on different partitions. They help with
rearranging the computations and optimizing the data processing.
• RDDs are immutable. We can modify an RDD with a transformation but
the transformation returns a new RDD whereas the original RDD remains
the same.
• In addition to Spark Core, the Spark ecosystem includes a number of other
first-party components for more specific data processing tasks, including
Spark SQL, Spark MLLib, Spark ML, and Graph X.

RDD Operations
• RDD supports two types of operations:
– Transformation: Transformations don't return a single value,
they return a new RDD. Nothing gets evaluated when
Transformation function is called, it just takes an RDD and return
a new RDD.
Few of the Transformation functions are map, filter, flatMap,
groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
– Action: Action operation evaluates and returns a new value.
When an Action function is called on a RDD object, all the data
processing queries are computed at that time and the result
value is returned.
Few of the Actions are reduce, collect, count, first, take,
countByKey, and foreach.

Lazy Evaluation
• Evaluation of RDDs is completely lazy.
• Spark does not begin computing the partitions until and
action is called.
• Actions trigger the scheduler, which builds a directed acyclic
graph (called the DAG), based on the dependencies between
RDD transformations.

PERFORMANCE & USABILITY
ADVANTAGES OF LAZY EVALUATION
• Allows Spark to chain together operations that don’t
require communication with the driver to avoid doing
multiple passes through the data.
• As each partition of the data contains the dependency
information needed to re-calculate the partition, Spark is
fault-tolerant
• RDD contains all the dependency information required to
replicate each of its partitions.
• In case o failure when a partition is lost, the RDD has
enough information about its lineage to recompute it, and
that computation can be parallelized to make recovery
faster.

IN-MEMORY STORAGE & MEMORY MANAGEMENT
• Spark has option of storing the data on slave nodes on loaded into
memory. So its performance it very good for iterative computations
compare to MapReduce.
• Spark offers three options for memory management:
1. In memory as de-serialized Java objects: memory storage is the fastest
but not memory efficient, as it needs the data to be as objects.
2. As serialized data: slower, since serialized data is more CPU-intensive to
read often more memory efficient, since it allows the user to choose a
more efficient representation for data than as Java objects
3. On Disk: obviously slower for repeated computations, but can be more
fault-tolerant for long strings of transformations and may be the only
feasible option for enormous computations.

IMMUTABILITY AND THE RDD INTERFACE
• Spark has a RDD interface whose properties are followed by
RDD of every type.
• RDD properties include dependences & information about
data locality that are needed for the execution engine to
compute that RDD
• RDDs can be created in two ways:
(1) by transforming an existing RDD or
(2) from a Spark Context(by passing a list or reading files)

What are the benefits of Spark?
• Speed-Engineered from the bottom-up for performance, Spark can
be 100x faster than Hadoop for large scale data processing by exploiting in
memory computing and other optimizations. Spark is also fast when data
is stored on disk, and currently holds the world record for large-scale on-
disk sorting. Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Ease of Use-Spark has easy-to-use APIs for operating on large datasets.
This includes a collection of over 100 operators for transforming data and
familiar data frame APIs for manipulating semi-structured data. Write
applications quickly in Java, Scala, Python, R.
• A Unified Engine-Spark comes packaged with higher-level libraries,
including support for SQL queries, streaming data, machine learning and
graph processing. These standard libraries increase developer productivity
and can be seamlessly combined to create complex workflows.

When to use Spark?
• Faster Batch Applications: You can now deploy batch applications that run 10-
100x faster in production environments with the added benefit of easy code
maintenance.
• Complex ETL Data Pipelines: You can leverage the complete Spark stack to
build complex ETL pipelines that can merge streaming, machine learning and
sql operations all in one program.
• Real-time Operational Analytics :You can leverage MapR-DB/HBase and/or
Spark Streaming functionality to build real-time operational dashboards or
time-series analytics over data ingested at high speeds.
Example:
• Credit Card Fraud Detection
• Network Security
• Genomic Sequencing

When Not to use Spark?
• Spark was not designed as a multi-user environment. Spark users are
required to know whether the memory they have access to is sufficient for
a dataset. Adding more users further complicates this since the users will
have to coordinate memory usage to run projects concurrently. Due to
this, users will want to consider an alternate engine, such as Apache Hive,
for large, batch projects.

Questions?
Thank You
Anirudh Menon(animenon@mail.com)
Aman Kaushik(amanthekaushik@gmail.com)

Apache Spark for Beginners

More Related Content

What's hot (20)

Similar to Apache Spark for Beginners (20)

Recently uploaded (20)

Apache Spark for Beginners