5 things one must know about spark!

www.edureka.co/apache-spark-scala-training
5 Things one must know about Spark!

What will you learn today?
 Spark In-Memory Processing
 Streaming Support
 Machine Learning and Graph
 Spark DataFrame API
 Spark's Integration with Hadoop

Spark In-Memory Processing

Spark Cut Down Read/Write I/O To Disk
Spark tries to keep things in-memory of its distributed workers, allowing for significantly
faster/lower-latency computations, whereas MapReduce keep shuffling things in and out of disk.

Spark is blazingly Fast

Isn’t Spark In-Memory Only
But I have
heard Spark is
good for only
in-memory
processing?

Spark : Best of both Worlds
It’s a common misconception Spark is only for in-memory processing. From its inception Spark
was designed to be a general execution engine that works both in-memory and on-disk.
Almost all Spark operators perform external operations when data does not fit in memory

Streaming Support

Spark Streaming
 Used for processing the real-time streaming data.
 It uses the DStream which is a series of RDDs, for processing the continuous real-time data.
 Spark Streaming API closely matches that of the Spark Core

Machine Learning and Graph
Implementation with DAG

Machine Learning
MLlib, a
machine
learning library
Classification Regression Clustering
Collaborative
filtering
Some of the algorithms also work with streaming data, such as linear regression using
ordinary least squares or k-means clustering

Cyclic Data Flows
 All jobs in spark comprise a series of operators and run on a set of data.
 All the operators in a job are used to construct a DAG (Directed Acyclic Graph).
 The DAG is optimized by rearranging and combining operators where possible.

GraphX
Graph
Algorithms
Page Rank
Connected
Components
Triangle
Counting
 Component for graphs and graph-parallel computation
 Extends the Spark RDD by introducing a new Graph abstraction

Support for DataFrames

DataFrame
Inspired by DataFrames in R and Python (Pandas).
DataFrames API is designed to make big data processing on tabular data easier.
DataFrame is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.
Can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

DataFrame features
Ability to scale from KBs to PBs
Support for a wide array of data formats and storage systems
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Seamless integration with all big data tooling and infrastructure via spark
APIs for Python, Java, Scala, and R

Spark’s Integration with Hadoop

Spark Execution Platforms
 Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
 Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution
modes
Standalone Mesos HDFS

Spark in one Snapshot

Spark Use Cases
Different companies are using Spark
for solving various problems e.g.
recommendation systems, business
intelligence, fraud detection etc.

Who is using Spark?
A complete list of companies using Spark can be found here : https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

References
IBM backs Apache Spark for Big Data Analytics :
http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/
Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark' :
http://fortune.com/2015/09/09/cloudera-spark-mapreduce/
5 reasons to turn to Spark for Big Data Analytics :
http://www.infoworld.com/article/2897287/big-data/5-reasons-to-turn-to-spark-for-big-data-analytics.html

References
Spark new record for large scale sorting :
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
How eBay uses Spark to ignite Data Analytics :
http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/
Spark is fast on disk too :
https://gigaom.com/2014/10/10/databricks-demolishes-big-data-benchmark-to-prove-spark-is-fast-on-disk-too/

Thank You …
Questions/Queries/Feedback
Recording and presentation will be made available to you within 24 hours

5 things one must know about spark!

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to 5 things one must know about spark! (20)

More from Edureka! (20)

Recently uploaded (20)

5 things one must know about spark!