Introduction to apache spark and the architecture

Spark - Intro
 A fast and general engine for large-scale data
processing.
 Scalable architecuture
 Work with cluster manager (such as YARN)

Spark Context
 SparkContext is the entry point to any spark functionality.
 As soon as you run a spark application, a driver program
starts, which has the main function and the sparkcontext
gets initiated.
 The driver program runs the operations inside the
executors on the worker nodes.

Spark Context
 SparkContext uses Py4J to launch a JVM and creates a
JavaSparkContext.
 Spark supports Scala, Java and Python. PySpark is the
library to be installed to use python code snippets.
 PySpark has a default SparkContext library. This helps to
read a local file from the system and process it using
Spark.

SparkShell
 Simple interactive REPL (Read-Eval-Print-Loop).
 Provides a simple way to connect and analyze data
interactively.
 Can be started using pyspark or spark-shell command in
terminal. The former supports python based programs
and the latter supports scala based programs.

Features
 Runs programs up to 100x faster than Hadoop
Mapreduce in memory or 10x faster in disk.
 DAG engine – a directed acyclic graph is created that
optimzes workflows.
 Lot of big players like amazon, eBay, NASA Deep space
network, etc. use Spark.
 Built around one main concept: Resilient Distributed
Dataset (RDD).

RDD – Resilient Distributed Datasets
 This is the core object on which the spark revolves
including SparkSQL, MLLib, etc.
 Similar to pandas dataframes.
 RDD can run on standalone systems or a cluster.
 It is created by the sparkcontext object.

Creating RDDs
 Nums = sc.parallelize([1,2,3,4])
 sc.textFile(“file:///users/....txt”)
 Or from s3n:// or hdfs://
 HiveCtx = HiveContext(sc)
 Can also be created from
 JDBC, HBase, JSON, CSV, etc.

Operations on RDDs
 Map
 Filter
 Distinct
 Sample
 Union, intersection, subtract, cartesian

RDD actions
 collect
 count
 countByValue
 reduce
 Etc...
 Nothing actually happens in the driver program until an
action is called.! - Lazy Evaluation

Introduction to apache spark and the architecture

More Related Content

Similar to Introduction to apache spark and the architecture (20)

Recently uploaded (20)

Introduction to apache spark and the architecture