4. Spark Context
SparkContext is the entry point to any spark functionality.
As soon as you run a spark application, a driver program
starts, which has the main function and the sparkcontext
gets initiated.
The driver program runs the operations inside the
executors on the worker nodes.
5. Spark Context
SparkContext uses Py4J to launch a JVM and creates a
JavaSparkContext.
Spark supports Scala, Java and Python. PySpark is the
library to be installed to use python code snippets.
PySpark has a default SparkContext library. This helps to
read a local file from the system and process it using
Spark.
8. SparkShell
Simple interactive REPL (Read-Eval-Print-Loop).
Provides a simple way to connect and analyze data
interactively.
Can be started using pyspark or spark-shell command in
terminal. The former supports python based programs
and the latter supports scala based programs.
10. Features
Runs programs up to 100x faster than Hadoop
Mapreduce in memory or 10x faster in disk.
DAG engine – a directed acyclic graph is created that
optimzes workflows.
Lot of big players like amazon, eBay, NASA Deep space
network, etc. use Spark.
Built around one main concept: Resilient Distributed
Dataset (RDD).
12. RDD – Resilient Distributed Datasets
This is the core object on which the spark revolves
including SparkSQL, MLLib, etc.
Similar to pandas dataframes.
RDD can run on standalone systems or a cluster.
It is created by the sparkcontext object.
13. Creating RDDs
Nums = sc.parallelize([1,2,3,4])
sc.textFile(“file:///users/....txt”)
Or from s3n:// or hdfs://
HiveCtx = HiveContext(sc)
Can also be created from
JDBC, HBase, JSON, CSV, etc.
15. RDD actions
collect
count
countByValue
reduce
Etc...
Nothing actually happens in the driver program until an
action is called.! - Lazy Evaluation