SlideShare a Scribd company logo
Introduction to Apache
Spark
Spark - Intro
 A fast and general engine for large-scale data
processing.
 Scalable architecuture
 Work with cluster manager (such as YARN)
Spark Context
Spark Context
 SparkContext is the entry point to any spark functionality.
 As soon as you run a spark application, a driver program
starts, which has the main function and the sparkcontext
gets initiated.
 The driver program runs the operations inside the
executors on the worker nodes.
Spark Context
 SparkContext uses Py4J to launch a JVM and creates a
JavaSparkContext.
 Spark supports Scala, Java and Python. PySpark is the
library to be installed to use python code snippets.
 PySpark has a default SparkContext library. This helps to
read a local file from the system and process it using
Spark.
Introduction to apache spark and the architecture
Sample program
SparkShell
 Simple interactive REPL (Read-Eval-Print-Loop).
 Provides a simple way to connect and analyze data
interactively.
 Can be started using pyspark or spark-shell command in
terminal. The former supports python based programs
and the latter supports scala based programs.
SparkShell
Features
 Runs programs up to 100x faster than Hadoop
Mapreduce in memory or 10x faster in disk.
 DAG engine – a directed acyclic graph is created that
optimzes workflows.
 Lot of big players like amazon, eBay, NASA Deep space
network, etc. use Spark.
 Built around one main concept: Resilient Distributed
Dataset (RDD).
Components of spark
RDD – Resilient Distributed Datasets
 This is the core object on which the spark revolves
including SparkSQL, MLLib, etc.
 Similar to pandas dataframes.
 RDD can run on standalone systems or a cluster.
 It is created by the sparkcontext object.
Creating RDDs
 Nums = sc.parallelize([1,2,3,4])
 sc.textFile(“file:///users/....txt”)
 Or from s3n:// or hdfs://
 HiveCtx = HiveContext(sc)
 Can also be created from
 JDBC, HBase, JSON, CSV, etc.
Operations on RDDs
 Map
 Filter
 Distinct
 Sample
 Union, intersection, subtract, cartesian
RDD actions
 collect
 count
 countByValue
 reduce
 Etc...
 Nothing actually happens in the driver program until an
action is called.! - Lazy Evaluation

More Related Content

Similar to Introduction to apache spark and the architecture (20)

PDF
Let's start with Spark
Milos Milovanovic
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PDF
Apache spark
Dona Mary Philip
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Apache spark
Prashant Pranay
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
Spark infrastructure
ericwilliammarshall
 
PDF
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PDF
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
PDF
Spark and scala course content | Spark and scala course online training
Selfpaced
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Let's start with Spark
Milos Milovanovic
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Apache spark
Dona Mary Philip
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Introduction to Apache Spark
Rahul Jain
 
Apache spark
Prashant Pranay
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Spark infrastructure
ericwilliammarshall
 
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark Core
Girish Khanzode
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
A Deep Dive Into Spark
Ashish kumar
 
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
Spark and scala course content | Spark and scala course online training
Selfpaced
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Spark from the Surface
Josi Aranda
 
Apache Spark Overview
Dharmjit Singh
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 

Recently uploaded (20)

PPTX
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PPTX
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
Introduction to Indian Writing in English
Trushali Dodiya
 
PPTX
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PPTX
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
Controller Request and Response in Odoo18
Celine George
 
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
Horarios de distribución de agua en julio
pegazohn1978
 
Introduction to Indian Writing in English
Trushali Dodiya
 
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
Ad

Introduction to apache spark and the architecture

  • 2. Spark - Intro  A fast and general engine for large-scale data processing.  Scalable architecuture  Work with cluster manager (such as YARN)
  • 4. Spark Context  SparkContext is the entry point to any spark functionality.  As soon as you run a spark application, a driver program starts, which has the main function and the sparkcontext gets initiated.  The driver program runs the operations inside the executors on the worker nodes.
  • 5. Spark Context  SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext.  Spark supports Scala, Java and Python. PySpark is the library to be installed to use python code snippets.  PySpark has a default SparkContext library. This helps to read a local file from the system and process it using Spark.
  • 8. SparkShell  Simple interactive REPL (Read-Eval-Print-Loop).  Provides a simple way to connect and analyze data interactively.  Can be started using pyspark or spark-shell command in terminal. The former supports python based programs and the latter supports scala based programs.
  • 10. Features  Runs programs up to 100x faster than Hadoop Mapreduce in memory or 10x faster in disk.  DAG engine – a directed acyclic graph is created that optimzes workflows.  Lot of big players like amazon, eBay, NASA Deep space network, etc. use Spark.  Built around one main concept: Resilient Distributed Dataset (RDD).
  • 12. RDD – Resilient Distributed Datasets  This is the core object on which the spark revolves including SparkSQL, MLLib, etc.  Similar to pandas dataframes.  RDD can run on standalone systems or a cluster.  It is created by the sparkcontext object.
  • 13. Creating RDDs  Nums = sc.parallelize([1,2,3,4])  sc.textFile(“file:///users/....txt”)  Or from s3n:// or hdfs://  HiveCtx = HiveContext(sc)  Can also be created from  JDBC, HBase, JSON, CSV, etc.
  • 14. Operations on RDDs  Map  Filter  Distinct  Sample  Union, intersection, subtract, cartesian
  • 15. RDD actions  collect  count  countByValue  reduce  Etc...  Nothing actually happens in the driver program until an action is called.! - Lazy Evaluation