SlideShare a Scribd company logo
Introduction to Apache Spark
Contents
Introduction to Spark1
2
3
Resilient Distributed Datasets
RDD Operations
4 Workshop
1. Introduction
What is Apache Spark?
● Extends MapReduce
● Cluster computing platform
● Runs in memory
Fast
Easy of
development
Unified
Stack
Multi
Language
Support
Deployment
Flexibility
❏ Scala, python, java, R
❏ Deployment: Mesos, YARN, standalone, local
❏ Storage: HDFS, S3, local FS
❏ Batch
❏ Streaming
❏ 10x faster on disk
❏ 100x in memory
❏ Easy code
❏ Interactive shell
Why
Spark
Rise of the data center
Hugh amounts of data spread out
across many commodity servers
MapReduce
lots of data → scale out
Data Processing Requirements
Network bottleneck → Distributed Computing
Hardware failure → Fault Tolerance
Abstraction to organize parallelizable tasks
MapReduce
Abstraction to organize parallelizable tasks
MapReduce
Input Split Map [combine]
Suffle &
Sort
Reduce Output
AA BB AA
AA CC DD
AA EE DD
BB FF AA
AA BB AA
AA CC DD
AA EE DD
BB FF AA
(AA, 1)
(BB, 1)
(AA, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(BB, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(AA, 1)
(AA, 1)
(AA, 1)
(BB, 1)
(BB, 1)
(CC, 1)
(DD, 1)
(DD, 1)
(EE, 1)
(FF, 1)
(AA, 5)
(BB, 2)
(CC, 1)
(DD, 2)
(EE, 1)
(FF, 1)
AA, 5
BB, 2
CC, 1
DD, 2
EE, 1
FF, 1
Spark Components
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Spark Components
SparkContext
● Main entry point for Spark functionality
● Represents the connection to a Spark cluster
● Tells Spark how & where to access a cluster
● Can be used to create RDDs, accumulators and
broadcast variables on that cluster
Driver program
● “Main” process coordinated by the
SparkContext object
● Allows to configure any spark process with
specific parameters
● Spark actions are executed in the Driver
● Spark-shell
● Application → driver program + executors
Driver Program
SparkContext
Spark Components
● External service for acquiring resources on the cluster
● Variety of cluster managers
○ Local
○ Standalone
○ YARN
○ Mesos
● Deploy mode:
○ Cluster → framework launches the driver inside of the cluser
○ Client → submitter launches the driver outside of the cluster
Cluster Manager
Spark Components
● Any node that can run application code in the cluster
● Key Terms
○ Executor: A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
○ Task: Unit of work that will be sent to one executor
○ Job: A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect)
○ Stage: smaller set of tasks inside any job
Worker Node
Executor
Task Task
Worker
2. Resilient Distributed Datasets
RDD
Resilient Distributed Datasets
● Collection of objects that is distributed across
nodes in a cluster
● Data Operations are performed on RDD
● Once created, RDD are immutable
● RDD can be persisted in memory or on disk
● Fault Tolerant
numbers = RDD[1,2,3,4,5,6,7,8,9,10]
Worker Node
Executor
[1,5,6,9]
Worker Node
Executor
[2,7,8]
Worker Node
Executor
[3,4,10]
RDD
● Lazy Evaluation
● Operation: Transformation / Action
● Lineage
● Base RDD
● Partition
● Task
● Level of Parallelism
Main Concepts
RDD
Internally, each RDD is characterized by five main properties
A list of partitions
A function for
computing each split
A list of dependencies
on other RDDs
A Partitioner for key-value RDDs
A list of preferred locations to
compute each split on
Method Location Input Output
getPartitions()
compute()
getDependencies()
Driver
Driver
Worker
-
Partition
-
[Partition]
Iterable
[Dependency]
Optionally
RDD
Creating RDDs
Text File
Collection
Database
val textFile = sc.textFile("README.md")
val input = sc.parallelize(List(1, 2, 3, 4))
val casRdd = sc.newAPIHadoopRDD(
job.getConfiguration(),
classOf[ColumnFamilyInputFormat],
classOf[ByteBuffer],
classOf[SortedMap[ByteBuffer, IColumn]])
Transformation val input = rddFather.map(value => value.toString )
File / set of files
(Local/Distributed)
Memory
Another RDD
Spark load and
write data with
database
RDD
Data Operations
RDD
RDD
RDD
RDD Value
Transformations
Action
3. RDD Operations
Data Operations
Transformations Actions
❏ Creates new dataset from existing one
❏ Lazy evaluated (Transformed RDD
executed only when action runs on it)
❏ Example: filter(), map(), flatMap()
❏ Return a value to driver program after
computation on dataset
❏ Example: count(), reduce(), take(), collect()
Transformations
map(func) Return a new distributed dataset formed by passing each
element of the source through a function func
filter(func) Return a new dataset formed by selecting those elements of the
source on which func returns true
flatMap(func) Similar to map, but each input item can be mapped to 0 or
more output items (so func should return a Seq rather than a
single item)
distinct Return a new dataset that contains the distinct elements of the
source dataset
Commonly Used Transformations
Transformations
Map(func)
1
2
3
3
2
3
4
4
rdd.map(x=> x+1)
Transformations
Filter(func)
1
2
3
3
2
3
3
rdd.filter(x=> x!=1)
Transformations
flatMap(func)
1
2
3
3
2
3
3
rdd.flatMap
(x=> x.to(3))
Transformations
Distinct
1
2
3
3
1
2
3
rdd.distinct()
Transformations
union(otherRDD) Return a new RDD that contains the union of the elements in
the source dataset and the argument
intersection
(otherRDD)
Return a new RDD that contains the intersection of elements in
the source dataset and the argument
Operations of mathematical sets
rdd
Transformations
Union
1
2
3
1
2
3
3
rdd.union(other)other
3
4
5
5
4
rdd
Transformations
Intersection
1
2
3 3
rdd.intersection
(other)
other
3
4
5
Actions
count() Returns the number of elements in the dataset
reduce(func) Aggregate the elements of the dataset using a function func
(which takes two arguments and returns one). The function
should be commutative and associative so that it can be
computed correctly in parallel
collect() Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation
that returns a sufficiently small subset of the data
take(n) Returns an array with first n elements
first() Returns the first element of the dataset
takeOrdered
(n,[ordering])
Returns first n elements of RDD using natural order or custom
operator
Commonly Used Actions
Actions
Count()
4
1
2
3
3
rdd.count()
Actions
Reduce(func)
9
1
2
3
3
rdd.reduce
((x,y)=>x+y)
Actions
Collect()
{1,2,3,3}
1
2
3
3
rdd.collect()
Actions
Take(n)
{1,2}
1
2
3
3
rdd.take(2)
Actions
first()
1
1
2
3
3
rdd.first()
Actions
takeOrdered(n,[ordering])
{3,3}
1
2
3
3
rdd.takeOrdered(2)
(myOrdering)
4. Workshop
WORKSHOP
In order to practice the main concepts, please complete the exercises
proposed at our Github repository by clicking the following link:
○ Homework
THANKS!
Any questions?
@datiobddatio-big-data
Special thanks to Stratio for its theoretical contribution
academy@datiobd.com

More Related Content

What's hot (20)

PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Physical Plans in Spark SQL
Databricks
 
PDF
Intro to HBase
alexbaranau
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Introduction to apache spark
Aakashdata
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPTX
Map reduce presentation
ateeq ateeq
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PPTX
Key-Value NoSQL Database
Heman Hosainpana
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
Spark shuffle introduction
colorant
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Physical Plans in Spark SQL
Databricks
 
Intro to HBase
alexbaranau
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark streaming , Spark SQL
Yousun Jeong
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 
Introduction to apache spark
Aakashdata
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Apache Spark overview
DataArt
 
Introduction to Apache Spark
Rahul Jain
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Map reduce presentation
ateeq ateeq
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Apache spark
TEJPAL GAUTAM
 
Key-Value NoSQL Database
Heman Hosainpana
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Spark shuffle introduction
colorant
 

Viewers also liked (9)

PDF
Unsupervised Learning with Apache Spark
DB Tsai
 
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
PDF
Realizing AI Conversational Bot
Rakuten Group, Inc.
 
PDF
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Assist
 
PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
MapR Technologies
 
PDF
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
PDF
Music Recommendations at Scale with Spark
Chris Johnson
 
PDF
Collaborative Filtering with Spark
Chris Johnson
 
PDF
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
TWG
 
Unsupervised Learning with Apache Spark
DB Tsai
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
Realizing AI Conversational Bot
Rakuten Group, Inc.
 
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Assist
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
MapR Technologies
 
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Music Recommendations at Scale with Spark
Chris Johnson
 
Collaborative Filtering with Spark
Chris Johnson
 
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
TWG
 
Ad

Similar to Introduction to Apache Spark (20)

PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Spark 计算模型
wang xing
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Boston Spark Meetup event Slides Update
vithakur
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
SparkNotes
Demet Aksoy
 
PDF
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
PDF
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
PDF
Hadoop ecosystem
Ran Silberman
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Scala and spark
Fabio Fumarola
 
Spark 计算模型
wang xing
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Dive into spark2
Gal Marder
 
Boston Spark Meetup event Slides Update
vithakur
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Tuning and Debugging in Apache Spark
Databricks
 
SparkNotes
Demet Aksoy
 
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Data processing platforms with SMACK: Spark and Mesos internals
Anton Kirillov
 
Introduction to Apache Spark
Vincent Poncet
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop ecosystem
Ran Silberman
 
Ad

More from Datio Big Data (20)

PDF
Búsqueda IA
Datio Big Data
 
PDF
Descubriendo la Inteligencia Artificial
Datio Big Data
 
PDF
Learning Python. Level 0
Datio Big Data
 
PDF
Learn Python
Datio Big Data
 
PDF
How to document without dying in the attempt
Datio Big Data
 
PDF
Developers on test
Datio Big Data
 
PDF
Ceph: The Storage System of the Future
Datio Big Data
 
PDF
A Travel Through Mesos
Datio Big Data
 
PDF
Datio OpenStack
Datio Big Data
 
PDF
Quality Assurance Glossary
Datio Big Data
 
PDF
Data Integration
Datio Big Data
 
PDF
Gamification: from buzzword to reality
Datio Big Data
 
PDF
Pandas: High Performance Structured Data Manipulation
Datio Big Data
 
PDF
Road to Analytics
Datio Big Data
 
PDF
Del Mono al QA
Datio Big Data
 
PDF
Databases and how to choose them
Datio Big Data
 
PPTX
DC/OS: The definitive platform for modern apps
Datio Big Data
 
PPTX
PDP Your personal development plan
Datio Big Data
 
PPTX
Security&Governance
Datio Big Data
 
PDF
Kafka Connect by Datio
Datio Big Data
 
Búsqueda IA
Datio Big Data
 
Descubriendo la Inteligencia Artificial
Datio Big Data
 
Learning Python. Level 0
Datio Big Data
 
Learn Python
Datio Big Data
 
How to document without dying in the attempt
Datio Big Data
 
Developers on test
Datio Big Data
 
Ceph: The Storage System of the Future
Datio Big Data
 
A Travel Through Mesos
Datio Big Data
 
Datio OpenStack
Datio Big Data
 
Quality Assurance Glossary
Datio Big Data
 
Data Integration
Datio Big Data
 
Gamification: from buzzword to reality
Datio Big Data
 
Pandas: High Performance Structured Data Manipulation
Datio Big Data
 
Road to Analytics
Datio Big Data
 
Del Mono al QA
Datio Big Data
 
Databases and how to choose them
Datio Big Data
 
DC/OS: The definitive platform for modern apps
Datio Big Data
 
PDP Your personal development plan
Datio Big Data
 
Security&Governance
Datio Big Data
 
Kafka Connect by Datio
Datio Big Data
 

Recently uploaded (20)

PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Design Thinking basics for Engineers.pdf
CMR University
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 

Introduction to Apache Spark