SlideShare a Scribd company logo
Apache Spark for
Beginners
PREPARED BY BORCELLE
Basic Guide
BY
RAHUL BAGUL
( AZURE DATA ENGINEER )
TABLE OF
CONTENT
2. HADOOP
1. THE GENESIS OF SPARK 1
2
5
3. UNDERSTANDING DATA LAKE LANDSCAPE
Introduction
Basic overview
Hive
Basic details about data warehouse
Date Lake Architecture
4. APACHE SPARK & IT'S ECO-SYSTEM
History of Spark
Key Features
Spark Eco-system & it's components: Storage
& cluster manager, spark core, set of libraries
5. SPARK ARCHITECTURE & EXECUTION MODEL
Resilient Distributed Dataset (RDD):
Transformations, Actions, Types of
transformations : Narrow & Wide, Lazy Evaluation
Directed Acyclic Graph (DAG)
Components of Spark Application Architecture:
Spark Application, SparkSession/SparkContext,
Job, Stage, Task, Driver, Executor, Cluster
Manager, types of Cluster Manager
Execution of Spark Application
Spark Execution Modes : Local, Client, Cluster
6. SPARK DATABASES, TABLES & VIEWS
Tables in Spark : Managed , Unmanaged
Views in Spark : Global Temporary View,
Temporary View
11
24
7
BACKGROUND
Data collection and ingestion
Data storage and management
Data processing and transformation
Data access and retrieval
Increased consumer traffic, a variety of new forms of data and
greater computations demanded the need for more storage and
better performance. Traditional data storage methods including
relational database management systems (RDBMSs) and
imperative programming techniques were unable to handle the
enormous amounts of data and their processing.
Google is the first to overcome below problems-
Google published the white papers in a sequence to solve these
issues -
The Genesis of
Spark
Chapter- 1
Apache Spark for Beginners 1
1st paper : Google File
System (GFS) - 2003
solving data storage and
management problem
2nd paper : MapReduce
(MR) - 2004
data processing and
transformation problem
( Fig. 1 : Google White papers )
( Storage )
HDFS - Hadoop Distributed
File System
( Compute Engine )
Hadoop MapReduce
The Google white papers were highly appreciated by the open
source community and served as the inspiration for the design and
development of a comparable open source implementation, called
Hadoop.
Hadoop
Chapter- 2
Apache Spark for Beginners 2
( Fig. 2 : Hadoop System )
Hadoop is an open-source software framework for storing and
processing large amounts of data in a distributed computing
environment.
It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel
processing of large datasets.
Its framework is based on Java programming.
It facilitate to start with the small clusters and expand the size
as you grow.
It allows the storage capacity of 100's to 1000's of computers
and use it as unified storage system
Apache Spark for Beginners 3
( Fig. 3 : Hadoop basic working )
Basic Overview :
MAP REDUCE (MR)
HDFS
HDFS : It allowed us to form cluster of computers
& used combined capacity for storing our data
MR: Solved compute capacity problem by
implementing distributed capacity framework
over HDFS
Hadoop allowed us to break data processing jobs
into smaller tasks & use the clusters to finish the
individual task.
It combines the output of different task and
produce single consolidated output.
Many solutions have been developed over Hadoop platform by
various organizations.
Some of the widely adopted systems were Hive , Pig & HBase.
Apache Hive is the most popular adopted component of
Hadoop.
Databases
Tables
Views
Hive offered following core capabilities on Hadoop platform -
1. Create
2. Run SQL Queries
Apache Spark for Beginners 4
Hive :
Performance - Hive SQL query performing slower than RDBMS SQL
query
Ease of Development - writing MapReduce program was difficult
Language Support - MapReduce was only available in JAVA
Storage - expensive than cloud storage
Resource Management - only YARN container support , unable to use
other container like Mesos, Docker , Kubernetes , etc
Bringing together , Hadoop as platform and Hive as a database became
very popular. But we still had other problems -
The point is, Hadoop left a lot scope for improvement
and as a result Apache Spark came into the existence...!
Before HDFS & MapReduce, we had Data warehouses
(like Teradata, Exadata) where the data is brought from many
OLTP/OLAP systems.
Understanding Data
Lake Landscape
Chapter- 3
Apache Spark for Beginners 5
( Fig. 4 : Basic DW flow )
Vertical Scaling - adding more DW was expensive
Large Capital Investment
Storage - non scalable
Support only structured data
Horizontal Scaling - adding more cheap servers to clusters
Low Capital Investment
Storage - scalable (cloud storage)
Support structured , unstructured and semi-structured data
The challenges faced by data warehouses are as follows -
To overcome above challenges , Data Lake came into the picture
with following features -
Source (Ingest) Datawarehouse (DW) Destiination (consume)
Data collection and ingestion (Ingest)
Data storage and management (Storage)
Data processing and transformation (Process)
Data access and retrieval (Consume)
The core capability of data lake was storage but with timely
advancement, it had developed 4 important capabilities -
Let's analyze the below Data Lake architecture for understanding
different layers.
Apache Spark for Beginners 6
Data Lake Architecture :
Consume
Ingest
Storage
Process
( Fig. 5 : Data Lake Architecture )
BACKGROUND
Apache Spark is a unified analytics engine for large-scale
distributed data processing and machine learning.
It is an open-source cluster computing framework which
handles both batch data & streaming data.
Spark was built on the top of the Hadoop MapReduce.
Spark provides in-memory storage for intermediate
computations whereas alternative approaches like Hadoop's
MapReduce writes data to and from computer hard drives. So,
Spark process the data much faster than other alternatives.
Apache Spark &
it's Eco-system
Chapter- 4
Apache Spark for Beginners 7
( Fig. 6 : Spark key Features )
History of Spark :
The Spark was initiated by Matei Zaharia at UC Berkeley's
AMPLab in 2009. It was open sourced in 2010 under a BSD license.
In 2013, the project was acquired by Apache Software Foundation.
In 2014, the Spark emerged as a Top-Level Apache Project.
Key
Features :
The Spark project is made up of a variety of closely coupled
components. Spark is a computational engine at its core that can
distribute, schedule, and monitor several applications.
Apache Spark for Beginners 8
Spark eco-system and it's components :
( Fig. 7 : Spark Eco-system )
Set of Libraries
Spark Core
Storage
Cluster
Manager
Storage and Cluster Manager
Spark Core
Set of Libraries
The Apache Spark ecosystem may be divided into three tiers as
indicated in the above diagram.
Apache Spark is a distributed processing engine. However, it doesn't
come with an inbuilt cluster resource manager and a distributed
storage system.
There is a good reason behind that design decision. Apache Spark
tried to decouple the functionality of a cluster resource manager,
distributed storage and a distributed computing engine from the
beginning.
This design allows us to use Apache Spark with any compatible
cluster manager and storage solution. Hence, the storage and the
cluster manager are part of the ecosystem however they are not
part of Apache Spark.
You can plugin a cluster manager and a storage system of your
choice. There are multiple alternatives. You can use Apache YARN,
Mesos, and even Kubernetes as a cluster manager for Apache Spark.
Similarly, for the storage system, you can use HDFS, Amazon S3,
Azure Data Lake, Google Cloud storage, Cassandra File system and
many others.
Apache Spark for Beginners 9
The Spark Core includes a computation engine for Spark. Basic
functions including memory management, job scheduling, fault
recovery, and most crucially, communication with the cluster
manager and storage system, are provided by the compute
engine.
Apache Spark core contains two main components -
1) Spark Compute engine
2) Spark Core APIs
Spark Compute engine -
1) Storage and Cluster Manager :
2) Spark Core :
Apache Spark for Beginners 10
So, in order to give the user a smooth experience, the Spark compute
engine manages and executes our Spark jobs. Simply submit your job
to Spark, and the core of Spark does the rest.
Structured API
Unstructured API
The Structured APIs consists of data frames and data sets. They are
designed and optimized to work with structured data.
The Unstructured APIs are the lower level APIs including RDDs,
Accumulators and Broadcast variables. These core APIs are available in
Scala, Python, Java, and R.
Spark Core APIs -
The second part of Spark Core is core API. Spark core consists of two
types of APIs.
a.
b.
3) Set of Libraries :
Apache Spark has different set of libraries and packages that make it a
powerful big data processing framework. The set of libraries include
Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX.
These libraries provide different functionalities for data processing,
analysis, and machine learning tasks.
Spark SQL - Allows you to use SQL queries for structured data
processing.
Spark Streaming - Helps you to consume and process continuous
data streams.
MLlib - A machine learning library that delivers high-quality
algorithms.
GraphX - Comes with a library of typical graph algorithms.
These libraries offer us APIs, DSLs, and algorithms in multiple languages.
They directly depend on Spark Core APIs to achieve distributed
processing.
Working :
BACKGROUND
Apache Spark works in a master-slave architecture where the
master is called “Driver” and slaves are called “Workers".
Master manages, maintains, and monitors the slaves while
slaves are the actual workers who perform the processing tasks.
You tell the master what wants to be done and the master will
take care of the rest. It will complete the task, using its slaves.
Apache Spark Architecture is based on two main abstractions :
Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)
Spark Architecture
& Execution model
Chapter- 5
Apache Spark for Beginners 11
Resilient Distributed Dataset (RDD) :
Resilient: Fault tolerant and is capable of rebuilding data on
failure
Distributed: Distributed data among the multiple nodes in a
cluster
Dataset: Collection of partitioned data with values
When you load the data into a Spark application, it creates an RDD
which stores the loaded data.
RDD is immutable, meaning that it cannot be modified once
created, but it can be transformed at any time.
RDDs are the building blocks of any Spark application. RDDs Stands
for:
Create RDD
RDD
Transformations
Actions Results
They are the operations that are applied to create a new RDD. This
function is used to transform the data into new RDD without altering
the original data.
e.g. groupBy, sum, rename, sort,etc
12
Every Dataset in RDD is divided into multiple logical partitions,
which may be computed on different nodes of the cluster. This
distribution is done by Spark, so users don’t have to worry about
computing the right distribution.
With RDDs, you can perform two types of operations :
The data in an RDD is
split into chunks based
on a key and those
chunks of data is
distributed across the
cluster. ( Fig. 8 : RDD Partitioning )
1) Transformations :
2) Actions :
They are applied on an RDD to instruct Apache Spark to apply
computation and send the result back to the driver.
e.g. write, collect, show,etc
groupBy,
sum,
join,
union ,etc
collect,
count,
reduce,write
,etc
Fig. 9 : Workflow of RDD
Narrow transformations are those for which each input partition will
contribute to only one output partition.
These transformations are performed on individual partition of data in
an RDD in parallel.
Since they do not require shuffling of data between partitions, their
performance is more efficient than wide transformation .
Narrow transformations can be performed on various data formats,
such as unstructured, structured, and semi-structured data.
Wide transformation will have input partitions contributing to many
output partitions.
These transformations require shuffling data between partitions.
They are typically more complex and require more resources to
perform.
Wide transformations, however, typically require structured data in
order to perform the necessary operations.
1) Narrow transformations -
2) Wide transformations -
Apache Spark for Beginners 13
Types of transformations :
( Fig. 10 : Narrow & wide
Transformation )
T : Trasformation
A : Action
A
Lazy Evaluation in Sparks means Spark will not start the execution of
the process until an ACTION is called.
Spark is not too concerned as long as we are just performing
transformations on the RDD, dataframe or dataset.
Once Spark notices an ACTION being called, it starts looking at all the
transformations, creates a DAG and execute it.
Apache Spark for Beginners 14
Lazy Evaluation :
( Fig. 11 : Lazy Evaluation )
RDD NEW RDD
NEW RDD
NEW RDD
NEW RDD
T
T
T
all the previous
transformation
are recorded &
finally executed
once action is
triggered
parallelize
filter
map
parallelize
map
parallelize
filter
map
join
Directed Acyclic Graph is a finite direct graph that performs a
sequence of computations on data.
The DAG in Spark supports cyclic data flow. Every Spark job creates a
DAG of task stages that will be executed on the cluster.
Spark DAGs can contain many stages, unlike the Hadoop MapReduce
which has only two predefined stages.
In a Spark DAG, there are consecutive computation stages that
optimize the execution plan.
You can achieve fault-tolerance in Spark with DAG.
Apache Spark for Beginners 15
Directed Acyclic Graph (DAG) :
Stage 1 Stage 2 Stage 4
Stage 3
Spark DAG Visualisation
Spark DAG Visualisation
( Fig. 12 : DAG visualization )
Spark application is a program built with Spark APIs and runs in a Spark
compatible cluster/environment. It can be a PySpark script, a Java
application, a Scala application, a SparkSession started by spark-shell or
spark-sql command,etc.
It consists of a driver container and executors.
An object that provides a point of entry to interact with underlying
Spark functionality and allows programming Spark with its APIs.
It represents the connection to the Spark cluster. This class is how you
communicate with some of Spark’s lower-level APIs, such as RDDs.
A parallel computation consisting of multiple tasks that gets produced
in response to a Spark action (e.g., save(), collect()).
Each job gets divided into smaller sets of tasks called stages that
depend on each other.
Stages in Spark represent groups of tasks that can be executed
together to compute the same operation on multiple machines.
A single unit of work or execution that runs in a Spark executor. Each
stage contains one or multiple tasks.
Each task maps to a single core and works on a single partition of data
Spark Application :
SparkSession / SparkContext :
Job
Stage
Task
Apache Spark for Beginners 16
Components of Spark Application
Architecture :
The Driver is the main program that runs on the master node and is
responsible for coordinating the entire Spark application. The Driver is
responsible for several tasks, including :
Managing the SparkContext: The Driver is responsible for creating
and managing the SparkContext, which is the main entry point for
a Spark application.
Breaking down the application into tasks: The Driver is responsible
for breaking down the application into a set of tasks that can be
executed in parallel on the worker nodes.
Scheduling tasks: The Driver is responsible for scheduling tasks to
worker nodes, based on the available resources and the
requirements of tasks.
Monitoring tasks: The Driver is responsible for monitoring the tasks
and making sure that they are executing correctly. If a task fails,
the Driver can reschedule it on a different node.
Gathering results: The Driver is responsible for gathering results of
the tasks and combining them to produce the final result of the
application.
The Driver is the central component of a Spark application, and it plays
a critical role in ensuring that the application runs correctly and
efficiently.
Driver :
Apache Spark for Beginners 17
( Fig. 13 : Jobs, stages & tasks distribution )
The Executor is a program that runs on the worker nodes and is
responsible for executing the tasks assigned by the Driver. The
Executor is responsible for several tasks, including:
Running tasks: The Executor is responsible for running the tasks
assigned by the Driver. Each Executor runs a set of tasks, and it can
run multiple tasks in parallel, based on the number of cores
available on the node.
Reporting status: The Executor is responsible for reporting the
status of the tasks to the Driver. The Driver uses this information
to monitor the progress of the application and make sure that it is
executing correctly.
Storing intermediate data: The Executor can store intermediate
data in memory, which can be used by other tasks. This allows
Spark to avoid shuffling data between nodes, which can be a
performance bottleneck.
Releasing resources: The Executor is responsible for releasing
resources when the tasks are completed, which allows Spark to
make the most efficient use of the available resources.
The Executor is a critical component of a Spark application, and it plays
a crucial role in executing the tasks and producing the final result.
Executor :
Apache Spark for Beginners 18
( Fig. 14 : Spark Application Diagram )
The cluster manager is responsible for managing the resources
required for a Spark application, including CPU, memory, and
network resources. Its primary functions include :
Resource allocation: The cluster manager receives resource
requests from the Spark driver and allocates the necessary
resources, such as CPU cores and memory, to the application.
Executor management: The cluster manager launches and
manages Spark executors on worker nodes, which are
responsible for executing tasks and storing data.
Fault tolerance: The cluster manager monitors the health of
the worker nodes and detects failures, ensuring the smooth
execution of the application by reallocating resources and
restarting failed tasks.
Node management: The cluster manager keeps track of the
worker nodes' status and manages their lifecycle, handling
node registration, de-registration, and decommissioning.
Standalone– a simple cluster manager included with Spark that
makes it easy to set up a cluster.
Apache Mesos– a general cluster manager that can also run
Hadoop MapReduce and service applications.
Hadoop YARN– the resource manager in Hadoop 2.
Kubernetes– an open-source system for automating deployment,
scaling, and management of containerized applications.
Cluster Manager :
Types of cluster managers :
Apache Spark for Beginners 19
Cluster
Manager
Cache
When the Driver Program in the Apache Spark architecture
executes, it calls the real program of an application and creates a
SparkContext. SparkContext contains all of the basic functions. The
Spark Driver includes several other components, including a DAG
Scheduler, Task Scheduler, Backend Scheduler, and Block Manager,
all of which are responsible for translating user-written code into
jobs that are actually executed on the cluster.
The Cluster Manager manages the execution of various jobs in the
cluster. Spark Driver works in conjunction with the Cluster
Manager to control the execution of various other jobs. The cluster
Manager does the task of allocating resources for the job. Once the
job has been broken down into smaller jobs, which are then
distributed to worker nodes, SparkDriver will control the execution.
Apache Spark for Beginners 20
Execution of Spark Application :
Task
Task
Task
Cache
Task
SparkContest
Driver Program
Executor
Executor
Worker Node
Worker Node
(Master Node)
(Slave Node)
(Slave Node)
( Fig. 15 : Execution of Spark Application )
Whenever an RDD is created in the SparkContext, it can be
distributed across many worker nodes and can also be cached there.
Worker nodes execute the tasks assigned by the Cluster Manager and
return it back to the Spark Context.
The executor is responsible for the execution of these tasks. The
lifespan of executors is the same as that of the Spark Application. We
can increase the number of workers if we want to improve the
performance of the system. In this way, we can divide jobs into more
coherent parts.
Apache Spark for Beginners 21
Spark Execution Modes :
Local Mode
Client Mode
Cluster Mode
This is similar to executing a program on a single JVM on
someone’s laptop or desktop.
It could be a program written in any language, such as Java, Scala or
Python.
However, you should have defined and used a spark context object
in these apps, as well as imported spark libraries and processed
data from your local system files.
This is the local mode because everything is done locally, there is
no concept of a node, and nothing is done in a distributed manner.
A single JVM process is used to produce both the driver and the
executor.
For example, launching a spark-shell on your laptop is an example
of a local mode of execution.
There are 3 types of execution modes -
1.
2.
3.
1) Local Mode :
Driver
Executor
Executor
Executor
Apache Spark for Beginners 22
In client mode, the driver is present on the client machine (laptop/
desktop). i.e., the driver is not part of the cluster. On the other
hand, the executors run within the cluster.
The driver connects to the cluster manager, starts all the executors
on the clusters for interactive queries and receives the results back
to the client.
In case of a problem with the client (local) machine or you log off ,
the driver will go off and subsequently all executors will shut down
on the cluster.
2) Client Mode :
CLUSTER MANAGER
(YARN)
( Fig. 16 : Client Mode Execution )
So the point is, in this mode, the entire program is dependent on
the client (local) machine since the driver is located there.
This mode is unsuitable for production environments and long
running queries. It remains useful for debugging and testing
purposes.
( Client Machine )
( Spark Cluster)
Driver
Executor
Executor
Executor
CLUSTER
MANAGER
(YARN)
Apache Spark for Beginners 23
In cluster mode, the driver and executor both run inside the cluster.
The spark job is submitted from your local machine to a cluster
machine within the cluster. Such machines are usually called edge
node .
In case of a problem with the client (local) machine or you log off ,
the driver will not get impacted as it is running on the cluster.
2) Cluster Mode :
( Fig. 17: Cluster Mode Execution )
This means that the cluster manager is responsible for maintaining
all Spark Application related processes.
This mode is useful for long running queries and production
environments.
( Client Machine )
( Spark Cluster)
Apache Spark is not only a set of APIs and processing engine
but also it is a database in itself.
You can create database in spark. Once you have database, you
can create tables and views.
Spark Databases,
Tables & Views
Chapter- 6
Apache Spark for Beginners 24
Tables in Spark :
Managed (Internal) tables
Unmanaged (External) Tables
There are 2 types of SQL tables in Spark -
1.
2.
Apache Spark
Database -
1) Tables
2) Views
Table Data
Table Metadata
Catalog
Metastore
Spark
Warehouse
e.g. parquet, avro.etc
e.g. schema, datatype,
location, partition, etc
( Fig. 18 : Spark Database )
Syntax :
CREATE TABLE internal_table
(id INT, FirstName String, LastName String) ;
Apache Spark for Beginners 25
For Managed tables, Spark manages both the data and the
metadata.
The table data is stored in Spark SQL Warehouse directory which is
the default storage for managed tables.
Metadata gets stored in a meta-store of relational entities
(including databases, Spark tables, and temporary views).
If we drop the managed table, Spark will delete both data as well as
metadata. After dropping tables, we can neither query the table
directly nor retrieve data from it.
1) Managed (Internal) Tables :
( Fig. 19 : Spark Managed Tables )
Managed Tables
Table Data
Spark
Warehouse
e.g. parquet, avro.etc
Table Metadata
Catalog
Metastore
e.g. schema, datatype,
location, partition, etc
Save Table
Create Table
Syntax :
CREATE TABLE external_table
(id INT, FirstName String, LastName String)
LOCATION '/user/tmp/external_table' ;
Apache Spark for Beginners 26
For External table, Spark manages the metadata and we have
flexibility to store the table data at our preferred location.
We need to specify the exact location where you wish to store the
table or, alternatively, the source directory from which data will be
pulled to create a table.
Metadata gets stored in a meta-store of relational entities
(including databases, Spark tables, and temporary views).
If we drop the external table, Spark will delete only metadata but
the underlying data remains as it is in its directory
2) Unmanaged (External) Tables :
Fig. 20 : Spark Unmanaged Tables
Unmanaged Tables
Table Metadata
Catalog
Metastore
e.g. schema, datatype,
location, partition, etc
Save Table
Create Table
External
files
In the context of Apache Spark, "views" typically refer to SQL views
or temporary tables that allow you to organize and work with your
data using SQL queries.
Views provide a convenient way to abstract and manipulate your
data without modifying the original data source. Here are common
types of views in Apache Spark :
A global temporary view is available to all Spark sessions and
persists until the Spark application terminates.
It can be accessed from different Spark sessions running on the
same cluster.
To create a global temporary view, you can use the
createOrReplaceGlobalTempView method on a DataFrame or
Dataset.
A temporary view is specific to the Spark session that creates it
and exists for the duration of that session.
It cannot be accessed from other Spark sessions.
To create a temporary view, you can use the
createOrReplaceTempView method on a DataFrame or Dataset.
1) Global Temporary View (GlobalTempView) :
2) Temporary View (TempView) :
Apache Spark for Beginners 27
Views in Spark :
LET’S TAKE A
PAUSE...!
11
Learned The Genesis of Spark
Gained some knowledge on hadoop
Understood data lake concept
Learned Apache Spark basics and eco-
system
Grasped the Spark Architecture & Execution
model.
Captured some details about Spark
Databases, Tables & Views
By now, you have -
Let's get masters on all of the above subjects,
and soon we'll meet in the next workbook,
"Apache Spark for Intermediate."
Till then... "Keep Learning and keep
Till then... "Keep Learning and keep
Exploring...!"
Exploring...!"

More Related Content

Similar to Spark Concepts Cheat Sheet_Interview_Question.pdf (20)

PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
PPTX
Apache spark installation [autosaved]
Shweta Patnaik
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Spark SQL | Apache Spark
Edureka!
 
PDF
Big Data Processing With Spark
Edureka!
 
PPTX
Introduction to spark
Home
 
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
PPTX
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
PPT
Spark_Part 1
Shashi Prakash
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
RDBMS vs Hadoop vs Spark
Laxmi8
 
PDF
Low latency access of bigdata using spark and shark
Pradeep Kumar G.S
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PPTX
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PDF
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PDF
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
Apache spark installation [autosaved]
Shweta Patnaik
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Spark SQL | Apache Spark
Edureka!
 
Big Data Processing With Spark
Edureka!
 
Introduction to spark
Home
 
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
Spark_Part 1
Shashi Prakash
 
APACHE SPARK.pptx
DeepaThirumurugan
 
RDBMS vs Hadoop vs Spark
Laxmi8
 
Low latency access of bigdata using spark and shark
Pradeep Kumar G.S
 
Spark For Faster Batch Processing
Edureka!
 
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 

Recently uploaded (20)

PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
Ad

Spark Concepts Cheat Sheet_Interview_Question.pdf

  • 1. Apache Spark for Beginners PREPARED BY BORCELLE Basic Guide BY RAHUL BAGUL ( AZURE DATA ENGINEER )
  • 2. TABLE OF CONTENT 2. HADOOP 1. THE GENESIS OF SPARK 1 2 5 3. UNDERSTANDING DATA LAKE LANDSCAPE Introduction Basic overview Hive Basic details about data warehouse Date Lake Architecture 4. APACHE SPARK & IT'S ECO-SYSTEM History of Spark Key Features Spark Eco-system & it's components: Storage & cluster manager, spark core, set of libraries 5. SPARK ARCHITECTURE & EXECUTION MODEL Resilient Distributed Dataset (RDD): Transformations, Actions, Types of transformations : Narrow & Wide, Lazy Evaluation Directed Acyclic Graph (DAG) Components of Spark Application Architecture: Spark Application, SparkSession/SparkContext, Job, Stage, Task, Driver, Executor, Cluster Manager, types of Cluster Manager Execution of Spark Application Spark Execution Modes : Local, Client, Cluster 6. SPARK DATABASES, TABLES & VIEWS Tables in Spark : Managed , Unmanaged Views in Spark : Global Temporary View, Temporary View 11 24 7
  • 3. BACKGROUND Data collection and ingestion Data storage and management Data processing and transformation Data access and retrieval Increased consumer traffic, a variety of new forms of data and greater computations demanded the need for more storage and better performance. Traditional data storage methods including relational database management systems (RDBMSs) and imperative programming techniques were unable to handle the enormous amounts of data and their processing. Google is the first to overcome below problems- Google published the white papers in a sequence to solve these issues - The Genesis of Spark Chapter- 1 Apache Spark for Beginners 1 1st paper : Google File System (GFS) - 2003 solving data storage and management problem 2nd paper : MapReduce (MR) - 2004 data processing and transformation problem ( Fig. 1 : Google White papers )
  • 4. ( Storage ) HDFS - Hadoop Distributed File System ( Compute Engine ) Hadoop MapReduce The Google white papers were highly appreciated by the open source community and served as the inspiration for the design and development of a comparable open source implementation, called Hadoop. Hadoop Chapter- 2 Apache Spark for Beginners 2 ( Fig. 2 : Hadoop System ) Hadoop is an open-source software framework for storing and processing large amounts of data in a distributed computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the parallel processing of large datasets. Its framework is based on Java programming. It facilitate to start with the small clusters and expand the size as you grow. It allows the storage capacity of 100's to 1000's of computers and use it as unified storage system
  • 5. Apache Spark for Beginners 3 ( Fig. 3 : Hadoop basic working ) Basic Overview : MAP REDUCE (MR) HDFS HDFS : It allowed us to form cluster of computers & used combined capacity for storing our data MR: Solved compute capacity problem by implementing distributed capacity framework over HDFS Hadoop allowed us to break data processing jobs into smaller tasks & use the clusters to finish the individual task. It combines the output of different task and produce single consolidated output.
  • 6. Many solutions have been developed over Hadoop platform by various organizations. Some of the widely adopted systems were Hive , Pig & HBase. Apache Hive is the most popular adopted component of Hadoop. Databases Tables Views Hive offered following core capabilities on Hadoop platform - 1. Create 2. Run SQL Queries Apache Spark for Beginners 4 Hive : Performance - Hive SQL query performing slower than RDBMS SQL query Ease of Development - writing MapReduce program was difficult Language Support - MapReduce was only available in JAVA Storage - expensive than cloud storage Resource Management - only YARN container support , unable to use other container like Mesos, Docker , Kubernetes , etc Bringing together , Hadoop as platform and Hive as a database became very popular. But we still had other problems - The point is, Hadoop left a lot scope for improvement and as a result Apache Spark came into the existence...!
  • 7. Before HDFS & MapReduce, we had Data warehouses (like Teradata, Exadata) where the data is brought from many OLTP/OLAP systems. Understanding Data Lake Landscape Chapter- 3 Apache Spark for Beginners 5 ( Fig. 4 : Basic DW flow ) Vertical Scaling - adding more DW was expensive Large Capital Investment Storage - non scalable Support only structured data Horizontal Scaling - adding more cheap servers to clusters Low Capital Investment Storage - scalable (cloud storage) Support structured , unstructured and semi-structured data The challenges faced by data warehouses are as follows - To overcome above challenges , Data Lake came into the picture with following features - Source (Ingest) Datawarehouse (DW) Destiination (consume)
  • 8. Data collection and ingestion (Ingest) Data storage and management (Storage) Data processing and transformation (Process) Data access and retrieval (Consume) The core capability of data lake was storage but with timely advancement, it had developed 4 important capabilities - Let's analyze the below Data Lake architecture for understanding different layers. Apache Spark for Beginners 6 Data Lake Architecture : Consume Ingest Storage Process ( Fig. 5 : Data Lake Architecture )
  • 9. BACKGROUND Apache Spark is a unified analytics engine for large-scale distributed data processing and machine learning. It is an open-source cluster computing framework which handles both batch data & streaming data. Spark was built on the top of the Hadoop MapReduce. Spark provides in-memory storage for intermediate computations whereas alternative approaches like Hadoop's MapReduce writes data to and from computer hard drives. So, Spark process the data much faster than other alternatives. Apache Spark & it's Eco-system Chapter- 4 Apache Spark for Beginners 7 ( Fig. 6 : Spark key Features ) History of Spark : The Spark was initiated by Matei Zaharia at UC Berkeley's AMPLab in 2009. It was open sourced in 2010 under a BSD license. In 2013, the project was acquired by Apache Software Foundation. In 2014, the Spark emerged as a Top-Level Apache Project. Key Features :
  • 10. The Spark project is made up of a variety of closely coupled components. Spark is a computational engine at its core that can distribute, schedule, and monitor several applications. Apache Spark for Beginners 8 Spark eco-system and it's components : ( Fig. 7 : Spark Eco-system ) Set of Libraries Spark Core Storage Cluster Manager Storage and Cluster Manager Spark Core Set of Libraries The Apache Spark ecosystem may be divided into three tiers as indicated in the above diagram.
  • 11. Apache Spark is a distributed processing engine. However, it doesn't come with an inbuilt cluster resource manager and a distributed storage system. There is a good reason behind that design decision. Apache Spark tried to decouple the functionality of a cluster resource manager, distributed storage and a distributed computing engine from the beginning. This design allows us to use Apache Spark with any compatible cluster manager and storage solution. Hence, the storage and the cluster manager are part of the ecosystem however they are not part of Apache Spark. You can plugin a cluster manager and a storage system of your choice. There are multiple alternatives. You can use Apache YARN, Mesos, and even Kubernetes as a cluster manager for Apache Spark. Similarly, for the storage system, you can use HDFS, Amazon S3, Azure Data Lake, Google Cloud storage, Cassandra File system and many others. Apache Spark for Beginners 9 The Spark Core includes a computation engine for Spark. Basic functions including memory management, job scheduling, fault recovery, and most crucially, communication with the cluster manager and storage system, are provided by the compute engine. Apache Spark core contains two main components - 1) Spark Compute engine 2) Spark Core APIs Spark Compute engine - 1) Storage and Cluster Manager : 2) Spark Core :
  • 12. Apache Spark for Beginners 10 So, in order to give the user a smooth experience, the Spark compute engine manages and executes our Spark jobs. Simply submit your job to Spark, and the core of Spark does the rest. Structured API Unstructured API The Structured APIs consists of data frames and data sets. They are designed and optimized to work with structured data. The Unstructured APIs are the lower level APIs including RDDs, Accumulators and Broadcast variables. These core APIs are available in Scala, Python, Java, and R. Spark Core APIs - The second part of Spark Core is core API. Spark core consists of two types of APIs. a. b. 3) Set of Libraries : Apache Spark has different set of libraries and packages that make it a powerful big data processing framework. The set of libraries include Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX. These libraries provide different functionalities for data processing, analysis, and machine learning tasks. Spark SQL - Allows you to use SQL queries for structured data processing. Spark Streaming - Helps you to consume and process continuous data streams. MLlib - A machine learning library that delivers high-quality algorithms. GraphX - Comes with a library of typical graph algorithms. These libraries offer us APIs, DSLs, and algorithms in multiple languages. They directly depend on Spark Core APIs to achieve distributed processing. Working :
  • 13. BACKGROUND Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are called “Workers". Master manages, maintains, and monitors the slaves while slaves are the actual workers who perform the processing tasks. You tell the master what wants to be done and the master will take care of the rest. It will complete the task, using its slaves. Apache Spark Architecture is based on two main abstractions : Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG) Spark Architecture & Execution model Chapter- 5 Apache Spark for Beginners 11 Resilient Distributed Dataset (RDD) : Resilient: Fault tolerant and is capable of rebuilding data on failure Distributed: Distributed data among the multiple nodes in a cluster Dataset: Collection of partitioned data with values When you load the data into a Spark application, it creates an RDD which stores the loaded data. RDD is immutable, meaning that it cannot be modified once created, but it can be transformed at any time. RDDs are the building blocks of any Spark application. RDDs Stands for:
  • 14. Create RDD RDD Transformations Actions Results They are the operations that are applied to create a new RDD. This function is used to transform the data into new RDD without altering the original data. e.g. groupBy, sum, rename, sort,etc 12 Every Dataset in RDD is divided into multiple logical partitions, which may be computed on different nodes of the cluster. This distribution is done by Spark, so users don’t have to worry about computing the right distribution. With RDDs, you can perform two types of operations : The data in an RDD is split into chunks based on a key and those chunks of data is distributed across the cluster. ( Fig. 8 : RDD Partitioning ) 1) Transformations : 2) Actions : They are applied on an RDD to instruct Apache Spark to apply computation and send the result back to the driver. e.g. write, collect, show,etc groupBy, sum, join, union ,etc collect, count, reduce,write ,etc Fig. 9 : Workflow of RDD
  • 15. Narrow transformations are those for which each input partition will contribute to only one output partition. These transformations are performed on individual partition of data in an RDD in parallel. Since they do not require shuffling of data between partitions, their performance is more efficient than wide transformation . Narrow transformations can be performed on various data formats, such as unstructured, structured, and semi-structured data. Wide transformation will have input partitions contributing to many output partitions. These transformations require shuffling data between partitions. They are typically more complex and require more resources to perform. Wide transformations, however, typically require structured data in order to perform the necessary operations. 1) Narrow transformations - 2) Wide transformations - Apache Spark for Beginners 13 Types of transformations : ( Fig. 10 : Narrow & wide Transformation )
  • 16. T : Trasformation A : Action A Lazy Evaluation in Sparks means Spark will not start the execution of the process until an ACTION is called. Spark is not too concerned as long as we are just performing transformations on the RDD, dataframe or dataset. Once Spark notices an ACTION being called, it starts looking at all the transformations, creates a DAG and execute it. Apache Spark for Beginners 14 Lazy Evaluation : ( Fig. 11 : Lazy Evaluation ) RDD NEW RDD NEW RDD NEW RDD NEW RDD T T T all the previous transformation are recorded & finally executed once action is triggered
  • 17. parallelize filter map parallelize map parallelize filter map join Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. The DAG in Spark supports cyclic data flow. Every Spark job creates a DAG of task stages that will be executed on the cluster. Spark DAGs can contain many stages, unlike the Hadoop MapReduce which has only two predefined stages. In a Spark DAG, there are consecutive computation stages that optimize the execution plan. You can achieve fault-tolerance in Spark with DAG. Apache Spark for Beginners 15 Directed Acyclic Graph (DAG) : Stage 1 Stage 2 Stage 4 Stage 3 Spark DAG Visualisation Spark DAG Visualisation ( Fig. 12 : DAG visualization )
  • 18. Spark application is a program built with Spark APIs and runs in a Spark compatible cluster/environment. It can be a PySpark script, a Java application, a Scala application, a SparkSession started by spark-shell or spark-sql command,etc. It consists of a driver container and executors. An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. It represents the connection to the Spark cluster. This class is how you communicate with some of Spark’s lower-level APIs, such as RDDs. A parallel computation consisting of multiple tasks that gets produced in response to a Spark action (e.g., save(), collect()). Each job gets divided into smaller sets of tasks called stages that depend on each other. Stages in Spark represent groups of tasks that can be executed together to compute the same operation on multiple machines. A single unit of work or execution that runs in a Spark executor. Each stage contains one or multiple tasks. Each task maps to a single core and works on a single partition of data Spark Application : SparkSession / SparkContext : Job Stage Task Apache Spark for Beginners 16 Components of Spark Application Architecture :
  • 19. The Driver is the main program that runs on the master node and is responsible for coordinating the entire Spark application. The Driver is responsible for several tasks, including : Managing the SparkContext: The Driver is responsible for creating and managing the SparkContext, which is the main entry point for a Spark application. Breaking down the application into tasks: The Driver is responsible for breaking down the application into a set of tasks that can be executed in parallel on the worker nodes. Scheduling tasks: The Driver is responsible for scheduling tasks to worker nodes, based on the available resources and the requirements of tasks. Monitoring tasks: The Driver is responsible for monitoring the tasks and making sure that they are executing correctly. If a task fails, the Driver can reschedule it on a different node. Gathering results: The Driver is responsible for gathering results of the tasks and combining them to produce the final result of the application. The Driver is the central component of a Spark application, and it plays a critical role in ensuring that the application runs correctly and efficiently. Driver : Apache Spark for Beginners 17 ( Fig. 13 : Jobs, stages & tasks distribution )
  • 20. The Executor is a program that runs on the worker nodes and is responsible for executing the tasks assigned by the Driver. The Executor is responsible for several tasks, including: Running tasks: The Executor is responsible for running the tasks assigned by the Driver. Each Executor runs a set of tasks, and it can run multiple tasks in parallel, based on the number of cores available on the node. Reporting status: The Executor is responsible for reporting the status of the tasks to the Driver. The Driver uses this information to monitor the progress of the application and make sure that it is executing correctly. Storing intermediate data: The Executor can store intermediate data in memory, which can be used by other tasks. This allows Spark to avoid shuffling data between nodes, which can be a performance bottleneck. Releasing resources: The Executor is responsible for releasing resources when the tasks are completed, which allows Spark to make the most efficient use of the available resources. The Executor is a critical component of a Spark application, and it plays a crucial role in executing the tasks and producing the final result. Executor : Apache Spark for Beginners 18 ( Fig. 14 : Spark Application Diagram )
  • 21. The cluster manager is responsible for managing the resources required for a Spark application, including CPU, memory, and network resources. Its primary functions include : Resource allocation: The cluster manager receives resource requests from the Spark driver and allocates the necessary resources, such as CPU cores and memory, to the application. Executor management: The cluster manager launches and manages Spark executors on worker nodes, which are responsible for executing tasks and storing data. Fault tolerance: The cluster manager monitors the health of the worker nodes and detects failures, ensuring the smooth execution of the application by reallocating resources and restarting failed tasks. Node management: The cluster manager keeps track of the worker nodes' status and manages their lifecycle, handling node registration, de-registration, and decommissioning. Standalone– a simple cluster manager included with Spark that makes it easy to set up a cluster. Apache Mesos– a general cluster manager that can also run Hadoop MapReduce and service applications. Hadoop YARN– the resource manager in Hadoop 2. Kubernetes– an open-source system for automating deployment, scaling, and management of containerized applications. Cluster Manager : Types of cluster managers : Apache Spark for Beginners 19
  • 22. Cluster Manager Cache When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of which are responsible for translating user-written code into jobs that are actually executed on the cluster. The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager does the task of allocating resources for the job. Once the job has been broken down into smaller jobs, which are then distributed to worker nodes, SparkDriver will control the execution. Apache Spark for Beginners 20 Execution of Spark Application : Task Task Task Cache Task SparkContest Driver Program Executor Executor Worker Node Worker Node (Master Node) (Slave Node) (Slave Node) ( Fig. 15 : Execution of Spark Application )
  • 23. Whenever an RDD is created in the SparkContext, it can be distributed across many worker nodes and can also be cached there. Worker nodes execute the tasks assigned by the Cluster Manager and return it back to the Spark Context. The executor is responsible for the execution of these tasks. The lifespan of executors is the same as that of the Spark Application. We can increase the number of workers if we want to improve the performance of the system. In this way, we can divide jobs into more coherent parts. Apache Spark for Beginners 21 Spark Execution Modes : Local Mode Client Mode Cluster Mode This is similar to executing a program on a single JVM on someone’s laptop or desktop. It could be a program written in any language, such as Java, Scala or Python. However, you should have defined and used a spark context object in these apps, as well as imported spark libraries and processed data from your local system files. This is the local mode because everything is done locally, there is no concept of a node, and nothing is done in a distributed manner. A single JVM process is used to produce both the driver and the executor. For example, launching a spark-shell on your laptop is an example of a local mode of execution. There are 3 types of execution modes - 1. 2. 3. 1) Local Mode :
  • 24. Driver Executor Executor Executor Apache Spark for Beginners 22 In client mode, the driver is present on the client machine (laptop/ desktop). i.e., the driver is not part of the cluster. On the other hand, the executors run within the cluster. The driver connects to the cluster manager, starts all the executors on the clusters for interactive queries and receives the results back to the client. In case of a problem with the client (local) machine or you log off , the driver will go off and subsequently all executors will shut down on the cluster. 2) Client Mode : CLUSTER MANAGER (YARN) ( Fig. 16 : Client Mode Execution ) So the point is, in this mode, the entire program is dependent on the client (local) machine since the driver is located there. This mode is unsuitable for production environments and long running queries. It remains useful for debugging and testing purposes. ( Client Machine ) ( Spark Cluster)
  • 25. Driver Executor Executor Executor CLUSTER MANAGER (YARN) Apache Spark for Beginners 23 In cluster mode, the driver and executor both run inside the cluster. The spark job is submitted from your local machine to a cluster machine within the cluster. Such machines are usually called edge node . In case of a problem with the client (local) machine or you log off , the driver will not get impacted as it is running on the cluster. 2) Cluster Mode : ( Fig. 17: Cluster Mode Execution ) This means that the cluster manager is responsible for maintaining all Spark Application related processes. This mode is useful for long running queries and production environments. ( Client Machine ) ( Spark Cluster)
  • 26. Apache Spark is not only a set of APIs and processing engine but also it is a database in itself. You can create database in spark. Once you have database, you can create tables and views. Spark Databases, Tables & Views Chapter- 6 Apache Spark for Beginners 24 Tables in Spark : Managed (Internal) tables Unmanaged (External) Tables There are 2 types of SQL tables in Spark - 1. 2. Apache Spark Database - 1) Tables 2) Views Table Data Table Metadata Catalog Metastore Spark Warehouse e.g. parquet, avro.etc e.g. schema, datatype, location, partition, etc ( Fig. 18 : Spark Database )
  • 27. Syntax : CREATE TABLE internal_table (id INT, FirstName String, LastName String) ; Apache Spark for Beginners 25 For Managed tables, Spark manages both the data and the metadata. The table data is stored in Spark SQL Warehouse directory which is the default storage for managed tables. Metadata gets stored in a meta-store of relational entities (including databases, Spark tables, and temporary views). If we drop the managed table, Spark will delete both data as well as metadata. After dropping tables, we can neither query the table directly nor retrieve data from it. 1) Managed (Internal) Tables : ( Fig. 19 : Spark Managed Tables ) Managed Tables Table Data Spark Warehouse e.g. parquet, avro.etc Table Metadata Catalog Metastore e.g. schema, datatype, location, partition, etc Save Table Create Table
  • 28. Syntax : CREATE TABLE external_table (id INT, FirstName String, LastName String) LOCATION '/user/tmp/external_table' ; Apache Spark for Beginners 26 For External table, Spark manages the metadata and we have flexibility to store the table data at our preferred location. We need to specify the exact location where you wish to store the table or, alternatively, the source directory from which data will be pulled to create a table. Metadata gets stored in a meta-store of relational entities (including databases, Spark tables, and temporary views). If we drop the external table, Spark will delete only metadata but the underlying data remains as it is in its directory 2) Unmanaged (External) Tables : Fig. 20 : Spark Unmanaged Tables Unmanaged Tables Table Metadata Catalog Metastore e.g. schema, datatype, location, partition, etc Save Table Create Table External files
  • 29. In the context of Apache Spark, "views" typically refer to SQL views or temporary tables that allow you to organize and work with your data using SQL queries. Views provide a convenient way to abstract and manipulate your data without modifying the original data source. Here are common types of views in Apache Spark : A global temporary view is available to all Spark sessions and persists until the Spark application terminates. It can be accessed from different Spark sessions running on the same cluster. To create a global temporary view, you can use the createOrReplaceGlobalTempView method on a DataFrame or Dataset. A temporary view is specific to the Spark session that creates it and exists for the duration of that session. It cannot be accessed from other Spark sessions. To create a temporary view, you can use the createOrReplaceTempView method on a DataFrame or Dataset. 1) Global Temporary View (GlobalTempView) : 2) Temporary View (TempView) : Apache Spark for Beginners 27 Views in Spark :
  • 30. LET’S TAKE A PAUSE...! 11 Learned The Genesis of Spark Gained some knowledge on hadoop Understood data lake concept Learned Apache Spark basics and eco- system Grasped the Spark Architecture & Execution model. Captured some details about Spark Databases, Tables & Views By now, you have - Let's get masters on all of the above subjects, and soon we'll meet in the next workbook, "Apache Spark for Intermediate." Till then... "Keep Learning and keep Till then... "Keep Learning and keep Exploring...!" Exploring...!"