SlideShare a Scribd company logo
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
What is Spark SQL?
Spark SQL Features
Spark SQL Architecture
Spark SQL – DataFrame API
Spark SQL – Data Source
API
Spark SQL – Catalyst
Optimizer
Running SQL Queries
Spark SQL Demo
What’s in it for you?
SQL
What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
Click here to watch the video
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
Hive lags in performance as it uses MapReduce
jobs for executing ad-hoc queries
Hive does not allow you to resume a job
processing if it fails in the middle
Limitations
What is Spark SQL?
SQL
Spark performs better than Hive in most scenarios
Source: https://engineering.fb.com/
Hive ~ Spark
Spark SQL
Features
SQL
Integrated
High
Compatibility
You can integrate Spark SQL
and query structured data
inside Spark programs
You can run unmodified Hive queries
on existing warehouses in Spark
SQL. With existing Hive data, queries
and UDFs, Spark SQL offers full
compatibility
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL Features
Spark
SQL
Spark
programs
SQLQueries
SQL
Scalability
Standard
Connectivity
Spark SQL leverages RDD model as
it supports large jobs and mid-
query fault tolerance. For interactive
and long queries, it uses the same
engine
You can easily connect Spark
SQL with JDBC or ODBC. For
connectivity for business
intelligence tools, both turned as
industry norms
Spark SQL Features
SQL
SQL
RDD
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL
Architecture
SQL
DataFrame DSLDataframe DSL
DataFrame API
Data Source API
CSV JSON JDBC
DataFrame DSLSpark SQL and HQL
Spark SQL Architecture
Spark SQL has three main layers
Spark SQL is Apache Spark’s module for working with structured data
Language API SchemaRDD Data Sources
Spark is very compatible as it
supports languages like Python,
HiveQL, Scala, and Java
As Spark SQL works on schema,
tables, and records, you can use
SchemaRDD or DataFrame as a
temporary table
SQL
Spark SQL supports multiple
data sources like JSON,
Cassandra database, Hive
tables
Spark SQL Architecture
Spark SQL –
DataFrame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
Spark SQL – Data Frame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Has can process the data in the size of Kilobytes to Petabytes
on a single node cluster
Can be easily integrated with all Big Data tools and frameworks
via Spark-Core
Provides API for Python, Java, Scala, and R Programming
DataFrame features
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL –
Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
ContextSQL
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
It can be easily integrated with all
Big Data tools and frameworks
via Spark-Core
ContextSQL
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL –
Catalyst
Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark SQL – Catalyst Optimizer
It works in 4 phases:
1 Analyzing a logical plan to
resolve references
2 Logical plan optimization
3 Physical planning 4
Code generation to compile parts of
the query to Java bytecode
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Catalog
Analysis
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Catalog
Analysis
Logical
Optimization
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
RDDs
Catalog
Analysis
Logical
Optimization
Physical
Planning
Code Generation
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark
SQLContext
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
SparkContext class object (sc) is required for initializing SQLContext class object
The following command initializes
SparkContext through spark-shell
$ spark-shell
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command creates a SQLContext
scala> val sqlcontext = new
org.apache.sql.SQLContext(sc)
Spark SQLContext
SparkContext class object (sc) is required for initializing SQLContext class object
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command initializes
SparkContext through spark-shell
$ spark-shell
SparkSession
It is the entry point to any functionality in Spark. To create a basic
SparkSession, use SparkSession.builder()
Source: https://spark.apache.org/
SparkSession
Applications can create DataFrames with the help of an existing RDD using a
Hive table, or from Spark data sources
The following creates a DataFrame based on the content of a JSON file:
https://spark.apache.org/Source:
Creating DataFrames
DataFrame
Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
https://spark.apache.org/Source:
DataFrame Operations
Below are some examples of structured data processing:
https://spark.apache.org/Source:
DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:
https://spark.apache.org/Source:
DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:
Running SQL
Queries
The sql function on a SparkSession allows applications to run SQL queries
programmatically and returns the result in the form of a DataFrame
https://spark.apache.org/Source:
Running SQL Queries
Demo on Spark
SQL
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

More Related Content

What's hot (20)

PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
PDF
Introduction to Apache Spark
datamantra
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
PySpark dataframe
Jaemun Jung
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PPTX
Introduction to sqoop
Uday Vakalapudi
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Azure SQL Database Managed Instance - technical overview
George Walters
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Apache spark
TEJPAL GAUTAM
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Introduction to Apache Spark
datamantra
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
PySpark dataframe
Jaemun Jung
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Free Training: How to Build a Lakehouse
Databricks
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Introduction to sqoop
Uday Vakalapudi
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark Core
Girish Khanzode
 
Azure SQL Database Managed Instance - technical overview
George Walters
 

Similar to Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn (20)

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PDF
Introduction to SparkR
Olgun Aydın
 
PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
PPTX
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PPTX
Spark for big data analytics
Edureka!
 
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
PPTX
Spark
Koushik Mondal
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PDF
Spark sql under the hood - Data KRK meetup
Mikołaj Kromka
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PDF
Spark SQL | Apache Spark
Edureka!
 
PDF
Big Data Processing With Spark
Edureka!
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Overview
Dharmjit Singh
 
Introduction to SparkR
Olgun Aydın
 
Introduction to SparkR
Ankara Big Data Meetup
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
An Introduction to Apache spark with scala
johnn210
 
Spark for big data analytics
Edureka!
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Michael Rys
 
Spark from the Surface
Josi Aranda
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Spark sql under the hood - Data KRK meetup
Mikołaj Kromka
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Spark core
Prashant Gupta
 
Spark sql
Zahra Eskandari
 
Machine Learning with SparkR
Olgun Aydın
 
Spark SQL | Apache Spark
Edureka!
 
Big Data Processing With Spark
Edureka!
 
Ad

More from Simplilearn (20)

PPTX
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Simplilearn
 
PPTX
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Simplilearn
 
PPTX
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
Simplilearn
 
PPTX
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
Simplilearn
 
PPTX
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
Simplilearn
 
PPTX
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Simplilearn
 
PPTX
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
Simplilearn
 
PPTX
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
Simplilearn
 
PPTX
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Simplilearn
 
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
PPTX
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Simplilearn
 
PPTX
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Simplilearn
 
PPTX
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
Simplilearn
 
PPTX
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Simplilearn
 
PPTX
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Simplilearn
 
PPTX
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Simplilearn
 
PPTX
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Simplilearn
 
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Simplilearn
 
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Simplilearn
 
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
Simplilearn
 
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
Simplilearn
 
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
Simplilearn
 
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Simplilearn
 
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
Simplilearn
 
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
Simplilearn
 
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Simplilearn
 
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Simplilearn
 
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Simplilearn
 
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
Simplilearn
 
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Simplilearn
 
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Simplilearn
 
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Simplilearn
 
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Simplilearn
 
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
Ad

Recently uploaded (20)

PPTX
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PPTX
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PPTX
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PDF
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
DAY 1_QUARTER1 ENGLISH 5 WEEK- PRESENTATION.pptx
BanyMacalintal
 
Introduction presentation of the patentbutler tool
MIPLM
 
How to Send Email From Odoo 18 Website - Odoo Slides
Celine George
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
Controller Request and Response in Odoo18
Celine George
 
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

  • 2. What is Spark SQL? Spark SQL Features Spark SQL Architecture Spark SQL – DataFrame API Spark SQL – Data Source API Spark SQL – Catalyst Optimizer Running SQL Queries Spark SQL Demo What’s in it for you? SQL
  • 3. What is Spark SQL? SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data
  • 4. Click here to watch the video
  • 5. SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive What is Spark SQL?
  • 6. SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive Hive lags in performance as it uses MapReduce jobs for executing ad-hoc queries Hive does not allow you to resume a job processing if it fails in the middle Limitations What is Spark SQL?
  • 7. SQL Spark performs better than Hive in most scenarios Source: https://engineering.fb.com/ Hive ~ Spark
  • 9. SQL Integrated High Compatibility You can integrate Spark SQL and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing Spark SQL Features Spark SQL Spark programs SQLQueries
  • 10. SQL Scalability Standard Connectivity Spark SQL leverages RDD model as it supports large jobs and mid- query fault tolerance. For interactive and long queries, it uses the same engine You can easily connect Spark SQL with JDBC or ODBC. For connectivity for business intelligence tools, both turned as industry norms Spark SQL Features SQL SQL RDD Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing
  • 12. SQL DataFrame DSLDataframe DSL DataFrame API Data Source API CSV JSON JDBC DataFrame DSLSpark SQL and HQL Spark SQL Architecture
  • 13. Spark SQL has three main layers Spark SQL is Apache Spark’s module for working with structured data Language API SchemaRDD Data Sources Spark is very compatible as it supports languages like Python, HiveQL, Scala, and Java As Spark SQL works on schema, tables, and records, you can use SchemaRDD or DataFrame as a temporary table SQL Spark SQL supports multiple data sources like JSON, Cassandra database, Hive tables Spark SQL Architecture
  • 15. A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema Spark SQL – Data Frame API
  • 16. DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python Spark SQL – Data Frame API A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
  • 17. Has can process the data in the size of Kilobytes to Petabytes on a single node cluster Can be easily integrated with all Big Data tools and frameworks via Spark-Core Provides API for Python, Java, Scala, and R Programming DataFrame features Spark SQL – Data Frame API DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
  • 18. Spark SQL – Data Source API
  • 19. Spark SQL supports operating on a variety of data sources through the DataFrame interface Spark SQL – Data Source API
  • 20. Spark SQL supports operating on a variety of data sources through the DataFrame interface It supports different files such as CSV, Hive, Avro, JSON, Parquet Spark SQL – Data Source API
  • 21. It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context ContextSQL Spark SQL – Data Source API Spark SQL supports operating on a variety of data sources through the DataFrame interface
  • 22. It can be easily integrated with all Big Data tools and frameworks via Spark-Core ContextSQL Spark SQL – Data Source API It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context Spark SQL supports operating on a variety of data sources through the DataFrame interface
  • 24. Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer Spark SQL – Catalyst Optimizer
  • 25. It works in 4 phases: 1 Analyzing a logical plan to resolve references 2 Logical plan optimization 3 Physical planning 4 Code generation to compile parts of the query to Java bytecode Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 26. SQL Query SQL Query Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 27. SQL Query SQL Query Unresolved Logical plan Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 28. SQL Query SQL Query Unresolved Logical plan Logical plan Catalog Analysis Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 29. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Catalog Analysis Logical Optimization Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 30. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 31. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 32. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Selected Physical Plan Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 33. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Selected Physical Plan RDDs Catalog Analysis Logical Optimization Physical Planning Code Generation Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 35. Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL
  • 36. SparkContext class object (sc) is required for initializing SQLContext class object The following command initializes SparkContext through spark-shell $ spark-shell Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL
  • 37. The following command creates a SQLContext scala> val sqlcontext = new org.apache.sql.SQLContext(sc) Spark SQLContext SparkContext class object (sc) is required for initializing SQLContext class object SQLContext is a class used for initializing the functionalities of Spark SQL The following command initializes SparkContext through spark-shell $ spark-shell
  • 39. It is the entry point to any functionality in Spark. To create a basic SparkSession, use SparkSession.builder() Source: https://spark.apache.org/ SparkSession
  • 40. Applications can create DataFrames with the help of an existing RDD using a Hive table, or from Spark data sources The following creates a DataFrame based on the content of a JSON file: https://spark.apache.org/Source: Creating DataFrames
  • 42. Structured data can be manipulated using domain-specific language provided by DataFrames https://spark.apache.org/Source: DataFrame Operations Below are some examples of structured data processing:
  • 43. https://spark.apache.org/Source: DataFrame Operations Structured data can be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:
  • 44. https://spark.apache.org/Source: DataFrame Operations Structured data can be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:
  • 46. The sql function on a SparkSession allows applications to run SQL queries programmatically and returns the result in the form of a DataFrame https://spark.apache.org/Source: Running SQL Queries