SlideShare a Scribd company logo
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn
What is Spark SQL?
Spark SQL Features
Spark SQL Architecture
Spark SQL – DataFrame API
Spark SQL – Data Source
API
Spark SQL – Catalyst
Optimizer
Running SQL Queries
Spark SQL Demo
What’s in it for you?
SQL
What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
Click here to watch the video
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data
It originated to overcome the
limitations of Apache Hive
Hive lags in performance as it uses MapReduce
jobs for executing ad-hoc queries
Hive does not allow you to resume a job
processing if it fails in the middle
Limitations
What is Spark SQL?
SQL
Spark performs better than Hive in most scenarios
Source: https://engineering.fb.com/
Hive ~ Spark
Spark SQL
Features
SQL
Integrated
High
Compatibility
You can integrate Spark SQL
and query structured data
inside Spark programs
You can run unmodified Hive queries
on existing warehouses in Spark
SQL. With existing Hive data, queries
and UDFs, Spark SQL offers full
compatibility
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL Features
Spark
SQL
Spark
programs
SQLQueries
SQL
Scalability
Standard
Connectivity
Spark SQL leverages RDD model as
it supports large jobs and mid-
query fault tolerance. For interactive
and long queries, it uses the same
engine
You can easily connect Spark
SQL with JDBC or ODBC. For
connectivity for business
intelligence tools, both turned as
industry norms
Spark SQL Features
SQL
SQL
RDD
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL
Architecture
SQL
DataFrame DSLDataframe DSL
DataFrame API
Data Source API
CSV JSON JDBC
DataFrame DSLSpark SQL and HQL
Spark SQL Architecture
Spark SQL has three main layers
Spark SQL is Apache Spark’s module for working with structured data
Language API SchemaRDD Data Sources
Spark is very compatible as it
supports languages like Python,
HiveQL, Scala, and Java
As Spark SQL works on schema,
tables, and records, you can use
SchemaRDD or DataFrame as a
temporary table
SQL
Spark SQL supports multiple
data sources like JSON,
Cassandra database, Hive
tables
Spark SQL Architecture
Spark SQL –
DataFrame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
Spark SQL – Data Frame API
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Has can process the data in the size of Kilobytes to Petabytes
on a single node cluster
Can be easily integrated with all Big Data tools and frameworks
via Spark-Core
Provides API for Python, Java, Scala, and R Programming
DataFrame features
Spark SQL – Data Frame API
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python
A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL –
Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
ContextSQL
Spark SQL – Data Source API
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
It can be easily integrated with all
Big Data tools and frameworks
via Spark-Core
ContextSQL
Spark SQL – Data Source API
It supports different files such as
CSV, Hive, Avro, JSON, Parquet
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL –
Catalyst
Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark SQL – Catalyst Optimizer
It works in 4 phases:
1 Analyzing a logical plan to
resolve references
2 Logical plan optimization
3 Physical planning 4
Code generation to compile parts of
the query to Java bytecode
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Catalog
Analysis
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Catalog
Analysis
Logical
Optimization
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
Catalog
Analysis
Logical
Optimization
Physical
Planning
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
RDDs
Catalog
Analysis
Logical
Optimization
Physical
Planning
Code Generation
Spark SQL – Catalyst Optimizer
Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark
SQLContext
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
SparkContext class object (sc) is required for initializing SQLContext class object
The following command initializes
SparkContext through spark-shell
$ spark-shell
Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command creates a SQLContext
scala> val sqlcontext = new
org.apache.sql.SQLContext(sc)
Spark SQLContext
SparkContext class object (sc) is required for initializing SQLContext class object
SQLContext is a class used for initializing the functionalities of Spark SQL
The following command initializes
SparkContext through spark-shell
$ spark-shell
SparkSession
It is the entry point to any functionality in Spark. To create a basic
SparkSession, use SparkSession.builder()
Source: https://spark.apache.org/
SparkSession
Applications can create DataFrames with the help of an existing RDD using a
Hive table, or from Spark data sources
The following creates a DataFrame based on the content of a JSON file:
https://spark.apache.org/Source:
Creating DataFrames
DataFrame
Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
https://spark.apache.org/Source:
DataFrame Operations
Below are some examples of structured data processing:
https://spark.apache.org/Source:
DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:
https://spark.apache.org/Source:
DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:
Running SQL
Queries
The sql function on a SparkSession allows applications to run SQL queries
programmatically and returns the result in the form of a DataFrame
https://spark.apache.org/Source:
Running SQL Queries
Demo on Spark
SQL
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

More Related Content

What's hot (20)

PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
Spark
Heena Madan
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Intro to Neo4j and Graph Databases
Neo4j
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PPTX
Apache Spark overview
DataArt
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PPTX
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Introduction to Spark Internals
Pietro Michiardi
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Spark Architecture
Alexey Grishchenko
 
Understanding Query Plans and Spark UIs
Databricks
 
Introduction to apache spark
Aakashdata
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Intro to Neo4j and Graph Databases
Neo4j
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Apache Spark Introduction
sudhakara st
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Apache Spark overview
DataArt
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 

Similar to Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn (20)

PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Apache Spark sql
aftab alam
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PDF
Spark sql under the hood - Data KRK meetup
Mikołaj Kromka
 
PDF
Spark sql
Freeman Zhang
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
SparkPaper
Suraj Thapaliya
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Introduction to Spark SQL, query types and UDF
sundharakumarkb2
 
ODP
A Step to programming with Apache Spark
Knoldus Inc.
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
PDF
20140908 spark sql & catalyst
Takuya UESHIN
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Spark sql
Zahra Eskandari
 
Apache Spark sql
aftab alam
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Learning spark ch09 - Spark SQL
phanleson
 
Spark sql under the hood - Data KRK meetup
Mikołaj Kromka
 
Spark sql
Freeman Zhang
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Spark sql meetup
Michael Zhang
 
SparkPaper
Suraj Thapaliya
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Introduction to Spark SQL, query types and UDF
sundharakumarkb2
 
A Step to programming with Apache Spark
Knoldus Inc.
 
Apache Spark Overview
Dharmjit Singh
 
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
20140908 spark sql & catalyst
Takuya UESHIN
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Ad

More from Simplilearn (20)

PPTX
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Simplilearn
 
PPTX
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Simplilearn
 
PPTX
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
Simplilearn
 
PPTX
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
Simplilearn
 
PPTX
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
Simplilearn
 
PPTX
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Simplilearn
 
PPTX
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
Simplilearn
 
PPTX
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
Simplilearn
 
PPTX
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Simplilearn
 
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
PPTX
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Simplilearn
 
PPTX
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Simplilearn
 
PPTX
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
Simplilearn
 
PPTX
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Simplilearn
 
PPTX
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Simplilearn
 
PPTX
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Simplilearn
 
PPTX
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Simplilearn
 
PPTX
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
PPTX
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
Top 50 Scrum Master Interview Questions | Scrum Master Interview Questions & ...
Simplilearn
 
Bagging Vs Boosting In Machine Learning | Ensemble Learning In Machine Learni...
Simplilearn
 
Future Of Social Media | Social Media Trends and Strategies 2025 | Instagram ...
Simplilearn
 
SQL Query Optimization | SQL Query Optimization Techniques | SQL Basics | SQL...
Simplilearn
 
SQL INterview Questions .pTop 45 SQL Interview Questions And Answers In 2025 ...
Simplilearn
 
How To Start Influencer Marketing Business | Influencer Marketing For Beginne...
Simplilearn
 
Cyber Security Roadmap 2025 | How To Become Cyber Security Engineer In 2025 |...
Simplilearn
 
How To Become An AI And ML Engineer In 2025 | AI Engineer Roadmap | AI ML Car...
Simplilearn
 
What Is GitHub Copilot? | How To Use GitHub Copilot? | How does GitHub Copilo...
Simplilearn
 
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
Top 7 High Paying AI Certifications Courses For 2025 | Best AI Certifications...
Simplilearn
 
Data Cleaning In Data Mining | Step by Step Data Cleaning Process | Data Clea...
Simplilearn
 
Top 10 Data Analyst Projects For 2025 | Data Analyst Projects | Data Analysis...
Simplilearn
 
AI Engineer Roadmap 2025 | AI Engineer Roadmap For Beginners | AI Engineer Ca...
Simplilearn
 
Machine Learning Roadmap 2025 | Machine Learning Engineer Roadmap For Beginne...
Simplilearn
 
Kotter's 8-Step Change Model Explained | Kotter's Change Management Model | S...
Simplilearn
 
Gen AI Engineer Roadmap For 2025 | How To Become Gen AI Engineer In 2025 | Si...
Simplilearn
 
Top 10 Data Analyst Certification For 2025 | Best Data Analyst Certification ...
Simplilearn
 
Complete Data Science Roadmap For 2025 | Data Scientist Roadmap For Beginners...
Simplilearn
 
Ad

Recently uploaded (20)

PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PPTX
SD_GMRC5_Session 6AB_Dulog Pedagohikal at Pagtataya (1).pptx
NickeyArguelles
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPTX
ENG8_Q1_WEEK2_LESSON1. Presentation pptx
marawehsvinetshe
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
PDF
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
PDF
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PPT
Indian Contract Act 1872, Business Law #MBA #BBA #BCOM
priyasinghy107
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
WATERSHED MANAGEMENT CASE STUDIES - ULUGURU MOUNTAINS AND ARVARI RIVERpdf
Ar.Asna
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
SD_GMRC5_Session 6AB_Dulog Pedagohikal at Pagtataya (1).pptx
NickeyArguelles
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
ENG8_Q1_WEEK2_LESSON1. Presentation pptx
marawehsvinetshe
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
Governor Josh Stein letter to NC delegation of U.S. House
Mebane Rash
 
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
Introduction presentation of the patentbutler tool
MIPLM
 
Indian Contract Act 1872, Business Law #MBA #BBA #BCOM
priyasinghy107
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
Controller Request and Response in Odoo18
Celine George
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

  • 2. What is Spark SQL? Spark SQL Features Spark SQL Architecture Spark SQL – DataFrame API Spark SQL – Data Source API Spark SQL – Catalyst Optimizer Running SQL Queries Spark SQL Demo What’s in it for you? SQL
  • 3. What is Spark SQL? SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data
  • 4. Click here to watch the video
  • 5. SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive What is Spark SQL?
  • 6. SQL Spark SQL is Apache Spark’s module for working with structured and semi-structured data It originated to overcome the limitations of Apache Hive Hive lags in performance as it uses MapReduce jobs for executing ad-hoc queries Hive does not allow you to resume a job processing if it fails in the middle Limitations What is Spark SQL?
  • 7. SQL Spark performs better than Hive in most scenarios Source: https://engineering.fb.com/ Hive ~ Spark
  • 9. SQL Integrated High Compatibility You can integrate Spark SQL and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing Spark SQL Features Spark SQL Spark programs SQLQueries
  • 10. SQL Scalability Standard Connectivity Spark SQL leverages RDD model as it supports large jobs and mid- query fault tolerance. For interactive and long queries, it uses the same engine You can easily connect Spark SQL with JDBC or ODBC. For connectivity for business intelligence tools, both turned as industry norms Spark SQL Features SQL SQL RDD Below are some essential features of Spark SQL that makes it a compelling framework for data processing and analyzing
  • 12. SQL DataFrame DSLDataframe DSL DataFrame API Data Source API CSV JSON JDBC DataFrame DSLSpark SQL and HQL Spark SQL Architecture
  • 13. Spark SQL has three main layers Spark SQL is Apache Spark’s module for working with structured data Language API SchemaRDD Data Sources Spark is very compatible as it supports languages like Python, HiveQL, Scala, and Java As Spark SQL works on schema, tables, and records, you can use SchemaRDD or DataFrame as a temporary table SQL Spark SQL supports multiple data sources like JSON, Cassandra database, Hive tables Spark SQL Architecture
  • 15. A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema Spark SQL – Data Frame API
  • 16. DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python Spark SQL – Data Frame API A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
  • 17. Has can process the data in the size of Kilobytes to Petabytes on a single node cluster Can be easily integrated with all Big Data tools and frameworks via Spark-Core Provides API for Python, Java, Scala, and R Programming DataFrame features Spark SQL – Data Frame API DataFrame API in Spark was designed taking inspiration from DataFrame in R programming and Pandas in Python A DataFrame is a domain-specific language (DSL) for working with structured and semi-structured data, i.e., datasets with a schema
  • 18. Spark SQL – Data Source API
  • 19. Spark SQL supports operating on a variety of data sources through the DataFrame interface Spark SQL – Data Source API
  • 20. Spark SQL supports operating on a variety of data sources through the DataFrame interface It supports different files such as CSV, Hive, Avro, JSON, Parquet Spark SQL – Data Source API
  • 21. It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context ContextSQL Spark SQL – Data Source API Spark SQL supports operating on a variety of data sources through the DataFrame interface
  • 22. It can be easily integrated with all Big Data tools and frameworks via Spark-Core ContextSQL Spark SQL – Data Source API It supports different files such as CSV, Hive, Avro, JSON, Parquet It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context Spark SQL supports operating on a variety of data sources through the DataFrame interface
  • 24. Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer Spark SQL – Catalyst Optimizer
  • 25. It works in 4 phases: 1 Analyzing a logical plan to resolve references 2 Logical plan optimization 3 Physical planning 4 Code generation to compile parts of the query to Java bytecode Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 26. SQL Query SQL Query Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 27. SQL Query SQL Query Unresolved Logical plan Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 28. SQL Query SQL Query Unresolved Logical plan Logical plan Catalog Analysis Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 29. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Catalog Analysis Logical Optimization Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 30. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 31. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 32. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Selected Physical Plan Catalog Analysis Logical Optimization Physical Planning Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 33. SQL Query SQL Query Unresolved Logical plan Logical plan Optimized Logical plan Physical plans Cost Model Selected Physical Plan RDDs Catalog Analysis Logical Optimization Physical Planning Code Generation Spark SQL – Catalyst Optimizer Catalyst optimizer leverages advanced programming language features (such as Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer
  • 35. Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL
  • 36. SparkContext class object (sc) is required for initializing SQLContext class object The following command initializes SparkContext through spark-shell $ spark-shell Spark SQLContext SQLContext is a class used for initializing the functionalities of Spark SQL
  • 37. The following command creates a SQLContext scala> val sqlcontext = new org.apache.sql.SQLContext(sc) Spark SQLContext SparkContext class object (sc) is required for initializing SQLContext class object SQLContext is a class used for initializing the functionalities of Spark SQL The following command initializes SparkContext through spark-shell $ spark-shell
  • 39. It is the entry point to any functionality in Spark. To create a basic SparkSession, use SparkSession.builder() Source: https://spark.apache.org/ SparkSession
  • 40. Applications can create DataFrames with the help of an existing RDD using a Hive table, or from Spark data sources The following creates a DataFrame based on the content of a JSON file: https://spark.apache.org/Source: Creating DataFrames
  • 42. Structured data can be manipulated using domain-specific language provided by DataFrames https://spark.apache.org/Source: DataFrame Operations Below are some examples of structured data processing:
  • 43. https://spark.apache.org/Source: DataFrame Operations Structured data can be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:
  • 44. https://spark.apache.org/Source: DataFrame Operations Structured data can be manipulated using domain-specific language provided by DataFrames Below are some examples of structured data processing:
  • 46. The sql function on a SparkSession allows applications to run SQL queries programmatically and returns the result in the form of a DataFrame https://spark.apache.org/Source: Running SQL Queries