Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

What is Spark SQL?
Spark SQL Features
Spark SQL Architecture
Spark SQL – DataFrame API
Spark SQL – Data Source
API
Spark SQL – Catalyst
Optimizer
Running SQL Queries
Spark SQL Demo
What’s in it for you?
SQL

What is Spark SQL?
SQL
Spark SQL is Apache Spark’s module for working with structured and
semi-structured data

SQL
It originated to overcome the
limitations of Apache Hive
What is Spark SQL?

SQL
It originated to overcome the
limitations of Apache Hive
Hive lags in performance as it uses MapReduce
jobs for executing ad-hoc queries
Hive does not allow you to resume a job
processing if it fails in the middle
Limitations
What is Spark SQL?

SQL
Spark performs better than Hive in most scenarios
Source: https://engineering.fb.com/
Hive ~ Spark

SQL
Integrated
High
Compatibility
You can integrate Spark SQL
and query structured data
inside Spark programs
You can run unmodified Hive queries
on existing warehouses in Spark
SQL. With existing Hive data, queries
and UDFs, Spark SQL offers full
compatibility
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing
Spark SQL Features
Spark
SQL
Spark
programs
SQLQueries

SQL
Scalability
Standard
Connectivity
Spark SQL leverages RDD model as
it supports large jobs and mid-
query fault tolerance. For interactive
and long queries, it uses the same
engine
You can easily connect Spark
SQL with JDBC or ODBC. For
connectivity for business
intelligence tools, both turned as
industry norms
Spark SQL Features
SQL
SQL
RDD
Below are some essential features of Spark SQL that makes it a compelling
framework for data processing and analyzing

SQL
DataFrame DSLDataframe DSL
DataFrame API
Data Source API
CSV JSON JDBC
DataFrame DSLSpark SQL and HQL

Spark SQL has three main layers
Spark SQL is Apache Spark’s module for working with structured data
Language API SchemaRDD Data Sources
Spark is very compatible as it
supports languages like Python,
HiveQL, Scala, and Java
As Spark SQL works on schema,
tables, and records, you can use
SchemaRDD or DataFrame as a
temporary table
SQL
Spark SQL supports multiple
data sources like JSON,
Cassandra database, Hive
tables

A DataFrame is a domain-specific language (DSL) for working
with structured and semi-structured data, i.e., datasets with a schema
Spark SQL – Data Frame API

DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python

Has can process the data in the size of Kilobytes to Petabytes
on a single node cluster
Can be easily integrated with all Big Data tools and frameworks
via Spark-Core
Provides API for Python, Java, Scala, and R Programming
DataFrame features
DataFrame API in Spark was
designed taking inspiration from
DataFrame in R programming and
Pandas in Python

Spark SQL supports operating on a variety of data sources through the
DataFrame interface
Spark SQL – Data Source API

DataFrame interface
It supports different files such as
CSV, Hive, Avro, JSON, Parquet

It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
ContextSQL
DataFrame interface

It can be easily integrated with all
Big Data tools and frameworks
via Spark-Core
ContextSQL
It is lazily evaluated like Apache
Spark Transformations and can
be accessed through SQL
Context and Hive Context
DataFrame interface

Spark SQL –
Catalyst
Optimizer

Catalyst optimizer leverages advanced programming language features
(such as Scala’s pattern matching and quasi quotes) in a novel way to build
an extensible query optimizer
Spark SQL – Catalyst Optimizer

It works in 4 phases:
1 Analyzing a logical plan to
resolve references
2 Logical plan optimization
3 Physical planning 4
Code generation to compile parts of
the query to Java bytecode

SQL
Query
SQL
Query

SQL
Query
SQL
Query
Unresolved
Logical plan

SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Catalog
Analysis

SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Catalog
Analysis
Logical
Optimization

SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Catalog
Analysis
Logical
Optimization
Physical
Planning

SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Catalog
Analysis
Logical
Optimization
Physical
Planning

SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
Catalog
Analysis
Logical
Optimization
Physical
Planning

SQL
Query
SQL
Query
Unresolved
Logical plan
Logical plan
Optimized
Logical plan
Physical
plans
Cost Model
Selected
Physical
Plan
RDDs
Catalog
Analysis
Logical
Optimization
Physical
Planning
Code Generation

Spark SQLContext
SQLContext is a class used for initializing the functionalities of Spark SQL

SparkContext class object (sc) is required for initializing SQLContext class object
The following command initializes
SparkContext through spark-shell
$ spark-shell
Spark SQLContext

The following command creates a SQLContext
scala> val sqlcontext = new
org.apache.sql.SQLContext(sc)
Spark SQLContext
SparkContext class object (sc) is required for initializing SQLContext class object
The following command initializes
SparkContext through spark-shell
$ spark-shell

It is the entry point to any functionality in Spark. To create a basic
SparkSession, use SparkSession.builder()
Source: https://spark.apache.org/
SparkSession

Applications can create DataFrames with the help of an existing RDD using a
Hive table, or from Spark data sources
The following creates a DataFrame based on the content of a JSON file:
https://spark.apache.org/Source:
Creating DataFrames

Structured data can be manipulated using domain-specific language provided
by DataFrames
DataFrame Operations
Below are some examples of structured data processing:

DataFrame Operations
Structured data can be manipulated using domain-specific language provided
by DataFrames
Below are some examples of structured data processing:

The sql function on a SparkSession allows applications to run SQL queries
programmatically and returns the result in the form of a DataFrame
Running SQL Queries

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn

More Related Content

What's hot (20)

Similar to Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn (20)

More from Simplilearn (20)

Recently uploaded (20)

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginners | Simplilearn