SlideShare a Scribd company logo
Spark SQL under the hood
Mikołaj Kromka, VirtusLab
mkromka@virtuslab.com
DataKRK meetup
Kraków, 06.09.2017
Bio
● Software engineer at VirtusLab and Spark trainer at Virtusity
● Focused mostly on the Scala ecosystem
● Currently developing a new Analytics Platform for Tesco
Brief (and selective) history of structuring data
● Codd's relational model (1969 - 50th anniversary in two years!)
● SQL
○ one of the first commercial implementations at IBM (early 1970s)
○ SQL-based RDBMS developed at Relational Software, Inc (now Oracle Corporation) in the late 1970s
● Apache Hive bringing SQL-like capabilities to the Big Data world (open sourced 2008)
● Shark
● Spark SQL (2014)
Apache Spark: why the fuss?
● General engine for large-scale data processing
● Resilient Distributed Datasets
● Generating graph of computations automatically
● Scala, Java, Python and R APIs
● A lot of libraries on top of it (SQL, ML, GraphX, Streaming)
● One of the most active open source projects
source https://spark.apache.org/docs/latest/cluster-overview.html
Apache Spark: why the fuss?
Do we need anything else?
YES
● Data is usually structured - but RDDs contain arbitrary Java/Python objects
and Transformations of RDDs contain arbitrary code
● Analysts know SQL/Hive
● Large SQL/HiveQL codebases that we would like to reuse
● Connecting to different data sources with (semi-)structured datasets
● Applying advanced and complex algorithms (such as ML)
Spark SQL to the rescue
Spark SQL to the rescue
source https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Spark SQL to the rescue
Catalyst Optimizer
source: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Analysis
Resolves references of attributes (assigns them types or matches them to an input table)
Logical Optimization
Physical Planning
source http://henning.kropponline.de/2016/12/11/broadcast-join-with-spark/
BroadcastHashJoin
source http://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read
Physical Planning
Code generation
● Why do we need it?
○ without it simple expressions such as (x + y) + 1 would be interpreted from scratch for every row in the
dataset
● Newer version of spark SQL support Whole-Stage Code Generation (not only expressions)
Spark UI
Vectorization
no vectorization (json source)
...
[cropped source code]
vectorization (parquet source)
Some advice
● Don't stick to the Dataset API blindly - some operations cannot be inlined during codegen and will
be slower
● Don't think that Spark SQL has all features of the traditional RDBMS, if you don't handle large
amounts of data Postgres will be enough
● If possible don't create DataFrames from RDDs using .toDF() method, use specific
DataFrameReader instead
● Analyse plans generated by the Catalyst to see if some optimizations were missed or there is a
place to improve
● Spark UI is always useful
questions?

More Related Content

What's hot (20)

PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
PDF
H2O World - H2O Rains with Databricks Cloud
Sri Ambati
 
PDF
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks
 
PDF
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
PPTX
NATE-Central-Log
Stefan Coetzee
 
PDF
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
Presto Summit 2018 - 01 - Facebook Presto
kbajda
 
PDF
Introduction to basic data analytics tools
Nascenia IT
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PDF
Superset druid realtime
arupmalakar
 
PDF
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
ODP
Spline 0.3 User Guide
Vaclav Kosar
 
PDF
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
PDF
Vertica And Spark: Connecting Computation And Data
Spark Summit
 
PPTX
Machine Learning on the Microsoft Stack
Lynn Langit
 
PDF
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
PDF
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Ivan Ermilov
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PPTX
Spark sql meetup
Michael Zhang
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
H2O World - H2O Rains with Databricks Cloud
Sri Ambati
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks
 
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
NATE-Central-Log
Stefan Coetzee
 
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Presto Summit 2018 - 01 - Facebook Presto
kbajda
 
Introduction to basic data analytics tools
Nascenia IT
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Superset druid realtime
arupmalakar
 
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
Spline 0.3 User Guide
Vaclav Kosar
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
Vertica And Spark: Connecting Computation And Data
Spark Summit
 
Machine Learning on the Microsoft Stack
Lynn Langit
 
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016
Ivan Ermilov
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Spark sql meetup
Michael Zhang
 

Similar to Spark sql under the hood - Data KRK meetup (20)

PPTX
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
Spark SQL
Joud Khattab
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PPTX
What's New in Spark 2?
Eyal Ben Ivri
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PPT
An Introduction to Apache spark with scala
johnn210
 
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Luciano Resende
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Spark SQL
Joud Khattab
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark Overview
Dharmjit Singh
 
Spark streaming , Spark SQL
Yousun Jeong
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
20170126 big data processing
Vienna Data Science Group
 
What's New in Spark 2?
Eyal Ben Ivri
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
An Introduction to Apache spark with scala
johnn210
 
Ad

Recently uploaded (20)

PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...
apidays
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Ad

Spark sql under the hood - Data KRK meetup