Spark sql under the hood - Data KRK meetup

2 likes•315 views

The document discusses the evolution of data structuring from Codd's relational model to Apache Spark SQL. It highlights Spark SQL's capabilities to handle large-scale data processing and its advantages for analysts familiar with SQL or Hive. Key features such as the Catalyst optimizer and the importance of efficient plan generation are emphasized, along with best practices for using the Spark SQL API.

Data & Analytics

Spark SQL under the hood
Mikołaj Kromka, VirtusLab
mkromka@virtuslab.com
DataKRK meetup
Kraków, 06.09.2017

Bio
● Software engineer at VirtusLab and Spark trainer at Virtusity
● Focused mostly on the Scala ecosystem
● Currently developing a new Analytics Platform for Tesco

Brief (and selective) history of structuring data
● Codd's relational model (1969 - 50th anniversary in two years!)
● SQL
○ one of the first commercial implementations at IBM (early 1970s)
○ SQL-based RDBMS developed at Relational Software, Inc (now Oracle Corporation) in the late 1970s
● Apache Hive bringing SQL-like capabilities to the Big Data world (open sourced 2008)
● Shark
● Spark SQL (2014)

Apache Spark: why the fuss?
● General engine for large-scale data processing
● Resilient Distributed Datasets
● Generating graph of computations automatically
● Scala, Java, Python and R APIs
● A lot of libraries on top of it (SQL, ML, GraphX, Streaming)
● One of the most active open source projects
source https://spark.apache.org/docs/latest/cluster-overview.html

Do we need anything else?
YES
● Data is usually structured - but RDDs contain arbitrary Java/Python objects
and Transformations of RDDs contain arbitrary code
● Analysts know SQL/Hive
● Large SQL/HiveQL codebases that we would like to reuse
● Connecting to different data sources with (semi-)structured datasets
● Applying advanced and complex algorithms (such as ML)

Spark SQL to the rescue
source https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Catalyst Optimizer
source: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Analysis
Resolves references of attributes (assigns them types or matches them to an input table)

Physical Planning
source http://henning.kropponline.de/2016/12/11/broadcast-join-with-spark/
BroadcastHashJoin
source http://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read

Code generation
● Why do we need it?
○ without it simple expressions such as (x + y) + 1 would be interpreted from scratch for every row in the
dataset
● Newer version of spark SQL support Whole-Stage Code Generation (not only expressions)

Vectorization
no vectorization (json source)
...
[cropped source code]
vectorization (parquet source)

Some advice
● Don't stick to the Dataset API blindly - some operations cannot be inlined during codegen and will
be slower
● Don't think that Spark SQL has all features of the traditional RDBMS, if you don't handle large
amounts of data Postgres will be enough
● If possible don't create DataFrames from RDDs using .toDF() method, use specific
DataFrameReader instead
● Analyse plans generated by the Catalyst to see if some optimizations were missed or there is a
place to improve
● Spark UI is always useful

More Related Content

What's hot (20)

PDF

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

PDF

H2O World - H2O Rains with Databricks CloudSri Ambati

PDF

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks

PDF

Insights Without Tradeoffs: Using Structured StreamingDatabricks

PPTX

NATE-Central-LogStefan Coetzee

PDF

Machine Learning Data Lineage with MLflow and Delta LakeDatabricks

PDF

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

PDF

Presto Summit 2018 - 01 - Facebook Prestokbajda

PDF

Introduction to basic data analytics toolsNascenia IT

PPTX

Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit

PDF

Superset druid realtimearupmalakar

PDF

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

ODP

Spline 0.3 User GuideVaclav Kosar

PDF

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

PDF

Vertica And Spark: Connecting Computation And DataSpark Summit

PPTX

Machine Learning on the Microsoft StackLynn Langit

PDF

An Introduction to Sparkling Water by Michal MalohlavaSpark Summit

PDF

Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Ivan Ermilov

PDF

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

PPTX

Spark sql meetupMichael Zhang

GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks

H2O World - H2O Rains with Databricks CloudSri Ambati

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks

Insights Without Tradeoffs: Using Structured StreamingDatabricks

NATE-Central-LogStefan Coetzee

Machine Learning Data Lineage with MLflow and Delta LakeDatabricks

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Presto Summit 2018 - 01 - Facebook Prestokbajda

Introduction to basic data analytics toolsNascenia IT

Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit

Superset druid realtimearupmalakar

Spark Summit EU 2015: Reynold Xin KeynoteDatabricks

Spline 0.3 User GuideVaclav Kosar

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Vertica And Spark: Connecting Computation And DataSpark Summit

Machine Learning on the Microsoft StackLynn Langit

An Introduction to Sparkling Water by Michal MalohlavaSpark Summit

Lodstats: The Data Web Census Dataset. Kobe, Japan, 2016Ivan Ermilov

Apache® Spark™ MLlib: From Quick Start to Scikit-LearnDatabricks

Spark sql meetupMichael Zhang

Similar to Spark sql under the hood - Data KRK meetup (20)

PPTX

big data analytics (BAD601) Module-5.pptxAmbikaVenkatesh4

PDF

Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende

PDF

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

PDF

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

PPTX

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

PPTX

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

PDF

Spark SQLJoud Khattab

PDF

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

PDF

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

PDF

Started with-apache-sparkHappiest Minds Technologies

PPTX

Apache Spark OverviewDharmjit Singh

PDF

Spark streaming , Spark SQLYousun Jeong

PPTX

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

PDF

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

PDF

20170126 big data processingVienna Data Science Group

PPTX

What's New in Spark 2?Eyal Ben Ivri

PPTX

Apache Spark: Lightning Fast Cluster ComputingAll Things Open

PDF

Media_Entertainment_VeriticalsPeyman Mohajerian

PPTX

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

PPT

An Introduction to Apache spark with scalajohnn210

big data analytics (BAD601) Module-5.pptxAmbikaVenkatesh4

Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Spark SQLJoud Khattab

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

Started with-apache-sparkHappiest Minds Technologies

Apache Spark OverviewDharmjit Singh

Spark streaming , Spark SQLYousun Jeong

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

20170126 big data processingVienna Data Science Group

What's New in Spark 2?Eyal Ben Ivri

Apache Spark: Lightning Fast Cluster ComputingAll Things Open

Media_Entertainment_VeriticalsPeyman Mohajerian

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

An Introduction to Apache spark with scalajohnn210

Recently uploaded (20)

PPTX

Future_of_AI_Presentation for everyone.pptxboranamanju07

PPTX

World-population.pptx fire bunberbpeopleumutunsalnsl4402

PDF

apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...apidays

PPTX

Data-Users-in-Database-Management-Systems (1).pptxdharmik832021

PDF

Blitz Campinas - Dia 24 de maio - Piettro.pdffabigreek

PPTX

Presentation (1) (1).pptx k8hhfftuiiigffkarthikjagath2005

PPTX

MR and reffffffvvvvvvvfversal_083605.pptxmanjeshjain

PPTX

Fluvial_Civilizations_Presentation (1).pptxalisslovemendoza7

PDF

717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...pedelli41

PPT

From Vision to Reality: The Digital India RevolutionHarsh Bharvadiya

PDF

apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...apidays

PPTX

short term internship project on Data visualizationJMJCollegeComputerde

PPT

Real Life Application of Set theory, Relations and Functionsmanavparmar205

PPTX

White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...RamNeymarjr

PPTX

Introduction to computer chapter one 2017.pptxmensunmarley

PPTX

M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptxteodoroferiarevanojr

PPTX

Probability systematic sampling methods.pptxPrakashRajput19

PDF

An Uncut Conversation With Grok | PDF DocumentMike Hydes

PDF

SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdfpandeydiksha814

PDF

Blue Futuristic Cyber Security Presentation.pdftanvikhunt1003

Future_of_AI_Presentation for everyone.pptxboranamanju07

World-population.pptx fire bunberbpeopleumutunsalnsl4402

apidays Munich 2025 - Integrate Your APIs into the New AI Marketplace, Senthi...apidays

Data-Users-in-Database-Management-Systems (1).pptxdharmik832021

Blitz Campinas - Dia 24 de maio - Piettro.pdffabigreek

Presentation (1) (1).pptx k8hhfftuiiigffkarthikjagath2005

MR and reffffffvvvvvvvfversal_083605.pptxmanjeshjain

Fluvial_Civilizations_Presentation (1).pptxalisslovemendoza7

717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...pedelli41

From Vision to Reality: The Digital India RevolutionHarsh Bharvadiya

apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...apidays

short term internship project on Data visualizationJMJCollegeComputerde

Real Life Application of Set theory, Relations and Functionsmanavparmar205

White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...RamNeymarjr

Introduction to computer chapter one 2017.pptxmensunmarley

M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptxteodoroferiarevanojr

Probability systematic sampling methods.pptxPrakashRajput19

An Uncut Conversation With Grok | PDF DocumentMike Hydes

SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdfpandeydiksha814

Blue Futuristic Cyber Security Presentation.pdftanvikhunt1003

Spark sql under the hood - Data KRK meetup

1. Spark SQL under the hood Mikołaj Kromka, VirtusLab [email protected] DataKRK meetup Kraków, 06.09.2017

2. Bio ● Software engineer at VirtusLab and Spark trainer at Virtusity ● Focused mostly on the Scala ecosystem ● Currently developing a new Analytics Platform for Tesco

3. Brief (and selective) history of structuring data ● Codd's relational model (1969 - 50th anniversary in two years!) ● SQL ○ one of the first commercial implementations at IBM (early 1970s) ○ SQL-based RDBMS developed at Relational Software, Inc (now Oracle Corporation) in the late 1970s ● Apache Hive bringing SQL-like capabilities to the Big Data world (open sourced 2008) ● Shark ● Spark SQL (2014)

4. Apache Spark: why the fuss? ● General engine for large-scale data processing ● Resilient Distributed Datasets ● Generating graph of computations automatically ● Scala, Java, Python and R APIs ● A lot of libraries on top of it (SQL, ML, GraphX, Streaming) ● One of the most active open source projects source https://spark.apache.org/docs/latest/cluster-overview.html

5. Apache Spark: why the fuss?

6. Do we need anything else? YES ● Data is usually structured - but RDDs contain arbitrary Java/Python objects and Transformations of RDDs contain arbitrary code ● Analysts know SQL/Hive ● Large SQL/HiveQL codebases that we would like to reuse ● Connecting to different data sources with (semi-)structured datasets ● Applying advanced and complex algorithms (such as ML)

7. Spark SQL to the rescue

8. Spark SQL to the rescue source https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

9. Spark SQL to the rescue

10. Catalyst Optimizer source: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

11. Analysis Resolves references of attributes (assigns them types or matches them to an input table)

12. Logical Optimization

13. Physical Planning source http://henning.kropponline.de/2016/12/11/broadcast-join-with-spark/ BroadcastHashJoin source http://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read

14. Physical Planning

15. Code generation ● Why do we need it? ○ without it simple expressions such as (x + y) + 1 would be interpreted from scratch for every row in the dataset ● Newer version of spark SQL support Whole-Stage Code Generation (not only expressions)

16. Spark UI

17. Vectorization no vectorization (json source) ... [cropped source code] vectorization (parquet source)

18. Some advice ● Don't stick to the Dataset API blindly - some operations cannot be inlined during codegen and will be slower ● Don't think that Spark SQL has all features of the traditional RDBMS, if you don't handle large amounts of data Postgres will be enough ● If possible don't create DataFrames from RDDs using .toDF() method, use specific DataFrameReader instead ● Analyse plans generated by the Catalyst to see if some optimizations were missed or there is a place to improve ● Spark UI is always useful

19. questions?