Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

0 likes636 views

The document outlines a live coding session focused on extending Spark SQL 2.4 with new data sources, emphasizing the importance of developing new data sources to enhance data accessibility for external systems. It details the goals, frameworks, and APIs involved in loading and saving data, and it introduces the DataFrameReader as a key component for data flow description. Additionally, it touches on the evolution from Data Source API v1 to v2, offering insights for developers and data engineers on the capabilities of Spark SQL.

Data & Analytics

Extending Spark SQL 2.4
with New Data Sources
Live Coding Session
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / Spark+AI Summit 2019

● Freelance IT consultant
● Specializing in Spark, Kafka, Kafka Streams, Scala
● Development | Consulting | Training | Speaking
● "The Internals Of" online books
● Among contributors to Apache Spark
● Among Conﬂuent Community Catalyst (Class of 2019 - 2020)
● Contact me at jacek@japila.pl
● Follow @JacekLaskowski on twitter for more #ApacheSpark
#ApacheKafka #KafkaStreams
Jacek Laskowski

Friendly reminder
Pictures...take a lot of pictures! 📷
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

Why Should You Care?
1. Why would you ever consider developing a new data
source for Spark SQL?
2. Let structured queries access data in external systems
(e.g. Splice Machine, Google Cloud Spanner)
3. Make loading or writing process self-contained
a. Hidden from developers who'd focus on what to do with the data
not how to make the data available in a proper format
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

Data Source / Data Provider
1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving
data
a. Abstraction in a loose meaning
b. Also known as Data Provider or Data Format or Relation Provider
2. Built-In Data Sources: parquet, kafka, avro, json, etc.
3. All available for developers, data engineers, and data scientists
a. Scala, Java, Python, SQL
4. Allows for new data sources
5. Source or Reader for loading data
6. Sink or Writer for saving data
7. Read up on Data Sources in the ofﬁcial documentation
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
The goal of the session! 🎯

Before Developing New Data Source
1. What Apache Spark version?
2. Data Source API V1 vs Data Source API V2?
3. Loading and/or Saving Data?
4. Spark SQL only?
5. Spark Structured Streaming?
a. Micro-Batch Stream Processing?
b. Continuous Stream Processing?
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

DataFrameReader (1 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. SparkSession.read to start describing a data ﬂow
a. Creates a DataFrameReader
2. DataFrameReader is a ﬂuent interface to describe the
input data source
3. Used to “load” data from external storage systems (e.g.
ﬁle systems, key-value stores, etc.)
a. No physical data movement yet
b. Metadata of an input node in a data ﬂow (graph)
4. DataFrameReader.load to ﬁnish describing the input

DataFrameReader (2 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Worth noticing:
1. DataSource.lookupDataS
ource
2. DataSourceV2
3. ReadSupport
4. DataSourceV2Relation
5. loadV1Source

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

loadV1Source = DataSource.resolveRelation
1. loadV1Source loads a DataSource API V1 data source
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

Data Source API
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
2. 👉 Data Source API V1
3. 👉 Data Source API V2

Friendly reminders
1. Pictures...take a lot of pictures! 📷
2. It should be a live coding, shouldn’t it? 🤔
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

Data Source API V1
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
a. SchemaRelationProvider
b. RelationProvider
c. FileFormat
d. CreatableRelationProvider
2. BaseRelation
a. PrunedFilteredScan
b. InsertableRelation
c. PrunedScan
d. TableScan
e. CatalystScan

Data Source API V2
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
2. DataSourceV2
3. ReadSupport
4. WriteSupport

“The Internals Of” Online Books
1. The Internals of Spark SQL
2. The Internals of Spark Structured Streaming
3. The Internals of Apache Spark

Questions?
1. Follow @jaceklaskowski on twitter (DMs open)
2. Upvote my questions and answers on StackOverﬂow
3. Contact me at jacek@japila.pl
4. Connect with me at LinkedIn
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

More Related Content

What's hot (20)

PDF

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks

PDF

Infrastructure for Deep Learning in Apache SparkDatabricks

PDF

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks

PDF

Self-Service Apache Spark Structured Streaming Applications and AnalyticsDatabricks

PDF

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

PDF

Databricks with R: Deep DiveDatabricks

PDF

Spark Summit EU talk by Christos ErotocritouSpark Summit

PDF

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

PDF

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran

PDF

Powering Custom Apps at Facebook using Spark Script TransformationDatabricks

PDF

Acid ORC, Iceberg and Delta LakeMichal Gancarski

PDF

Accelerating Machine Learning on Databricks RuntimeDatabricks

PDF

End-to-End Data Pipelines with Apache SparkBurak Yavuz

PDF

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

PDF

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

PDF

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit

PDF

Big Telco - Yousun JeongSpark Summit

PDF

Change Data Feed in DeltaDatabricks

PDF

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit

PDF

Insights Without Tradeoffs: Using Structured StreamingDatabricks

Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks

Infrastructure for Deep Learning in Apache SparkDatabricks

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks

Self-Service Apache Spark Structured Streaming Applications and AnalyticsDatabricks

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

Databricks with R: Deep DiveDatabricks

Spark Summit EU talk by Christos ErotocritouSpark Summit

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran

Powering Custom Apps at Facebook using Spark Script TransformationDatabricks

Acid ORC, Iceberg and Delta LakeMichal Gancarski

Accelerating Machine Learning on Databricks RuntimeDatabricks

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit

Big Telco - Yousun JeongSpark Summit

Change Data Feed in DeltaDatabricks

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit

Insights Without Tradeoffs: Using Structured StreamingDatabricks

Similar to Extending Spark SQL 2.4 with New Data Sources (Live Coding Session) (20)

PDF

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

PPTX

ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar

PDF

Introduction to Datasource V2 APIdatamantra

PDF

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

PDF

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

PPTX

Big data processing with Apache Spark and Oracle DatabaseMartin Toshev

PDF

Apache spark 2.4 and beyondXiao Li

PPTX

Building highly scalable data pipelines with Apache SparkMartin Toshev

PPTX

Azure Databricks is Easier Than You ThinkIke Ellis

PPTX

Building a modern Application with DataFramesSpark Summit

PPTX

Building a modern Application with DataFramesDatabricks

PDF

Understanding transactional writes in datasource v2datamantra

PDF

Introduction to Apache Spark 2.0Knoldus Inc.

PPTX

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

PPTX

Apache Spark for BeginnersAnirudh

PDF

Apache Spark and Python: unified Big Data analyticsJulien Anguenot

PDF

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

PDF

Introduction to Spark TrainingSpark Summit

PPTX

Intro to Spark development Spark Summit

PDF

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar

Introduction to Datasource V2 APIdatamantra

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Big data processing with Apache Spark and Oracle DatabaseMartin Toshev

Apache spark 2.4 and beyondXiao Li

Building highly scalable data pipelines with Apache SparkMartin Toshev

Azure Databricks is Easier Than You ThinkIke Ellis

Building a modern Application with DataFramesSpark Summit

Building a modern Application with DataFramesDatabricks

Understanding transactional writes in datasource v2datamantra

Introduction to Apache Spark 2.0Knoldus Inc.

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

Apache Spark for BeginnersAnirudh

Apache Spark and Python: unified Big Data analyticsJulien Anguenot

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Introduction to Spark TrainingSpark Summit

Intro to Spark development Spark Summit

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

More from Databricks (20)

PPTX

DW Migration Webinar-March 2022.pptxDatabricks

PPTX

Data Lakehouse Symposium | Day 1 | Part 1Databricks

PPT

Data Lakehouse Symposium | Day 1 | Part 2Databricks

PPTX

Data Lakehouse Symposium | Day 2Databricks

PPTX

Data Lakehouse Symposium | Day 4Databricks

PDF

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

PDF

Democratizing Data Quality Through a Centralized PlatformDatabricks

PDF

Learn to Use Databricks for Data ScienceDatabricks

PDF

Why APM Is Not the Same As ML MonitoringDatabricks

PDF

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

PDF

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

PDF

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

PDF

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

PDF

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

PDF

Sawtooth Windows for Feature AggregationsDatabricks

PDF

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

PDF

Re-imagine Data Monitoring with whylogs and SparkDatabricks

PDF

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

PDF

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

PDF

Massive Data Processing in Adobe Using Delta LakeDatabricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

PDF

SQL for Accountants and Finance Managersysmaelreyes

PPTX

03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_eventFinTech Belgium

PPT

Growth of Public Expendituuure_55423.pptNavyaDeora

PDF

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

PPTX

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

PDF

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

PDF

Group 5_RMB Final Project on circular economypgban24anmola

PPTX

apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...apidays

PPTX

Feb 2021 Ransomware Recovery presentation.pptxenginsayin1

PDF

apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...apidays

PDF

NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)GRC Kompas

PPTX

How to Add Columns and Rows in an R Data Framesubhashenia

PPTX

美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买Taqyea

PDF

Research Methodology Overview Introductionayeshagul29594

PPTX

办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单Taqyea

PDF

InformaticsPractices-MS - Google Docs.pdfseshuashwin0829

PPT

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

PDF

Driving Employee Engagement in a Hybrid World.pdfMia scott

PPTX

What Is Data Integration and Transformation?subhashenia

PDF

1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdfsandeep718278

SQL for Accountants and Finance Managersysmaelreyes

03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_eventFinTech Belgium

Growth of Public Expendituuure_55423.pptNavyaDeora

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

Group 5_RMB Final Project on circular economypgban24anmola

apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...apidays

Feb 2021 Ransomware Recovery presentation.pptxenginsayin1

apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...apidays

NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)GRC Kompas

How to Add Columns and Rows in an R Data Framesubhashenia

美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买Taqyea

Research Methodology Overview Introductionayeshagul29594

办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单Taqyea

InformaticsPractices-MS - Google Docs.pdfseshuashwin0829

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

Driving Employee Engagement in a Hybrid World.pdfMia scott

What Is Data Integration and Transformation?subhashenia

1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdfsandeep718278

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

1. Extending Spark SQL 2.4 with New Data Sources Live Coding Session © Jacek Laskowski / @JacekLaskowski / [email protected] / Spark+AI Summit 2019

2. ● Freelance IT consultant ● Specializing in Spark, Kafka, Kafka Streams, Scala ● Development | Consulting | Training | Speaking ● "The Internals Of" online books ● Among contributors to Apache Spark ● Among Conﬂuent Community Catalyst (Class of 2019 - 2020) ● Contact me at [email protected] ● Follow @JacekLaskowski on twitter for more #ApacheSpark #ApacheKafka #KafkaStreams Jacek Laskowski

4. Why Should You Care? 1. Why would you ever consider developing a new data source for Spark SQL? 2. Let structured queries access data in external systems (e.g. Splice Machine, Google Cloud Spanner) 3. Make loading or writing process self-contained a. Hidden from developers who'd focus on what to do with the data not how to make the data available in a proper format © Jacek Laskowski / @JacekLaskowski / [email protected]

5. Data Source / Data Provider 1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving data a. Abstraction in a loose meaning b. Also known as Data Provider or Data Format or Relation Provider 2. Built-In Data Sources: parquet, kafka, avro, json, etc. 3. All available for developers, data engineers, and data scientists a. Scala, Java, Python, SQL 4. Allows for new data sources 5. Source or Reader for loading data 6. Sink or Writer for saving data 7. Read up on Data Sources in the ofﬁcial documentation © Jacek Laskowski / @JacekLaskowski / [email protected] The goal of the session! 🎯

6. Before Developing New Data Source 1. What Apache Spark version? 2. Data Source API V1 vs Data Source API V2? 3. Loading and/or Saving Data? 4. Spark SQL only? 5. Spark Structured Streaming? a. Micro-Batch Stream Processing? b. Continuous Stream Processing? © Jacek Laskowski / @JacekLaskowski / [email protected]

7. DataFrameReader (1 of 2) © Jacek Laskowski / @JacekLaskowski / [email protected] 1. SparkSession.read to start describing a data flow a. Creates a DataFrameReader 2. DataFrameReader is a fluent interface to describe the input data source 3. Used to “load” data from external storage systems (e.g. file systems, key-value stores, etc.) a. No physical data movement yet b. Metadata of an input node in a data flow (graph) 4. DataFrameReader.load to finish describing the input

8. DataFrameReader (2 of 2) © Jacek Laskowski / @JacekLaskowski / [email protected] Worth noticing: 1. DataSource.lookupDataS ource 2. DataSourceV2 3. ReadSupport 4. DataSourceV2Relation 5. loadV1Source

10. loadV1Source = DataSource.resolveRelation 1. loadV1Source loads a DataSource API V1 data source © Jacek Laskowski / @JacekLaskowski / [email protected]

11. Data Source API © Jacek Laskowski / @JacekLaskowski / [email protected] 1. DataSourceRegister 2. 👉 Data Source API V1 3. 👉 Data Source API V2

12. Friendly reminders 1. Pictures...take a lot of pictures! 📷 2. It should be a live coding, shouldn’t it? 🤔 © Jacek Laskowski / @JacekLaskowski / [email protected]

13. Data Source API V1 © Jacek Laskowski / @JacekLaskowski / [email protected] 1. DataSourceRegister a. SchemaRelationProvider b. RelationProvider c. FileFormat d. CreatableRelationProvider 2. BaseRelation a. PrunedFilteredScan b. InsertableRelation c. PrunedScan d. TableScan e. CatalystScan

14. Data Source API V2 © Jacek Laskowski / @JacekLaskowski / [email protected] 1. DataSourceRegister 2. DataSourceV2 3. ReadSupport 4. WriteSupport

15. “The Internals Of” Online Books 1. The Internals of Spark SQL 2. The Internals of Spark Structured Streaming 3. The Internals of Apache Spark

16. Questions? 1. Follow @jaceklaskowski on twitter (DMs open) 2. Upvote my questions and answers on StackOverﬂow 3. Contact me at [email protected] 4. Connect with me at LinkedIn © Jacek Laskowski / @JacekLaskowski / [email protected]