SlideShare a Scribd company logo
Extending Spark SQL 2.4
with New Data Sources
Live Coding Session
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl / Spark+AI Summit 2019
● Freelance IT consultant
● Specializing in Spark, Kafka, Kafka Streams, Scala
● Development | Consulting | Training | Speaking
● "The Internals Of" online books
● Among contributors to Apache Spark
● Among Confluent Community Catalyst (Class of 2019 - 2020)
● Contact me at jacek@japila.pl
● Follow @JacekLaskowski on twitter for more #ApacheSpark
#ApacheKafka #KafkaStreams
Jacek Laskowski
Friendly reminder
Pictures...take a lot of pictures! 📷
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Why Should You Care?
1. Why would you ever consider developing a new data
source for Spark SQL?
2. Let structured queries access data in external systems
(e.g. Splice Machine, Google Cloud Spanner)
3. Make loading or writing process self-contained
a. Hidden from developers who'd focus on what to do with the data
not how to make the data available in a proper format
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Data Source / Data Provider
1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving
data
a. Abstraction in a loose meaning
b. Also known as Data Provider or Data Format or Relation Provider
2. Built-In Data Sources: parquet, kafka, avro, json, etc.
3. All available for developers, data engineers, and data scientists
a. Scala, Java, Python, SQL
4. Allows for new data sources
5. Source or Reader for loading data
6. Sink or Writer for saving data
7. Read up on Data Sources in the official documentation
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
The goal of the session! 🎯
Before Developing New Data Source
1. What Apache Spark version?
2. Data Source API V1 vs Data Source API V2?
3. Loading and/or Saving Data?
4. Spark SQL only?
5. Spark Structured Streaming?
a. Micro-Batch Stream Processing?
b. Continuous Stream Processing?
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
DataFrameReader (1 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. SparkSession.read to start describing a data flow
a. Creates a DataFrameReader
2. DataFrameReader is a fluent interface to describe the
input data source
3. Used to “load” data from external storage systems (e.g.
file systems, key-value stores, etc.)
a. No physical data movement yet
b. Metadata of an input node in a data flow (graph)
4. DataFrameReader.load to finish describing the input
DataFrameReader (2 of 2)
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Worth noticing:
1. DataSource.lookupDataS
ource
2. DataSourceV2
3. ReadSupport
4. DataSourceV2Relation
5. loadV1Source
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
loadV1Source = DataSource.resolveRelation
1. loadV1Source loads a DataSource API V1 data source
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Data Source API
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
2. 👉 Data Source API V1
3. 👉 Data Source API V2
Friendly reminders
1. Pictures...take a lot of pictures! 📷
2. It should be a live coding, shouldn’t it? 🤔
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
Data Source API V1
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
a. SchemaRelationProvider
b. RelationProvider
c. FileFormat
d. CreatableRelationProvider
2. BaseRelation
a. PrunedFilteredScan
b. InsertableRelation
c. PrunedScan
d. TableScan
e. CatalystScan
Data Source API V2
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl
1. DataSourceRegister
2. DataSourceV2
3. ReadSupport
4. WriteSupport
“The Internals Of” Online Books
1. The Internals of Spark SQL
2. The Internals of Spark Structured Streaming
3. The Internals of Apache Spark
Questions?
1. Follow @jaceklaskowski on twitter (DMs open)
2. Upvote my questions and answers on StackOverflow
3. Contact me at jacek@japila.pl
4. Connect with me at LinkedIn
© Jacek Laskowski / @JacekLaskowski / jacek@japila.pl

More Related Content

What's hot (20)

PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
PDF
Infrastructure for Deep Learning in Apache Spark
Databricks
 
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
Databricks
 
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
PDF
Databricks with R: Deep Dive
Databricks
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
PDF
Acid ORC, Iceberg and Delta Lake
Michal Gancarski
 
PDF
Accelerating Machine Learning on Databricks Runtime
Databricks
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PDF
Change Data Feed in Delta
Databricks
 
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
PDF
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Infrastructure for Deep Learning in Apache Spark
Databricks
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Databricks
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Databricks with R: Deep Dive
Databricks
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Acid ORC, Iceberg and Delta Lake
Michal Gancarski
 
Accelerating Machine Learning on Databricks Runtime
Databricks
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Big Telco - Yousun Jeong
Spark Summit
 
Change Data Feed in Delta
Databricks
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Insights Without Tradeoffs: Using Structured Streaming
Databricks
 

Similar to Extending Spark SQL 2.4 with New Data Sources (Live Coding Session) (20)

PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PPTX
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
PDF
Introduction to Datasource V2 API
datamantra
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PPTX
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
PDF
Apache spark 2.4 and beyond
Xiao Li
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PPTX
Azure Databricks is Easier Than You Think
Ike Ellis
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Introduction to Apache Spark 2.0
Knoldus Inc.
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
Apache Spark for Beginners
Anirudh
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PDF
Introduction to Spark Training
Spark Summit
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
Introduction to Datasource V2 API
datamantra
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Apache spark 2.4 and beyond
Xiao Li
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Azure Databricks is Easier Than You Think
Ike Ellis
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Apache Spark 2.0
Knoldus Inc.
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Apache Spark for Beginners
Anirudh
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Introduction to Spark Training
Spark Summit
 
Intro to Spark development
Spark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
Research Methodology Overview Introduction
ayeshagul29594
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
What Is Data Integration and Transformation?
subhashenia
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

  • 1. Extending Spark SQL 2.4 with New Data Sources Live Coding Session © Jacek Laskowski / @JacekLaskowski / [email protected] / Spark+AI Summit 2019
  • 2. ● Freelance IT consultant ● Specializing in Spark, Kafka, Kafka Streams, Scala ● Development | Consulting | Training | Speaking ● "The Internals Of" online books ● Among contributors to Apache Spark ● Among Confluent Community Catalyst (Class of 2019 - 2020) ● Contact me at [email protected] ● Follow @JacekLaskowski on twitter for more #ApacheSpark #ApacheKafka #KafkaStreams Jacek Laskowski
  • 3. Friendly reminder Pictures...take a lot of pictures! 📷 © Jacek Laskowski / @JacekLaskowski / [email protected]
  • 4. Why Should You Care? 1. Why would you ever consider developing a new data source for Spark SQL? 2. Let structured queries access data in external systems (e.g. Splice Machine, Google Cloud Spanner) 3. Make loading or writing process self-contained a. Hidden from developers who'd focus on what to do with the data not how to make the data available in a proper format © Jacek Laskowski / @JacekLaskowski / [email protected]
  • 5. Data Source / Data Provider 1. Data Source is an pluggable “abstraction” in Spark SQL for loading and saving data a. Abstraction in a loose meaning b. Also known as Data Provider or Data Format or Relation Provider 2. Built-In Data Sources: parquet, kafka, avro, json, etc. 3. All available for developers, data engineers, and data scientists a. Scala, Java, Python, SQL 4. Allows for new data sources 5. Source or Reader for loading data 6. Sink or Writer for saving data 7. Read up on Data Sources in the official documentation © Jacek Laskowski / @JacekLaskowski / [email protected] The goal of the session! 🎯
  • 6. Before Developing New Data Source 1. What Apache Spark version? 2. Data Source API V1 vs Data Source API V2? 3. Loading and/or Saving Data? 4. Spark SQL only? 5. Spark Structured Streaming? a. Micro-Batch Stream Processing? b. Continuous Stream Processing? © Jacek Laskowski / @JacekLaskowski / [email protected]
  • 7. DataFrameReader (1 of 2) © Jacek Laskowski / @JacekLaskowski / [email protected] 1. SparkSession.read to start describing a data flow a. Creates a DataFrameReader 2. DataFrameReader is a fluent interface to describe the input data source 3. Used to “load” data from external storage systems (e.g. file systems, key-value stores, etc.) a. No physical data movement yet b. Metadata of an input node in a data flow (graph) 4. DataFrameReader.load to finish describing the input
  • 8. DataFrameReader (2 of 2) © Jacek Laskowski / @JacekLaskowski / [email protected] Worth noticing: 1. DataSource.lookupDataS ource 2. DataSourceV2 3. ReadSupport 4. DataSourceV2Relation 5. loadV1Source
  • 10. loadV1Source = DataSource.resolveRelation 1. loadV1Source loads a DataSource API V1 data source © Jacek Laskowski / @JacekLaskowski / [email protected]
  • 11. Data Source API © Jacek Laskowski / @JacekLaskowski / [email protected] 1. DataSourceRegister 2. 👉 Data Source API V1 3. 👉 Data Source API V2
  • 12. Friendly reminders 1. Pictures...take a lot of pictures! 📷 2. It should be a live coding, shouldn’t it? 🤔 © Jacek Laskowski / @JacekLaskowski / [email protected]
  • 13. Data Source API V1 © Jacek Laskowski / @JacekLaskowski / [email protected] 1. DataSourceRegister a. SchemaRelationProvider b. RelationProvider c. FileFormat d. CreatableRelationProvider 2. BaseRelation a. PrunedFilteredScan b. InsertableRelation c. PrunedScan d. TableScan e. CatalystScan
  • 14. Data Source API V2 © Jacek Laskowski / @JacekLaskowski / [email protected] 1. DataSourceRegister 2. DataSourceV2 3. ReadSupport 4. WriteSupport
  • 15. “The Internals Of” Online Books 1. The Internals of Spark SQL 2. The Internals of Spark Structured Streaming 3. The Internals of Apache Spark
  • 16. Questions? 1. Follow @jaceklaskowski on twitter (DMs open) 2. Upvote my questions and answers on StackOverflow 3. Contact me at [email protected] 4. Connect with me at LinkedIn © Jacek Laskowski / @JacekLaskowski / [email protected]