SlideShare a Scribd company logo
Anatomy of Data Source
API
A deep dive into the Spark Data source API
https://github.com/phatak-dev/anatomy_of_spark_datasource_api
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Data Source API
● Schema discovery
● Build Scan
● Data type inference
● Save
● Column pruning
● Filter push
Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc
Data source API
Building CSV data source
● Ability to load and save csv data
● Automatic schema discovery
● Support for user schema override
● Automatic data type inference
● Ability to columns pruning
● Filter push
Schema discovery
Tag v0.1
CsvSchemaDiscovery Example
Default Source
● Spark looks for a class named DefaultSource in given
package of Data source API
● Default Source should extend RelationProvider trait
● Relation Provider is responsible for taking user
parameters and turn them into a Base relation
● SchemaRelationProvider trait allows to specify user
defined schema
● Ex : DefaultSource.scala
Base Relation
● Represent collection of tuples with known schema
● Methods needed to be overridden
○ def sqlContext
Return sqlContext for building Data Frames
○ def schema:StructType
Returns the schema of the relation in terms of
StructType (analogous to hive serde)
● Ex : CsvRelation.scala
Reading Data
Tag v0.2
TableScan
● Table scan is a trait to be implemented for reading data
● It’s Base Relation that can produce all of it’s tuples as
an RDD of Row objects
● Methods to override
○ def buildScan(): RDD[Row]
● In csv example, we use sc.textFile to create RDD and
then Row.fromSeq to convert to an ROW
● CsvTableScanExample.scala
Data Type inference
Tag v0.3
Inferring data types
● Treated every value as string as now
● Sample data and infer schema for each row
● Take inferred schema of first row
● Update table scan to cast it to right data type
● Ex: CsvSchemaDiscovery.scala
● Ex : SalesSumExample.scala
Save As Csv
Tag v0.4
CreateTableRelationProvider
● DefaultSource should implement
CreateTableRelationProvider in order to support save
call
● Override createRelation method to implement save
mechanism
● Convert RDD[Row] to String and use saveAsTextFile to
save
● Ex : CsvSaveExample.scala
Column Pruning
Tag v0.5
PrunedScan
● CsvRelation should implement PrunedScan trait to
optimize the columns access
● PrunedScan gives information to the data source which
columns it wants to access
● When we build RDD[Row] we only give columns need
● No performance benefit in Csv data, just for demo. But it
has great performance benefits in sources like jdbc
● Ex : SalesSumExample.scala
Filter push
Tag v0.6
PrunedFilterScan
● CsvRelation should implement PrunedFilterScan trait
to optimize filtering
● PrunedFilterScan pushes filters to data source
● When we build RDD[Row] we only give rows which
satisfy the filter
● It’s an optimization. The filters will be evaluated again.
● Ex :CsvFilerExample.scala

More Related Content

What's hot (20)

PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Improving Mobile Payments With Real time Spark
datamantra
 
PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PDF
Introduction to Datasource V2 API
datamantra
 
PDF
Building distributed processing system from scratch - Part 2
datamantra
 
PDF
Introduction to spark 2.0
datamantra
 
PDF
Introduction to concurrent programming with Akka actors
Shashank L
 
PDF
Productionalizing a spark application
datamantra
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Building Distributed Systems from Scratch - Part 1
datamantra
 
PDF
Machine learning pipeline with spark ml
datamantra
 
PPTX
Machine Learning With Spark
Shivaji Dutta
 
PDF
Migrating to spark 2.0
datamantra
 
PDF
Evolution of apache spark
datamantra
 
PDF
Exploratory Data Analysis in Spark
datamantra
 
PDF
Real time ETL processing using Spark streaming
datamantra
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Building end to end streaming application on Spark
datamantra
 
Improving Mobile Payments With Real time Spark
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 
Introduction to Datasource V2 API
datamantra
 
Building distributed processing system from scratch - Part 2
datamantra
 
Introduction to spark 2.0
datamantra
 
Introduction to concurrent programming with Akka actors
Shashank L
 
Productionalizing a spark application
datamantra
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Introduction to Spark - DataFactZ
DataFactZ
 
Building Distributed Systems from Scratch - Part 1
datamantra
 
Machine learning pipeline with spark ml
datamantra
 
Machine Learning With Spark
Shivaji Dutta
 
Migrating to spark 2.0
datamantra
 
Evolution of apache spark
datamantra
 
Exploratory Data Analysis in Spark
datamantra
 
Real time ETL processing using Spark streaming
datamantra
 
Apache Spark MLlib
Zahra Eskandari
 
Enabling exploratory data science with Spark and R
Databricks
 
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 

Similar to Anatomy of Data Source API : A deep dive into Spark Data source API (20)

PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Data processing with spark in r & python
Maloy Manna, PMP®
 
PDF
Introduction to and Extending Spark ML
Holden Karau
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
PDF
Spark Summit EU talk by Ross Lawley
Spark Summit
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PPTX
Intro to Spark
Kyle Burke
 
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PPTX
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Anant Corporation
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PDF
Understanding transactional writes in datasource v2
datamantra
 
ODP
Spring Data in 10 minutes
Corneil du Plessis
 
PDF
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Spark sql
Zahra Eskandari
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Data processing with spark in r & python
Maloy Manna, PMP®
 
Introduction to and Extending Spark ML
Holden Karau
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
Spark Summit EU talk by Ross Lawley
Spark Summit
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Intro to Spark
Kyle Burke
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Anant Corporation
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Understanding transactional writes in datasource v2
datamantra
 
Spring Data in 10 minutes
Corneil du Plessis
 
Spark & Cassandra - DevFest Córdoba
Jose Mº Muñoz
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Ad

More from datamantra (18)

PPTX
State management in Structured Streaming
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Core Services behind Spark Job Execution
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
Structured Streaming with Kafka
datamantra
 
PDF
Understanding time in structured streaming
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Productionalizing Spark ML
datamantra
 
PDF
Introduction to Structured streaming
datamantra
 
PPTX
Building real time Data Pipeline using Spark Streaming
datamantra
 
PDF
Testing Spark and Scala
datamantra
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PPTX
Telco analytics at scale
datamantra
 
PPTX
Platform for Data Scientists
datamantra
 
PDF
Building scalable rest service using Akka HTTP
datamantra
 
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
datamantra
 
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
datamantra
 
Telco analytics at scale
datamantra
 
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
datamantra
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 

Anatomy of Data Source API : A deep dive into Spark Data source API

  • 1. Anatomy of Data Source API A deep dive into the Spark Data source API https://github.com/phatak-dev/anatomy_of_spark_datasource_api
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Data Source API ● Schema discovery ● Build Scan ● Data type inference ● Save ● Column pruning ● Filter push
  • 4. Data source API ● Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra (in works) etc
  • 6. Building CSV data source ● Ability to load and save csv data ● Automatic schema discovery ● Support for user schema override ● Automatic data type inference ● Ability to columns pruning ● Filter push
  • 9. Default Source ● Spark looks for a class named DefaultSource in given package of Data source API ● Default Source should extend RelationProvider trait ● Relation Provider is responsible for taking user parameters and turn them into a Base relation ● SchemaRelationProvider trait allows to specify user defined schema ● Ex : DefaultSource.scala
  • 10. Base Relation ● Represent collection of tuples with known schema ● Methods needed to be overridden ○ def sqlContext Return sqlContext for building Data Frames ○ def schema:StructType Returns the schema of the relation in terms of StructType (analogous to hive serde) ● Ex : CsvRelation.scala
  • 12. TableScan ● Table scan is a trait to be implemented for reading data ● It’s Base Relation that can produce all of it’s tuples as an RDD of Row objects ● Methods to override ○ def buildScan(): RDD[Row] ● In csv example, we use sc.textFile to create RDD and then Row.fromSeq to convert to an ROW ● CsvTableScanExample.scala
  • 14. Inferring data types ● Treated every value as string as now ● Sample data and infer schema for each row ● Take inferred schema of first row ● Update table scan to cast it to right data type ● Ex: CsvSchemaDiscovery.scala ● Ex : SalesSumExample.scala
  • 16. CreateTableRelationProvider ● DefaultSource should implement CreateTableRelationProvider in order to support save call ● Override createRelation method to implement save mechanism ● Convert RDD[Row] to String and use saveAsTextFile to save ● Ex : CsvSaveExample.scala
  • 18. PrunedScan ● CsvRelation should implement PrunedScan trait to optimize the columns access ● PrunedScan gives information to the data source which columns it wants to access ● When we build RDD[Row] we only give columns need ● No performance benefit in Csv data, just for demo. But it has great performance benefits in sources like jdbc ● Ex : SalesSumExample.scala
  • 20. PrunedFilterScan ● CsvRelation should implement PrunedFilterScan trait to optimize filtering ● PrunedFilterScan pushes filters to data source ● When we build RDD[Row] we only give rows which satisfy the filter ● It’s an optimization. The filters will be evaluated again. ● Ex :CsvFilerExample.scala