Anatomy of Data Source API : A deep dive into Spark Data source API

7 likes3,448 views

The document provides an in-depth look at the Spark Data Source API, covering features like schema discovery, data type inference, and order of operations for loading and saving structured data, particularly focusing on CSV sources. It details building a CSV data source, including automatic schema discovery and optimizations such as column pruning and filter push. The content illustrates the implementation of various traits necessary for supporting user schema definitions, reading, saving, and optimizing data access.

Data & Analytics

Anatomy of Data Source
API
A deep dive into the Spark Data source API
https://github.com/phatak-dev/anatomy_of_spark_datasource_api

● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● Data Source API
● Schema discovery
● Build Scan
● Data type inference
● Save
● Column pruning
● Filter push

Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc

Building CSV data source
● Ability to load and save csv data
● Automatic schema discovery
● Support for user schema override
● Automatic data type inference
● Ability to columns pruning
● Filter push

Default Source
● Spark looks for a class named DefaultSource in given
package of Data source API
● Default Source should extend RelationProvider trait
● Relation Provider is responsible for taking user
parameters and turn them into a Base relation
● SchemaRelationProvider trait allows to specify user
defined schema
● Ex : DefaultSource.scala

Base Relation
● Represent collection of tuples with known schema
● Methods needed to be overridden
○ def sqlContext
Return sqlContext for building Data Frames
○ def schema:StructType
Returns the schema of the relation in terms of
StructType (analogous to hive serde)
● Ex : CsvRelation.scala

TableScan
● Table scan is a trait to be implemented for reading data
● It’s Base Relation that can produce all of it’s tuples as
an RDD of Row objects
● Methods to override
○ def buildScan(): RDD[Row]
● In csv example, we use sc.textFile to create RDD and
then Row.fromSeq to convert to an ROW
● CsvTableScanExample.scala

Inferring data types
● Treated every value as string as now
● Sample data and infer schema for each row
● Take inferred schema of first row
● Update table scan to cast it to right data type
● Ex: CsvSchemaDiscovery.scala
● Ex : SalesSumExample.scala

CreateTableRelationProvider
● DefaultSource should implement
CreateTableRelationProvider in order to support save
call
● Override createRelation method to implement save
mechanism
● Convert RDD[Row] to String and use saveAsTextFile to
save
● Ex : CsvSaveExample.scala

PrunedScan
● CsvRelation should implement PrunedScan trait to
optimize the columns access
● PrunedScan gives information to the data source which
columns it wants to access
● When we build RDD[Row] we only give columns need
● No performance benefit in Csv data, just for demo. But it
has great performance benefits in sources like jdbc
● Ex : SalesSumExample.scala

PrunedFilterScan
● CsvRelation should implement PrunedFilterScan trait
to optimize filtering
● PrunedFilterScan pushes filters to data source
● When we build RDD[Row] we only give rows which
satisfy the filter
● It’s an optimization. The filters will be evaluated again.
● Ex :CsvFilerExample.scala

More Related Content

What's hot (20)

PDF

Building end to end streaming application on Sparkdatamantra

PDF

Improving Mobile Payments With Real time Sparkdatamantra

PDF

Interactive Data Analysis in Spark Streamingdatamantra

PDF

Introduction to Datasource V2 APIdatamantra

PDF

Building distributed processing system from scratch - Part 2datamantra

PDF

Introduction to spark 2.0datamantra

PDF

Introduction to concurrent programming with Akka actorsShashank L

PDF

Productionalizing a spark applicationdatamantra

PDF

Migrating to Spark 2.0 - Part 2datamantra

PPTX

Introduction to Spark - DataFactZDataFactZ

PDF

Building Distributed Systems from Scratch - Part 1datamantra

PDF

Machine learning pipeline with spark mldatamantra

PPTX

Machine Learning With SparkShivaji Dutta

PDF

Migrating to spark 2.0datamantra

PDF

Evolution of apache sparkdatamantra

PDF

Exploratory Data Analysis in Sparkdatamantra

PDF

Real time ETL processing using Spark streamingdatamantra

PPTX

Apache Spark MLlib Zahra Eskandari

PDF

Enabling exploratory data science with Spark and RDatabricks

PPTX

Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc

Building end to end streaming application on Sparkdatamantra

Improving Mobile Payments With Real time Sparkdatamantra

Interactive Data Analysis in Spark Streamingdatamantra

Introduction to Datasource V2 APIdatamantra

Building distributed processing system from scratch - Part 2datamantra

Introduction to spark 2.0datamantra

Introduction to concurrent programming with Akka actorsShashank L

Productionalizing a spark applicationdatamantra

Migrating to Spark 2.0 - Part 2datamantra

Introduction to Spark - DataFactZDataFactZ

Building Distributed Systems from Scratch - Part 1datamantra

Machine learning pipeline with spark mldatamantra

Machine Learning With SparkShivaji Dutta

Migrating to spark 2.0datamantra

Evolution of apache sparkdatamantra

Exploratory Data Analysis in Sparkdatamantra

Real time ETL processing using Spark streamingdatamantra

Apache Spark MLlib Zahra Eskandari

Enabling exploratory data science with Spark and RDatabricks

Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc

Similar to Anatomy of Data Source API : A deep dive into Spark Data source API (20)

PDF

Introduction to Structured Data Processing with Spark SQLdatamantra

PPTX

Spark sqlZahra Eskandari

PPTX

Building highly scalable data pipelines with Apache SparkMartin Toshev

PDF

Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau

PDF

Data processing with spark in r & pythonMaloy Manna, PMP®

PDF

Introduction to and Extending Spark MLHolden Karau

PPTX

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

PPTX

ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar

PDF

Spark Summit EU talk by Ross LawleySpark Summit

PDF

How To Connect Spark To Your Own DatasourceMongoDB

PDF

Introduction to Spark Datasets - Functional and relational together at lastHolden Karau

PDF

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

PPTX

Intro to SparkKyle Burke

PDF

Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau

PPTX

Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation

PDF

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit

PDF

Understanding transactional writes in datasource v2datamantra

ODP

Spring Data in 10 minutesCorneil du Plessis

PDF

Spark & Cassandra - DevFest CórdobaJose Mº Muñoz

PDF

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau

Introduction to Structured Data Processing with Spark SQLdatamantra

Spark sqlZahra Eskandari

Building highly scalable data pipelines with Apache SparkMartin Toshev

Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau

Data processing with spark in r & pythonMaloy Manna, PMP®

Introduction to and Extending Spark MLHolden Karau

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar

Spark Summit EU talk by Ross LawleySpark Summit

How To Connect Spark To Your Own DatasourceMongoDB

Introduction to Spark Datasets - Functional and relational together at lastHolden Karau

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Intro to SparkKyle Burke

Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau

Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...Spark Summit

Understanding transactional writes in datasource v2datamantra

Spring Data in 10 minutesCorneil du Plessis

Spark & Cassandra - DevFest CórdobaJose Mº Muñoz

Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau

More from datamantra (18)

PPTX

State management in Structured Streamingdatamantra

PDF

Spark on Kubernetesdatamantra

PDF

Core Services behind Spark Job Executiondatamantra

PDF

Optimizing S3 Write-heavy Spark workloadsdatamantra

PDF

Structured Streaming with Kafkadatamantra

PDF

Understanding time in structured streamingdatamantra

PDF

Spark stack for Model life-cycle managementdatamantra

PDF

Productionalizing Spark MLdatamantra

PDF

Introduction to Structured streamingdatamantra

PPTX

Building real time Data Pipeline using Spark Streamingdatamantra

PDF

Testing Spark and Scaladatamantra

PDF

Understanding Implicits in Scaladatamantra

PDF

Scalable Spark deployment using Kubernetesdatamantra

PDF

Introduction to concurrent programming with akka actorsdatamantra

PDF

Functional programming in Scaladatamantra

PPTX

Telco analytics at scaledatamantra

PPTX

Platform for Data Scientistsdatamantra

PDF

Building scalable rest service using Akka HTTPdatamantra

State management in Structured Streamingdatamantra

Spark on Kubernetesdatamantra

Core Services behind Spark Job Executiondatamantra

Optimizing S3 Write-heavy Spark workloadsdatamantra

Structured Streaming with Kafkadatamantra

Understanding time in structured streamingdatamantra

Spark stack for Model life-cycle managementdatamantra

Productionalizing Spark MLdatamantra

Introduction to Structured streamingdatamantra

Building real time Data Pipeline using Spark Streamingdatamantra

Testing Spark and Scaladatamantra

Understanding Implicits in Scaladatamantra

Scalable Spark deployment using Kubernetesdatamantra

Introduction to concurrent programming with akka actorsdatamantra

Functional programming in Scaladatamantra

Telco analytics at scaledatamantra

Platform for Data Scientistsdatamantra

Building scalable rest service using Akka HTTPdatamantra

Recently uploaded (20)

PDF

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

PDF

Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdfKPycho

PDF

Data Science Course Certificate by Sigma Software UniversityStepan Kalika

PPTX

apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...apidays

PPTX

apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...apidays

PDF

The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...Lal Chandran

PPTX

What Is Data Integration and Transformation?subhashenia

PDF

apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...apidays

PPTX

04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025FinTech Belgium

PPTX

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

PPTX

apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...apidays

PDF

apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...apidays

PPTX

03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_eventFinTech Belgium

PPTX

apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...apidays

PPTX

05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_EventFinTech Belgium

PPTX

apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)apidays

PDF

apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)apidays

PDF

Development and validation of the Japanese version of the Organizational Matt...Yoga Tokuyoshi

PPTX

apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...apidays

PDF

JavaScript - Good or Bad? Tips for Google Tag Manager📊 Markus Baersch