SlideShare a Scribd company logo
Harnessing Spark Catalyst for
Custom Data Payloads
GIS Raster Support in Spark DataFrames
Simeon	H.K.	Fitch
Co-Founder	&	VP	of	R&D,	Astraea
Astraea
• Developing a machine learning platform to
make solving planetary problems easier
• With exploding population growth and finite
resources, we need to have tools to better plan
for sustainable growth
• We aim to bring earth science data to business
applications through machine learning
2
See	the	earth.	As	it	was,	as	it	is, as	it	could	be.​
Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3
PROBLEM STATEMENT
To efficiently and effectively build machine learning models with Earth observation data
4
Data Native Form
5
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Remote Sensing Data Product
Granule/Scene/Tile
(GeoTIFF, HDF-EOS, GML-JPEG2000)
… …
add_offset
Band 32 emissivity
scale_factor
TileID
Value
0.002
1, 255
0.49
long_name
Key
valid_range
51004010
Multiband
Tile
Granule-wide
properties
Canonical ML Functional Form
6
c
1
a
1
b
1TPEA
1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1]
Spark Dataframe Row
(i.e. ML Observation)
Band Values at
Single Cell
. . .. . .. . .. . .. . .. . .
Projected Extent of
Tile + Cell Row/
Column
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Analytics Base Table
(ABT)
…
t1
t2
t2
t1
t2
t1
T3
T2
T2
T3
T2
T1
…
Delivering Imagery to ML
SLAAW
Scenes/
Granules
(Scene 1)
t0,b1
(Scene 1)
t0, bn
(Scene 1)
t0,b3
(Scene 1)
t0,b2
(Scene 1)
t0, b7
(Scene 1)
t0, b6
(Scene 1)
t0, b4
(Scene 1)
t0, b5
(Scene 2)
t1,b1
(Scene 2)
t1, bn
(Scene 2)
t1,b3
(Scene 2)
t1,b2
(Scene 2)
t1, b7
(Scene 2)
t1, b6
(Scene 2)
t1, b4
(Scene 2)
t1, b5
(Scene N)
tf,b1
(Scene N)
tf, bn
(Scene N)
tf,b3
(Scene N)
tf,b2
(Scene N)
tf, b7
(Scene N)
tf, b6
(Scene N)
tf, b4
(Scene N)
tf, b5
…
…
…
Feature
Engineering
Exploratory Data
Analysis
(EDA)
Data Quality
Check
(DQC)
Base Analytics Functional Form
(BAFF)
t1
t2
t2
t2
t1
t1
i6
i5
i4
i3
i2
i1
…
7
World-wide	data	coverage
Distributed	DataFrame
Distributed	DataFrame
Scalable	Machine	Learning
time
wavelength
Why This is Hard: Dimensionality
8
Spatial
(500m	→	5m	→	30cm)	
Temporal
(Refresh	rates:	Weeks	→	Daily	→	Hourly)	
Spectral
(4	bands	→	200	bands)
Planet
DigiGlobe
Landsat8
Planetary	
Resources
Metadata
• Coordinate	Reference	System
• Temporal/Spatial	Extent
• QA	Flags
• Calibration	parameters
+
Why This is Hard: Data Footprint
9
As resolution scales, image size explodes
Data	footprint	for	one	football	field	size	multiband	raster	
(single	point	in	time!)
• 30	meters
• 8 band
• 0.5	GB/image
Landsat8
(NASA)
• 3	meters
• 4	band
• 16	GB/image
Planet
PlanetScope
Ortho
• 30	centimeters
• 4	band
• 1.0	TB/image
DigiGlobe
• 10	m	Resolution
• 200	band	(hyper-spectral)
• 50	TB/	image?
Planetary
Resources
CAPABILITY DEMONSTRATION
Prototyping Spark Catalyst raster integration
10
Domain-Specific Data Discretization
Swath ~ Granule ~ Scene ~ Raster
⇓
Tile ~ Chip
⇓
Cell ~ Pixel
11
𝑛	×	𝑚	where	𝑛, 𝑚 ≳ 1200
(e.g.	Landsat	8:	76002)	
𝑛.
, where	𝑛 ≲ 512
(Typical:	642 to	2562)
1×1
Each	of	these	has	one	or	more	“bands”
(e.g.	Landsat	8:	11,	MODIS:	36,	Hyperion:	220)
TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
12
§ vectorizeTiles
§ explodeTiles
§ localMax
§ localMin
§ localStats
§ localAdd
§ localSubtract
§ tileHistogram
§ tileStatistics
§ tileMean
§ aggHistogram
§ aggStats
See	work-in-progress	code	and	examples/tests	in:
https://github.com/s22s/geotrellis-spark-sql/
TileUDT Notebook Demo
13
ZeppelinHub Version
14
IMPLEMENTATION
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin
15
GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17
Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18
Anatomy of a UDT
To	access	private	API,	need	to	be	a	subpackage of	sql.
Supertype parameterized	on	user	type
Name	shown	in	schema	and	query	plan
Runtime	class	descriptor	of	user	type
Schema	describing	how	the	type	will	be	
encoded	within	Catalyst.	You	have	lots	of	
flexibility	here,	even	using	other	UDTs.	In	this	
example	we	pack	the	tile	into	an	opaque	blob.
Conversion	from	user	data	type	to	Catalyst	encoding
Conversion	from	Catalyst	encoding	to	user	data	type
19
UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20
Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21
Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22
23
THANK YOU!
The End

More Related Content

What's hot (20)

PDF
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
PDF
Keeping Identity Graphs In Sync With Apache Spark
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
PPTX
Catalyst optimizer
Ayub Mohammad
 
PDF
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
PDF
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
PDF
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
PDF
DASK and Apache Spark
Databricks
 
PDF
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
PDF
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
PDF
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Keeping Identity Graphs In Sync With Apache Spark
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Catalyst optimizer
Ayub Mohammad
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
Spark Summit
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
DASK and Apache Spark
Databricks
 
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Extending Machine Learning Algorithms with PySpark
Databricks
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...
Databricks
 

Similar to Harnessing Spark Catalyst for Custom Data Payloads (20)

PPTX
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
Astraea, Inc.
 
PPTX
RasterFrames - FOSS4G NA 2018
Simeon Fitch
 
PDF
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Rob Emanuele
 
PDF
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
Martin Christen
 
PPTX
Geo data analytics
Daniel Marcous
 
PDF
Monitoring environment based on satellite data with Python and PySpark - Albe...
GetInData
 
PDF
20190704_AGIT_Georaster_ImageryData_KPatenge
Karin Patenge
 
PPTX
design_doc
Aman Gill
 
PDF
Using python to analyze spatial data
Kudos S.A.S
 
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Databricks
 
PPTX
understanding the planet using satellites and deep learning
Albert Pujol Torras
 
PDF
Spatial Data Science with R
amsantac
 
PDF
Deep Learning Applications to Satellite Imagery
rlewis48
 
PPTX
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
DataWorks Summit
 
PDF
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Databricks
 
PDF
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Stavros Papadopoulos
 
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
PDF
Scattered gis handbook
Waleed Liaqat
 
PDF
Spatial_Data_Analysis_with_open_source_softwares[1]
Joachim Nkendeys
 
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
Astraea, Inc.
 
RasterFrames - FOSS4G NA 2018
Simeon Fitch
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Rob Emanuele
 
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
Martin Christen
 
Geo data analytics
Daniel Marcous
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
GetInData
 
20190704_AGIT_Georaster_ImageryData_KPatenge
Karin Patenge
 
design_doc
Aman Gill
 
Using python to analyze spatial data
Kudos S.A.S
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Databricks
 
understanding the planet using satellites and deep learning
Albert Pujol Torras
 
Spatial Data Science with R
amsantac
 
Deep Learning Applications to Satellite Imagery
rlewis48
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
DataWorks Summit
 
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Databricks
 
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Stavros Papadopoulos
 
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
Scattered gis handbook
Waleed Liaqat
 
Spatial_Data_Analysis_with_open_source_softwares[1]
Joachim Nkendeys
 
Ad

Recently uploaded (20)

PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Ad

Harnessing Spark Catalyst for Custom Data Payloads

  • 1. Harnessing Spark Catalyst for Custom Data Payloads GIS Raster Support in Spark DataFrames Simeon H.K. Fitch Co-Founder & VP of R&D, Astraea
  • 2. Astraea • Developing a machine learning platform to make solving planetary problems easier • With exploding population growth and finite resources, we need to have tools to better plan for sustainable growth • We aim to bring earth science data to business applications through machine learning 2 See the earth. As it was, as it is, as it could be.​
  • 3. Preface • Assumptions: – Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame compute model – Basic understanding of a typical ETL/ML pipeline • Prior Art: – Approach outlined derived from other work – Fundamental raster support via Azavea’s GeoTrellis – Spark integration cues taken from: • CCRi’s GeoMesa • Databrick’s Spark-Avro • Caveat Emptor: – As of Spark 2.1.0, approach is not officially sanctioned; uses undocumented, private APIs – Not for everyone, but for us, benefits outweigh the risks 3
  • 4. PROBLEM STATEMENT To efficiently and effectively build machine learning models with Earth observation data 4
  • 5. Data Native Form 5 Bandc Bandb Banda Temporal Projected Extent (TPE) Granule Metadata (GM) Remote Sensing Data Product Granule/Scene/Tile (GeoTIFF, HDF-EOS, GML-JPEG2000) … … add_offset Band 32 emissivity scale_factor TileID Value 0.002 1, 255 0.49 long_name Key valid_range 51004010 Multiband Tile Granule-wide properties
  • 6. Canonical ML Functional Form 6 c 1 a 1 b 1TPEA 1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1] Spark Dataframe Row (i.e. ML Observation) Band Values at Single Cell . . .. . .. . .. . .. . .. . . Projected Extent of Tile + Cell Row/ Column Bandc Bandb Banda Temporal Projected Extent (TPE) Granule Metadata (GM)
  • 7. Analytics Base Table (ABT) … t1 t2 t2 t1 t2 t1 T3 T2 T2 T3 T2 T1 … Delivering Imagery to ML SLAAW Scenes/ Granules (Scene 1) t0,b1 (Scene 1) t0, bn (Scene 1) t0,b3 (Scene 1) t0,b2 (Scene 1) t0, b7 (Scene 1) t0, b6 (Scene 1) t0, b4 (Scene 1) t0, b5 (Scene 2) t1,b1 (Scene 2) t1, bn (Scene 2) t1,b3 (Scene 2) t1,b2 (Scene 2) t1, b7 (Scene 2) t1, b6 (Scene 2) t1, b4 (Scene 2) t1, b5 (Scene N) tf,b1 (Scene N) tf, bn (Scene N) tf,b3 (Scene N) tf,b2 (Scene N) tf, b7 (Scene N) tf, b6 (Scene N) tf, b4 (Scene N) tf, b5 … … … Feature Engineering Exploratory Data Analysis (EDA) Data Quality Check (DQC) Base Analytics Functional Form (BAFF) t1 t2 t2 t2 t1 t1 i6 i5 i4 i3 i2 i1 … 7 World-wide data coverage Distributed DataFrame Distributed DataFrame Scalable Machine Learning time wavelength
  • 8. Why This is Hard: Dimensionality 8 Spatial (500m → 5m → 30cm) Temporal (Refresh rates: Weeks → Daily → Hourly) Spectral (4 bands → 200 bands) Planet DigiGlobe Landsat8 Planetary Resources Metadata • Coordinate Reference System • Temporal/Spatial Extent • QA Flags • Calibration parameters +
  • 9. Why This is Hard: Data Footprint 9 As resolution scales, image size explodes Data footprint for one football field size multiband raster (single point in time!) • 30 meters • 8 band • 0.5 GB/image Landsat8 (NASA) • 3 meters • 4 band • 16 GB/image Planet PlanetScope Ortho • 30 centimeters • 4 band • 1.0 TB/image DigiGlobe • 10 m Resolution • 200 band (hyper-spectral) • 50 TB/ image? Planetary Resources
  • 10. CAPABILITY DEMONSTRATION Prototyping Spark Catalyst raster integration 10
  • 11. Domain-Specific Data Discretization Swath ~ Granule ~ Scene ~ Raster ⇓ Tile ~ Chip ⇓ Cell ~ Pixel 11 𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200 (e.g. Landsat 8: 76002) 𝑛. , where 𝑛 ≲ 512 (Typical: 642 to 2562) 1×1 Each of these has one or more “bands” (e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)
  • 12. TileUDT and Friends • Using the approach covered in the next section we register TileUDT with Spark • With UDTs come User Defined Functions (UDFs) • Some examples: 12 § vectorizeTiles § explodeTiles § localMax § localMin § localStats § localAdd § localSubtract § tileHistogram § tileStatistics § tileMean § aggHistogram § aggStats See work-in-progress code and examples/tests in: https://github.com/s22s/geotrellis-spark-sql/
  • 14. 14 IMPLEMENTATION From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
  • 15. Software Stack • Scala • Apache Spark • GeoTrellis • Accumulo • Docker • Apache Zeppelin 15
  • 16. GeoTrellis • GeoTrellis is an open source Scala framework for efficiently manipulating raster GIS data • Provides facilities to ingest and process tiles at scale • Has powerful abstractions for working with RDD[Tile]s. – Mosaicing, stitching, pyramiding, resampling, reprojecting, etc. – Implements C. Dana Tomlin’s “Map Algebra” 16
  • 17. Getting From RDDs to DataFrames • Goal: work with tiles via DataFrame APIs – Better ergonomics – More computationally efficient – Required for SparkML • Bonus: if a capability is available in DataFrames, it’s also available in SQL! 17
  • 18. Encoding Data with Spark Catalyst • Catalyst is the engine behind Spark DataFrames & SQL • Moving data from RDDs to DataFrames requires using one of two Catalyst APIs: – ExpressionEncoder[Tile] or – UserDefinedType[Tile] • Both are (currently) package private • Both have steep learning curves • Both are extremely powerful once harnessed – ExpressionEncoder is ideal for simple structures – UserDefinedType is more efficient for larger data payloads • For our needs, UserDefinedType (UDT) is the best fit 18
  • 19. Anatomy of a UDT To access private API, need to be a subpackage of sql. Supertype parameterized on user type Name shown in schema and query plan Runtime class descriptor of user type Schema describing how the type will be encoded within Catalyst. You have lots of flexibility here, even using other UDTs. In this example we pack the tile into an opaque blob. Conversion from user data type to Catalyst encoding Conversion from Catalyst encoding to user data type 19
  • 20. UDT Registration • User defined type is registered with Catalyst by providing mapping between native type and UDT 20
  • 21. Spark Catalyst Toolbox • User Defined Type (UDT) • User Defined Function (UDF, 2 forms) • User Defined Aggregation Function (UDAF) • User Defined Table Function (UDTF, a.k.a. “Generator”) • Data Source • Query Plan • Optimization Rule 21
  • 22. Future Work • GeoTrellis Layer Store as an integrated Spark DataSource (in progress) • Expanding standard GeoTrellis RDD features into efficient UDFs • GIS Vector primitives (a la GeoMesa) • Becoming an official module of GeoTrellis 22

Editor's Notes

  • #2: Approach is general, not limited to GIS/EO
  • #3: A little about who we are and what we’re up to
  • #4: Explain why it matters.... can't be a data scientist if you can get the data to the form you need for modelling
  • #5: What we're working on at Astraea: platform to allow data scientists to efficiently build and deploy models based on EO data.
  • #8: SLAAW has to happen before you can even start your experimental design Save the Data Scientists time by providing higher-level abstractions for doing the “science” Make a really challenging data source more accessible to the data scientist. Two goals: address SLAAW; make data science steps more efficient. World wide collections of data. Need to be able to scale. Distinction between Python/R dataframes and Spark distributed ones
  • #13: 1) These functions can be applied globally to the distributed dataframe Allows for SLAAW, DQC, EDA, FE
  • #15: Get rasters into Spark Manipulate rasters Move rasters into Dataframe
  • #17: GeoTrellis gets the imagery into Spark Map Algebra provides fundamental sets of primitives for performing analytics on GIS raster data
  • #18: GeoTrellis alone only gets us part of the way there