Harnessing Spark Catalyst for Custom Data Payloads

Harnessing Spark Catalyst for
Custom Data Payloads
GIS Raster Support in Spark DataFrames
Simeon H.K. Fitch
Co-Founder & VP of R&D, Astraea

Astraea
• Developing a machine learning platform to
make solving planetary problems easier
• With exploding population growth and finite
resources, we need to have tools to better plan
for sustainable growth
• We aim to bring earth science data to business
applications through machine learning
2
See the earth. As it was, as it is, as it could be.

Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3

PROBLEM STATEMENT
To efficiently and effectively build machine learning models with Earth observation data
4

Data Native Form
5
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Remote Sensing Data Product
Granule/Scene/Tile
(GeoTIFF, HDF-EOS, GML-JPEG2000)
… …
add_offset
Band 32 emissivity
scale_factor
TileID
Value
0.002
1, 255
0.49
long_name
Key
valid_range
51004010
Multiband
Tile
Granule-wide
properties

Canonical ML Functional Form
6
c
1
a
1
b
1TPEA
1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1]
Spark Dataframe Row
(i.e. ML Observation)
Band Values at
Single Cell
. . .. . .. . .. . .. . .. . .
Projected Extent of
Tile + Cell Row/
Column
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)

Analytics Base Table
(ABT)
…
t1
t2
t2
t1
t2
t1
T3
T2
T2
T3
T2
T1
…
Delivering Imagery to ML
SLAAW
Scenes/
Granules
(Scene 1)
t0,b1
(Scene 1)
t0, bn
(Scene 1)
t0,b3
(Scene 1)
t0,b2
(Scene 1)
t0, b7
(Scene 1)
t0, b6
(Scene 1)
t0, b4
(Scene 1)
t0, b5
(Scene 2)
t1,b1
(Scene 2)
t1, bn
(Scene 2)
t1,b3
(Scene 2)
t1,b2
(Scene 2)
t1, b7
(Scene 2)
t1, b6
(Scene 2)
t1, b4
(Scene 2)
t1, b5
(Scene N)
tf,b1
(Scene N)
tf, bn
(Scene N)
tf,b3
(Scene N)
tf,b2
(Scene N)
tf, b7
(Scene N)
tf, b6
(Scene N)
tf, b4
(Scene N)
tf, b5
…
…
…
Feature
Engineering
Exploratory Data
Analysis
(EDA)
Data Quality
Check
(DQC)
Base Analytics Functional Form
(BAFF)
t1
t2
t2
t2
t1
t1
i6
i5
i4
i3
i2
i1
…
7
World-wide data coverage
Distributed DataFrame
Distributed DataFrame
Scalable Machine Learning
time
wavelength

Why This is Hard: Dimensionality
8
Spatial
(500m → 5m → 30cm)
Temporal
(Refresh rates: Weeks → Daily → Hourly)
Spectral
(4 bands → 200 bands)
Planet
DigiGlobe
Landsat8
Planetary
Resources
Metadata
• Coordinate Reference System
• Temporal/Spatial Extent
• QA Flags
• Calibration parameters
+

Why This is Hard: Data Footprint
9
As resolution scales, image size explodes
Data footprint for one football field size multiband raster
(single point in time!)
• 30 meters
• 8 band
• 0.5 GB/image
Landsat8
(NASA)
• 3 meters
• 4 band
• 16 GB/image
Planet
PlanetScope
Ortho
• 30 centimeters
• 4 band
• 1.0 TB/image
DigiGlobe
• 10 m Resolution
• 200 band (hyper-spectral)
• 50 TB/ image?
Planetary
Resources

CAPABILITY DEMONSTRATION
Prototyping Spark Catalyst raster integration
10

Domain-Specific Data Discretization
Swath ~ Granule ~ Scene ~ Raster
⇓
Tile ~ Chip
⇓
Cell ~ Pixel
11
𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200
(e.g. Landsat 8: 76002)
𝑛.
, where 𝑛 ≲ 512
(Typical: 642 to 2562)
1×1
Each of these has one or more “bands”
(e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)

TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
12
§ vectorizeTiles
§ explodeTiles
§ localMax
§ localMin
§ localStats
§ localAdd
§ localSubtract
§ tileHistogram
§ tileStatistics
§ tileMean
§ aggHistogram
§ aggStats
See work-in-progress code and examples/tests in:
https://github.com/s22s/geotrellis-spark-sql/

TileUDT Notebook Demo
13
ZeppelinHub Version

14
IMPLEMENTATION
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame

Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin
15

GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16

Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17

Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18

Anatomy of a UDT
To access private API, need to be a subpackage of sql.
Supertype parameterized on user type
Name shown in schema and query plan
Runtime class descriptor of user type
Schema describing how the type will be
encoded within Catalyst. You have lots of
flexibility here, even using other UDTs. In this
example we pack the tile into an opaque blob.
Conversion from user data type to Catalyst encoding
Conversion from Catalyst encoding to user data type
19

UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20

Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21

Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22

Harnessing Spark Catalyst for Custom Data Payloads

More Related Content

What's hot (20)

Similar to Harnessing Spark Catalyst for Custom Data Payloads (20)

Recently uploaded (20)

Harnessing Spark Catalyst for Custom Data Payloads

Editor's Notes