Make your PySpark Data Fly with Arrow!

Make Your
PySpark Data Fly
with Apache Arrow!
Bryan Cutler
Software Engineer
@BryanCutler
DBG / May 2, 2018 / © 2019 IBM Corporation

About Bryan
@BryanCutler on Github
Software Engineer, IBM
Center for Open-Source Data & AI Technologies
(CODAIT)
Big Data Machine Learning & AI
Apache Spark committer
Apache Arrow committer
TensorFlow I/O maintainer

Row 1 Row 2 Row 3 Row 4
0
2
4
6
8
10
12
Column 1
Column 2
Column 3
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
CODAIT aims to make AI solutions
dramatically easier to create,
deploy, and manage in the
enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open
Source

Agenda
Overview of Apache Arrow
Intro to Arrow Flight
How to talk Arrow
Flight in Action

Apache Arrow Overview

About Arrow
Apache Arrow
Standard format for in-memory columnar data
●
Implementations in many languages and growing
●
Built for efficient analytic operations on modern hardware
Has built in primitives for basic exchange of Arrow data
●
Zero-copy data within a process
●
IPC with Arrow record batch messages

Why use Arrow
Apache Arrow
Arrow brings many benefits
●
Common standard with cross-
language support
●
Better interoperability between
frameworks
●
Avoid costly data serialization

Who is using Arrow
Apache Arrow
The Apache® Software Foundation Announces Apache Arrow™ Momentum
●
Adopted by dozens of Open Source and commercial technologies
●
Exceeded 1,000,000 monthly downloads within first three years as an
Apache Top-Level Project
●
Apache Spark, NVIDIA RAPIDS, pandas, and Dremio, among others
https://arrow.apache.org/powered_by
Source: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46

Arrow Flight

Introduction
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing1
:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management

Arrow Flight
Extensible data service
●
Clients get/put Arrow data
●
List available data
●
Custom actions
●
Can think of it as ODBC for in-memory data

Stream Batching
Arrow Flight
Arrow Stream is a schema + record batches
A Flight is composed of multiple streams
●
Streams could come from different endpoints
●
Transfer data in bulk for efficiency
●
Location info can be used to improve data locality
Flight
Stream 1
Record
Batch
Record
Batch
Stream 2
Record
Batch
Record
Batch

Stream Management
Arrow Flight
Service manages Flights for the clients
●
Flight Info gives a list of endpoints with locations of
each stream in the Flight
●
Streams are referenced by a ticket
– A ticket is an opaque struct that is unique for
each stream
●
Flight descriptors differentiate between flights
– Can define how Flight is composed
– Batch size, or even a SQL query

FlightDescriptor Types
Arrow Flight
Simple path-like:
Custom proto:
message MyDescriptor {
   string sql_query = 1;
   int32 records_per_batch = 2;
}
Message MyTicket {
   MyDescriptor desc = 1;
   string uuid = 2;
}
“datasets/catsdogs/training”

Ticket Sequence for Consumer
Flight Example
To consume an entire Flight
●
Get FlightInfo for list of
endpoints with tickets
●
For each endpoint
– Use ticket to get endpoint
stream
– Process each RecordBatch in
the stream
Consumer Flight Service
Get FlightInfo (FlightDescriptor)
FlightInfo
For Each Endpoint
Get Stream (Ticket)
For Each Batch in Stream
RecordBatch
Stream
Get Next
Process
batch

Benefits
Arrow Flight
●
Applications use client interface and exchange
standard record batches
●
Complex communication handled internally
●
Efficient, uses batches and minimum copies
●
Standardized protocol
– Authentication
– Support different transports
– Able to handle backpressure

Current Status
Arrow Flight
Common protocol defined using protocol buffers
Prototype implementations in Java, C++, Python
Still experimental, but lots of work being done to
make production ready

How to Talk Arrow
Arrow Flight
If a system wants to exchange Arrow Fight data, then
needs to be able to produce/consume an Arrow
stream
●
Spark kind of does already, but not externalized
●
See SPARK-24579 and SPARK-26413
●
Can build a Scala Flight connector with a little
hacking

How to Talk Arrow
Arrow Flight
TensorFlow I/O has Arrow Datasets
●
Maintained by SIG-IO community
– Also many other inputs to TF
– Many sources from legacy contrib/
●
Several Arrow datasets
– ArrowStreamDataset used here
●
Input ops only for now
●
Install: “pip install tensorflow-io”
Check it out at https://github.com/tensorflow/io

Flight in Action:
Spark to TensorFlow

Define the Service
Flight Example
Simple Service backed by an in-memory data store
●
Keeps streams in memory
●
Flight descriptor is a string id
●
This is from the Java Flight examples

Make the Clients
Flight Example
PySpark will put Arrow data
●
Map partition op of DataFrame to Arrow
●
Each partition sent as a stream of batches
– A ticket is roughly the partition index
TensorFlow Dataset will get Arrow data
●
Request entire Flight, which is multiple streams
●
Gets one batch at a time to process
●
Op outputs tensors

Data Flow
Flight Example
Spark Worker
Stream 1
Record
Batch
Record
Batch
Flight
Service
Stream 2
Record
Batch
Record
Batch
TensorFlow
Process Batches
Record
Batch
Record
Batch
Record
Batch
Record
Batch
Flight =
Stream 1
+
Stream 2

Walkthrough
Flight Example
Application code is simple
– Only a few lines
– Focus on working with data
– Don’t need to worry about conversion, file
formats, networking
Example in Python but data never needs to
go through Python!
Worker JVM → Flight Service → TF C++
“”” PySpark Client
“””
# Spark job to put partitions to service
SparkFlightConnector.put(
    df,           # Existing DataFrame
    host, port,   # Flight Service ip
    'radspark'   # Data descriptor
)
“”” TensorFlow Client
“””
# Arrow tf.data.Dataset gets Flight data
dataset = ArrowFlightDataset.from_schema(
    host, port,   # Flight Service ip
    'radspark',  # Data descriptor
    to_arrow_schema(df.schema)  # Schema
)
# Iterate over Flight data as tensors
it = dataset.make_one_shot_iterator()

Recap
Arrow Flight
Apache Arrow – standard for in-memory data
Arrow Flight – efficiently move data around network
●
Arrow data as a service
●
Stream batching
●
Stream management
Simple example with PySpark + TensorFlow
●
Data transfer never goes through Python

Links & References
Apache Arrow and Flight specification
https://arrow.apache.org/
https://github.com/apache/arrow/blob/master/format/Flight.proto
TensorFlow I/O
https://github.com/tensorflow/io
Related Spark JIRAs
SPARK-24579
SPARK-26413
Example Code
https://github.com/BryanCutler/SparkArrowFlight
References: Flight Overview by Arrow PMC Jacques Nadeau
[1] https://www.slideshare.net/JacquesNadeau5/apache-arrow-flight-overview

Introduction
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing:
●
●
Batch Streams
●
Stream Management

Thank you!
codait.org
http://github.com/BryanCutler
developer.ibm.com/code
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://ibm.biz/BdZgcx
https://datascience.ibm.com/
MAX

Backup Slides
Slides
BACKUP

Spark Client
Code
Map Partitions
to RecordBatches
Add partition batches
Into a Stream
Put stream to
service
// Spark job to put partitions to service
rdd.mapPartitions { it =>
   val allocator = it.allocator.newChildAllocator(
       "SparkFlightConnector", 0, Long.MaxValue)
   val client = new FlightClient(allocator, new Location(host, port))
   val desc = FlightDescriptor.path(descriptor)
   val stream = client.startPut(desc, it.root)
   // Use VectorSchemaRootIterator to convert Rows > Vectors
   it.foreach { root =>
     // doPut on the populated VectorSchemaRoot
     stream.putNext()
   }
   stream.completed()
   stream.getResult
   client.close()
   Iterator.empty
}.count()

Make your PySpark Data Fly with Arrow!

More Related Content

What's hot (20)

Similar to Make your PySpark Data Fly with Arrow! (20)

More from Databricks (20)

Recently uploaded (20)

Make your PySpark Data Fly with Arrow!