SlideShare a Scribd company logo
Make Your
PySpark Data Fly
with Apache Arrow!
Bryan Cutler
Software Engineer
@BryanCutler
DBG / May 2, 2018 / © 2019 IBM Corporation
About Bryan
@BryanCutler on Github
Software Engineer, IBM
Center for Open-Source Data & AI Technologies
(CODAIT)
Big Data Machine Learning & AI
Apache Spark committer
Apache Arrow committer
TensorFlow I/O maintainer
DBG / May 2, 2018 / © 2018 IBM Corporation
DBG / May 2, 2018 / © 2018 IBM Corporation
Row 1 Row 2 Row 3 Row 4
0
2
4
6
8
10
12
Column 1
Column 2
Column 3
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
DBG / May 2, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create,
deploy, and manage in the
enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open
Source
Agenda
Overview of Apache Arrow
Intro to Arrow Flight
How to talk Arrow
Flight in Action
DBG / May 2, 2018 / © 2018 IBM Corporation
Apache Arrow Overview
DBG / May 2, 2018 / © 2018 IBM Corporation
About Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Standard format for in-memory columnar data
●
Implementations in many languages and growing
●
Built for efficient analytic operations on modern hardware
Has built in primitives for basic exchange of Arrow data
●
Zero-copy data within a process
●
IPC with Arrow record batch messages
Why use Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow brings many benefits
●
Common standard with cross-
language support
●
Better interoperability between
frameworks
●
Avoid costly data serialization
Who is using Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
The Apache® Software Foundation Announces Apache Arrow™ Momentum
●
Adopted by dozens of Open Source and commercial technologies
●
Exceeded 1,000,000 monthly downloads within first three years as an
Apache Top-Level Project
●
Apache Spark, NVIDIA RAPIDS, pandas, and Dremio, among others
https://arrow.apache.org/powered_by
Source: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46
Arrow Flight
DBG / May 2, 2018 / © 2018 IBM Corporation
Introduction
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing1
:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management
Arrow Data as a Service
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Extensible data service
●
Clients get/put Arrow data
●
List available data
●
Custom actions
●
Can think of it as ODBC for in-memory data
Stream Batching
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Stream is a schema + record batches
A Flight is composed of multiple streams
●
Streams could come from different endpoints
●
Transfer data in bulk for efficiency
●
Location info can be used to improve data locality
Flight
Stream 1
Record
Batch
Record
Batch
Stream 2
Record
Batch
Record
Batch
Stream Management
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Service manages Flights for the clients
●
Flight Info gives a list of endpoints with locations of
each stream in the Flight
●
Streams are referenced by a ticket
– A ticket is an opaque struct that is unique for
each stream
●
Flight descriptors differentiate between flights
– Can define how Flight is composed
– Batch size, or even a SQL query
FlightDescriptor Types
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Simple path-like:
Custom proto:
 message MyDescriptor {
   string sql_query = 1;
   int32 records_per_batch = 2;
 }
 Message MyTicket {
   MyDescriptor desc = 1;
   string uuid = 2;
 }
“datasets/cats­dogs/training”
Ticket Sequence for Consumer
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
To consume an entire Flight
●
Get FlightInfo for list of
endpoints with tickets
●
For each endpoint
– Use ticket to get endpoint
stream
– Process each RecordBatch in
the stream
Consumer Flight Service
Get FlightInfo (FlightDescriptor)
FlightInfo
For Each Endpoint
Get Stream (Ticket)
For Each Batch in Stream
RecordBatch
Stream
Get Next
Process
batch
Benefits
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
●
Applications use client interface and exchange
standard record batches
●
Complex communication handled internally
●
Efficient, uses batches and minimum copies
●
Standardized protocol
– Authentication
– Support different transports
– Able to handle backpressure
Current Status
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Common protocol defined using protocol buffers
Prototype implementations in Java, C++, Python
Still experimental, but lots of work being done to
make production ready
How to Talk Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
If a system wants to exchange Arrow Fight data, then
needs to be able to produce/consume an Arrow
stream
●
Spark kind of does already, but not externalized
●
See SPARK-24579 and SPARK-26413
●
Can build a Scala Flight connector with a little
hacking
How to Talk Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
TensorFlow I/O has Arrow Datasets
●
Maintained by SIG-IO community
– Also many other inputs to TF
– Many sources from legacy contrib/
●
Several Arrow datasets
– ArrowStreamDataset used here
●
Input ops only for now
●
Install: “pip install tensorflow-io”
Check it out at https://github.com/tensorflow/io
Flight in Action:
Spark to TensorFlow
DBG / May 2, 2018 / © 2018 IBM Corporation
Define the Service
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Simple Service backed by an in-memory data store
●
Keeps streams in memory
●
Flight descriptor is a string id
●
This is from the Java Flight examples
Make the Clients
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
PySpark will put Arrow data
●
Map partition op of DataFrame to Arrow
●
Each partition sent as a stream of batches
– A ticket is roughly the partition index
TensorFlow Dataset will get Arrow data
●
Request entire Flight, which is multiple streams
●
Gets one batch at a time to process
●
Op outputs tensors
Data Flow
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Spark Worker
Stream 1
Record
Batch
Record
Batch
Flight
Service
Stream 2
Record
Batch
Record
Batch
TensorFlow
Process Batches
Record
Batch
Record
Batch
Record
Batch
Record
Batch
Flight =
Stream 1
+
Stream 2
Walkthrough
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Application code is simple
– Only a few lines
– Focus on working with data
– Don’t need to worry about conversion, file
formats, networking
Example in Python but data never needs to
go through Python!
Worker JVM → Flight Service → TF C++
“”” PySpark Client
“””
# Spark job to put partitions to service
SparkFlightConnector.put(
    df,           # Existing DataFrame
    host, port,   # Flight Service ip
    'rad­spark'   # Data descriptor
)
“”” TensorFlow Client
“””
# Arrow tf.data.Dataset gets Flight data
dataset = ArrowFlightDataset.from_schema(
    host, port,   # Flight Service ip
    'rad­spark',  # Data descriptor
    to_arrow_schema(df.schema)  # Schema  
)
# Iterate over Flight data as tensors
it = dataset.make_one_shot_iterator()
Recap
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Apache Arrow – standard for in-memory data
Arrow Flight – efficiently move data around network
●
Arrow data as a service
●
Stream batching
●
Stream management
Simple example with PySpark + TensorFlow
●
Data transfer never goes through Python
Links & References
Apache Arrow and Flight specification
https://arrow.apache.org/
https://github.com/apache/arrow/blob/master/format/Flight.proto
TensorFlow I/O
https://github.com/tensorflow/io
Related Spark JIRAs
SPARK-24579
SPARK-26413
Example Code
https://github.com/BryanCutler/SparkArrowFlight
References: Flight Overview by Arrow PMC Jacques Nadeau
[1] https://www.slideshare.net/JacquesNadeau5/apache-arrow-flight-overview
DBG / May 2, 2018 / © 2018 IBM Corporation
Introduction
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management
Thank you!
codait.org
http://github.com/BryanCutler
developer.ibm.com/code
DBG / May 2, 2018 / © 2018 IBM Corporation
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://ibm.biz/BdZgcx
https://datascience.ibm.com/
MAX
DBG / May 2, 2018 / © 2018 IBM Corporation
Backup Slides
DBG / May 2, 2018 / © 2018 IBM Corporation
Slides
BACKUP
Spark Client
DBG / May 2, 2018 / © 2018 IBM Corporation
Code
Map Partitions
to RecordBatches
Add partition batches
Into a Stream
Put stream to
service
// Spark job to put partitions to service
rdd.mapPartitions { it =>
   val allocator = it.allocator.newChildAllocator(
       "SparkFlightConnector", 0, Long.MaxValue)
   val client = new FlightClient(allocator, new Location(host, port))
   val desc = FlightDescriptor.path(descriptor)
   val stream = client.startPut(desc, it.root)
   // Use VectorSchemaRootIterator to convert Rows ­> Vectors
   it.foreach { root =>
     // doPut on the populated VectorSchemaRoot
     stream.putNext()
   }
   stream.completed()
   stream.getResult
   client.close()
   Iterator.empty
 }.count()

More Related Content

What's hot (20)

PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
PDF
What is in a Lucene index?
lucenerevolution
 
PPT
Scala and spark
Fabio Fumarola
 
PDF
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Anne Nicolas
 
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
PDF
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
PDF
Apache NiFi Meetup - Princeton NJ 2016
Timothy Spann
 
PDF
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
PDF
S3 Deduplication with StorReduce and Cloudian
Cloudian
 
PDF
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
PDF
Summary introduction to data engineering
Novita Sari
 
PPTX
Managing enterprise users in Hadoop ecosystem
DataWorks Summit
 
PDF
Reproducible bioinformatics for everyone: Nextflow & nf-core
Phil Ewels
 
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
PDF
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
miguelnoronha
 
PDF
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
PDF
SUSE - performance analysis-with_ceph
inwin stack
 
PPTX
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
 
PDF
Getting Started: Intro to Telegraf - July 2021
InfluxData
 
PPTX
eBPF Workshop
Michael Kehoe
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
What is in a Lucene index?
lucenerevolution
 
Scala and spark
Fabio Fumarola
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Anne Nicolas
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
Apache NiFi Meetup - Princeton NJ 2016
Timothy Spann
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
S3 Deduplication with StorReduce and Cloudian
Cloudian
 
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
Summary introduction to data engineering
Novita Sari
 
Managing enterprise users in Hadoop ecosystem
DataWorks Summit
 
Reproducible bioinformatics for everyone: Nextflow & nf-core
Phil Ewels
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Why Oracle on IBM POWER7 is Better Than Oracle Exadata - The Advantages of IB...
miguelnoronha
 
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
SUSE - performance analysis-with_ceph
inwin stack
 
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
 
Getting Started: Intro to Telegraf - July 2021
InfluxData
 
eBPF Workshop
Michael Kehoe
 

Similar to Make your PySpark Data Fly with Arrow! (20)

PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PDF
Apache Arrow and Python: The latest
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Apache Arrow
Mike Frampton
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PDF
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow and Python: The latest
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow
Mike Frampton
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
New Directions for Apache Arrow
Wes McKinney
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Anant Corporation
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Climate Action.pptx action plan for climate
justfortalabat
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 

Make your PySpark Data Fly with Arrow!

  • 1. Make Your PySpark Data Fly with Apache Arrow! Bryan Cutler Software Engineer @BryanCutler DBG / May 2, 2018 / © 2019 IBM Corporation
  • 2. About Bryan @BryanCutler on Github Software Engineer, IBM Center for Open-Source Data & AI Technologies (CODAIT) Big Data Machine Learning & AI Apache Spark committer Apache Arrow committer TensorFlow I/O maintainer DBG / May 2, 2018 / © 2018 IBM Corporation
  • 3. DBG / May 2, 2018 / © 2018 IBM Corporation Row 1 Row 2 Row 3 Row 4 0 2 4 6 8 10 12 Column 1 Column 2 Column 3 Center for Open Source Data and AI Technologies CODAIT codait.org DBG / May 2, 2018 / © 2018 IBM Corporation CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source
  • 4. Agenda Overview of Apache Arrow Intro to Arrow Flight How to talk Arrow Flight in Action DBG / May 2, 2018 / © 2018 IBM Corporation
  • 5. Apache Arrow Overview DBG / May 2, 2018 / © 2018 IBM Corporation
  • 6. About Arrow Apache Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Standard format for in-memory columnar data ● Implementations in many languages and growing ● Built for efficient analytic operations on modern hardware Has built in primitives for basic exchange of Arrow data ● Zero-copy data within a process ● IPC with Arrow record batch messages
  • 7. Why use Arrow Apache Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Arrow brings many benefits ● Common standard with cross- language support ● Better interoperability between frameworks ● Avoid costly data serialization
  • 8. Who is using Arrow Apache Arrow DBG / May 2, 2018 / © 2018 IBM Corporation The Apache® Software Foundation Announces Apache Arrow™ Momentum ● Adopted by dozens of Open Source and commercial technologies ● Exceeded 1,000,000 monthly downloads within first three years as an Apache Top-Level Project ● Apache Spark, NVIDIA RAPIDS, pandas, and Dremio, among others https://arrow.apache.org/powered_by Source: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46
  • 9. Arrow Flight DBG / May 2, 2018 / © 2018 IBM Corporation
  • 10. Introduction DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Arrow Flight is an Arrow-native RPC framework Defines a standard protocol for data exchange Makes it easy to efficiently move data around a network by providing1 : ● Arrow Data as a Service ● Batch Streams ● Stream Management
  • 11. Arrow Data as a Service DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Extensible data service ● Clients get/put Arrow data ● List available data ● Custom actions ● Can think of it as ODBC for in-memory data
  • 12. Stream Batching DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Arrow Stream is a schema + record batches A Flight is composed of multiple streams ● Streams could come from different endpoints ● Transfer data in bulk for efficiency ● Location info can be used to improve data locality Flight Stream 1 Record Batch Record Batch Stream 2 Record Batch Record Batch
  • 13. Stream Management DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Service manages Flights for the clients ● Flight Info gives a list of endpoints with locations of each stream in the Flight ● Streams are referenced by a ticket – A ticket is an opaque struct that is unique for each stream ● Flight descriptors differentiate between flights – Can define how Flight is composed – Batch size, or even a SQL query
  • 14. FlightDescriptor Types DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Simple path-like: Custom proto:  message MyDescriptor {    string sql_query = 1;    int32 records_per_batch = 2;  }  Message MyTicket {    MyDescriptor desc = 1;    string uuid = 2;  } “datasets/cats­dogs/training”
  • 15. Ticket Sequence for Consumer DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example To consume an entire Flight ● Get FlightInfo for list of endpoints with tickets ● For each endpoint – Use ticket to get endpoint stream – Process each RecordBatch in the stream Consumer Flight Service Get FlightInfo (FlightDescriptor) FlightInfo For Each Endpoint Get Stream (Ticket) For Each Batch in Stream RecordBatch Stream Get Next Process batch
  • 16. Benefits DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight ● Applications use client interface and exchange standard record batches ● Complex communication handled internally ● Efficient, uses batches and minimum copies ● Standardized protocol – Authentication – Support different transports – Able to handle backpressure
  • 17. Current Status DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Common protocol defined using protocol buffers Prototype implementations in Java, C++, Python Still experimental, but lots of work being done to make production ready
  • 18. How to Talk Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight If a system wants to exchange Arrow Fight data, then needs to be able to produce/consume an Arrow stream ● Spark kind of does already, but not externalized ● See SPARK-24579 and SPARK-26413 ● Can build a Scala Flight connector with a little hacking
  • 19. How to Talk Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight TensorFlow I/O has Arrow Datasets ● Maintained by SIG-IO community – Also many other inputs to TF – Many sources from legacy contrib/ ● Several Arrow datasets – ArrowStreamDataset used here ● Input ops only for now ● Install: “pip install tensorflow-io” Check it out at https://github.com/tensorflow/io
  • 20. Flight in Action: Spark to TensorFlow DBG / May 2, 2018 / © 2018 IBM Corporation
  • 21. Define the Service DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example Simple Service backed by an in-memory data store ● Keeps streams in memory ● Flight descriptor is a string id ● This is from the Java Flight examples
  • 22. Make the Clients DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example PySpark will put Arrow data ● Map partition op of DataFrame to Arrow ● Each partition sent as a stream of batches – A ticket is roughly the partition index TensorFlow Dataset will get Arrow data ● Request entire Flight, which is multiple streams ● Gets one batch at a time to process ● Op outputs tensors
  • 23. Data Flow DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example Spark Worker Stream 1 Record Batch Record Batch Flight Service Stream 2 Record Batch Record Batch TensorFlow Process Batches Record Batch Record Batch Record Batch Record Batch Flight = Stream 1 + Stream 2
  • 24. Walkthrough DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example Application code is simple – Only a few lines – Focus on working with data – Don’t need to worry about conversion, file formats, networking Example in Python but data never needs to go through Python! Worker JVM → Flight Service → TF C++ “”” PySpark Client “”” # Spark job to put partitions to service SparkFlightConnector.put(     df,           # Existing DataFrame     host, port,   # Flight Service ip     'rad­spark'   # Data descriptor ) “”” TensorFlow Client “”” # Arrow tf.data.Dataset gets Flight data dataset = ArrowFlightDataset.from_schema(     host, port,   # Flight Service ip     'rad­spark',  # Data descriptor     to_arrow_schema(df.schema)  # Schema   ) # Iterate over Flight data as tensors it = dataset.make_one_shot_iterator()
  • 25. Recap DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Apache Arrow – standard for in-memory data Arrow Flight – efficiently move data around network ● Arrow data as a service ● Stream batching ● Stream management Simple example with PySpark + TensorFlow ● Data transfer never goes through Python
  • 26. Links & References Apache Arrow and Flight specification https://arrow.apache.org/ https://github.com/apache/arrow/blob/master/format/Flight.proto TensorFlow I/O https://github.com/tensorflow/io Related Spark JIRAs SPARK-24579 SPARK-26413 Example Code https://github.com/BryanCutler/SparkArrowFlight References: Flight Overview by Arrow PMC Jacques Nadeau [1] https://www.slideshare.net/JacquesNadeau5/apache-arrow-flight-overview DBG / May 2, 2018 / © 2018 IBM Corporation
  • 27. Introduction DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Arrow Flight is an Arrow-native RPC framework Defines a standard protocol for data exchange Makes it easy to efficiently move data around a network by providing: ● Arrow Data as a Service ● Batch Streams ● Stream Management
  • 28. Thank you! codait.org http://github.com/BryanCutler developer.ibm.com/code DBG / May 2, 2018 / © 2018 IBM Corporation FfDL Sign up for IBM Cloud and try Watson Studio! https://ibm.biz/BdZgcx https://datascience.ibm.com/ MAX
  • 29. DBG / May 2, 2018 / © 2018 IBM Corporation
  • 30. Backup Slides DBG / May 2, 2018 / © 2018 IBM Corporation Slides BACKUP
  • 31. Spark Client DBG / May 2, 2018 / © 2018 IBM Corporation Code Map Partitions to RecordBatches Add partition batches Into a Stream Put stream to service // Spark job to put partitions to service rdd.mapPartitions { it =>    val allocator = it.allocator.newChildAllocator(        "SparkFlightConnector", 0, Long.MaxValue)    val client = new FlightClient(allocator, new Location(host, port))    val desc = FlightDescriptor.path(descriptor)    val stream = client.startPut(desc, it.root)    // Use VectorSchemaRootIterator to convert Rows ­> Vectors    it.foreach { root =>      // doPut on the populated VectorSchemaRoot      stream.putNext()    }    stream.completed()    stream.getResult    client.close()    Iterator.empty  }.count()