How Apache Arrow and Parquet boost cross-language interoperability

Uwe L. Korn
PyData Paris 14th June 2016
How Apache Arrow and Parquet
boost cross-language interop

About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• We optimize Replenishment and Pricing for the Retail
industry with Predictive Analytics
• Contributor to Apache {Arrow, Parquet}
• Work in Python, Cython, C++11 and SQL

Agenda
The Problem
Arrow
Parquet
Outlook

Why is columnar better?
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )

Diﬀerent Systems - Varying
Python Support
• Various levels of Python Support
• Build in Python
• Python API
• No Python at all
• Each tool/algorithm works on
columnar data
• Separate conversion routines for
each pair
• causes overhead
• there’s no one-size-fits-all solution
Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )

Apache Arrow
• Specification for in-memory
columnar data layout
• No overhead for cross-system /
cross-language communication
• Designed for eﬃciency (exploit
SIMD, cache locality, ..)
• Supports nested data structures
Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )

Apache Arrow - The Impact
• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in the DB
• Pass it in columnar form to the DB driver
• The OBDC layer transform it into row-wise form
• Pandas makes it columnar again
• Ugly real-life solution: export as CSV, bypass ODBC
• In future: Use Arrow as interface between the DB and
Pandas

Apache Arrow
• Top-level Apache project from the beginning
• Not only a specification: also includes C++ / Java /
Python / .. code.
• Arrow structures / classes
• RPC (upcoming) & IPC (alpha) support
• Conversion code for Parquet, Pandas, ..
• Combined eﬀort from developer of over 13 major OSS
projects
• Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..
• Spec: https://github.com/apache/arrow/blob/master/format/Layout.md

Arrow in Action: Feather
• Language-agnostic file format for
binary data frame storage
• Read performance close to raw
disk I/O
• by Wes McKinney (Python) and
Hadley Wickham (R)
• Julia Support in progress
Arrow Arrays
Feather Metadata
(flatbuﬀers)

Apache Parquet
• Binary file format for nested columnar data
• Inspired from Google Dremel paper
• space and query eﬃcient
• multiple encodings
• predicate pushdown
• column-wise compression
• many tools use Parquet as the default input format
• very popular in the JVM/Hadoop-based world

The Basics
• 1 File, includes metadata
• Several row groups
• all with the same number of column chunks
• n pages per column chunk
• Benefits:
• pre-partitioned for fast distributed access
• statistics in the metadata for predicate pushdown
Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-
simple-with-parquet
File
Row Group
Column Chunk
Page

Using Parquet in Python
• You can use it already today with Python:
• sqlContext.read.parquet(“..“).toPandas()
• Needs to pass through Spark, very slow
• Native Python support on its way:
• Parquet I/O to Arrow
• Arrow provides NumPy conversion

State of Arrow & Parquet
Arrow
in-memory spec for columnar data
• Java (beta)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
Parquet
columnar on-disk storage
• Java (mature)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R

Upcoming
• Parquet <-Arrow-> Pandas
• IPC on its way
• alpha implementation using memory mapped files
• JVM <-> native with shared reference counting

Get Involved!
• dev@arrow.apache.org & dev@parquet.apache.org
• https://apachearrowslackin.herokuapp.com/
• https://arrow.apache.org/
• https://parquet.apache.org/
• @ApacheArrow & @ApacheParquet

How Apache Arrow and Parquet boost cross-language interoperability

More Related Content

What's hot (20)

Similar to How Apache Arrow and Parquet boost cross-language interoperability (20)

More from Uwe Korn (10)

Recently uploaded (20)

How Apache Arrow and Parquet boost cross-language interoperability