Next-generation Python Big Data Tools, powered by Apache Arrow

1
©
Cloudera,
Inc.
All
rights
reserved.

Next-‐genera;on

Python
Big
Data
Tools,

powered
by
Apache
Arrow

Wes
McKinney
@wesmckinn

SF
Big
Analy;cs
Meetup,
2016-‐04-‐05

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Data
Science
Tools
at
Cloudera,
formerly
DataPad
CEO/founder

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Wrote
bestseller
Python
for
Data
Analysis
2012

•  Open
source
projects

• Python
{pandas,
Ibis,
statsmodels}

• Apache
{Arrow,
Parquet,
Kudu
(incuba;ng)}

•  Mostly
work
in
Python
and
Cython/C/C++

3
©
Cloudera,
Inc.
All
rights
reserved.

In
process:

Python
for
Data
Analysis:
2nd
Edi4on

Coming
late
2016
/
early

2017

4
©
Cloudera,
Inc.
All
rights
reserved.

Python
+
Big
Data:
The
State
of
things

•  See
“Python
and
Apache
Hadoop:
A
State
of
the
Union”
from
February
17

•  Areas
where
much
more
work
needed

• Binary
ﬁle
format
read/write
support
(e.g.
Parquet
ﬁles)

• File
system
libraries
(HDFS,
S3,
etc.)

• Client
drivers
(Spark,
Hive,
Impala,
Kudu)

• Compute
system
integra;on
(Spark,
Impala,
etc.)

5
©
Cloudera,
Inc.
All
rights
reserved.

Apache

Arrow

Many
slides
here
from
my
joint
talk
with
Jacques
Nadeau,
VP
Apache
Arrow

6
©
Cloudera,
Inc.
All
rights
reserved.

Arrow
in
a
Slide

•  New
Top-‐level
Apache
Sofware
Founda;on
project

•  Announced
Feb
17,
2016

•  Focused
on
Columnar
In-‐Memory
Analy;cs

1.  10-‐100x
speedup
on
many
workloads

2.  Common
data
layer
enables
companies
to
choose
best
of

breed
systems

3.  Designed
to
work
with
any
programming
language

4.  Support
for
both
rela;onal
and
complex
data
as-‐is

•  Developers
from
13+
major
open
source
projects
involved

•  A
signiﬁcant
%
of
the
world’s
data
will
be
processed
through

Arrow!

Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

7
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Arrow:
What
is
it?

•  hkp://arrow.apache.org

•  Not
a
piece
of
sofware,
exactly!

•  A
standardized
in-‐memory
representa;on
for
columnar
data

•  Enables

• Suitable
for
implemen;ng
high-‐performance
analy;cs
in-‐memory
(think
like

“pandas
internals”)

• Cheap
data
interchange
amongst
systems,
likle
or
no
serializa;on

• Flexible
support
for
complex
JSON-‐like
data

•  Targets:
Impala,
Kudu,
Parquet,
Spark

8
©
Cloudera,
Inc.
All
rights
reserved.

Focus
on
CPU
Eﬃciency

1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer

Arrow
Memory Buffer

•  Cache
Locality

•  Super-‐scalar
&
vectorized

opera;on

•  Minimal
Structure
Overhead

•  Constant
value
access

•  With
minimal
structure
overhead

•  Operate
directly
on
columnar

compressed
data

9
©
Cloudera,
Inc.
All
rights
reserved.

High
Performance
Sharing
&
Interchange

Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert

10
©
Cloudera,
Inc.
All
rights
reserved.

Big
Data
Systems:
Poor
Python
IO
performance

h9p://wesmckinney.com/blog/pandas-‐and-‐apache-‐arrow/

11
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Feather
File
Format
for
Python

and
R

• Problem:
fast,
language-‐
agnos;c
binary
data
frame

file
format

• Wriken
by
Wes
McKinney

(Python)
Hadley
Wickham
(R)

• Read
speeds
close
to
disk
IO

performance

Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers

12
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Feather
File
Format
for
Python

and
R

library(feather)

path
<-‐
"my_data.feather"

write_feather(df,
path)

df
<-‐
read_feather(path)

import
feather

path
=
'my_data.feather'

feather.write_dataframe(df,
path)

df
=
feather.read_dataframe(path)

R
Python

13
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Parquet:
Binary
columnar
storage
format

•  I
just
became
a
Parquet
commiker!

•  github.com/apache/parquet-‐cpp

•  Python
users
will
soon
be
able
to

read
Parquet
ﬁles
via
PyArrow

•  parquet-‐cpp
<-‐>
PyArrow
<-‐>

pandas

14
©
Cloudera,
Inc.
All
rights
reserved.

Language
Bindings

•  Target
Languages

• Java
(beta)

• CPP
(underway)

• Python
&
Pandas
(underway)

• R

• Julia

•  Ini;al
Focus

• Read
a
structure

• Write
a
structure

• Manage
Memory

15
©
Cloudera,
Inc.
All
rights
reserved.

pandas
and
Arrow
in
context

16
©
Cloudera,
Inc.
All
rights
reserved.

RPC
&
IPC:
Moving
Data
Between
Systems

RPC

•  Avoid
Serializa;on
&
Deserializa;on

•  Layer
TBD:
Focused
on
suppor;ng
vectored
io

• Scaker/gather
reads/writes
against
socket

IPC

•  Alpha
implementa;on

using
memory
mapped
ﬁles

• Moving
data
between
Python
and
Drill

•  Working
on
shared
alloca;on
approach

• Shared
reference
coun;ng
and
well-‐deﬁned
ownership
seman;cs

17
©
Cloudera,
Inc.
All
rights
reserved.

Execu;ng
data
science
languages
in
the
compute
layer

UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?

18
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Python
With
Spark,
Drill,
Impala

in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine

19
©
Cloudera,
Inc.
All
rights
reserved.

What’s
Next

•  Parquet
for
Python
&
C++

• Using
Arrow
as
intermediary

•  Available
IPC
Implementa;on

•  Spark,
Drill
Integra;on

• Faster
UDFs,
Storage
interfaces

Next-generation Python Big Data Tools, powered by Apache Arrow

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Next-generation Python Big Data Tools, powered by Apache Arrow (20)

More from Wes McKinney (14)

Recently uploaded (20)

Next-generation Python Big Data Tools, powered by Apache Arrow