Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

1
©
Cloudera,
Inc.
All
rights
reserved.

Ibis:
Scaling
Python
Analy=cs

on
Hadoop
and
Impala

Wes
McKinney,
SF
Data
Mining
Meetup
2015-‐10-‐22

@wesmckinn

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  R&D
at
Cloudera

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Mathema=cian
—
MIT
‘07

•  “Professional
SQL
programmer”
2007-‐2010
(@
AQR)

•  Created
pandas
(Python
library)
in
2008

•  Wrote
bestseller
Python
for
Data
Analysis
2012

•  Founder
of
DataPad

3
©
Cloudera,
Inc.
All
rights
reserved.

Python
is
popular…

•  Python
has
become
a
standard
language
of
data
science

•  Why
is
it
popular?

• Maximizes
produc=vity
for
data
engineers
and
data
scien=sts

• Build
robust
sobware
and
do
interac=ve
data
analysis
with
100%
Python
code

• Easy-‐to-‐learn
and
makes
happy
and
produc=ve
data
teams

• Large,
diverse
open
source
development
community

• Comprehensive
libraries:
data
wrangling,
ML,
visualiza=on,
etc.

•  Main
use
case:
data
science
&
engineering
swiss
army
knife
on
small-‐to-‐medium

size
data

4
©
Cloudera,
Inc.
All
rights
reserved.

…but
Python
does
not
scale
today

•  Python
ecosystem
conﬁned
to
single-‐node
analysis

• Great
for
smaller
data
sets

• Requires
sampling
or
aggrega=ons
for
larger
data

• Distributed
tools
compromise
in
various
ways

•  Extrac=ng
samples
or
aggrega=ons
for
larger
data
means:

• “Scales”
by
losing
more
ﬁdelity

• Addi=onal
ETL
overhead
to
extract
samples/aggrega=ons

• Loss
of
produc=vity
with
mul=ple
languages,
tools,
etc

• Blocks
certain
analysis
and
use
cases

5
©
Cloudera,
Inc.
All
rights
reserved.

Industry
Analy=cs
Scien=ﬁc
Compu=ng

Heterogeneous
data

Flat
tables
and
JSON

Spark
/
MapReduce

SQL

DFS-‐friendly
/
streaming
data
formats

More
physical
machines

Homogeneous
data

Mul=dimensional
arrays

HPC
tools

Linear
algebra

Scien=ﬁc
data
formats

Fewer
physical
machines

Some
simplis=c
generaliza=ons

6
©
Cloudera,
Inc.
All
rights
reserved.

Industry
Analy=cs
Scien=ﬁc
Compu=ng

Heterogeneous
data

Flat
tables
and
JSON

Spark
/
MapReduce

SQL

DFS-‐friendly
/
streaming
data
formats

More
physical
machines

Homogeneous
data

Mul=dimensional
arrays

HPC
tools

Linear
algebra

Scien=ﬁc
data
formats
(e.g.
HDF5)

Fewer
physical
machines

Some
simplis=c
generaliza=ons

Python:
heavy
investment,

generally

Python:
light
investment,

generally

7
©
Cloudera,
Inc.
All
rights
reserved.

pandas

•  Hugely
popular
Python
table
/
“data
frame”
library

• Labeled
table,
array,
and
=me
series
data
structures

•  Popular
for
data
prepara=on,
ETL,
and
in-‐memory
analy=cs

•  Built
using
Python’s
scien=fic
compu=ng
stack

• User
API
/
domain
specific
language

• Bespoke
in-‐memory
analy=cs
/
rela=onal
algebra
engine

• IO
interfaces
(CSV,
SQL,
etc.)

• Expanded
data
type
system
(beyond
NumPy)

•  Supports
flat
data
only
(or
semi-‐structured
data
that
can
be
flaqened)

8
©
Cloudera,
Inc.
All
rights
reserved.

Many
SQL
engines

…
and
more

9
©
Cloudera,
Inc.
All
rights
reserved.

The
“Great
Decoupling”
for
Big
Data

UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase

10
©
Cloudera,
Inc.
All
rights
reserved.

A
sample
big
data
architecture

Kafka
Kafka
Kafka
Kafka
Application data
HDFS
JSON Spark/MapReduce
Columnar
storage
Analytic SQL Engine
User
SQL

11
©
Cloudera,
Inc.
All
rights
reserved.

Nested
/
Complex
types
support

•  Arrays,
structs,
maps,
and
unions
as
ﬁrst-‐class
value
types

•  Analyze
JSON-‐like
data
directly
without
ﬂaqening
or
normaliza=on

•  Most
new
SQL
engines
have
some
level
of
support

• Impala

• Presto

• Drill

• BigQuery

• Spark
SQL

• Hive

• …

12
©
Cloudera,
Inc.
All
rights
reserved.

Ibis
in
a
nutshell

•  For
Python
programmers
doing
analy=cs
in
industry

•  Project
Blog:
hqp://blog.ibis-‐project.org

•  Joint
project
with
Impala
team
@
Cloudera

•  Apache-‐licensed,
open
source
hqp://github.com/cloudera/ibis

•  Crabing
a
compelling
Python-‐on-‐Hadoop
user
experience

• Remove
SQL
coding
from
user
workﬂows

• Develop
high
performance
Python
extension
APIs

13
©
Cloudera,
Inc.
All
rights
reserved.

Ibis
in
a
nutshell,
cont’d

•  Composable
Python
DSL
(“Ibis
expressions”)
makes
hand-‐coding
SQL
SELECT

statements
unnecessary

•  Ibis
for
SQL
Programmers:
hqp://docs.ibis-‐project.org/sql.html

•  Development
roadmap
targets
Impala
(C++
/
LLVM)
query
engine

• …
but
SQL
compiler
toolchain
is
general
purpose

•  Current
supports
Impala
and
SQLite,
but
soon
other
dialects

• We
welcome
external
contributors
for
other
Analy=c
SQL
engines

14
©
Cloudera,
Inc.
All
rights
reserved.

15
©
Cloudera,
Inc.
All
rights
reserved.

Benefits
of
Ibis

•  Maximize
developer
produc=vity

• Mirrors
single-‐node
Python
experience

• Solve
big
data
problems
without
leaving
Python

• Leverage
Python
skills,
ecosystem,
and
tools

•  Python
as
first-‐class
language
for
Hadoop

• Full-‐fidelity
analysis
without
extrac=ons

• Python
analysis
at
any
scale

• Na=ve
hardware
speeds
for
a
broad
set
of
use
cases

16
©
Cloudera,
Inc.
All
rights
reserved.

Brief
interac=ve
demo

17
©
Cloudera,
Inc.
All
rights
reserved.

Ibis/Impala
Joint
Roadmap

•  More
natural
data
modeling

• Complex
types
support

•  Integra=on
with
full
Python
data
ecosystem

• Advanced
analy=cs
+
machine
learning

• Enable
use
of
performance
compu=ng
tools

•  User
extensibility
with
na=ve
performance

• In-‐memory
columnar
format

• Python-‐to-‐LLVM
IR
compila=on

•  Workﬂow
and
usability
tools

18
©
Cloudera,
Inc.
All
rights
reserved.

Execu=ng
data
science
languages
in
the
compute
layer

UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?

19
©
Cloudera,
Inc.
All
rights
reserved.

Enabling
interoperability
with
big
data
systems

•  Distributed
/
MPP
query
engines:
implemented
in
a
host
language

• Typically
C/C++
or
Java/Scala

•  User-‐deﬁned
func=ons
(UDFs)
through
various
means

• Implement
in
host
language

• Implement
in
user
language
through
some
external
language
protocol
(oben

RPC-‐based)

•  External
UDFs
are
usually
very
slow
(cf:
PL/Python,
PySpark,
etc.)

20
©
Cloudera,
Inc.
All
rights
reserved.

What
are
UDFs
good
for?

•  Note:
industry
data
scien=sts
have
libraries
containing
100s
of
UDFs
for
Hive
or

other
distributed
query
engines

•  Custom
data
transforma=ons

•  Custom
domain
logic
(date
/
=me
/
data
types)

•  Custom
data
types

•  Custom
aggrega=ons
(incl.
machine
learning
/
sta=s=cs
expressible
as
reduc=ons)

24
©
Cloudera,
Inc.
All
rights
reserved.

How
to
make
them
fast?

•  Common
run=me
memory
representa=on
for
tabular
data

•  Share-‐memory
(zero-‐copy
or
memcpy-‐only)
external
UDF
protocol

•  Vectorized
UDF
interface
(for
interpreted
languages)

•  Impala
is
uniquely
posi=oned
to
play
well
with
Ibis

• Best-‐in-‐class
performance
and
scalability

• C++
and
LLVM-‐based
(JIT
compiler)
run=me

• Uniﬁed,
eﬃcient
data
interchange
amongst
Ibis,
Impala,
and
Kudu
will
enable

high
performance
real
=me
analy=cs
from
Python

25
©
Cloudera,
Inc.
All
rights
reserved.

Memory
representa=on

•  Many
query
engines
are
standardizing
on
in-‐memory
columnar
rep’n
of

materialized
transient
data

• Impala:

hqp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐
reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/

• Apache
Drill:
hqps://drill.apache.org/faq/

•  Industry-‐standard
serializa=on
format:
Apache
Parquet

• hqps://parquet.apache.org/

26
©
Cloudera,
Inc.
All
rights
reserved.

Serializa=on
vs
In-‐memory

•  Serializa=on
formats
(e.g.
Parquet)

• Op=mize
for
IO
/
DFS
throughput
at
expense
of
CPU/memory
bus
throughput

• Do
not
consider
random
access
or
in-‐memory
analy=cs
as
a
goal

•  No
standardized
in-‐memory
containers
for
materialized
data
from
ﬁle
/
RPC

protocols
(Parquet,
Thrib,
protobuf,
Avro,
etc.)

27
©
Cloudera,
Inc.
All
rights
reserved.

Standardized
in-‐memory
columnar
(IMC)

•  Compact
in-‐memory
representa=on
for
semistructured
data

•  Part
of
Impala’s
upcoming
dev
roadmap

•  Some
prior
IMC-‐for-‐SQL
work:
Apache
Drill

•  Standardized
memory
representa=on
means
data
can
be
shared
without

serializa=on

•  Create
a
canonical
C/C++
implementa=on
for
use
in
Python
/
R
/
Julia

28
©
Cloudera,
Inc.
All
rights
reserved.

Ibis’s
Vision

•  Uncompromised
Python
experience

• 100%
Python
end-‐to-‐end
user
workﬂows

• Enable
integra=on
with
the
exis=ng
Python
data
ecosystem
(pandas,
scikit-‐
learn,
NumPy,
etc)

•  Interac=ve
at
big
data
scale

• Full-‐ﬁdelity
analysis
without
extrac=ons

• Scalability
for
big
data

• Na=ve
hardware
speeds
for
a
broad
set
of
use
cases

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney (20)

More from Hakka Labs (20)

Recently uploaded (20)

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney