HUG_Ireland_Apache_Arrow_Tomer_Shiran

DREMIO
Apache Arrow
A New Era of Columnar In-Memory Analytics
Tomer Shiran, Co-Founder & CEO at Dremio
tshiran@dremio.com | @tshiran
Hadoop Meetup Ireland 2016
April 12, 2016

DREMIO
Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• MapR (VP Product); Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source

DREMIO
Arrow in a Slide
• New Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of breed systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
– A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

DREMIO
Agenda
• Purpose
• Memory Representation
• Language Bindings
• IPC & RPC
• Example Integrations

DREMIO
Overview
• A high speed in-memory representation
• Well-documented and cross language
compatible
• Designed to take advantage of modern
CPU characteristics
• Embeddable in execution engines, storage
layers, etc.

DREMIO
Focus on CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache Locality
• Super-scalar & vectorized
operation
• Minimal Structure
Overhead
• Constant value access
– With minimal structure
overhead
• Operate directly on
columnar compressed data

DREMIO
High Performance Sharing & Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)

DREMIO
Shared Need -> Open Source Opportunity
• Columnar is complex
• Shredded Columnar is even
more complex
• We all need to go to same place
• Take advantage of Open
Source approach
• Once we pick a shared
solution, we get interchange
for “free”
“We are also considering switching to a
columnar canonical in-memory format for
data that needs to be materialized during
query processing, in order to take advantage
of SIMD instructions” - Impala Team
“A large fraction of the CPU time is spent
waiting for data to be fetched from main
memory…we are designing cache-friendly
algorithms and data structures so Spark
applications will spend less time waiting to
fetch data from memory and more time doing
useful work - Spark Team

DREMIO
IN MEMORY REPRESENTATION

DREMIO
persons = [
{
name: 'wes',
iq: 180,
addresses: [
{number: 2, street 'a'},
{number: 3, street 'bb'}
]
},
{
name: 'joe',
iq: 100,
addresses: [
{number: 4, street 'ccc'},
{number: 5, street 'dddd'},
{number: 2, street 'f'}
]
}
]

DREMIO
Simple Example: persons.iq
person.iq
180
100

DREMIO
Simple Example: persons.addresses.number
person.addresses
0
2
5
person.addresses.number
2
3
4
5
6
offset

DREMIO
Columnar data
person.addresses.street
person.addresses
0
2
5
offset
0
1
3
6
10
a
b
b
c
c
c
d
d
d
d
f
person.addresses.number
2
3
4
5
6
offset

DREMIO
Language Bindings
• Target Languages
– Java (beta)
– C++ (underway)
– Python & Pandas (underway)
– R
– Julia
• Initial Focus
– Read a structure
– Write a structure
– Manage memory

DREMIO
Java: Creating Dynamic Off-Heap Structures
FieldWriter w= getWriter();
w.varChar("name").write("Wes");
w.integer("iq").write(180);
ListWriter list = writer.list("addresses");
list.startList();
MapWriter map = list.map();
map.start();
map.integer("number").writeInt(2);
map.varChar("street").write("a");
map.end();
map.start();
map.integer("number").writeInt(3);
map.varChar("street").write("bb");
map.end();
list.endList();
{
name: 'wes',
iq: 180,
addresses: [
{number: 2, street 'a'},
{number: 3, street 'bb'}
]
}
JSON Representation Programmatic Construction

DREMIO
Java: Memory Management (& NVMe)
• Chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Limit and transfer semantics across allocators
– Leak detection and location accounting
• Wrap native memory from other applications
• New support for integration with Intel’s Persistent
Memory library via Apache Mnemonic

DREMIO
Common Message Pattern
• Schema negotiation
– Logical description of structure
– Identification of dictionary
encoded nodes
• Dictionary batch
– Dictionary ID, values
• Record batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches

DREMIO
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
iq (data)
addresses (list offset)
addresses.number
addresses.street (offset) addresses.street (data)
data header (describes offsets into data)
name (bitmap)
iq (bitmap)
addresses (bitmap)
addresses.number (bitmap)
addresses.street (bitmap)
{
name: 'wes',
iq: 180,
addresses: [
{ number: 2,
street 'a'},
{ number: 3,
street 'bb'}
]
}
Each box is
contiguous memory,
entirely contiguous
on wire

DREMIO
RPC & IPC: Moving Data Between Systems
RPC
• Avoid Serialization & Deserialization
• Layer TBD: Focused on supporting vectored io
– Scatter/gather reads/writes against socket
IPC
• Alpha implementation using memory mapped files
– Moving data between Python and Drill
• Working on shared allocation approach
– Shared reference counting and well-defined ownership semantics

DREMIO
Real World Example: Python With Drill
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine

DREMIO
Real World Example: Feather File Format for
Python and R
• Problem: fast, language-
agnostic binary data
frame file format
• Written by Wes
McKinney (Python)
Hadley Wickham (R)
• Read speeds close to disk
IO performance
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather ﬁle
Apache Arrow
memory
Google
ﬂatbuffers

DREMIO
Real World Example: Feather File Format for
Python and R
library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
R Python

DREMIO
What’s Next
• Parquet for Python & C++
– Using Arrow Representation
• Available IPC Implementation
• Spark, Drill Integration
– Faster UDFs, Storage interfaces

DREMIO
Get Involved
• Join the community
– dev@arrow.apache.org
– Slack: https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– @ApacheArrow
• Hadoop Summit Talks
– Tomorrow: The Heterogeneous Data Lake
– Thursday: Planning with Polyalgebra

HUG_Ireland_Apache_Arrow_Tomer_Shiran

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to HUG_Ireland_Apache_Arrow_Tomer_Shiran (20)

More from John Mulhall (12)

Recently uploaded (20)

HUG_Ireland_Apache_Arrow_Tomer_Shiran