SlideShare a Scribd company logo
DREMIO
Apache Arrow
A New Era of Columnar In-Memory Analytics
Tomer Shiran, Co-Founder & CEO at Dremio
tshiran@dremio.com | @tshiran
Hadoop Meetup Ireland 2016
April 12, 2016
DREMIO
Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• MapR (VP Product); Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source
DREMIO
Arrow in a Slide
• New Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Common data layer enables companies to choose best of breed systems
3. Designed to work with any programming language
4. Support for both relational and complex data as-is
• Developers from 13+ major open source projects involved
– A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
DREMIO
Agenda
• Purpose
• Memory Representation
• Language Bindings
• IPC & RPC
• Example Integrations
DREMIO
PURPOSE
DREMIO
Overview
• A high speed in-memory representation
• Well-documented and cross language
compatible
• Designed to take advantage of modern
CPU characteristics
• Embeddable in execution engines, storage
layers, etc.
DREMIO
Focus on CPU Efficiency
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache Locality
• Super-scalar & vectorized
operation
• Minimal Structure
Overhead
• Constant value access
– With minimal structure
overhead
• Operate directly on
columnar compressed data
DREMIO
High Performance Sharing & Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
DREMIO
Shared Need -> Open Source Opportunity
• Columnar is complex
• Shredded Columnar is even
more complex
• We all need to go to same place
• Take advantage of Open
Source approach
• Once we pick a shared
solution, we get interchange
for “free”
“We are also considering switching to a
columnar canonical in-memory format for
data that needs to be materialized during
query processing, in order to take advantage
of SIMD instructions” - Impala Team
“A large fraction of the CPU time is spent
waiting for data to be fetched from main
memory…we are designing cache-friendly
algorithms and data structures so Spark
applications will spend less time waiting to
fetch data from memory and more time doing
useful work - Spark Team
DREMIO
IN MEMORY REPRESENTATION
DREMIO
persons = [
{
name: 'wes',
iq: 180,
addresses: [
{number: 2, street 'a'},
{number: 3, street 'bb'}
]
},
{
name: 'joe',
iq: 100,
addresses: [
{number: 4, street 'ccc'},
{number: 5, street 'dddd'},
{number: 2, street 'f'}
]
}
]
DREMIO
Simple Example: persons.iq
person.iq
180
100
DREMIO
Simple Example: persons.addresses.number
person.addresses
0
2
5
person.addresses.number
2
3
4
5
6
offset
DREMIO
Columnar data
person.addresses.street
person.addresses
0
2
5
offset
0
1
3
6
10
a
b
b
c
c
c
d
d
d
d
f
person.addresses.number
2
3
4
5
6
offset
DREMIO
LANGUAGE BINDINGS
DREMIO
Language Bindings
• Target Languages
– Java (beta)
– C++ (underway)
– Python & Pandas (underway)
– R
– Julia
• Initial Focus
– Read a structure
– Write a structure
– Manage memory
DREMIO
Java: Creating Dynamic Off-Heap Structures
FieldWriter w= getWriter();
w.varChar("name").write("Wes");
w.integer("iq").write(180);
ListWriter list = writer.list("addresses");
list.startList();
MapWriter map = list.map();
map.start();
map.integer("number").writeInt(2);
map.varChar("street").write("a");
map.end();
map.start();
map.integer("number").writeInt(3);
map.varChar("street").write("bb");
map.end();
list.endList();
{
name: 'wes',
iq: 180,
addresses: [
{number: 2, street 'a'},
{number: 3, street 'bb'}
]
}
JSON Representation Programmatic Construction
DREMIO
Java: Memory Management (& NVMe)
• Chunk-based managed allocator
– Built on top of Netty’s JEMalloc implementation
• Create a tree of allocators
– Limit and transfer semantics across allocators
– Leak detection and location accounting
• Wrap native memory from other applications
• New support for integration with Intel’s Persistent
Memory library via Apache Mnemonic
DREMIO
RPC & IPC
DREMIO
Common Message Pattern
• Schema negotiation
– Logical description of structure
– Identification of dictionary
encoded nodes
• Dictionary batch
– Dictionary ID, values
• Record batch
– Batches of records up to 64K
– Leaf nodes up to 2B values
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
1..N
Batches
0..N
Batches
DREMIO
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
iq (data)
addresses (list offset)
addresses.number
addresses.street (offset) addresses.street (data)
data header (describes offsets into data)
name (bitmap)
iq (bitmap)
addresses (bitmap)
addresses.number (bitmap)
addresses.street (bitmap)
{
name: 'wes',
iq: 180,
addresses: [
{ number: 2,
street 'a'},
{ number: 3,
street 'bb'}
]
}
Each box is
contiguous memory,
entirely contiguous
on wire
DREMIO
RPC & IPC: Moving Data Between Systems
RPC
• Avoid Serialization & Deserialization
• Layer TBD: Focused on supporting vectored io
– Scatter/gather reads/writes against socket
IPC
• Alpha implementation using memory mapped files
– Moving data between Python and Drill
• Working on shared allocation approach
– Shared reference counting and well-defined ownership semantics
DREMIO
REAL-WORLD EXAMPLES
DREMIO
Real World Example: Python With Drill
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
DREMIO
Real World Example: Feather File Format for
Python and R
• Problem: fast, language-
agnostic binary data
frame file format
• Written by Wes
McKinney (Python)
Hadley Wickham (R)
• Read speeds close to disk
IO performance
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers
DREMIO
Real World Example: Feather File Format for
Python and R
library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
R Python
DREMIO
What’s Next
• Parquet for Python & C++
– Using Arrow Representation
• Available IPC Implementation
• Spark, Drill Integration
– Faster UDFs, Storage interfaces
DREMIO
Get Involved
• Join the community
– dev@arrow.apache.org
– Slack: https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– @ApacheArrow
• Hadoop Summit Talks
– Tomorrow: The Heterogeneous Data Lake
– Thursday: Planning with Polyalgebra

More Related Content

PPTX
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
PPTX
Introduction to Dremio
Dremio Corporation
 
PDF
HUG_Ireland_Streaming_Ted_Dunning
John Mulhall
 
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
PPTX
Apache Arrow - An Overview
Dremio Corporation
 
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
PPTX
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
PDF
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Introduction to Dremio
Dremio Corporation
 
HUG_Ireland_Streaming_Ted_Dunning
John Mulhall
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
Apache Arrow - An Overview
Dremio Corporation
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 

What's hot (20)

PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
Dremio introduction
Alexis Gendronneau
 
PDF
Apache Arrow and Python: The latest
Wes McKinney
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PPTX
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PDF
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Dremio introduction
Alexis Gendronneau
 
Apache Arrow and Python: The latest
Wes McKinney
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
 
Data Science Languages and Industry Analytics
Wes McKinney
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
DataFrames: The Extended Cut
Wes McKinney
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Spark Summit
 
Ad

Viewers also liked (14)

PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
PPTX
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
PPTX
Calcite meetup-2016-04-20
Josh Elser
 
PDF
SQL on everything, in memory
Julian Hyde
 
PDF
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PPTX
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Calcite overview
Julian Hyde
 
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
PPTX
High-Performance Analytics in the Cloud with Apache Impala
Cloudera, Inc.
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Calcite meetup-2016-04-20
Josh Elser
 
SQL on everything, in memory
Julian Hyde
 
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Calcite
Jordan Halterman
 
Streaming SQL with Apache Calcite
Julian Hyde
 
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Apache Calcite overview
Julian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde
 
High-Performance Analytics in the Cloud with Apache Impala
Cloudera, Inc.
 
High Performance Python on Apache Spark
Wes McKinney
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Ad

Similar to HUG_Ireland_Apache_Arrow_Tomer_Shiran (20)

PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
PPTX
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
PDF
Vue d'ensemble Dremio
Modern Data Stack France
 
PPTX
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
PPTX
How to boost your datamanagement with Dremio ?
Vincent Terrasi
 
PPTX
Lakehouse Analytics with Dremio
DimitarMitov4
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
PDF
Column and hadoop
Alex Jiang
 
PDF
Cassandra Talk: Austin JUG
Stu Hood
 
PDF
In-Memory Computing - The Big Picture
Markus Kett
 
PDF
Dremel
Anhua Xu
 
PDF
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
PPTX
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
PPTX
Drill at the Chug 9-19-12
Ted Dunning
 
PDF
Ciel, mes données ne sont plus relationnelles
Xavier Gorse
 
PPTX
Dremel interactive analysis of web scale datasets
Carl Lu
 
PPTX
Drill at the Chicago Hug
MapR Technologies
 
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Vue d'ensemble Dremio
Modern Data Stack France
 
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
How to boost your datamanagement with Dremio ?
Vincent Terrasi
 
Lakehouse Analytics with Dremio
DimitarMitov4
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Column and hadoop
Alex Jiang
 
Cassandra Talk: Austin JUG
Stu Hood
 
In-Memory Computing - The Big Picture
Markus Kett
 
Dremel
Anhua Xu
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
Drill at the Chug 9-19-12
Ted Dunning
 
Ciel, mes données ne sont plus relationnelles
Xavier Gorse
 
Dremel interactive analysis of web scale datasets
Carl Lu
 
Drill at the Chicago Hug
MapR Technologies
 

More from John Mulhall (12)

PPTX
cloud-migrations.pptx
John Mulhall
 
PDF
HUGIreland_VincentDeStocklin_DataScienceWorkflows
John Mulhall
 
PDF
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
PPTX
Introduction to Software - Coder Forge - John Mulhall
John Mulhall
 
PDF
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
John Mulhall
 
PDF
HUG Ireland Event - HPCC Presentation Slides
John Mulhall
 
PDF
HUG Ireland Event Presentation - In-Memory Databases
John Mulhall
 
PDF
HUG_Ireland_BryanQuinnPresentation_20160111
John Mulhall
 
PDF
HUG Ireland Event - Dama Ireland slides
John Mulhall
 
PDF
Periscope Getting Started-2
John Mulhall
 
PDF
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
John Mulhall
 
PDF
Sonra Intelligence Ltd
John Mulhall
 
cloud-migrations.pptx
John Mulhall
 
HUGIreland_VincentDeStocklin_DataScienceWorkflows
John Mulhall
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
Introduction to Software - Coder Forge - John Mulhall
John Mulhall
 
Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
John Mulhall
 
HUG Ireland Event - HPCC Presentation Slides
John Mulhall
 
HUG Ireland Event Presentation - In-Memory Databases
John Mulhall
 
HUG_Ireland_BryanQuinnPresentation_20160111
John Mulhall
 
HUG Ireland Event - Dama Ireland slides
John Mulhall
 
Periscope Getting Started-2
John Mulhall
 
AIB's road-to-Real-Time-Analytics - Tommy Mitchell and Kevin McTiernan of AIB
John Mulhall
 
Sonra Intelligence Ltd
John Mulhall
 

Recently uploaded (20)

PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PPTX
Short term internship project report on power Bi
JMJCollegeComputerde
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
Short term internship project report on power Bi
JMJCollegeComputerde
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 

HUG_Ireland_Apache_Arrow_Tomer_Shiran

  • 1. DREMIO Apache Arrow A New Era of Columnar In-Memory Analytics Tomer Shiran, Co-Founder & CEO at Dremio [email protected] | @tshiran Hadoop Meetup Ireland 2016 April 12, 2016
  • 2. DREMIO Company Background Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • MapR (VP Product); Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Stealth data analytics startup • Founded in 2015 • Led by experts in Big Data and open source
  • 3. DREMIO Arrow in a Slide • New Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relational and complex data as-is • Developers from 13+ major open source projects involved – A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 4. DREMIO Agenda • Purpose • Memory Representation • Language Bindings • IPC & RPC • Example Integrations
  • 6. DREMIO Overview • A high speed in-memory representation • Well-documented and cross language compatible • Designed to take advantage of modern CPU characteristics • Embeddable in execution engines, storage layers, etc.
  • 7. DREMIO Focus on CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache Locality • Super-scalar & vectorized operation • Minimal Structure Overhead • Constant value access – With minimal structure overhead • Operate directly on columnar compressed data
  • 8. DREMIO High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 9. DREMIO Shared Need -> Open Source Opportunity • Columnar is complex • Shredded Columnar is even more complex • We all need to go to same place • Take advantage of Open Source approach • Once we pick a shared solution, we get interchange for “free” “We are also considering switching to a columnar canonical in-memory format for data that needs to be materialized during query processing, in order to take advantage of SIMD instructions” - Impala Team “A large fraction of the CPU time is spent waiting for data to be fetched from main memory…we are designing cache-friendly algorithms and data structures so Spark applications will spend less time waiting to fetch data from memory and more time doing useful work - Spark Team
  • 11. DREMIO persons = [ { name: 'wes', iq: 180, addresses: [ {number: 2, street 'a'}, {number: 3, street 'bb'} ] }, { name: 'joe', iq: 100, addresses: [ {number: 4, street 'ccc'}, {number: 5, street 'dddd'}, {number: 2, street 'f'} ] } ]
  • 16. DREMIO Language Bindings • Target Languages – Java (beta) – C++ (underway) – Python & Pandas (underway) – R – Julia • Initial Focus – Read a structure – Write a structure – Manage memory
  • 17. DREMIO Java: Creating Dynamic Off-Heap Structures FieldWriter w= getWriter(); w.varChar("name").write("Wes"); w.integer("iq").write(180); ListWriter list = writer.list("addresses"); list.startList(); MapWriter map = list.map(); map.start(); map.integer("number").writeInt(2); map.varChar("street").write("a"); map.end(); map.start(); map.integer("number").writeInt(3); map.varChar("street").write("bb"); map.end(); list.endList(); { name: 'wes', iq: 180, addresses: [ {number: 2, street 'a'}, {number: 3, street 'bb'} ] } JSON Representation Programmatic Construction
  • 18. DREMIO Java: Memory Management (& NVMe) • Chunk-based managed allocator – Built on top of Netty’s JEMalloc implementation • Create a tree of allocators – Limit and transfer semantics across allocators – Leak detection and location accounting • Wrap native memory from other applications • New support for integration with Intel’s Persistent Memory library via Apache Mnemonic
  • 20. DREMIO Common Message Pattern • Schema negotiation – Logical description of structure – Identification of dictionary encoded nodes • Dictionary batch – Dictionary ID, values • Record batch – Batches of records up to 64K – Leaf nodes up to 2B values Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch 1..N Batches 0..N Batches
  • 21. DREMIO Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) iq (data) addresses (list offset) addresses.number addresses.street (offset) addresses.street (data) data header (describes offsets into data) name (bitmap) iq (bitmap) addresses (bitmap) addresses.number (bitmap) addresses.street (bitmap) { name: 'wes', iq: 180, addresses: [ { number: 2, street 'a'}, { number: 3, street 'bb'} ] } Each box is contiguous memory, entirely contiguous on wire
  • 22. DREMIO RPC & IPC: Moving Data Between Systems RPC • Avoid Serialization & Deserialization • Layer TBD: Focused on supporting vectored io – Scatter/gather reads/writes against socket IPC • Alpha implementation using memory mapped files – Moving data between Python and Drill • Working on shared allocation approach – Shared reference counting and well-defined ownership semantics
  • 24. DREMIO Real World Example: Python With Drill in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 25. DREMIO Real World Example: Feather File Format for Python and R • Problem: fast, language- agnostic binary data frame file format • Written by Wes McKinney (Python) Hadley Wickham (R) • Read speeds close to disk IO performance Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  • 26. DREMIO Real World Example: Feather File Format for Python and R library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) R Python
  • 27. DREMIO What’s Next • Parquet for Python & C++ – Using Arrow Representation • Available IPC Implementation • Spark, Drill Integration – Faster UDFs, Storage interfaces
  • 28. DREMIO Get Involved • Join the community – [email protected] – Slack: https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – @ApacheArrow • Hadoop Summit Talks – Tomorrow: The Heterogeneous Data Lake – Thursday: Planning with Polyalgebra