SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Next-­‐genera;on	
  	
  
Python	
  Big	
  Data	
  Tools,	
  	
  
powered	
  by	
  Apache	
  Arrow	
  
Wes	
  McKinney	
  @wesmckinn	
  
SF	
  Big	
  Analy;cs	
  Meetup,	
  2016-­‐04-­‐05	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera,	
  formerly	
  DataPad	
  CEO/founder	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Open	
  source	
  projects	
  
• Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
• Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incuba;ng)}	
  
•  Mostly	
  work	
  in	
  Python	
  and	
  Cython/C/C++	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In	
  process:	
  
Python	
  for	
  Data	
  Analysis:	
  2nd	
  Edi4on	
  
Coming	
  late	
  2016	
  /	
  early	
  
2017	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  +	
  Big	
  Data:	
  The	
  State	
  of	
  things	
  
•  See	
  “Python	
  and	
  Apache	
  Hadoop:	
  A	
  State	
  of	
  the	
  Union”	
  from	
  February	
  17	
  
•  Areas	
  where	
  much	
  more	
  work	
  needed	
  
• Binary	
  file	
  format	
  read/write	
  support	
  (e.g.	
  Parquet	
  files)	
  
• File	
  system	
  libraries	
  (HDFS,	
  S3,	
  etc.)	
  
• Client	
  drivers	
  (Spark,	
  Hive,	
  Impala,	
  Kudu)	
  
• Compute	
  system	
  integra;on	
  (Spark,	
  Impala,	
  etc.)	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
Many	
  slides	
  here	
  from	
  my	
  joint	
  talk	
  with	
  Jacques	
  Nadeau,	
  VP	
  Apache	
  Arrow	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apache	
  Sofware	
  Founda;on	
  project	
  
•  Announced	
  Feb	
  17,	
  2016	
  
•  Focused	
  on	
  Columnar	
  In-­‐Memory	
  Analy;cs	
  
1.  10-­‐100x	
  speedup	
  on	
  many	
  workloads	
  
2.  Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  	
  
3.  Designed	
  to	
  work	
  with	
  any	
  programming	
  language	
  
4.  Support	
  for	
  both	
  rela;onal	
  and	
  complex	
  data	
  as-­‐is	
  
•  Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved	
  
•  A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!	
  
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow:	
  What	
  is	
  it?	
  	
  
•  hkp://arrow.apache.org	
  
•  Not	
  a	
  piece	
  of	
  sofware,	
  exactly!	
  
•  A	
  standardized	
  in-­‐memory	
  representa;on	
  for	
  columnar	
  data	
  
•  Enables	
  
• Suitable	
  for	
  implemen;ng	
  high-­‐performance	
  analy;cs	
  in-­‐memory	
  (think	
  like	
  
“pandas	
  internals”)	
  
• Cheap	
  data	
  interchange	
  amongst	
  systems,	
  likle	
  or	
  no	
  serializa;on	
  
• Flexible	
  support	
  for	
  complex	
  JSON-­‐like	
  data	
  
•  Targets:	
  Impala,	
  Kudu,	
  Parquet,	
  Spark	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Focus	
  on	
  CPU	
  Efficiency	
  
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer	
  
Arrow
Memory Buffer	
  
•  Cache	
  Locality	
  
•  Super-­‐scalar	
  &	
  vectorized	
  
opera;on	
  
•  Minimal	
  Structure	
  Overhead	
  
•  Constant	
  value	
  access	
  	
  
•  With	
  minimal	
  structure	
  overhead	
  
•  Operate	
  directly	
  on	
  columnar	
  
compressed	
  data	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Big	
  Data	
  Systems:	
  Poor	
  Python	
  IO	
  performance	
  
h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
• Problem:	
  fast,	
  language-­‐
agnos;c	
  binary	
  data	
  frame	
  
file	
  format	
  
• Wriken	
  by	
  Wes	
  McKinney	
  
(Python)	
  Hadley	
  Wickham	
  (R)	
  
• Read	
  speeds	
  close	
  to	
  disk	
  IO	
  
performance	
  
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Parquet:	
  Binary	
  columnar	
  storage	
  format	
  
•  I	
  just	
  became	
  a	
  Parquet	
  commiker!	
  
•  github.com/apache/parquet-­‐cpp	
  
•  Python	
  users	
  will	
  soon	
  be	
  able	
  to	
  
read	
  Parquet	
  files	
  via	
  PyArrow	
  
•  parquet-­‐cpp	
  <-­‐>	
  PyArrow	
  <-­‐>	
  
pandas	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Language	
  Bindings	
  
•  Target	
  Languages	
  
• Java	
  (beta)	
  
• CPP	
  (underway)	
  
• Python	
  &	
  Pandas	
  (underway)	
  
• R	
  
• Julia	
  
•  Ini;al	
  Focus	
  
• Read	
  a	
  structure	
  
• Write	
  a	
  structure	
  	
  
• Manage	
  Memory	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  and	
  Arrow	
  in	
  context	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RPC	
  &	
  IPC:	
  Moving	
  Data	
  Between	
  Systems	
  
RPC	
  
•  Avoid	
  Serializa;on	
  &	
  Deserializa;on	
  
•  Layer	
  TBD:	
  Focused	
  on	
  suppor;ng	
  vectored	
  io	
  
• Scaker/gather	
  reads/writes	
  against	
  socket	
  
IPC	
  
•  Alpha	
  implementa;on	
  	
  using	
  memory	
  mapped	
  files	
  
• Moving	
  data	
  between	
  Python	
  and	
  Drill	
  
•  Working	
  on	
  shared	
  alloca;on	
  approach	
  
• Shared	
  reference	
  coun;ng	
  and	
  well-­‐defined	
  ownership	
  seman;cs	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu;ng	
  data	
  science	
  languages	
  in	
  the	
  compute	
  layer	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Python	
  With	
  Spark,	
  Drill,	
  Impala	
  
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  Next	
  
•  Parquet	
  for	
  Python	
  &	
  C++	
  
• Using	
  Arrow	
  as	
  intermediary	
  
•  Available	
  IPC	
  Implementa;on	
  
•  Spark,	
  Drill	
  Integra;on	
  
• Faster	
  UDFs,	
  Storage	
  interfaces	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow	
  in	
  prac;ce	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Get	
  Involved	
  
•  Join	
  the	
  community	
  
• dev@arrow.apache.org	
  
• Slack:	
  hkps://apachearrowslackin.herokuapp.com/	
  
• hkp://arrow.apache.org	
  
• @ApacheArrow	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

More Related Content

What's hot (19)

PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PDF
Application Architectures with Hadoop
hadooparchbook
 
PDF
Apache Spark Briefing
Thomas W. Dinsmore
 
PDF
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
PDF
Apache Arrow and Python: The latest
Wes McKinney
 
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
PDF
Architectural considerations for Hadoop Applications
hadooparchbook
 
PDF
Introduction to Apache Kudu
Shravan (Sean) Pabba
 
PDF
Exponea - Kafka and Hadoop as components of architecture
MartinStrycek
 
PPTX
Architecting Applications with Hadoop
markgrover
 
PPTX
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Data Science Languages and Industry Analytics
Wes McKinney
 
DataFrames: The Extended Cut
Wes McKinney
 
Application Architectures with Hadoop
hadooparchbook
 
Apache Spark Briefing
Thomas W. Dinsmore
 
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Apache Arrow and Python: The latest
Wes McKinney
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Architectural considerations for Hadoop Applications
hadooparchbook
 
Introduction to Apache Kudu
Shravan (Sean) Pabba
 
Exponea - Kafka and Hadoop as components of architecture
MartinStrycek
 
Architecting Applications with Hadoop
markgrover
 
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 

Viewers also liked (20)

PPTX
Api Strat Portland 2017 Serverless Extensibility talk
Glenn Block
 
PDF
あなたの開発チームには、チームワークがあふれていますか?
Yusuke Amano
 
PDF
サイボウズのフロントエンド開発 現在とこれからの挑戦
Teppei Sato
 
PDF
遅いクエリと向き合う仕組み #CybozuMeetup
S Akai
 
PDF
サイボウズのサービスを支えるログ基盤
Shin'ya Ueoka
 
PPTX
すべての人にチームワークを サイボウズのアクセシビリティ
Kobayashi Daisuke
 
PDF
形態素解析
Works Applications
 
PPTX
WalB: Real-time and Incremental Backup System for Block Devices
uchan_nos
 
PDF
3000社の業務データ絞り込みを支える技術
Ryo Mitoma
 
PDF
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
Teppei Sato
 
PDF
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jumpei Miyata
 
PDF
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
Jumpei Miyata
 
PDF
Kubernetes in 30 minutes (2017/03/10)
lestrrat
 
PPTX
プロジェクト管理でkintone
Cybozucommunity
 
PDF
Kubernetesにまつわるエトセトラ(主に苦労話)
Works Applications
 
PPTX
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
PDF
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
PDF
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
PDF
Time Series Analysis with Spark
Sandy Ryza
 
Api Strat Portland 2017 Serverless Extensibility talk
Glenn Block
 
あなたの開発チームには、チームワークがあふれていますか?
Yusuke Amano
 
サイボウズのフロントエンド開発 現在とこれからの挑戦
Teppei Sato
 
遅いクエリと向き合う仕組み #CybozuMeetup
S Akai
 
サイボウズのサービスを支えるログ基盤
Shin'ya Ueoka
 
すべての人にチームワークを サイボウズのアクセシビリティ
Kobayashi Daisuke
 
形態素解析
Works Applications
 
WalB: Real-time and Incremental Backup System for Block Devices
uchan_nos
 
3000社の業務データ絞り込みを支える技術
Ryo Mitoma
 
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
Teppei Sato
 
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jumpei Miyata
 
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
Jumpei Miyata
 
Kubernetes in 30 minutes (2017/03/10)
lestrrat
 
プロジェクト管理でkintone
Cybozucommunity
 
Kubernetesにまつわるエトセトラ(主に苦労話)
Works Applications
 
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Time Series Analysis with Spark
Sandy Ryza
 
Ad

Similar to Next-generation Python Big Data Tools, powered by Apache Arrow (20)

PDF
Improving data interoperability in Python and R
Wes McKinney
 
PDF
Improving Data Interoperability for Python and R
Work-Bench
 
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
PDF
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
PDF
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
PDF
Apache Arrow
Mike Frampton
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PPTX
An Introduction to Apache Arrow for Python Programmers.pptx
ssuser59b75e
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PPTX
Apache Arrow - An Overview
Dremio Corporation
 
PPTX
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PDF
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
Improving data interoperability in Python and R
Wes McKinney
 
Improving Data Interoperability for Python and R
Work-Bench
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Apache Arrow
Mike Frampton
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
An Introduction to Apache Arrow for Python Programmers.pptx
ssuser59b75e
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Apache Arrow - An Overview
Dremio Corporation
 
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
Ad

More from Wes McKinney (14)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
PDF
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
PPTX
Shared Infrastructure for Data Science
Wes McKinney
 
PDF
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
PPTX
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
PDF
PyCon APAC 2016 Keynote
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
PyCon APAC 2016 Keynote
Wes McKinney
 

Recently uploaded (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 

Next-generation Python Big Data Tools, powered by Apache Arrow

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Next-­‐genera;on     Python  Big  Data  Tools,     powered  by  Apache  Arrow   Wes  McKinney  @wesmckinn   SF  Big  Analy;cs  Meetup,  2016-­‐04-­‐05  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba;ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  late  2016  /  early   2017  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Python  +  Big  Data:  The  State  of  things   •  See  “Python  and  Apache  Hadoop:  A  State  of  the  Union”  from  February  17   •  Areas  where  much  more  work  needed   • Binary  file  format  read/write  support  (e.g.  Parquet  files)   • File  system  libraries  (HDFS,  S3,  etc.)   • Client  drivers  (Spark,  Hive,  Impala,  Kudu)   • Compute  system  integra;on  (Spark,  Impala,  etc.)  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   Many  slides  here  from  my  joint  talk  with  Jacques  Nadeau,  VP  Apache  Arrow  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sofware  Founda;on  project   •  Announced  Feb  17,  2016   •  Focused  on  Columnar  In-­‐Memory  Analy;cs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  rela;onal  and  complex  data  as-­‐is   •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!   Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  hkp://arrow.apache.org   •  Not  a  piece  of  sofware,  exactly!   •  A  standardized  in-­‐memory  representa;on  for  columnar  data   •  Enables   • Suitable  for  implemen;ng  high-­‐performance  analy;cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  likle  or  no  serializa;on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache  Locality   •  Super-­‐scalar  &  vectorized   opera;on   •  Minimal  Structure  Overhead   •  Constant  value  access     •  With  minimal  structure  overhead   •  Operate  directly  on  columnar   compressed  data  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Big  Data  Systems:  Poor  Python  IO  performance   h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   • Problem:  fast,  language-­‐ agnos;c  binary  data  frame   file  format   • Wriken  by  Wes  McKinney   (Python)  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance   Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Binary  columnar  storage  format   •  I  just  became  a  Parquet  commiker!   •  github.com/apache/parquet-­‐cpp   •  Python  users  will  soon  be  able  to   read  Parquet  files  via  PyArrow   •  parquet-­‐cpp  <-­‐>  PyArrow  <-­‐>   pandas  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Language  Bindings   •  Target  Languages   • Java  (beta)   • CPP  (underway)   • Python  &  Pandas  (underway)   • R   • Julia   •  Ini;al  Focus   • Read  a  structure   • Write  a  structure     • Manage  Memory  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   pandas  and  Arrow  in  context  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   RPC  &  IPC:  Moving  Data  Between  Systems   RPC   •  Avoid  Serializa;on  &  Deserializa;on   •  Layer  TBD:  Focused  on  suppor;ng  vectored  io   • Scaker/gather  reads/writes  against  socket   IPC   •  Alpha  implementa;on    using  memory  mapped  files   • Moving  data  between  Python  and  Drill   •  Working  on  shared  alloca;on  approach   • Shared  reference  coun;ng  and  well-­‐defined  ownership  seman;cs  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Execu;ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  Next   •  Parquet  for  Python  &  C++   • Using  Arrow  as  intermediary   •  Available  IPC  Implementa;on   •  Spark,  Drill  Integra;on   • Faster  UDFs,  Storage  interfaces  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac;ce  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved   •  Join  the  community   • [email protected]   • Slack:  hkps://apachearrowslackin.herokuapp.com/   • hkp://arrow.apache.org   • @ApacheArrow  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own