SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enabling	
  Python	
  to	
  be	
  a	
  Be=er	
  
Big	
  Data	
  Ci?zen	
  
Wes	
  McKinney	
  @wesmckinn	
  
NYC	
  Python	
  Meetup	
  2016-­‐02-­‐17	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  R&D	
  at	
  Cloudera,	
  formerly	
  DataPad	
  CEO/founder	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Open	
  source	
  projects	
  
• Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
• Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incuba?ng)}	
  
•  Mostly	
  work	
  in	
  Python	
  and	
  Cython/C/C++	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Industry	
  Analy?cs	
   Scien?fic	
  Compu?ng	
  
Heterogeneous	
  data	
  
	
  	
  	
  	
  Flat	
  tables	
  and	
  JSON	
  
Spark	
  /	
  MapReduce	
  
SQL	
  
DFS-­‐friendly	
  /	
  streaming	
  data	
  formats	
  
More	
  physical	
  machines	
  
Homogeneous	
  data	
  
	
  	
  	
  	
  Mul?dimensional	
  arrays	
  
HPC	
  tools	
  
Linear	
  algebra	
  
Scien?fic	
  data	
  formats	
  (e.g.	
  HDF5)	
  
Fewer	
  physical	
  machines	
  
Some	
  simplis?c	
  generaliza?ons	
  
Python:	
  heavy	
  investment,	
  	
  
generally	
  
Python:	
  light	
  investment,	
  
generally	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  sample	
  big	
  data	
  architecture	
  
Kafka
Kafka
Kafka
Kafka
Application data
HDFS
JSON Spark/MapReduce
Columnar
storage
Analytic SQL Engine
User
SQL
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  Hugely	
  popular	
  Python	
  table	
  /	
  “data	
  frame”	
  library	
  
• Labeled	
  table,	
  array,	
  and	
  ?me	
  series	
  data	
  structures	
  
•  Popular	
  for	
  data	
  prepara?on,	
  ETL,	
  and	
  in-­‐memory	
  analy?cs	
  
•  Built	
  using	
  Python’s	
  scien?fic	
  compu?ng	
  stack	
  
• User	
  API	
  /	
  domain	
  specific	
  language	
  
• Bespoke	
  in-­‐memory	
  analy?cs	
  /	
  rela?onal	
  algebra	
  engine	
  
• IO	
  interfaces	
  (CSV,	
  SQL,	
  etc.)	
  
• Expanded	
  data	
  type	
  system	
  (beyond	
  NumPy)	
  
•  Supports	
  flat	
  data	
  only	
  (or	
  semi-­‐structured	
  data	
  that	
  can	
  be	
  fla=ened)	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2016	
  Python	
  Data	
  Trends	
  
•  Improved	
  Python	
  interoperability	
  with	
  the	
  Apache	
  Hadoop	
  ecosystem	
  
• I’m	
  working	
  with	
  {Arrow,	
  Kudu,	
  Impala,	
  Parquet,	
  Spark}	
  
•  Support	
  for	
  big	
  data	
  file	
  formats	
  like	
  Apache	
  Parquet	
  
•  Na?ve	
  in-­‐memory	
  Python	
  support	
  for	
  nested	
  /	
  JSON-­‐like	
  data	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis	
  in	
  a	
  nutshell	
  
•  For	
  Python	
  programmers	
  doing	
  analy?cs	
  in	
  industry	
  
•  Project	
  Blog:	
  h=p://blog.ibis-­‐project.org	
  
•  Cross-­‐team	
  project	
  @	
  Cloudera	
  
•  Apache-­‐licensed,	
  open	
  source	
  h=p://github.com/cloudera/ibis	
  	
  
•  Craoing	
  a	
  compelling	
  Python-­‐on-­‐Hadoop	
  user	
  experience	
  
• Remove	
  SQL	
  coding	
  from	
  user	
  workflows	
  
• Develop	
  high	
  performance	
  extensions	
  in	
  Python	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enabling	
  interoperability	
  with	
  big	
  data	
  systems	
  
•  Distributed	
  /	
  MPP	
  query	
  engines:	
  implemented	
  in	
  a	
  host	
  language	
  
• Typically	
  C/C++	
  or	
  Java/Scala	
  
•  User-­‐defined	
  func?ons	
  (UDFs)	
  through	
  various	
  means	
  
• Implement	
  in	
  host	
  language	
  
• Implement	
  in	
  user	
  language	
  through	
  some	
  external	
  language	
  protocol	
  (ooen	
  
RPC-­‐based)	
  
•  External	
  UDFs	
  are	
  usually	
  very	
  slow	
  (cf:	
  PL/Python,	
  PySpark,	
  etc.)	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu?ng	
  data	
  science	
  languages	
  in	
  the	
  compute	
  layer	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  interoperability	
  challenges	
  
•  Problem	
  1:	
  Serializa?on	
  /	
  deserializa?on	
  overhead	
  
in partition 0
…
in partition
n - 1
Big data system
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
Big data system
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  movement	
  can	
  be	
  extremely	
  costly	
  
in partition 0
Python
function
input
Ques:ons	
  
•  How	
  to	
  represent	
  “data	
  in-­‐flight”	
  (RPC)?	
  
•  Cost	
  of	
  conversion	
  between	
  in-­‐memory	
  data	
  structures	
  
and	
  RPC	
  representa?on	
  
•  How	
  to	
  communicate	
  schemas	
  /	
  metadata?	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  movement	
  can	
  be	
  extremely	
  costly	
  
in partition 0
Python
function
input
Slow	
  data	
  movement	
  /	
  conversion	
  can	
  largely	
  
undermine	
  the	
  performance	
  benefits	
  of	
  Python’s	
  
high	
  performance	
  in-­‐memory	
  data	
  tools	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  interoperability	
  challenges	
  
•  Problem	
  2:	
  Scalar	
  vs	
  vectorized	
  computa?ons	
  
result = np.empty(n)
for i in range(n):
result[i] = f(a[i], b[i])
result = f(a, b)
SCALAR
VECTORIZED
often
100-1000x faster
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow:	
  What	
  is	
  it?	
  	
  
•  h=p://arrow.apache.org	
  
•  Not	
  a	
  piece	
  of	
  sooware,	
  exactly!	
  
•  A	
  standardized	
  in-­‐memory	
  representa?on	
  for	
  columnar	
  data	
  
•  Enables	
  
• Suitable	
  for	
  implemen?ng	
  high-­‐performance	
  analy?cs	
  in-­‐memory	
  (think	
  like	
  
“pandas	
  internals”)	
  
• Cheap	
  data	
  interchange	
  amongst	
  systems,	
  li=le	
  or	
  no	
  serializa?on	
  
• Flexible	
  support	
  for	
  complex	
  JSON-­‐like	
  data	
  
•  Targets:	
  Impala,	
  Kudu,	
  Parquet,	
  Spark	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  data	
  
persons'='[
''{
''''name:'‘wes’,
''''addresses:'[
'''''''{number:'2,'street:'‘a’},
'''''''{number:'3,'street:'‘bb’},
'''']
''},
''{
''''name:'‘mark’,
''''addresses:'[
'''''''{number:'4,'street:'‘ccc’},
'''''''{number:'5,'street:'‘dddd’},
'''''''{number:'6,'street:'‘f’},
'''']
''},
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  data	
  
person.addresses.street
person.addresses
0
2
5
offset
0
1
3
6
10
a
b
b
c
c
c
d
d
d
d
f
person.addresses.number
2
3
4
5
6
offset
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow	
  in	
  prac?ce	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

More Related Content

What's hot (19)

PDF
Python for Financial Data Analysis with pandas
Wes McKinney
 
PPTX
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
PDF
PyCon Singapore 2013 Keynote
Wes McKinney
 
PDF
Ibis: Scaling the Python Data Experience
Wes McKinney
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
pandas: Powerful data analysis tools for Python
Wes McKinney
 
PDF
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
Apache Arrow and Python: The latest
Wes McKinney
 
PDF
Extending Pandas using Apache Arrow and Numba
Uwe Korn
 
PDF
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PDF
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
PPTX
Analyzing Data With Python
Sarah Guido
 
Python for Financial Data Analysis with pandas
Wes McKinney
 
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
PyCon Singapore 2013 Keynote
Wes McKinney
 
Ibis: Scaling the Python Data Experience
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
pandas: Powerful data analysis tools for Python
Wes McKinney
 
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Apache Arrow and Python: The latest
Wes McKinney
 
Extending Pandas using Apache Arrow and Numba
Uwe Korn
 
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Data Science Languages and Industry Analytics
Wes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
High Performance Python on Apache Spark
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Analyzing Data With Python
Sarah Guido
 

Similar to Enabling Python to be a Better Big Data Citizen (20)

PDF
PyData: The Next Generation
Wes McKinney
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
PDF
High-Performance Python On Spark
Jen Aman
 
PDF
Improving Data Interoperability for Python and R
Work-Bench
 
PPTX
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
DOC
AnilKumarT_Resume_latest
anil_thyagarajan
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PDF
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
 
PPTX
Data Science and CDSW
Jason Hubbard
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PDF
Apache Hadoop on the Open Cloud
Hortonworks
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PPTX
Hortonworks.bdb
Emil Andreas Siemes
 
PDF
Keynote at Converge 2019
Travis Oliphant
 
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
PPTX
Transform You Business with Big Data and Hortonworks
Hortonworks
 
PPTX
Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
PyData: The Next Generation
Wes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
High-Performance Python On Spark
Jen Aman
 
Improving Data Interoperability for Python and R
Work-Bench
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
AnilKumarT_Resume_latest
anil_thyagarajan
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
 
Data Science and CDSW
Jason Hubbard
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Apache Hadoop on the Open Cloud
Hortonworks
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Hortonworks.bdb
Emil Andreas Siemes
 
Keynote at Converge 2019
Travis Oliphant
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Transform You Business with Big Data and Hortonworks
Hortonworks
 
Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Ad

More from Wes McKinney (18)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
PPTX
Shared Infrastructure for Data Science
Wes McKinney
 
PDF
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
PPTX
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PDF
PyCon APAC 2016 Keynote
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PyCon APAC 2016 Keynote
Wes McKinney
 
Ad

Recently uploaded (20)

PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 

Enabling Python to be a Better Big Data Citizen

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  Python  to  be  a  Be=er   Big  Data  Ci?zen   Wes  McKinney  @wesmckinn   NYC  Python  Meetup  2016-­‐02-­‐17  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  R&D  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba?ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy?cs   Scien?fic  Compu?ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul?dimensional  arrays   HPC  tools   Linear  algebra   Scien?fic  data  formats  (e.g.  HDF5)   Fewer  physical  machines   Some  simplis?c  generaliza?ons   Python:  heavy  investment,     generally   Python:  light  investment,   generally  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Hugely  popular  Python  table  /  “data  frame”  library   • Labeled  table,  array,  and  ?me  series  data  structures   •  Popular  for  data  prepara?on,  ETL,  and  in-­‐memory  analy?cs   •  Built  using  Python’s  scien?fic  compu?ng  stack   • User  API  /  domain  specific  language   • Bespoke  in-­‐memory  analy?cs  /  rela?onal  algebra  engine   • IO  interfaces  (CSV,  SQL,  etc.)   • Expanded  data  type  system  (beyond  NumPy)   •  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  fla=ened)  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   2016  Python  Data  Trends   •  Improved  Python  interoperability  with  the  Apache  Hadoop  ecosystem   • I’m  working  with  {Arrow,  Kudu,  Impala,  Parquet,  Spark}   •  Support  for  big  data  file  formats  like  Apache  Parquet   •  Na?ve  in-­‐memory  Python  support  for  nested  /  JSON-­‐like  data  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell   •  For  Python  programmers  doing  analy?cs  in  industry   •  Project  Blog:  h=p://blog.ibis-­‐project.org   •  Cross-­‐team  project  @  Cloudera   •  Apache-­‐licensed,  open  source  h=p://github.com/cloudera/ibis     •  Craoing  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL  coding  from  user  workflows   • Develop  high  performance  extensions  in  Python  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C/C++  or  Java/Scala   •  User-­‐defined  func?ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol  (ooen   RPC-­‐based)   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Execu?ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Python  interoperability  challenges   •  Problem  1:  Serializa?on  /  deserializa?on  overhead   in partition 0 … in partition n - 1 Big data system Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 Big data system
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Data  movement  can  be  extremely  costly   in partition 0 Python function input Ques:ons   •  How  to  represent  “data  in-­‐flight”  (RPC)?   •  Cost  of  conversion  between  in-­‐memory  data  structures   and  RPC  representa?on   •  How  to  communicate  schemas  /  metadata?  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Data  movement  can  be  extremely  costly   in partition 0 Python function input Slow  data  movement  /  conversion  can  largely   undermine  the  performance  benefits  of  Python’s   high  performance  in-­‐memory  data  tools  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  interoperability  challenges   •  Problem  2:  Scalar  vs  vectorized  computa?ons   result = np.empty(n) for i in range(n): result[i] = f(a[i], b[i]) result = f(a, b) SCALAR VECTORIZED often 100-1000x faster
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  h=p://arrow.apache.org   •  Not  a  piece  of  sooware,  exactly!   •  A  standardized  in-­‐memory  representa?on  for  columnar  data   •  Enables   • Suitable  for  implemen?ng  high-­‐performance  analy?cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  li=le  or  no  serializa?on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  data   persons'='[ ''{ ''''name:'‘wes’, ''''addresses:'[ '''''''{number:'2,'street:'‘a’}, '''''''{number:'3,'street:'‘bb’}, ''''] ''}, ''{ ''''name:'‘mark’, ''''addresses:'[ '''''''{number:'4,'street:'‘ccc’}, '''''''{number:'5,'street:'‘dddd’}, '''''''{number:'6,'street:'‘f’}, ''''] ''},
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  data   person.addresses.street person.addresses 0 2 5 offset 0 1 3 6 10 a b b c c c d d d d f person.addresses.number 2 3 4 5 6 offset
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac?ce  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own