SlideShare a Scribd company logo
1	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Scaling	
  the	
  Python	
  Data	
  
Experience	
  
Wes	
  McKinney	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Marcel	
  Kornacker	
  
JusFn	
  Erickson 	
   	
  Silvius	
  Rus	
  
2	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes	
  McKinney	
  
•  A	
  key	
  person	
  in	
  building	
  today’s	
  open	
  source	
  Python	
  data	
  community	
  
•  Creator	
  of	
  pandas,	
  a	
  standard	
  Python	
  data	
  wrangling	
  and	
  analyFcs	
  toolkit	
  used	
  
by	
  data	
  scienFsts	
  
•  Author	
  of	
  best-­‐selling	
  canonical	
  text	
  Python	
  for	
  Data	
  Analysis	
  (2012)	
  
•  Formerly	
  Founder/CEO	
  of	
  DataPad	
  (acquired	
  by	
  Cloudera	
  in	
  2014)	
  
3	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  is	
  popular…	
  
•  Python	
  has	
  become	
  a	
  standard	
  language	
  of	
  data	
  science	
  
•  Why	
  is	
  it	
  popular?	
  
• Maximizes	
  producFvity	
  for	
  data	
  engineers	
  and	
  data	
  scienFsts	
  
• Build	
  robust	
  so[ware	
  and	
  do	
  interacFve	
  data	
  analysis	
  with	
  100%	
  Python	
  code	
  	
  
• Easy-­‐to-­‐learn	
  and	
  makes	
  happy	
  and	
  producFve	
  data	
  teams	
  	
  
• Large,	
  diverse	
  open	
  source	
  development	
  community	
  
• Comprehensive	
  libraries:	
  data	
  wrangling,	
  ML,	
  visualizaFon,	
  etc.	
  
•  Main	
  use	
  case:	
  data	
  science	
  &	
  engineering	
  swiss	
  army	
  knife	
  on	
  small-­‐to-­‐medium	
  
size	
  data	
  
4	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
…but	
  Python	
  does	
  not	
  scale	
  today	
  
•  Python	
  ecosystem	
  conned	
  to	
  single-­‐node	
  analysis	
  
• Great	
  for	
  smaller	
  data	
  sets	
  
• Requires	
  sampling	
  or	
  aggregaFons	
  for	
  larger	
  data	
  
• Distributed	
  tools	
  compromise	
  in	
  various	
  ways	
  
•  ExtracFng	
  samples	
  or	
  aggregaFons	
  for	
  larger	
  data	
  means:	
  
• “Scales”	
  by	
  losing	
  more	
  delity	
  
• AddiFonal	
  ETL	
  overhead	
  to	
  extract	
  samples/aggregaFons	
  
• Loss	
  of	
  producFvity	
  with	
  mulFple	
  languages,	
  tools,	
  etc	
  
• Blocks	
  certain	
  analysis	
  and	
  use	
  cases	
  
5	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Same	
  Python,	
  now	
  at	
  scale	
  
•  Target	
  user:	
  
• Data	
  scienFsts	
  and	
  data	
  engineers	
  (“Python	
  data	
  users”)	
  
•  Goals:	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Scales	
  to	
  any	
  node	
  and	
  data	
  size	
  
• No	
  compromise	
  in	
  funcFonality	
  or	
  usability	
  
• InteracFve	
  experience	
  at	
  naFve	
  hardware	
  speeds	
  
6	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  announced?	
  
•  First	
  public	
  release	
  of	
  Ibis	
  
• hgp://ibis-­‐project.org	
  
•  Beta	
  release	
  to	
  Cloudera	
  Labs	
  
•  InviFng	
  usage	
  and	
  community	
  development	
  
•  Apache-­‐licensed	
  open-­‐source	
  
7	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis’s	
  Vision	
  
•  Uncompromised	
  Python	
  experience	
  
• 100%	
  Python	
  end-­‐to-­‐end	
  user	
  workflows	
  	
  
• Enable	
  integraFon	
  with	
  the	
  exisFng	
  Python	
  data	
  ecosystem	
  (pandas,	
  scikit-­‐
learn,	
  NumPy,	
  etc)	
  
•  InteracFve	
  at	
  big	
  data	
  scale	
  
• Full-­‐fidelity	
  analysis	
  without	
  extracFons	
  
• Scalability	
  for	
  big	
  data	
  
• NaFve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
8	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
9	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Advantages	
  of	
  our	
  approach	
  
•  Analyze	
  big	
  data	
  100%	
  in	
  Python,	
  with	
  the	
  same	
  ease	
  as	
  small/medium	
  data	
  on	
  
the	
  local	
  lesystem	
  
•  Full-­‐fidelity	
  data	
  access	
  
•  Familiar	
  Python	
  experience	
  and	
  integraFon	
  with	
  exisFng	
  Python	
  data	
  libraries	
  
•  Provide	
  a	
  means	
  for	
  Python	
  high	
  performance	
  compuFng	
  tools	
  to	
  be	
  leveraged	
  at	
  
Hadoop-­‐scale	
  
10	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Beta	
  0.3	
  release 	
  	
  
•  High	
  level	
  Python	
  API	
  for	
  describing	
  analyFcs	
  and	
  ETL	
  that	
  can	
  be	
  executed	
  by	
  
Impala	
  
• Familiar	
  API	
  for	
  users	
  of	
  pandas	
  
• Comprehensive	
  coverage	
  of	
  operaFons	
  expressible	
  as	
  relaFonal	
  data	
  flows	
  
•  Integrated	
  tools	
  for	
  managing	
  data	
  in	
  HDFS	
  
•  Simple	
  workflows	
  to	
  query	
  data	
  les	
  in	
  several	
  formats	
  (Parquet,	
  Avro,	
  Text)	
  
•  pandas	
  data	
  interchange	
  
11	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis/Impala	
  Joint	
  Roadmap	
  
•  More	
  natural	
  data	
  modeling	
  
• Complex	
  types	
  support	
  
•  IntegraFon	
  with	
  full	
  Python	
  data	
  ecosystem	
  
• Advanced	
  analyFcs	
  +	
  machine	
  learning	
  
• Enable	
  use	
  of	
  performance	
  compuFng	
  tools	
  
•  User	
  extensibility	
  with	
  naFve	
  performance	
  
• In-­‐memory	
  columnar	
  format	
  
• Python-­‐to-­‐LLVM	
  IR	
  compilaFon	
  
•  Workflow	
  and	
  usability	
  tools	
  
12	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Benets	
  of	
  Ibis	
  
•  Maximize	
  developer	
  producFvity	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Solve	
  big	
  data	
  problems	
  without	
  leaving	
  Python	
  
• Leverage	
  Python	
  skills,	
  ecosystem,	
  and	
  tools	
  
•  Python	
  as	
  first-­‐class	
  language	
  for	
  Hadoop	
  
• Full-­‐fidelity	
  analysis	
  without	
  extracFons	
  
• Python	
  analysis	
  at	
  any	
  scale	
  
• NaFve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
13	
  Š	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
wes@cloudera.com	
  

More Related Content

What's hot (20)

PDF
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 
PPTX
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
PDF
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
PDF
Apache NiFi Meetup - Princeton NJ 2016
Timothy Spann
 
PDF
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
DataWorks Summit
 
PPTX
SDLC with Apache NiFi
DataWorks Summit
 
PPTX
Accelerating Big Data Insights
DataWorks Summit
 
PPTX
Spark Infrastructure Made Easy
BlueData, Inc.
 
PDF
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
Timothy Spann
 
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PPTX
Building a Smarter Home with Apache NiFi and Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
PDF
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
PPTX
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
PPTX
Apache NiFi Crash Course - San Jose Hadoop Summit
Aldrin Piri
 
PDF
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
 
PPTX
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Hortonworks
 
PPTX
OpenStack + Nano Server + Hyper-V + S2D
Alessandro Pilotti
 
PPTX
Cloudera Federal Forum 2014: Hadoop-Powered Solutions for Cybersecurity
Cloudera, Inc.
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
Apache NiFi Meetup - Princeton NJ 2016
Timothy Spann
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
DataWorks Summit
 
SDLC with Apache NiFi
DataWorks Summit
 
Accelerating Big Data Insights
DataWorks Summit
 
Spark Infrastructure Made Easy
BlueData, Inc.
 
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
Timothy Spann
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Building a Smarter Home with Apache NiFi and Spark
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Aldrin Piri
 
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Hortonworks
 
OpenStack + Nano Server + Hyper-V + S2D
Alessandro Pilotti
 
Cloudera Federal Forum 2014: Hadoop-Powered Solutions for Cybersecurity
Cloudera, Inc.
 

Viewers also liked (18)

PPTX
BCHS - Final Presentation
Linda Zheng
 
PPT
Pharmacy baba
sainaburg09
 
PDF
Pk 08.06 final
luisadoniacovo
 
PDF
Inferring networks of substitute and complementary products
Turi, Inc.
 
PDF
Bob’s training programs
Bob Seshadri
 
PPTX
ETP Introduction for Launch Events
RL Learning
 
PDF
American Builders Quarterly 12-12-07
Mark Roshanski
 
PDF
Ob1 unit 4 chapter - 15 - power and politics
Dr S Gokula Krishnan
 
RTF
Screenplay - 'Kay'
skywalker97
 
PDF
De 2
manhlethe999
 
PDF
Ob1 unit 4 chapter - 12 - managing teams at work
Dr S Gokula Krishnan
 
PDF
Rapport ramed 2013 v2
RACHID MABROUKI
 
PPTX
Managing Time as a Coach
RL Learning
 
PPTX
Fuel cell stacking
Pana Mann
 
PDF
Ob1 unit 4 chapter - 16 - conflict management
Dr S Gokula Krishnan
 
PDF
Osvaldo Ajuda C.V.-English
Osvaldo Ajuda
 
PDF
Marketing_Collateral_Samples_2015_final
Troy Wise
 
DOC
First
Pana Mann
 
BCHS - Final Presentation
Linda Zheng
 
Pharmacy baba
sainaburg09
 
Pk 08.06 final
luisadoniacovo
 
Inferring networks of substitute and complementary products
Turi, Inc.
 
Bob’s training programs
Bob Seshadri
 
ETP Introduction for Launch Events
RL Learning
 
American Builders Quarterly 12-12-07
Mark Roshanski
 
Ob1 unit 4 chapter - 15 - power and politics
Dr S Gokula Krishnan
 
Screenplay - 'Kay'
skywalker97
 
De 2
manhlethe999
 
Ob1 unit 4 chapter - 12 - managing teams at work
Dr S Gokula Krishnan
 
Rapport ramed 2013 v2
RACHID MABROUKI
 
Managing Time as a Coach
RL Learning
 
Fuel cell stacking
Pana Mann
 
Ob1 unit 4 chapter - 16 - conflict management
Dr S Gokula Krishnan
 
Osvaldo Ajuda C.V.-English
Osvaldo Ajuda
 
Marketing_Collateral_Samples_2015_final
Troy Wise
 
First
Pana Mann
 
Ad

Similar to Pandas & Cloudera: Scaling the Python Data Experience (20)

PDF
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
PDF
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PDF
PyData: The Next Generation
Wes McKinney
 
PDF
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PDF
High-Performance Python On Spark
Jen Aman
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
PDF
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PDF
Building a Hadoop Data Warehouse with Impala
huguk
 
PDF
Cloudera 5.3 Update
Cloudera, Inc.
 
PDF
Elephants Ibises and a more Pythonic way to work with databases
ssuser59b75e
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
PDF
Impala use case @ edge
Ram Kedem
 
PDF
Python as the Zen of Data Science
Travis Oliphant
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PyData: The Next Generation
Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
High Performance Python on Apache Spark
Wes McKinney
 
High-Performance Python On Spark
Jen Aman
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
huguk
 
Cloudera 5.3 Update
Cloudera, Inc.
 
Elephants Ibises and a more Pythonic way to work with databases
ssuser59b75e
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
 
Impala use case @ edge
Ram Kedem
 
Python as the Zen of Data Science
Travis Oliphant
 
DataFrames: The Extended Cut
Wes McKinney
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Ad

More from Turi, Inc. (20)

PPTX
Webinar - Analyzing Video
Turi, Inc.
 
PDF
Webinar - Patient Readmission Risk
Turi, Inc.
 
PPTX
Webinar - Know Your Customer - Arya (20160526)
Turi, Inc.
 
PPTX
Webinar - Product Matching - Palombo (20160428)
Turi, Inc.
 
PPTX
Webinar - Pattern Mining Log Data - Vega (20160426)
Turi, Inc.
 
PPTX
Webinar - Fraud Detection - Palombo (20160428)
Turi, Inc.
 
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
PDF
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
PPTX
Intelligent Applications with Machine Learning Toolkits
Turi, Inc.
 
PPTX
Text Analysis with Machine Learning
Turi, Inc.
 
PPTX
Machine Learning with GraphLab Create
Turi, Inc.
 
PPTX
Machine Learning in Production with Dato Predictive Services
Turi, Inc.
 
PPTX
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Turi, Inc.
 
PDF
Scalable data structures for data science
Turi, Inc.
 
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
PDF
Introduction to Recommender Systems
Turi, Inc.
 
PDF
Machine learning in production
Turi, Inc.
 
PPTX
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
PPTX
SFrame
Turi, Inc.
 
PPT
Building Personalized Data Products with Dato
Turi, Inc.
 
Webinar - Analyzing Video
Turi, Inc.
 
Webinar - Patient Readmission Risk
Turi, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Turi, Inc.
 
Text Analysis with Machine Learning
Turi, Inc.
 
Machine Learning with GraphLab Create
Turi, Inc.
 
Machine Learning in Production with Dato Predictive Services
Turi, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Turi, Inc.
 
Scalable data structures for data science
Turi, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
Introduction to Recommender Systems
Turi, Inc.
 
Machine learning in production
Turi, Inc.
 
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
SFrame
Turi, Inc.
 
Building Personalized Data Products with Dato
Turi, Inc.
 

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 

Pandas & Cloudera: Scaling the Python Data Experience

  • 1. 1  Š  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  the  Python  Data   Experience   Wes  McKinney                    Marcel  Kornacker   JusFn  Erickson    Silvius  Rus  
  • 2. 2  Š  Cloudera,  Inc.  All  rights  reserved.   Wes  McKinney   •  A  key  person  in  building  today’s  open  source  Python  data  community   •  Creator  of  pandas,  a  standard  Python  data  wrangling  and  analyFcs  toolkit  used   by  data  scienFsts   •  Author  of  best-­‐selling  canonical  text  Python  for  Data  Analysis  (2012)   •  Formerly  Founder/CEO  of  DataPad  (acquired  by  Cloudera  in  2014)  
  • 3. 3  Š  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  producFvity  for  data  engineers  and  data  scienFsts   • Build  robust  so[ware  and  do  interacFve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  producFve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualizaFon,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  • 4. 4  Š  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  conned  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggregaFons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  ExtracFng  samples  or  aggregaFons  for  larger  data  means:   • “Scales”  by  losing  more  delity   • AddiFonal  ETL  overhead  to  extract  samples/aggregaFons   • Loss  of  producFvity  with  mulFple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  • 5. 5  Š  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Same  Python,  now  at  scale   •  Target  user:   • Data  scienFsts  and  data  engineers  (“Python  data  users”)   •  Goals:   • Mirrors  single-­‐node  Python  experience   • Scales  to  any  node  and  data  size   • No  compromise  in  funcFonality  or  usability   • InteracFve  experience  at  naFve  hardware  speeds  
  • 6. 6  Š  Cloudera,  Inc.  All  rights  reserved.   What’s  announced?   •  First  public  release  of  Ibis   • hgp://ibis-­‐project.org   •  Beta  release  to  Cloudera  Labs   •  InviFng  usage  and  community  development   •  Apache-­‐licensed  open-­‐source  
  • 7. 7  Š  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integraFon  with  the  exisFng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  InteracFve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extracFons   • Scalability  for  big  data   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 8. 8  Š  Cloudera,  Inc.  All  rights  reserved.  
  • 9. 9  Š  Cloudera,  Inc.  All  rights  reserved.   Advantages  of  our  approach   •  Analyze  big  data  100%  in  Python,  with  the  same  ease  as  small/medium  data  on   the  local  lesystem   •  Full-­‐fidelity  data  access   •  Familiar  Python  experience  and  integraFon  with  exisFng  Python  data  libraries   •  Provide  a  means  for  Python  high  performance  compuFng  tools  to  be  leveraged  at   Hadoop-­‐scale  
  • 10. 10  Š  Cloudera,  Inc.  All  rights  reserved.   Beta  0.3  release     •  High  level  Python  API  for  describing  analyFcs  and  ETL  that  can  be  executed  by   Impala   • Familiar  API  for  users  of  pandas   • Comprehensive  coverage  of  operaFons  expressible  as  relaFonal  data  flows   •  Integrated  tools  for  managing  data  in  HDFS   •  Simple  workflows  to  query  data  les  in  several  formats  (Parquet,  Avro,  Text)   •  pandas  data  interchange  
  • 11. 11  Š  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  IntegraFon  with  full  Python  data  ecosystem   • Advanced  analyFcs  +  machine  learning   • Enable  use  of  performance  compuFng  tools   •  User  extensibility  with  naFve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compilaFon   •  Workflow  and  usability  tools  
  • 12. 12  Š  Cloudera,  Inc.  All  rights  reserved.   Benets  of  Ibis   •  Maximize  developer  producFvity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  rst-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extracFons   • Python  analysis  at  any  scale   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 13. 13  Š  Cloudera,  Inc.  All  rights  reserved.   Thank  you   [email protected] Â