SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Scaling	
  Python	
  Analy=cs	
  
on	
  Hadoop	
  and	
  Impala	
  
Wes	
  McKinney,	
  SF	
  Data	
  Mining	
  Meetup	
  2015-­‐10-­‐22	
  
@wesmckinn	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  R&D	
  at	
  Cloudera	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Mathema=cian	
  —	
  MIT	
  ‘07	
  
•  “Professional	
  SQL	
  programmer”	
  2007-­‐2010	
  (@	
  AQR)	
  
•  Created	
  pandas	
  (Python	
  library)	
  in	
  2008	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Founder	
  of	
  DataPad	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  is	
  popular…	
  
•  Python	
  has	
  become	
  a	
  standard	
  language	
  of	
  data	
  science	
  
•  Why	
  is	
  it	
  popular?	
  
• Maximizes	
  produc=vity	
  for	
  data	
  engineers	
  and	
  data	
  scien=sts	
  
• Build	
  robust	
  sobware	
  and	
  do	
  interac=ve	
  data	
  analysis	
  with	
  100%	
  Python	
  code	
  	
  
• Easy-­‐to-­‐learn	
  and	
  makes	
  happy	
  and	
  produc=ve	
  data	
  teams	
  	
  
• Large,	
  diverse	
  open	
  source	
  development	
  community	
  
• Comprehensive	
  libraries:	
  data	
  wrangling,	
  ML,	
  visualiza=on,	
  etc.	
  
•  Main	
  use	
  case:	
  data	
  science	
  &	
  engineering	
  swiss	
  army	
  knife	
  on	
  small-­‐to-­‐medium	
  
size	
  data	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
…but	
  Python	
  does	
  not	
  scale	
  today	
  
•  Python	
  ecosystem	
  confined	
  to	
  single-­‐node	
  analysis	
  
• Great	
  for	
  smaller	
  data	
  sets	
  
• Requires	
  sampling	
  or	
  aggrega=ons	
  for	
  larger	
  data	
  
• Distributed	
  tools	
  compromise	
  in	
  various	
  ways	
  
•  Extrac=ng	
  samples	
  or	
  aggrega=ons	
  for	
  larger	
  data	
  means:	
  
• “Scales”	
  by	
  losing	
  more	
  fidelity	
  
• Addi=onal	
  ETL	
  overhead	
  to	
  extract	
  samples/aggrega=ons	
  
• Loss	
  of	
  produc=vity	
  with	
  mul=ple	
  languages,	
  tools,	
  etc	
  
• Blocks	
  certain	
  analysis	
  and	
  use	
  cases	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Industry	
  Analy=cs	
   Scien=fic	
  Compu=ng	
  
Heterogeneous	
  data	
  
	
  	
  	
  	
  Flat	
  tables	
  and	
  JSON	
  
Spark	
  /	
  MapReduce	
  
SQL	
  
DFS-­‐friendly	
  /	
  streaming	
  data	
  formats	
  
More	
  physical	
  machines	
  
Homogeneous	
  data	
  
	
  	
  	
  	
  Mul=dimensional	
  arrays	
  
HPC	
  tools	
  
Linear	
  algebra	
  
Scien=fic	
  data	
  formats	
  
Fewer	
  physical	
  machines	
  
Some	
  simplis=c	
  generaliza=ons	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Industry	
  Analy=cs	
   Scien=fic	
  Compu=ng	
  
Heterogeneous	
  data	
  
	
  	
  	
  	
  Flat	
  tables	
  and	
  JSON	
  
Spark	
  /	
  MapReduce	
  
SQL	
  
DFS-­‐friendly	
  /	
  streaming	
  data	
  formats	
  
More	
  physical	
  machines	
  
Homogeneous	
  data	
  
	
  	
  	
  	
  Mul=dimensional	
  arrays	
  
HPC	
  tools	
  
Linear	
  algebra	
  
Scien=fic	
  data	
  formats	
  (e.g.	
  HDF5)	
  
Fewer	
  physical	
  machines	
  
Some	
  simplis=c	
  generaliza=ons	
  
Python:	
  heavy	
  investment,	
  	
  
generally	
  
Python:	
  light	
  investment,	
  
generally	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  Hugely	
  popular	
  Python	
  table	
  /	
  “data	
  frame”	
  library	
  
• Labeled	
  table,	
  array,	
  and	
  =me	
  series	
  data	
  structures	
  
•  Popular	
  for	
  data	
  prepara=on,	
  ETL,	
  and	
  in-­‐memory	
  analy=cs	
  
•  Built	
  using	
  Python’s	
  scien=fic	
  compu=ng	
  stack	
  
• User	
  API	
  /	
  domain	
  specific	
  language	
  
• Bespoke	
  in-­‐memory	
  analy=cs	
  /	
  rela=onal	
  algebra	
  engine	
  
• IO	
  interfaces	
  (CSV,	
  SQL,	
  etc.)	
  
• Expanded	
  data	
  type	
  system	
  (beyond	
  NumPy)	
  
•  Supports	
  flat	
  data	
  only	
  (or	
  semi-­‐structured	
  data	
  that	
  can	
  be	
  flaqened)	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Many	
  SQL	
  engines	
  
…	
  and	
  more	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  “Great	
  Decoupling”	
  for	
  Big	
  Data	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  sample	
  big	
  data	
  architecture	
  
Kafka
Kafka
Kafka
Kafka
Application data
HDFS
JSON Spark/MapReduce
Columnar
storage
Analytic SQL Engine
User
SQL
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Nested	
  /	
  Complex	
  types	
  support	
  
•  Arrays,	
  structs,	
  maps,	
  and	
  unions	
  as	
  first-­‐class	
  value	
  types	
  
•  Analyze	
  JSON-­‐like	
  data	
  directly	
  without	
  flaqening	
  or	
  normaliza=on	
  
•  Most	
  new	
  SQL	
  engines	
  have	
  some	
  level	
  of	
  support	
  
• Impala	
  
• Presto	
  
• Drill	
  
• BigQuery	
  
• Spark	
  SQL	
  
• Hive	
  
• …	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis	
  in	
  a	
  nutshell	
  
•  For	
  Python	
  programmers	
  doing	
  analy=cs	
  in	
  industry	
  
•  Project	
  Blog:	
  hqp://blog.ibis-­‐project.org	
  
•  Joint	
  project	
  with	
  Impala	
  team	
  @	
  Cloudera	
  
•  Apache-­‐licensed,	
  open	
  source	
  hqp://github.com/cloudera/ibis	
  	
  
•  Crabing	
  a	
  compelling	
  Python-­‐on-­‐Hadoop	
  user	
  experience	
  
• Remove	
  SQL	
  coding	
  from	
  user	
  workflows	
  
• Develop	
  high	
  performance	
  Python	
  extension	
  APIs	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis	
  in	
  a	
  nutshell,	
  cont’d	
  
•  Composable	
  Python	
  DSL	
  (“Ibis	
  expressions”)	
  makes	
  hand-­‐coding	
  SQL	
  SELECT	
  
statements	
  unnecessary	
  
•  Ibis	
  for	
  SQL	
  Programmers:	
  hqp://docs.ibis-­‐project.org/sql.html	
  
•  Development	
  roadmap	
  targets	
  Impala	
  (C++	
  /	
  LLVM)	
  query	
  engine	
  
• …	
  but	
  SQL	
  compiler	
  toolchain	
  is	
  general	
  purpose	
  
•  Current	
  supports	
  Impala	
  and	
  SQLite,	
  but	
  soon	
  other	
  dialects	
  
• We	
  welcome	
  external	
  contributors	
  for	
  other	
  Analy=c	
  SQL	
  engines	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Benefits	
  of	
  Ibis	
  
•  Maximize	
  developer	
  produc=vity	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Solve	
  big	
  data	
  problems	
  without	
  leaving	
  Python	
  
• Leverage	
  Python	
  skills,	
  ecosystem,	
  and	
  tools	
  
•  Python	
  as	
  first-­‐class	
  language	
  for	
  Hadoop	
  
• Full-­‐fidelity	
  analysis	
  without	
  extrac=ons	
  
• Python	
  analysis	
  at	
  any	
  scale	
  
• Na=ve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Brief	
  interac=ve	
  demo	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis/Impala	
  Joint	
  Roadmap	
  
•  More	
  natural	
  data	
  modeling	
  
• Complex	
  types	
  support	
  
•  Integra=on	
  with	
  full	
  Python	
  data	
  ecosystem	
  
• Advanced	
  analy=cs	
  +	
  machine	
  learning	
  
• Enable	
  use	
  of	
  performance	
  compu=ng	
  tools	
  
•  User	
  extensibility	
  with	
  na=ve	
  performance	
  
• In-­‐memory	
  columnar	
  format	
  
• Python-­‐to-­‐LLVM	
  IR	
  compila=on	
  
•  Workflow	
  and	
  usability	
  tools	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu=ng	
  data	
  science	
  languages	
  in	
  the	
  compute	
  layer	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enabling	
  interoperability	
  with	
  big	
  data	
  systems	
  
•  Distributed	
  /	
  MPP	
  query	
  engines:	
  implemented	
  in	
  a	
  host	
  language	
  
• Typically	
  C/C++	
  or	
  Java/Scala	
  
•  User-­‐defined	
  func=ons	
  (UDFs)	
  through	
  various	
  means	
  
• Implement	
  in	
  host	
  language	
  
• Implement	
  in	
  user	
  language	
  through	
  some	
  external	
  language	
  protocol	
  (oben	
  
RPC-­‐based)	
  
•  External	
  UDFs	
  are	
  usually	
  very	
  slow	
  (cf:	
  PL/Python,	
  PySpark,	
  etc.)	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  are	
  UDFs	
  good	
  for?	
  
•  Note:	
  industry	
  data	
  scien=sts	
  have	
  libraries	
  containing	
  100s	
  of	
  UDFs	
  for	
  Hive	
  or	
  
other	
  distributed	
  query	
  engines	
  
•  Custom	
  data	
  transforma=ons	
  
•  Custom	
  domain	
  logic	
  (date	
  /	
  =me	
  /	
  data	
  types)	
  
•  Custom	
  data	
  types	
  
•  Custom	
  aggrega=ons	
  (incl.	
  machine	
  learning	
  /	
  sta=s=cs	
  expressible	
  as	
  reduc=ons)	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  are	
  external	
  UDFs	
  slow?	
  
•  Serializa=on	
  /	
  deserializa=on	
  overhead	
  
•  Scalar	
  vs	
  vectorized	
  computa=ons	
  
•  RPC	
  overhead	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Example:	
  Vectoriza=on	
  for	
  interpreted	
  languages	
  
SUM(CASE WHEN x > y
THEN x
ELSE x + y
END)
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Vectorized	
  vs	
  Interpreted	
  perf	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  to	
  make	
  them	
  fast?	
  
•  Common	
  run=me	
  memory	
  representa=on	
  for	
  tabular	
  data	
  
•  Share-­‐memory	
  (zero-­‐copy	
  or	
  memcpy-­‐only)	
  external	
  UDF	
  protocol	
  
•  Vectorized	
  UDF	
  interface	
  (for	
  interpreted	
  languages)	
  
•  Impala	
  is	
  uniquely	
  posi=oned	
  to	
  play	
  well	
  with	
  Ibis	
  
• Best-­‐in-­‐class	
  performance	
  and	
  scalability	
  
• C++	
  and	
  LLVM-­‐based	
  (JIT	
  compiler)	
  run=me	
  
• Unified,	
  efficient	
  data	
  interchange	
  amongst	
  Ibis,	
  Impala,	
  and	
  Kudu	
  will	
  enable	
  
high	
  performance	
  real	
  =me	
  analy=cs	
  from	
  Python	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Memory	
  representa=on	
  
•  Many	
  query	
  engines	
  are	
  standardizing	
  on	
  in-­‐memory	
  columnar	
  rep’n	
  of	
  
materialized	
  transient	
  data	
  
• Impala:	
  
hqp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐
reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/	
  
• Apache	
  Drill:	
  hqps://drill.apache.org/faq/	
  
•  Industry-­‐standard	
  serializa=on	
  format:	
  Apache	
  Parquet	
  
• hqps://parquet.apache.org/	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Serializa=on	
  vs	
  In-­‐memory	
  
•  Serializa=on	
  formats	
  (e.g.	
  Parquet)	
  	
  
• Op=mize	
  for	
  IO	
  /	
  DFS	
  throughput	
  at	
  expense	
  of	
  CPU/memory	
  bus	
  throughput	
  
• Do	
  not	
  consider	
  random	
  access	
  or	
  in-­‐memory	
  analy=cs	
  as	
  a	
  goal	
  
•  No	
  standardized	
  in-­‐memory	
  containers	
  for	
  materialized	
  data	
  from	
  file	
  /	
  RPC	
  
protocols	
  (Parquet,	
  Thrib,	
  protobuf,	
  Avro,	
  etc.)	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Standardized	
  in-­‐memory	
  columnar	
  (IMC)	
  
•  Compact	
  in-­‐memory	
  representa=on	
  for	
  semistructured	
  data	
  
•  Part	
  of	
  Impala’s	
  upcoming	
  dev	
  roadmap	
  
•  Some	
  prior	
  IMC-­‐for-­‐SQL	
  work:	
  Apache	
  Drill	
  
•  Standardized	
  memory	
  representa=on	
  means	
  data	
  can	
  be	
  shared	
  without	
  
serializa=on	
  
•  Create	
  a	
  canonical	
  C/C++	
  implementa=on	
  for	
  use	
  in	
  Python	
  /	
  R	
  /	
  Julia	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis’s	
  Vision	
  
•  Uncompromised	
  Python	
  experience	
  
• 100%	
  Python	
  end-­‐to-­‐end	
  user	
  workflows	
  	
  
• Enable	
  integra=on	
  with	
  the	
  exis=ng	
  Python	
  data	
  ecosystem	
  (pandas,	
  scikit-­‐
learn,	
  NumPy,	
  etc)	
  
•  Interac=ve	
  at	
  big	
  data	
  scale	
  
• Full-­‐fidelity	
  analysis	
  without	
  extrac=ons	
  
• Scalability	
  for	
  big	
  data	
  
• Na=ve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

More Related Content

What's hot (20)

PDF
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PDF
Impala use case @ Zoosk
Cloudera, Inc.
 
PPTX
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
PPTX
SQL on Hadoop
Bigdatapump
 
PDF
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
 
PDF
SQL on Hadoop
nvvrajesh
 
PDF
Spark mhug2
Joseph Niemiec
 
PPTX
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
PDF
Sql on everything with drill
Julien Le Dem
 
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
PDF
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
 
PPTX
Node labels in YARN
Wangda Tan
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PDF
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
PPTX
Apache Ambari: Managing Hadoop and YARN
Hortonworks
 
PDF
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
PDF
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
PPTX
Cloudy with a Chance of Hadoop - Real World Considerations
DataWorks Summit/Hadoop Summit
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Impala use case @ Zoosk
Cloudera, Inc.
 
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
SQL on Hadoop
Bigdatapump
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
 
SQL on Hadoop
nvvrajesh
 
Spark mhug2
Joseph Niemiec
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
Sql on everything with drill
Julien Le Dem
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks
 
Node labels in YARN
Wangda Tan
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Apache Ambari: Managing Hadoop and YARN
Hortonworks
 
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
Cloudy with a Chance of Hadoop - Real World Considerations
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PPTX
DataEngConf: Building the Next New York Times Recommendation Engine
Hakka Labs
 
PDF
DataEngConf: Data Science at the New York Times by Chris Wiggins
Hakka Labs
 
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
PPTX
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
Hakka Labs
 
PDF
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
Hakka Labs
 
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
PDF
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Yahoo Developer Network
 
PPTX
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 
PPTX
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
David Lauzon
 
PPTX
DataEngConf: The Science of Virality at BuzzFeed
Hakka Labs
 
PPTX
Fighting cyber fraud with hadoop
Niel Dunnage
 
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
PDF
Making Big Data Projects Successful - Data Science Pop-up Seattle
Domino Data Lab
 
PPTX
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Uri Laserson
 
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
PDF
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Hakka Labs
 
PDF
How To Plan a Successful Big Data Pilot
farsitegroup
 
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
PDF
Compiled Python UDFs for Impala
Cloudera, Inc.
 
PPTX
How to start big data projects?
Agnieszka Zdebiak
 
DataEngConf: Building the Next New York Times Recommendation Engine
Hakka Labs
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
Hakka Labs
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Yahoo Developer Network
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
David Lauzon
 
DataEngConf: The Science of Virality at BuzzFeed
Hakka Labs
 
Fighting cyber fraud with hadoop
Niel Dunnage
 
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
Making Big Data Projects Successful - Data Science Pop-up Seattle
Domino Data Lab
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Uri Laserson
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
Approximate Nearest Neighbors and Vector Models by Erik Bernhardsson
Hakka Labs
 
How To Plan a Successful Big Data Pilot
farsitegroup
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
Compiled Python UDFs for Impala
Cloudera, Inc.
 
How to start big data projects?
Agnieszka Zdebiak
 
Ad

Similar to Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney (20)

PDF
PyData: The Next Generation
Wes McKinney
 
PDF
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PDF
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
 
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PDF
High-Performance Python On Spark
Jen Aman
 
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PDF
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
PDF
Impala use case @ edge
Ram Kedem
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PDF
Cloudera 5.3 Update
Cloudera, Inc.
 
PDF
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
PPTX
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Cloudera, Inc.
 
PPT
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PyData: The Next Generation
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Data Science Languages and Industry Analytics
Wes McKinney
 
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.
 
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
High Performance Python on Apache Spark
Wes McKinney
 
High-Performance Python On Spark
Jen Aman
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
DataFrames: The Extended Cut
Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
Impala use case @ edge
Ram Kedem
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Cloudera 5.3 Update
Cloudera, Inc.
 
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Cloudera, Inc.
 
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Ad

More from Hakka Labs (20)

PDF
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
PPTX
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
PDF
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PDF
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
PDF
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
PDF
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
PDF
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
PDF
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
PPTX
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
 
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Digital Circuits, important subject in CS
contactparinay1
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  Python  Analy=cs   on  Hadoop  and  Impala   Wes  McKinney,  SF  Data  Mining  Meetup  2015-­‐10-­‐22   @wesmckinn  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  R&D  at  Cloudera   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Mathema=cian  —  MIT  ‘07   •  “Professional  SQL  programmer”  2007-­‐2010  (@  AQR)   •  Created  pandas  (Python  library)  in  2008   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Founder  of  DataPad    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  produc=vity  for  data  engineers  and  data  scien=sts   • Build  robust  sobware  and  do  interac=ve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  produc=ve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualiza=on,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  confined  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggrega=ons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  Extrac=ng  samples  or  aggrega=ons  for  larger  data  means:   • “Scales”  by  losing  more  fidelity   • Addi=onal  ETL  overhead  to  extract  samples/aggrega=ons   • Loss  of  produc=vity  with  mul=ple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy=cs   Scien=fic  Compu=ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul=dimensional  arrays   HPC  tools   Linear  algebra   Scien=fic  data  formats   Fewer  physical  machines   Some  simplis=c  generaliza=ons  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy=cs   Scien=fic  Compu=ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul=dimensional  arrays   HPC  tools   Linear  algebra   Scien=fic  data  formats  (e.g.  HDF5)   Fewer  physical  machines   Some  simplis=c  generaliza=ons   Python:  heavy  investment,     generally   Python:  light  investment,   generally  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Hugely  popular  Python  table  /  “data  frame”  library   • Labeled  table,  array,  and  =me  series  data  structures   •  Popular  for  data  prepara=on,  ETL,  and  in-­‐memory  analy=cs   •  Built  using  Python’s  scien=fic  compu=ng  stack   • User  API  /  domain  specific  language   • Bespoke  in-­‐memory  analy=cs  /  rela=onal  algebra  engine   • IO  interfaces  (CSV,  SQL,  etc.)   • Expanded  data  type  system  (beyond  NumPy)   •  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  flaqened)  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Many  SQL  engines   …  and  more  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   The  “Great  Decoupling”  for  Big  Data   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Nested  /  Complex  types  support   •  Arrays,  structs,  maps,  and  unions  as  first-­‐class  value  types   •  Analyze  JSON-­‐like  data  directly  without  flaqening  or  normaliza=on   •  Most  new  SQL  engines  have  some  level  of  support   • Impala   • Presto   • Drill   • BigQuery   • Spark  SQL   • Hive   • …  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell   •  For  Python  programmers  doing  analy=cs  in  industry   •  Project  Blog:  hqp://blog.ibis-­‐project.org   •  Joint  project  with  Impala  team  @  Cloudera   •  Apache-­‐licensed,  open  source  hqp://github.com/cloudera/ibis     •  Crabing  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL  coding  from  user  workflows   • Develop  high  performance  Python  extension  APIs  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell,  cont’d   •  Composable  Python  DSL  (“Ibis  expressions”)  makes  hand-­‐coding  SQL  SELECT   statements  unnecessary   •  Ibis  for  SQL  Programmers:  hqp://docs.ibis-­‐project.org/sql.html   •  Development  roadmap  targets  Impala  (C++  /  LLVM)  query  engine   • …  but  SQL  compiler  toolchain  is  general  purpose   •  Current  supports  Impala  and  SQLite,  but  soon  other  dialects   • We  welcome  external  contributors  for  other  Analy=c  SQL  engines  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Benefits  of  Ibis   •  Maximize  developer  produc=vity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  first-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extrac=ons   • Python  analysis  at  any  scale   • Na=ve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Brief  interac=ve  demo  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  Integra=on  with  full  Python  data  ecosystem   • Advanced  analy=cs  +  machine  learning   • Enable  use  of  performance  compu=ng  tools   •  User  extensibility  with  na=ve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compila=on   •  Workflow  and  usability  tools  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Execu=ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C/C++  or  Java/Scala   •  User-­‐defined  func=ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol  (oben   RPC-­‐based)   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   What  are  UDFs  good  for?   •  Note:  industry  data  scien=sts  have  libraries  containing  100s  of  UDFs  for  Hive  or   other  distributed  query  engines   •  Custom  data  transforma=ons   •  Custom  domain  logic  (date  /  =me  /  data  types)   •  Custom  data  types   •  Custom  aggrega=ons  (incl.  machine  learning  /  sta=s=cs  expressible  as  reduc=ons)  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Why  are  external  UDFs  slow?   •  Serializa=on  /  deserializa=on  overhead   •  Scalar  vs  vectorized  computa=ons   •  RPC  overhead  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Example:  Vectoriza=on  for  interpreted  languages   SUM(CASE WHEN x > y THEN x ELSE x + y END)
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Vectorized  vs  Interpreted  perf  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   How  to  make  them  fast?   •  Common  run=me  memory  representa=on  for  tabular  data   •  Share-­‐memory  (zero-­‐copy  or  memcpy-­‐only)  external  UDF  protocol   •  Vectorized  UDF  interface  (for  interpreted  languages)   •  Impala  is  uniquely  posi=oned  to  play  well  with  Ibis   • Best-­‐in-­‐class  performance  and  scalability   • C++  and  LLVM-­‐based  (JIT  compiler)  run=me   • Unified,  efficient  data  interchange  amongst  Ibis,  Impala,  and  Kudu  will  enable   high  performance  real  =me  analy=cs  from  Python  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Memory  representa=on   •  Many  query  engines  are  standardizing  on  in-­‐memory  columnar  rep’n  of   materialized  transient  data   • Impala:   hqp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐ reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/   • Apache  Drill:  hqps://drill.apache.org/faq/   •  Industry-­‐standard  serializa=on  format:  Apache  Parquet   • hqps://parquet.apache.org/  
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Serializa=on  vs  In-­‐memory   •  Serializa=on  formats  (e.g.  Parquet)     • Op=mize  for  IO  /  DFS  throughput  at  expense  of  CPU/memory  bus  throughput   • Do  not  consider  random  access  or  in-­‐memory  analy=cs  as  a  goal   •  No  standardized  in-­‐memory  containers  for  materialized  data  from  file  /  RPC   protocols  (Parquet,  Thrib,  protobuf,  Avro,  etc.)  
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Standardized  in-­‐memory  columnar  (IMC)   •  Compact  in-­‐memory  representa=on  for  semistructured  data   •  Part  of  Impala’s  upcoming  dev  roadmap   •  Some  prior  IMC-­‐for-­‐SQL  work:  Apache  Drill   •  Standardized  memory  representa=on  means  data  can  be  shared  without   serializa=on   •  Create  a  canonical  C/C++  implementa=on  for  use  in  Python  /  R  /  Julia  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integra=on  with  the  exis=ng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  Interac=ve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extrac=ons   • Scalability  for  big  data   • Na=ve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own