SlideShare a Scribd company logo
Introduc)on	
  to	
  Apache	
  Drill	
  

          February	
  27,	
  2013	
  
            Tomer	
  Shiran	
  
Who	
  Am	
  I?	
  
•      Tomer	
  Shiran	
  
•      tshiran@maprtech.com	
  
•      Founding	
  Member	
  and	
  CommiGer,	
  Apache	
  Drill	
  
•      Director	
  of	
  Product	
  Management,	
  MapR	
  
       Technologies	
  
	
  
Agenda	
  

•    Apache	
  Drill	
  overview	
  
•    Key	
  features	
  
•    Status	
  and	
  progress	
  
•    How	
  to	
  get	
  involved	
  
Big	
  Data	
  Workloads	
  
•    ETL	
  
•    Data	
  mining	
  
•    Index	
  and	
  model	
  genera)on	
  
•    Clustering,	
  anomaly	
  detec)on	
  and	
  classifica)on	
  
•    Blob	
  store	
  
•    Lightweight	
  OLTP	
  on	
  large	
  datasets	
  
•    Web	
  crawling	
  
•    Stream	
  processing	
  
•    Interac)ve	
  analysis	
  
Interac)ve	
  Queries	
  and	
  Hadoop	
  
                                                        Exis)ng	
  Solu)ons	
  


                                Compile	
  SQL	
  (HiveQL)	
  to	
                Export	
  MapReduce	
  results	
  to	
  
                                     MapReduce	
                                   RDBMS	
  and	
  query	
  RDBMS	
  


                                               Emerging	
  Technologies	
  
                                                                                                        Stinger/Tez                    Impala
     Real-­‐)me	
  	
         PostgreSQL-­‐based	
                    PostgreSQL-­‐based	
           Hive	
  performance	
             Real-­‐)me	
  	
  
interac)ve	
  analysis	
      Hadoop	
  analy)cs	
                    Hadoop	
  analy)cs	
            improvements	
              interac)ve	
  analysis	
  


                             HAWQ                             Phoenix	
                                                        Cascading Lingual
       PostgreSQL-­‐based	
                                                                                                     Compile	
  ANSI	
  SQL	
  to	
  
       Hadoop	
  analy)cs	
                   SQL	
  layer	
  for	
  HBase	
           SQL	
  layer	
  for	
  HBase	
              MapReduce	
  
Example	
  Problem	
  
       •  Jane	
  works	
  as	
  an	
          Transac.on	
  
          analyst	
  at	
  an	
  e-­‐          informa.on	
  
          commerce	
  company	
  
       •  How	
  does	
  she	
  figure	
  
                                                 User	
  	
  
          out	
  good	
  targe)ng	
             profiles	
  
          segments	
  for	
  the	
  next	
  
          marke)ng	
  campaign?	
  
       •  She	
  has	
  some	
  ideas	
           Access	
  
          and	
  lots	
  of	
  data	
              logs	
  
Solving	
  the	
  Problem	
  with	
  Tradi)onal	
  Systems	
  

•  Use	
  an	
  RDBMS	
  
     –  ETL	
  the	
  data	
  from	
  MongoDB	
  and	
  Hadoop	
  into	
  the	
  RDBMS	
  
            •  MongoDB	
  data	
  must	
  be	
  flaGened,	
  schema)zed,	
  filtered	
  and	
  aggregated	
  
            •  Hadoop	
  data	
  must	
  be	
  filtered	
  and	
  aggregated	
  
     –  Query	
  the	
  data	
  using	
  any	
  SQL-­‐based	
  tool	
  
•  Use	
  MapReduce	
  
     –  ETL	
  the	
  data	
  from	
  Oracle	
  and	
  MongoDB	
  into	
  Hadoop	
  
     –  Work	
  with	
  the	
  MapReduce	
  team	
  to	
  generate	
  the	
  desired	
  analyses	
  
•  Use	
  Hive	
  
     –  ETL	
  the	
  data	
  from	
  Oracle	
  and	
  MongoDB	
  into	
  Hadoop	
  
            •  MongoDB	
  data	
  must	
  be	
  flaGened	
  and	
  schema)zed	
  
     –  But	
  HiveQL	
  is	
  limited,	
  queries	
  take	
  too	
  long	
  and	
  BI	
  tool	
  support	
  is	
  
        limited	
  
WWGD	
  

                 Distributed	
                       Interac.ve	
       Batch	
  
                                       NoSQL	
  
                 File	
  System	
                      analysis	
     processing	
  


                      GFS	
           BigTable	
       Dremel	
       MapReduce	
  


                                                                       Hadoop	
  
                     HDFS	
            HBase	
           ???	
  
                                                                      MapReduce	
  




  Build	
  Apache	
  Drill	
  to	
  provide	
  a	
  true	
  open	
  source	
  
     solu)on	
  to	
  interac)ve	
  analysis	
  of	
  Big	
  Data	
  
Apache	
  Drill	
  Overview	
  
•  Interac)ve	
  analysis	
  of	
  Big	
  Data	
  using	
  standard	
  SQL	
  
•  Fast	
  
      –  Low	
  latency	
  
      –  Columnar	
  execu)on	
  
             •  Inspired	
  by	
  Google	
  Dremel/BigQuery	
                                      Interac)ve	
  queries	
  
      –  Complement	
  na)ve	
  interfaces	
  and	
  MapReduce/              Apache	
  Drill	
     Data	
  analyst	
  
         Hive/Pig	
                                                                                Repor)ng	
  
                                                                                                   100	
  ms-­‐20	
  min	
  
•  Open	
  
      –  Community	
  driven	
  open	
  source	
  project	
  
      –  Under	
  Apache	
  Socware	
  Founda)on	
  
•  Modern	
  
      –    Standard	
  ANSI	
  SQL:2003	
  (select/into)	
                                         Data	
  mining	
  
      –    Nested/hierarchical	
  data	
  support	
                          MapReduce	
           Modeling	
  
                                                                                  Hive	
  
      –    Schema	
  is	
  op)onal	
                                               Pig	
  
                                                                                                   Large	
  ETL	
  
                                                                                                   20	
  min-­‐20	
  hr	
  
      –    Supports	
  RDBMS,	
  Hadoop	
  and	
  NoSQL	
  
How	
  Does	
  It	
  Work?	
  
•  Drillbits	
  run	
  on	
  each	
  node,	
  designed	
  to	
  
   maximize	
  data	
  locality	
  
•  Processing	
  is	
  done	
  outside	
  MapReduce	
  
                                                                   SELECT	
  *	
  FROM	
  
   paradigm	
  (YARN	
  is	
  supported)	
                         oracle.transac)ons,	
  
•  Queries	
  can	
  be	
  fed	
  to	
  any	
  Drillbit	
          mongo.users,	
  
                                                                   hdfs.events	
  
•  Coordina)on,	
  query	
  planning,	
  op)miza)on,	
             LIMIT	
  1	
  
   scheduling,	
  and	
  execu)on	
  are	
  distributed	
  
Key	
  Features	
  

•    Full	
  SQL	
  (ANSI	
  SQL:2003)	
  
•    Nested	
  data	
  
•    Schema	
  is	
  op)onal	
  
•    Flexible	
  and	
  extensible	
  architecture	
  
Full	
  SQL	
  (ANSI	
  SQL:2003)	
  
•  Drill	
  supports	
  standard	
  ANSI	
  SQL:2003	
  
     –  Correlated	
  subqueries,	
  analy)c	
  func)ons,	
  …	
  
     –  SQL-­‐like	
  is	
  not	
  enough	
  
•  Use	
  any	
  SQL-­‐based	
  tool	
  with	
  Apache	
  Drill	
  
     –  Tableau,	
  Microstrategy,	
  Excel,	
  SAP	
  Crystal	
  Reports,	
  Toad,	
  SQuirreL,	
  …	
  
     –  Standard	
  ODBC	
  and	
  JDBC	
  drivers	
  
                               Client

                 Tableau

                                                                           Drillbit
               MicroStrategy
                               Drill%ODBC%                    SQL%Query%               Query%
                                             Driver                                              Drillbits
                                 Driver                         Parser                Planner   Drill%Worker
                                                                                                 Drill%Worker
                   Excel


                SAP%Crystal%
                 Reports
Nested	
  Data	
  
                                                                                                         JSON	
  
•  Nested	
  data	
  is	
  becoming	
  prevalent	
                                        {	
  
                                                                                          	
  	
  "name":	
  "Homer",	
  
     –  JSON,	
  BSON,	
  XML,	
  Protocol	
  Buffers,	
  Avro,	
  …	
                     	
  	
  "gender":	
  "Male",	
  
                                                                                          	
  	
  "followers":	
  100	
  
     –  The	
  data	
  source	
  may	
  or	
  may	
  not	
  be	
  aware	
                 	
  	
  children:	
  [	
  
           •  MongoDB	
  supports	
  nested	
  data	
  na)vely	
                          	
  	
  	
  	
  {name:	
  "Bart"},	
  
                                                                                          	
  	
  	
  	
  {name:	
  "Lisa”}	
  
           •  A	
  single	
  HBase	
  value	
  could	
  be	
  a	
  JSON	
  document	
     	
  	
  ]	
  
              (compound	
  nested	
  type)	
                                              }	
  
     –  Google	
  Dremel’s	
  innova)on	
  was	
  efficient	
  columnar	
  
        storage	
  and	
  querying	
  of	
  nested	
  data	
  
•  FlaGening	
  nested	
  data	
  is	
  error-­‐prone	
  and	
  ocen	
                                    Avro	
  
                                                                                          enum	
  Gender	
  {	
  
   impossible	
                                                                           	
  	
  MALE,	
  FEMALE	
  
     –  Think	
  about	
  repeated	
  and	
  op)onal	
  fields	
  at	
  every	
            }	
  
                                                                                          	
  
        level…	
                                                                          record	
  User	
  {	
  

•  Apache	
  Drill	
  supports	
  nested	
  data	
  
                                                                                          	
  	
  string	
  name;	
  
                                                                                          	
  	
  Gender	
  gender;	
  
     –  Extensions	
  to	
  ANSI	
  SQL:2003	
                                            	
  	
  long	
  followers;	
  
                                                                                          }	
  
Schema	
  is	
  Op)onal	
  
•  Many	
  data	
  sources	
  do	
  not	
  have	
  rigid	
  schemas	
  
         –  Schemas	
  change	
  rapidly	
  
         –  Each	
  record	
  may	
  have	
  a	
  different	
  schema	
  
                 •  Sparse	
  and	
  wide	
  rows	
  in	
  HBase	
  and	
  Cassandra,	
  MongoDB	
  
•  Apache	
  Drill	
  supports	
  querying	
  against	
  unknown	
  schemas	
  
         –  Query	
  any	
  HBase,	
  Cassandra	
  or	
  MongoDB	
  table	
  
•  User	
  can	
  define	
  the	
  schema	
  or	
  let	
  the	
  system	
  discover	
  it	
  automa)cally	
  
         –  System	
  of	
  record	
  may	
  already	
  have	
  schema	
  informa)on	
  
                 •  Why	
  manage	
  it	
  in	
  a	
  separate	
  system?	
  
         –  No	
  need	
  to	
  manage	
  schema	
  evolu)on	
  

Row	
  Key	
                       CF	
  contents	
                             CF	
  anchor	
  
"com.cnn.www"	
                    contents:html	
  =	
  "<html>…"	
            anchor:my.look.ca	
  =	
  "CNN.com"	
  
                                                                                anchor:cnnsi.com	
  =	
  "CNN"	
  
"com.foxnews.www"	
                contents:html	
  =	
  "<html>…"	
            anchor:en.wikipedia.org	
  =	
  "Fox	
  News"	
  
                                                                                	
  
…	
                                …	
                                          …	
  
Flexible	
  and	
  Extensible	
  Architecture	
  
•  Apache	
  Drill	
  is	
  designed	
  for	
  extensibility	
  
      –  Well-­‐documented	
  APIs	
  and	
  interfaces	
  
•  Data	
  sources	
  and	
  file	
  formats	
  
      –  Implement	
  a	
  custom	
  scanner	
  to	
  support	
  a	
  new	
  data	
  source	
  or	
  file	
  format	
  
•  Query	
  languages	
  
      –  SQL:2003	
  is	
  the	
  primary	
  language	
  
      –  Implement	
  a	
  custom	
  Parser	
  to	
  support	
  a	
  Domain	
  Specific	
  Language	
  
      –  UDFs	
  and	
  UDTFs	
  
•  Op)mizers	
  
      –  Drill	
  will	
  have	
  a	
  cost-­‐based	
  op)mizer	
  
      –  Clear	
  surrounding	
  APIs	
  support	
  easy	
  op)mizer	
  explora)on	
  
•  Operators	
  
      –  Custom	
  operators	
  can	
  be	
  implemented	
  
             •  Special	
  operators	
  for	
  Mahout	
  (k-­‐means)	
  being	
  designed	
  
      –  Operator	
  push-­‐down	
  to	
  data	
  source	
  (RDBMS)	
  
What	
  About	
  Other	
  SQL-­‐on-­‐Hadoop	
  
Systems?	
  
•  Strengths	
  
    –  Code	
  is	
  more	
  mature	
  than	
  Apache	
  Drill	
  
           •  Already	
  in	
  beta	
  (~2	
  quarters	
  ahead)	
  
    –  Faster	
  than	
  Hive	
  on	
  some	
  queries	
  

•  Weaknesses	
  
    –    Proprietary	
  or	
  semi-­‐open	
  source	
  
    –    Query	
  results	
  must	
  fit	
  in	
  memory	
  (no	
  spooling)	
  
    –    Early	
  row	
  materializa)on	
  (no	
  columnar	
  execu)on)	
  
    –    Some	
  are	
  SQL-­‐like	
  (not	
  SQL)	
  
    –    No	
  support	
  for	
  nested	
  data	
  
    –    Rigid	
  schema	
  is	
  required	
  
    –    Limited	
  flexibility	
  and	
  extensibility	
  
    –    Only	
  support	
  Hadoop	
  and	
  HBase	
  (and	
  no	
  other	
  NoSQL	
  or	
  RDBMS)	
  
Status:	
  In	
  Progress	
  
•    Heavy	
  ac)ve	
  development	
  
      –  6-­‐7	
  companies	
  are	
  contribu)ng	
  
•    Available	
  
      –  Logical	
  plan	
  syntax	
  and	
  interpreter	
  
      –  Reference	
  execu)on	
  engine	
  
•    In	
  progress	
  
      –  SQL	
  interpreter	
  
      –  Storage	
  engine	
  implementa)ons	
  for	
  Accumulo,	
  Cassandra,	
  HBase	
  and	
  various	
  file	
  formats	
  
•    Significant	
  community	
  momentum	
  
      –    Over	
  250	
  people	
  on	
  the	
  Drill	
  mailing	
  list	
  
      –    Over	
  250	
  members	
  of	
  the	
  Bay	
  Area	
  Drill	
  User	
  Group	
  
      –    Drill	
  meetups	
  across	
  the	
  US	
  and	
  Europe	
  
      –    OpenDremel	
  team	
  joined	
  Apache	
  Drill	
  
•    An)cipated	
  schedule:	
  
      –  Prototype:	
  Q1	
  
      –  Alpha:	
  Q2	
  
      –  Beta:	
  Q3	
  
Why	
  Apache	
  Drill	
  Will	
  Be	
  Successful	
  
Resources	
                             Community	
                               Architecture	
  
•  Contributors	
  have	
  strong	
     •  Development	
  done	
  in	
  the	
     •  Full	
  SQL	
  
   backgrounds	
  from	
                   open	
                                 •  New	
  data	
  support	
  
   companies	
  like	
  Oracle,	
       •  Ac)ve	
  contributors	
  from	
        •  Extensible	
  APIs	
  
   IBM	
  Netezza,	
  Informa)ca,	
        mul)ple	
  companies	
                 •  Full	
  Columnar	
  Execu)on	
  
   Clustrix	
  and	
  Pentaho	
         •  Rapidly	
  growing	
                   •  Beyond	
  Hadoop	
  
Interested	
  in	
  Apache	
  Drill?	
  

•  Many	
  op)ons	
  to	
  contribute	
  
   –  Become	
  a	
  full	
  )me	
  Drill	
  engineer	
  @	
  MapR	
  
        •  Email	
  tshiran@maprtech.com	
  	
  
   –  Join	
  the	
  Drill	
  mailing	
  list	
  and	
  start	
  contribu)ng	
  
        •  JIRAs,	
  code,	
  unit	
  tests,	
  documenta)on,	
  …	
  
   –  Shoot	
  me	
  an	
  email	
  and	
  we	
  can	
  discuss	
  
        •  Email	
  tshiran@maprtech.com	
  
QUESTIONS?	
  
Why	
  Not	
  Leverage	
  MapReduce?	
  
•  Scheduling	
  Model	
  
    –  Coarse	
  resource	
  model	
  reduces	
  hardware	
  u)liza)on	
  
    –  Acquisi)on	
  of	
  resources	
  typically	
  takes	
  100’s	
  of	
  millis	
  to	
  seconds	
  
•  Barriers	
  
    –  Map	
  comple)on	
  required	
  before	
  shuffle/reduce	
  
       commencement	
  
    –  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
    –  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  en)rely	
  before	
  the	
  next	
  one	
  
       can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
    –  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
    –  Serializa)on	
  and	
  deserializa)on	
  are	
  required	
  between	
  execu)on	
  
       phase	
  

More Related Content

What's hot (20)

PDF
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
 
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
PDF
Productionzing ML Model Using MLflow Model Serving
Databricks
 
PDF
Modularized ETL Writing with Apache Spark
Databricks
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PPTX
Big data Hadoop presentation
Shivanee garg
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
HostedbyConfluent
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
PDF
Introducing Neo4j
Neo4j
 
PPTX
Introduction To HBase
Anil Gupta
 
PDF
Neo4j in Depth
Max De Marzi
 
PDF
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PPTX
Microsoft Azure Databricks
Sascha Dittmann
 
PDF
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Databricks
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Productionzing ML Model Using MLflow Model Serving
Databricks
 
Modularized ETL Writing with Apache Spark
Databricks
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Big data Hadoop presentation
Shivanee garg
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
HostedbyConfluent
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
Apache Spark Architecture
Alexey Grishchenko
 
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Introducing Neo4j
Neo4j
 
Introduction To HBase
Anil Gupta
 
Neo4j in Depth
Max De Marzi
 
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Microsoft Azure Databricks
Sascha Dittmann
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Databricks
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 

Viewers also liked (14)

PDF
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 
ODP
JBoss Enterprise Data Services (Data Virtualization)
plarsen67
 
PDF
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
inovex GmbH
 
PDF
Apache Drill Workshop
Charles Givre
 
PPTX
Putting Apache Drill into Production
MapR Technologies
 
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
PPTX
Data virtualization, Data Federation & IaaS with Jboss Teiid
Anil Allewar
 
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
PPTX
4×4: Big Data in der Cloud
Danny Linden
 
PDF
Red Hat JBOSS Data Virtualization
DLT Solutions
 
PPT
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPT
Why Data Virtualization? An Introduction by Denodo
Justo Hidalgo
 
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 
JBoss Enterprise Data Services (Data Virtualization)
plarsen67
 
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
inovex GmbH
 
Apache Drill Workshop
Charles Givre
 
Putting Apache Drill into Production
MapR Technologies
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Anil Allewar
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
4×4: Big Data in der Cloud
Danny Linden
 
Red Hat JBOSS Data Virtualization
DLT Solutions
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Why Data Virtualization? An Introduction by Denodo
Justo Hidalgo
 
Ad

Similar to An introduction to apache drill presentation (20)

PPTX
Drill njhug -19 feb2013
MapR Technologies
 
PPTX
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
PPTX
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
PPTX
No sql and sql - open analytics summit
Open Analytics
 
PPTX
Introduction to Apache Drill
Swiss Big Data User Group
 
PPTX
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
PDF
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
PPTX
Drill at the Chug 9-19-12
Ted Dunning
 
PPTX
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
PPTX
Drill dchug-29 nov2012
MapR Technologies
 
PPTX
PhillyDB Talk - Beyond Batch
boorad
 
PPTX
Apache drill
MapR Technologies
 
PPTX
Drill lightning-london-big-data-10-01-2012
Ted Dunning
 
PPTX
Big data hadoop ecosystem and nosql
Khanderao Kand
 
PPTX
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 
PPTX
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
PPTX
מיכאל
sqlserver.co.il
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
Drill njhug -19 feb2013
MapR Technologies
 
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 
Hadoop Summit - Hausenblas 20 March
MapR Technologies
 
No sql and sql - open analytics summit
Open Analytics
 
Introduction to Apache Drill
Swiss Big Data User Group
 
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
Drill at the Chug 9-19-12
Ted Dunning
 
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Drill dchug-29 nov2012
MapR Technologies
 
PhillyDB Talk - Beyond Batch
boorad
 
Apache drill
MapR Technologies
 
Drill lightning-london-big-data-10-01-2012
Ted Dunning
 
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
מיכאל
sqlserver.co.il
 
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
elliando dias
 
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
MapR Technologies
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
PPTX
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
PPTX
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
PDF
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
PDF
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
PPTX
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
PPTX
MapR and Cisco Make IT Better
MapR Technologies
 
PPTX
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Digital Circuits, important subject in CS
contactparinay1
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 

An introduction to apache drill presentation

  • 1. Introduc)on  to  Apache  Drill   February  27,  2013   Tomer  Shiran  
  • 2. Who  Am  I?   •  Tomer  Shiran   •  [email protected]   •  Founding  Member  and  CommiGer,  Apache  Drill   •  Director  of  Product  Management,  MapR   Technologies    
  • 3. Agenda   •  Apache  Drill  overview   •  Key  features   •  Status  and  progress   •  How  to  get  involved  
  • 4. Big  Data  Workloads   •  ETL   •  Data  mining   •  Index  and  model  genera)on   •  Clustering,  anomaly  detec)on  and  classifica)on   •  Blob  store   •  Lightweight  OLTP  on  large  datasets   •  Web  crawling   •  Stream  processing   •  Interac)ve  analysis  
  • 5. Interac)ve  Queries  and  Hadoop   Exis)ng  Solu)ons   Compile  SQL  (HiveQL)  to   Export  MapReduce  results  to   MapReduce   RDBMS  and  query  RDBMS   Emerging  Technologies   Stinger/Tez Impala Real-­‐)me     PostgreSQL-­‐based   PostgreSQL-­‐based   Hive  performance   Real-­‐)me     interac)ve  analysis   Hadoop  analy)cs   Hadoop  analy)cs   improvements   interac)ve  analysis   HAWQ Phoenix   Cascading Lingual PostgreSQL-­‐based   Compile  ANSI  SQL  to   Hadoop  analy)cs   SQL  layer  for  HBase   SQL  layer  for  HBase   MapReduce  
  • 6. Example  Problem   •  Jane  works  as  an   Transac.on   analyst  at  an  e-­‐ informa.on   commerce  company   •  How  does  she  figure   User     out  good  targe)ng   profiles   segments  for  the  next   marke)ng  campaign?   •  She  has  some  ideas   Access   and  lots  of  data   logs  
  • 7. Solving  the  Problem  with  Tradi)onal  Systems   •  Use  an  RDBMS   –  ETL  the  data  from  MongoDB  and  Hadoop  into  the  RDBMS   •  MongoDB  data  must  be  flaGened,  schema)zed,  filtered  and  aggregated   •  Hadoop  data  must  be  filtered  and  aggregated   –  Query  the  data  using  any  SQL-­‐based  tool   •  Use  MapReduce   –  ETL  the  data  from  Oracle  and  MongoDB  into  Hadoop   –  Work  with  the  MapReduce  team  to  generate  the  desired  analyses   •  Use  Hive   –  ETL  the  data  from  Oracle  and  MongoDB  into  Hadoop   •  MongoDB  data  must  be  flaGened  and  schema)zed   –  But  HiveQL  is  limited,  queries  take  too  long  and  BI  tool  support  is   limited  
  • 8. WWGD   Distributed   Interac.ve   Batch   NoSQL   File  System   analysis   processing   GFS   BigTable   Dremel   MapReduce   Hadoop   HDFS   HBase   ???   MapReduce   Build  Apache  Drill  to  provide  a  true  open  source   solu)on  to  interac)ve  analysis  of  Big  Data  
  • 9. Apache  Drill  Overview   •  Interac)ve  analysis  of  Big  Data  using  standard  SQL   •  Fast   –  Low  latency   –  Columnar  execu)on   •  Inspired  by  Google  Dremel/BigQuery   Interac)ve  queries   –  Complement  na)ve  interfaces  and  MapReduce/ Apache  Drill   Data  analyst   Hive/Pig   Repor)ng   100  ms-­‐20  min   •  Open   –  Community  driven  open  source  project   –  Under  Apache  Socware  Founda)on   •  Modern   –  Standard  ANSI  SQL:2003  (select/into)   Data  mining   –  Nested/hierarchical  data  support   MapReduce   Modeling   Hive   –  Schema  is  op)onal   Pig   Large  ETL   20  min-­‐20  hr   –  Supports  RDBMS,  Hadoop  and  NoSQL  
  • 10. How  Does  It  Work?   •  Drillbits  run  on  each  node,  designed  to   maximize  data  locality   •  Processing  is  done  outside  MapReduce   SELECT  *  FROM   paradigm  (YARN  is  supported)   oracle.transac)ons,   •  Queries  can  be  fed  to  any  Drillbit   mongo.users,   hdfs.events   •  Coordina)on,  query  planning,  op)miza)on,   LIMIT  1   scheduling,  and  execu)on  are  distributed  
  • 11. Key  Features   •  Full  SQL  (ANSI  SQL:2003)   •  Nested  data   •  Schema  is  op)onal   •  Flexible  and  extensible  architecture  
  • 12. Full  SQL  (ANSI  SQL:2003)   •  Drill  supports  standard  ANSI  SQL:2003   –  Correlated  subqueries,  analy)c  func)ons,  …   –  SQL-­‐like  is  not  enough   •  Use  any  SQL-­‐based  tool  with  Apache  Drill   –  Tableau,  Microstrategy,  Excel,  SAP  Crystal  Reports,  Toad,  SQuirreL,  …   –  Standard  ODBC  and  JDBC  drivers   Client Tableau Drillbit MicroStrategy Drill%ODBC% SQL%Query% Query% Driver Drillbits Driver Parser Planner Drill%Worker Drill%Worker Excel SAP%Crystal% Reports
  • 13. Nested  Data   JSON   •  Nested  data  is  becoming  prevalent   {      "name":  "Homer",   –  JSON,  BSON,  XML,  Protocol  Buffers,  Avro,  …      "gender":  "Male",      "followers":  100   –  The  data  source  may  or  may  not  be  aware      children:  [   •  MongoDB  supports  nested  data  na)vely          {name:  "Bart"},          {name:  "Lisa”}   •  A  single  HBase  value  could  be  a  JSON  document      ]   (compound  nested  type)   }   –  Google  Dremel’s  innova)on  was  efficient  columnar   storage  and  querying  of  nested  data   •  FlaGening  nested  data  is  error-­‐prone  and  ocen   Avro   enum  Gender  {   impossible      MALE,  FEMALE   –  Think  about  repeated  and  op)onal  fields  at  every   }     level…   record  User  {   •  Apache  Drill  supports  nested  data      string  name;      Gender  gender;   –  Extensions  to  ANSI  SQL:2003      long  followers;   }  
  • 14. Schema  is  Op)onal   •  Many  data  sources  do  not  have  rigid  schemas   –  Schemas  change  rapidly   –  Each  record  may  have  a  different  schema   •  Sparse  and  wide  rows  in  HBase  and  Cassandra,  MongoDB   •  Apache  Drill  supports  querying  against  unknown  schemas   –  Query  any  HBase,  Cassandra  or  MongoDB  table   •  User  can  define  the  schema  or  let  the  system  discover  it  automa)cally   –  System  of  record  may  already  have  schema  informa)on   •  Why  manage  it  in  a  separate  system?   –  No  need  to  manage  schema  evolu)on   Row  Key   CF  contents   CF  anchor   "com.cnn.www"   contents:html  =  "<html>…"   anchor:my.look.ca  =  "CNN.com"   anchor:cnnsi.com  =  "CNN"   "com.foxnews.www"   contents:html  =  "<html>…"   anchor:en.wikipedia.org  =  "Fox  News"     …   …   …  
  • 15. Flexible  and  Extensible  Architecture   •  Apache  Drill  is  designed  for  extensibility   –  Well-­‐documented  APIs  and  interfaces   •  Data  sources  and  file  formats   –  Implement  a  custom  scanner  to  support  a  new  data  source  or  file  format   •  Query  languages   –  SQL:2003  is  the  primary  language   –  Implement  a  custom  Parser  to  support  a  Domain  Specific  Language   –  UDFs  and  UDTFs   •  Op)mizers   –  Drill  will  have  a  cost-­‐based  op)mizer   –  Clear  surrounding  APIs  support  easy  op)mizer  explora)on   •  Operators   –  Custom  operators  can  be  implemented   •  Special  operators  for  Mahout  (k-­‐means)  being  designed   –  Operator  push-­‐down  to  data  source  (RDBMS)  
  • 16. What  About  Other  SQL-­‐on-­‐Hadoop   Systems?   •  Strengths   –  Code  is  more  mature  than  Apache  Drill   •  Already  in  beta  (~2  quarters  ahead)   –  Faster  than  Hive  on  some  queries   •  Weaknesses   –  Proprietary  or  semi-­‐open  source   –  Query  results  must  fit  in  memory  (no  spooling)   –  Early  row  materializa)on  (no  columnar  execu)on)   –  Some  are  SQL-­‐like  (not  SQL)   –  No  support  for  nested  data   –  Rigid  schema  is  required   –  Limited  flexibility  and  extensibility   –  Only  support  Hadoop  and  HBase  (and  no  other  NoSQL  or  RDBMS)  
  • 17. Status:  In  Progress   •  Heavy  ac)ve  development   –  6-­‐7  companies  are  contribu)ng   •  Available   –  Logical  plan  syntax  and  interpreter   –  Reference  execu)on  engine   •  In  progress   –  SQL  interpreter   –  Storage  engine  implementa)ons  for  Accumulo,  Cassandra,  HBase  and  various  file  formats   •  Significant  community  momentum   –  Over  250  people  on  the  Drill  mailing  list   –  Over  250  members  of  the  Bay  Area  Drill  User  Group   –  Drill  meetups  across  the  US  and  Europe   –  OpenDremel  team  joined  Apache  Drill   •  An)cipated  schedule:   –  Prototype:  Q1   –  Alpha:  Q2   –  Beta:  Q3  
  • 18. Why  Apache  Drill  Will  Be  Successful   Resources   Community   Architecture   •  Contributors  have  strong   •  Development  done  in  the   •  Full  SQL   backgrounds  from   open   •  New  data  support   companies  like  Oracle,   •  Ac)ve  contributors  from   •  Extensible  APIs   IBM  Netezza,  Informa)ca,   mul)ple  companies   •  Full  Columnar  Execu)on   Clustrix  and  Pentaho   •  Rapidly  growing   •  Beyond  Hadoop  
  • 19. Interested  in  Apache  Drill?   •  Many  op)ons  to  contribute   –  Become  a  full  )me  Drill  engineer  @  MapR   •  Email  [email protected]     –  Join  the  Drill  mailing  list  and  start  contribu)ng   •  JIRAs,  code,  unit  tests,  documenta)on,  …   –  Shoot  me  an  email  and  we  can  discuss   •  Email  [email protected]  
  • 21. Why  Not  Leverage  MapReduce?   •  Scheduling  Model   –  Coarse  resource  model  reduces  hardware  u)liza)on   –  Acquisi)on  of  resources  typically  takes  100’s  of  millis  to  seconds   •  Barriers   –  Map  comple)on  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  en)rely  before  the  next  one   can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  Serializa)on  and  deserializa)on  are  required  between  execu)on   phase