SlideShare a Scribd company logo
A	
  New	
  GeneraAon	
  of	
  Data	
  Transfer	
  
     	
  
          Tools	
  for	
  Hadoop:	
  Sqoop	
  2	
  
   Bilung	
  Lee	
  (blee	
  at	
  cloudera	
  dot	
  com)	
  
   Kathleen	
  Ting	
  (kathleen	
  at	
  cloudera	
  dot	
  com)	
  
   	
  
                       Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                      Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Who	
  Are	
  We?	
  
•  Bilung	
  Lee	
  
    –  Apache	
  Sqoop	
  CommiQer	
  
    –  So=ware	
  Engineer,	
  Cloudera	
  
    	
  
•  Kathleen	
  Ting	
  
    –  Apache	
  Sqoop	
  CommiQer	
  
    –  Support	
  Manager,	
  Cloudera	
  


                           Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                           2	
  
                          Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
What	
  is	
  Sqoop?	
  
•  Bulk	
  data	
  transfer	
  tool	
  
    –  Import/Export	
  from/to	
  relaAonal	
  databases,	
  
       enterprise	
  data	
  warehouses,	
  and	
  NoSQL	
  systems	
  
    –  Populate	
  tables	
  in	
  HDFS,	
  Hive,	
  and	
  HBase	
  
    –  Integrate	
  with	
  Oozie	
  as	
  an	
  acAon	
  
    –  Support	
  plugins	
  via	
  connector	
  based	
  architecture	
  
    May ‘09                      March ‘10                                      August ‘11      April ‘12


  First version                  Moved to                                      Moved to          Apache
(HADOOP-5815)                     GitHub                                       Apache        Top Level Project

                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                             3	
  
                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1	
  Architecture	
  
                                                                Document
                                   Enterprise                     Based
                                     Data                        Systems
                                   Warehouse




                                                                           Relational
                                                                           Database



command
                        Hadoop



                                               Map Task



 Sqoop




                                                            HDFS/HBase/
                                                               Hive




           Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                        4	
  
          Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1	
  Challenges	
  
•  CrypAc,	
  contextual	
  command	
  line	
  arguments	
  
•  Tight	
  coupling	
  between	
  data	
  transfer	
  and	
  
   output	
  format	
  	
  
•  Security	
  concerns	
  with	
  openly	
  shared	
  
   credenAals	
  
•  Not	
  easy	
  to	
  manage	
  installaAon/configuraAon	
  
•  Connectors	
  are	
  forced	
  to	
  follow	
  JDBC	
  model	
  

                       Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                       5	
  
                      Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2	
  Architecture	
  




       Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                       6	
  
      Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2	
  Themes	
  
•  Ease	
  of	
  Use	
  

•  Ease	
  of	
  Extension	
  

•  Security	
  




                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               7	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2	
  Themes	
  
•  Ease	
  of	
  Use	
  

•  Ease	
  of	
  Extension	
  

•  Security	
  




                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               8	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Ease	
  of	
  Use	
  
Sqoop	
  1	
                                                                    Sqoop	
  2	
  
Client-­‐only	
  Architecture	
                                                 Client/Server	
  Architecture	
  
CLI	
  based	
                                                                  CLI	
  +	
  Web	
  based	
  
Client	
  access	
  to	
  Hive,	
  HBase	
                                      Server	
  access	
  to	
  Hive,	
  HBase	
  
Oozie	
  and	
  Sqoop	
  Aghtly	
  coupled	
                                    Oozie	
  finds	
  REST	
  API	
  




                                                Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                                               9	
  
                                               Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  Client-­‐side	
  Tool	
  
•  Client-­‐side	
  installaAon	
  +	
  configuraAon	
  
   –  Connectors	
  are	
  installed/configured	
  locally	
  
   –  Local	
  requires	
  root	
  privileges	
  
   –  JDBC	
  drivers	
  are	
  needed	
  locally	
  	
  
   –  Database	
  connecAvity	
  is	
  needed	
  locally	
  




                       Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                       10	
  
                      Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  Sqoop	
  as	
  a	
  Service	
  
•  Server-­‐side	
  installaAon	
  +	
  configuraAon	
  
   –  Connectors	
  are	
  installed/configured	
  in	
  one	
  place	
  
   –  Managed	
  by	
  administrator	
  and	
  run	
  by	
  operator	
  
   –  JDBC	
  drivers	
  are	
  needed	
  in	
  one	
  place	
  
   –  Database	
  connecAvity	
  is	
  needed	
  on	
  the	
  server	
  




                         Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                         11	
  
                        Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Client	
  Interface	
  
•  Sqoop	
  1	
  client	
  interface:	
  
    –  Command	
  line	
  interface	
  (CLI)	
  based	
  
    –  Can	
  be	
  automated	
  via	
  scripAng	
  


•  Sqoop	
  2	
  client	
  interface:	
  
    –  CLI	
  based	
  (in	
  either	
  interacAve	
  or	
  script	
  mode)	
  
    –  Web	
  based	
  (remotely	
  accessible)	
  
    –  REST	
  API	
  is	
  exposed	
  for	
  external	
  tool	
  integraAon	
  

                            Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                            12	
  
                           Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  Service	
  Level	
  IntegraAon	
  
•  Hive,	
  HBase	
  
    –  Require	
  local	
  installaAon	
  
•  Oozie	
  
    –  von	
  Neumann(esque)	
  integraAon:	
  	
  
        •  Package	
  Sqoop	
  as	
  an	
  acAon	
  
        •  Then	
  run	
  Sqoop	
  from	
  node	
  machines,	
  causing	
  one	
  MR	
  
           job	
  to	
  be	
  dependent	
  on	
  another	
  MR	
  job	
  
        •  Error-­‐prone,	
  difficult	
  to	
  debug	
  


                             Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                             13	
  
                            Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  Service	
  Level	
  IntegraAon	
  
•  Hive,	
  HBase	
  
    –  Server-­‐side	
  integraAon	
  
•  Oozie	
  
    –  REST	
  API	
  integraAon	
  




                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                          14	
  
                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Ease	
  of	
  Use	
  
Sqoop	
  1	
                                                                    Sqoop	
  2	
  
Client-­‐only	
  Architecture	
                                                 Client/Server	
  Architecture	
  
CLI	
  based	
                                                                  CLI	
  +	
  Web	
  based	
  
Client	
  access	
  to	
  Hive,	
  HBase	
                                      Server	
  access	
  to	
  Hive,	
  HBase	
  
Oozie	
  and	
  Sqoop	
  Aghtly	
  coupled	
                                    Oozie	
  finds	
  REST	
  API	
  




                                                Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                                               15	
  
                                               Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2	
  Themes	
  
•  Ease	
  of	
  Use	
  

•  Ease	
  of	
  Extension	
  

•  Security	
  




                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               16	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Ease	
  of	
  Extension	
  
Sqoop	
  1	
                                                              Sqoop	
  2	
  
Connector	
  forced	
  to	
  follow	
  JDBC	
  model	
                    Connector	
  given	
  free	
  rein	
  
Connectors	
  must	
  implement	
  funcAonality	
   Connectors	
  benefit	
  from	
  common	
  
                                                    framework	
  of	
  funcAonality	
  
Connector	
  selecAon	
  is	
  implicit	
                                 Connector	
  selecAon	
  is	
  explicit	
  	
  




                                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                                            17	
  
                                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  ImplemenAng	
  Connectors	
  
•  Connectors	
  are	
  forced	
  to	
  follow	
  JDBC	
  model	
  
   –  Connectors	
  are	
  limited/required	
  to	
  use	
  common	
  
      JDBC	
  vocabulary	
  (URL,	
  database,	
  table,	
  etc)	
  
•  Connectors	
  must	
  implement	
  all	
  Sqoop	
  
   funcAonality	
  they	
  want	
  to	
  support	
  
   –  New	
  funcAonality	
  may	
  not	
  be	
  available	
  for	
  
      previously	
  implemented	
  connectors	
  



                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                          18	
  
                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  ImplemenAng	
  Connectors	
  
•  Connectors	
  are	
  not	
  restricted	
  to	
  JDBC	
  model	
  
   –  Connectors	
  can	
  define	
  own	
  domain	
  
•  Common	
  funcAonality	
  are	
  abstracted	
  out	
  of	
  
   connectors	
  
   –  Connectors	
  are	
  only	
  responsible	
  for	
  data	
  transfer	
  
   –  Common	
  Reduce	
  phase	
  implements	
  data	
  
      transformaAon	
  and	
  system	
  integraAon	
  
   –  Connectors	
  can	
  benefit	
  from	
  future	
  development	
  
      of	
  common	
  funcAonality	
  
                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                          19	
  
                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Different	
  OpAons,	
  Different	
  Results	
  
Which	
  is	
  running	
  MySQL?	
  
$ sqoop import --connect jdbc:mysql://localhost/db 
--username foo --table TEST

$ sqoop import --connect jdbc:mysql://localhost/db 
--driver com.mysql.jdbc.Driver --username foo --table TEST
	
  


•  Different	
  opAons	
  may	
  lead	
  to	
  unpredictable	
  
   results	
  
       –  Sqoop	
  2	
  requires	
  explicit	
  selecAon	
  of	
  a	
  connector,	
  
          thus	
  disambiguaAng	
  the	
  process	
  
                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               20	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  Using	
  Connectors	
  
•  Choice	
  of	
  connector	
  is	
  implicit	
  
    –  In	
  a	
  simple	
  case,	
  based	
  on	
  the	
  URL	
  in	
  -­‐-­‐connect	
  
       string	
  to	
  access	
  the	
  database	
  
    –  SpecificaAon	
  of	
  different	
  opAons	
  can	
  lead	
  to	
  
       different	
  connector	
  selecAon	
  
    –  Error-­‐prone	
  but	
  good	
  for	
  power	
  users	
  




                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               21	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  Using	
  Connectors	
  
•  Require	
  knowledge	
  of	
  database	
  idiosyncrasies	
  	
  
    –  e.g.	
  Couchbase	
  does	
  not	
  need	
  to	
  specify	
  a	
  table	
  
       name,	
  which	
  is	
  required,	
  causing	
  -­‐-­‐table	
  to	
  get	
  
       overloaded	
  as	
  backfill	
  or	
  dump	
  operaAon	
  
    –  e.g.	
  -­‐-­‐null-­‐string	
  representaAon	
  is	
  not	
  supported	
  
       by	
  all	
  connectors	
  

•  FuncAonality	
  is	
  limited	
  to	
  what	
  the	
  implicitly	
  
   chosen	
  connector	
  supports	
  


                             Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                             22	
  
                            Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  Using	
  Connectors	
  
•  Users	
  make	
  explicit	
  connector	
  choice	
  
    –  Less	
  error-­‐prone,	
  more	
  predictable	
  
•  Users	
  need	
  not	
  be	
  aware	
  of	
  the	
  funcAonality	
  
   of	
  all	
  connectors	
  
    –  Couchbase	
  users	
  need	
  not	
  care	
  that	
  other	
  
       connectors	
  use	
  tables	
  




                           Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                           23	
  
                          Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  Using	
  Connectors	
  
•  Common	
  funcAonality	
  is	
  available	
  to	
  all	
  
   connectors	
  
    –  Connectors	
  need	
  not	
  worry	
  about	
  common	
  
       downstream	
  funcAonality,	
  such	
  as	
  transformaAon	
  
       into	
  various	
  formats	
  and	
  integraAon	
  with	
  other	
  
       systems	
  




                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                          24	
  
                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Ease	
  of	
  Extension	
  
Sqoop	
  1	
                                                              Sqoop	
  2	
  
Connector	
  forced	
  to	
  follow	
  JDBC	
  model	
                    Connector	
  given	
  free	
  rein	
  
Connectors	
  must	
  implement	
  funcAonality	
   Connectors	
  benefit	
  from	
  common	
  
                                                    framework	
  of	
  funcAonality	
  
Connector	
  selecAon	
  is	
  implicit	
                                 Connector	
  selecAon	
  is	
  explicit	
  	
  




                                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                                            25	
  
                                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2	
  Themes	
  
•  Ease	
  of	
  Use	
  

•  Ease	
  of	
  Extension	
  

•  Security	
  




                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               26	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Security	
  
Sqoop	
  1	
                                                                   Sqoop	
  2	
  
Support	
  only	
  for	
  Hadoop	
  security	
                                 Support	
  for	
  Hadoop	
  security	
  and	
  role-­‐
                                                                               based	
  access	
  control	
  to	
  external	
  systems	
  
High	
  risk	
  of	
  abusing	
  access	
  to	
  external	
                    Reduced	
  risk	
  of	
  abusing	
  access	
  to	
  external	
  
systems	
                                                                      systems	
  
No	
  resource	
  management	
  policy	
                                       Resource	
  management	
  policy	
  




                                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                                                            27	
  
                                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  Security	
  
•  Inherit/Propagate	
  Kerberos	
  principal	
  for	
  the	
  
   jobs	
  it	
  launches	
  
•  Access	
  to	
  files	
  on	
  HDFS	
  can	
  be	
  controlled	
  via	
  
   HDFS	
  security	
  
•  Limited	
  support	
  (user/password)	
  for	
  secure	
  
   access	
  to	
  external	
  systems	
  




                           Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                           28	
  
                          Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  Security	
  
•  Inherit/Propagate	
  Kerberos	
  principal	
  for	
  the	
  
   jobs	
  it	
  launches	
  
•  Access	
  to	
  files	
  on	
  HDFS	
  can	
  be	
  controlled	
  via	
  
   HDFS	
  security	
  
•  Support	
  for	
  secure	
  access	
  to	
  external	
  systems	
  
   via	
  role-­‐based	
  access	
  to	
  connecAon	
  objects	
  
    –  Administrators	
  create/edit/delete	
  connecAons	
  
    –  Operators	
  use	
  connecAons	
  



                          Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                          29	
  
                         Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  External	
  System	
  Access	
  
•  Every	
  invocaAon	
  requires	
  necessary	
  
   credenAals	
  to	
  access	
  external	
  systems	
  (e.g.	
  
   relaAonal	
  database)	
  
   –  Workaround:	
  create	
  a	
  user	
  with	
  limited	
  access	
  in	
  
      lieu	
  of	
  giving	
  out	
  password	
  
       •  Does	
  not	
  scale	
  
       •  Permission	
  granularity	
  is	
  hard	
  to	
  obtain	
  
•  Hard	
  to	
  prevent	
  misuse	
  once	
  credenAals	
  are	
  
   given	
  
                             Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                             30	
  
                            Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  External	
  System	
  Access	
  
•  ConnecAons	
  are	
  enabled	
  as	
  first-­‐class	
  objects	
  
   –  ConnecAons	
  encompass	
  credenAals	
  
   –  ConnecAons	
  are	
  created	
  once	
  and	
  then	
  used	
  
      many	
  Ames	
  for	
  various	
  import/export	
  jobs	
  
   –  ConnecAons	
  are	
  created	
  by	
  administrator	
  and	
  
      used	
  by	
  operator	
  
       •  Safeguard	
  credenAal	
  access	
  from	
  end	
  users	
  
•  ConnecAons	
  can	
  be	
  restricted	
  in	
  scope	
  based	
  
   on	
  operaAon	
  (import/export)	
  
   –  Operators	
  cannot	
  abuse	
  credenAals	
  
                            Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                            31	
  
                           Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  1:	
  Resource	
  Management	
  
•  No	
  explicit	
  resource	
  management	
  policy	
  
   –  Users	
  specify	
  the	
  number	
  of	
  map	
  jobs	
  to	
  run	
  
   –  Cannot	
  throQle	
  load	
  on	
  external	
  systems	
  




                            Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                            32	
  
                           Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Sqoop	
  2:	
  Resource	
  Management	
  
•  ConnecAons	
  allow	
  specificaAon	
  of	
  resource	
  
   management	
  policy	
  	
  
   –  Administrators	
  can	
  limit	
  the	
  total	
  number	
  of	
  
      physical	
  connecAons	
  open	
  at	
  one	
  Ame	
  
   –  ConnecAons	
  can	
  also	
  be	
  disabled	
  




                           Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                           33	
  
                          Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Security	
  
Sqoop	
  1	
                                                                   Sqoop	
  2	
  
Support	
  only	
  for	
  Hadoop	
  security	
                                 Support	
  for	
  Hadoop	
  security	
  and	
  role-­‐
                                                                               based	
  access	
  control	
  to	
  external	
  systems	
  
High	
  risk	
  of	
  abusing	
  access	
  to	
  external	
                    Reduced	
  risk	
  of	
  abusing	
  access	
  to	
  external	
  
systems	
                                                                      systems	
  
No	
  resource	
  management	
  policy	
                                       Resource	
  management	
  policy	
  




                                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                                                                            34	
  
                                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Demo	
  Screenshots 	
  	
  




     Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                     35	
  
    Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Demo	
  Screenshots 	
  	
  




     Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                     36	
  
    Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Demo	
  Screenshots 	
  	
  




     Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                     37	
  
    Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Demo	
  Screenshots 	
  	
  




     Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                     38	
  
    Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Demo	
  Screenshots 	
  	
  




     Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                     39	
  
    Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Takeaway	
  	
  
Sqoop	
  2	
  Highights:	
  
    –  Ease	
  of	
  Use:	
  Sqoop	
  as	
  a	
  Service	
  
    –  Ease	
  of	
  Extension:	
  Connectors	
  benefit	
  from	
  
       shared	
  funcAonality	
  
    –  Security:	
  ConnecAons	
  as	
  first-­‐class	
  objects	
  and	
  
       role-­‐based	
  security	
  




                           Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                           40	
  
                          Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Current	
  Status:	
  work-­‐in-­‐progress	
  
•  Sqoop2	
  Development:	
  
  	
  hQp://issues.apache.org/jira/browse/SQOOP-­‐365	
  

•  Sqoop2	
  Blog	
  Post:	
  
  	
  hQp://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop	
  

•  Sqoop2	
  Design:	
  
  	
  hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2	
  




                               Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                               41	
  
                              Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Current	
  Status:	
  work-­‐in-­‐progress	
  
•  Sqoop2	
  Quickstart:	
  
  	
  hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Quickstart	
  

•  Sqoop2	
  Resource	
  Layout:	
  
  	
  hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+-­‐+Resource+Layout	
  

•  Sqoop2	
  Feature	
  Requests:	
  
  	
  hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Feature+Requests	
  




                            Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                            42	
  
                           Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                 43	
  
Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  
Reception
Reception takes place in the Community Showcase (Hall 2)




                                                     Page 44
Sqoop	
  &	
  Flume	
  Meetups	
  Tonight	
  

          6:30pm	
  (Sqoop)	
  
          7:45pm	
  (Flume)	
  
                    at	
  
           Hilton	
  San	
  Jose	
  	
  
        (San	
  Carlos	
  Conf	
  Rm)	
  
                Hadoop	
  Summit	
  2012.	
  6/13/12	
  Apache	
  Sqoop	
  
                                                                                45	
  
               Copyright	
  2012	
  The	
  Apache	
  So=ware	
  FoundaAon	
  

More Related Content

What's hot (20)

PDF
SQOOP PPT
Dushhyant Kumar
 
PDF
Sqoop tutorial
Ashoka Vanjare
 
PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
PDF
Apache sqoop
megrhi haikel
 
PDF
Habits of Effective Sqoop Users
Kathleen Ting
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PPTX
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
PDF
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
PPTX
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
SQOOP PPT
Dushhyant Kumar
 
Sqoop tutorial
Ashoka Vanjare
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
Apache sqoop
megrhi haikel
 
Habits of Effective Sqoop Users
Kathleen Ting
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 

Viewers also liked (13)

PDF
Big data processing using Hadoop with Cloudera Quickstart
IMC Institute
 
PDF
สมุดกิจกรรม Code for Kids
IMC Institute
 
PPTX
Apache sqoop with an use case
Davin Abraham
 
PDF
Big Data Analytics using Mahout
IMC Institute
 
PDF
Introduction to Apache Sqoop
Avkash Chauhan
 
PDF
Thai Software & Software Market Survey 2015
IMC Institute
 
PPT
ITSS Overview
IMC Institute
 
PDF
Mobile User and App Analytics in China
IMC Institute
 
PPTX
Advanced Sqoop
Yogesh Kulkarni
 
PDF
Install Apache Hadoop for Development/Production
IMC Institute
 
PDF
Machine Learning using Apache Spark MLlib
IMC Institute
 
PDF
Kanban boards step by step
Giulio Roggero
 
PPTX
Flume vs. kafka
Omid Vahdaty
 
Big data processing using Hadoop with Cloudera Quickstart
IMC Institute
 
สมุดกิจกรรม Code for Kids
IMC Institute
 
Apache sqoop with an use case
Davin Abraham
 
Big Data Analytics using Mahout
IMC Institute
 
Introduction to Apache Sqoop
Avkash Chauhan
 
Thai Software & Software Market Survey 2015
IMC Institute
 
ITSS Overview
IMC Institute
 
Mobile User and App Analytics in China
IMC Institute
 
Advanced Sqoop
Yogesh Kulkarni
 
Install Apache Hadoop for Development/Production
IMC Institute
 
Machine Learning using Apache Spark MLlib
IMC Institute
 
Kanban boards step by step
Giulio Roggero
 
Flume vs. kafka
Omid Vahdaty
 
Ad

Similar to New Data Transfer Tools for Hadoop: Sqoop 2 (20)

PDF
Highlights Of Sqoop2
Alexander Alten
 
PDF
Hadoop Conference Japan 2011 Fallに行ってきました
moai kids
 
PPTX
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
PPTX
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
PPT
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
AjajKhan23
 
PPTX
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PDF
Apache hadoop
Darpan Dekivadiya
 
PPTX
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hortonworks
 
KEY
Polyglot Persistence & Big Data in the Cloud
Andrei Savu
 
PPTX
Above the cloud: Big Data and BI
Denny Lee
 
PPTX
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
BigDataCloud
 
PDF
Hadoop, Taming Elephants
Ovidiu Dimulescu
 
PPTX
Intro to Hadoop
Jonathan Bloom
 
ODP
The other Apache technologies your big data solution needs!
gagravarr
 
PPTX
Cloud Friendly Hadoop and Hive
DataWorks Summit
 
PDF
Webinar: The Future of Hadoop
Cloudera, Inc.
 
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
PPTX
Hadoop Solutions
zenyk
 
Highlights Of Sqoop2
Alexander Alten
 
Hadoop Conference Japan 2011 Fallに行ってきました
moai kids
 
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
AjajKhan23
 
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Apache hadoop
Darpan Dekivadiya
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hortonworks
 
Polyglot Persistence & Big Data in the Cloud
Andrei Savu
 
Above the cloud: Big Data and BI
Denny Lee
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
BigDataCloud
 
Hadoop, Taming Elephants
Ovidiu Dimulescu
 
Intro to Hadoop
Jonathan Bloom
 
The other Apache technologies your big data solution needs!
gagravarr
 
Cloud Friendly Hadoop and Hive
DataWorks Summit
 
Webinar: The Future of Hadoop
Cloudera, Inc.
 
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Hadoop Solutions
zenyk
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Python basic programing language for automation
DanialHabibi2
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 

New Data Transfer Tools for Hadoop: Sqoop 2

  • 1. A  New  GeneraAon  of  Data  Transfer     Tools  for  Hadoop:  Sqoop  2   Bilung  Lee  (blee  at  cloudera  dot  com)   Kathleen  Ting  (kathleen  at  cloudera  dot  com)     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 2. Who  Are  We?   •  Bilung  Lee   –  Apache  Sqoop  CommiQer   –  So=ware  Engineer,  Cloudera     •  Kathleen  Ting   –  Apache  Sqoop  CommiQer   –  Support  Manager,  Cloudera   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   2   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 3. What  is  Sqoop?   •  Bulk  data  transfer  tool   –  Import/Export  from/to  relaAonal  databases,   enterprise  data  warehouses,  and  NoSQL  systems   –  Populate  tables  in  HDFS,  Hive,  and  HBase   –  Integrate  with  Oozie  as  an  acAon   –  Support  plugins  via  connector  based  architecture   May ‘09 March ‘10 August ‘11 April ‘12 First version Moved to Moved to Apache (HADOOP-5815) GitHub Apache Top Level Project Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   3   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 4. Sqoop  1  Architecture   Document Enterprise Based Data Systems Warehouse Relational Database command Hadoop Map Task Sqoop HDFS/HBase/ Hive Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   4   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 5. Sqoop  1  Challenges   •  CrypAc,  contextual  command  line  arguments   •  Tight  coupling  between  data  transfer  and   output  format     •  Security  concerns  with  openly  shared   credenAals   •  Not  easy  to  manage  installaAon/configuraAon   •  Connectors  are  forced  to  follow  JDBC  model   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   5   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 6. Sqoop  2  Architecture   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   6   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 7. Sqoop  2  Themes   •  Ease  of  Use   •  Ease  of  Extension   •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   7   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 8. Sqoop  2  Themes   •  Ease  of  Use   •  Ease  of  Extension   •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   8   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 9. Ease  of  Use   Sqoop  1   Sqoop  2   Client-­‐only  Architecture   Client/Server  Architecture   CLI  based   CLI  +  Web  based   Client  access  to  Hive,  HBase   Server  access  to  Hive,  HBase   Oozie  and  Sqoop  Aghtly  coupled   Oozie  finds  REST  API   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   9   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 10. Sqoop  1:  Client-­‐side  Tool   •  Client-­‐side  installaAon  +  configuraAon   –  Connectors  are  installed/configured  locally   –  Local  requires  root  privileges   –  JDBC  drivers  are  needed  locally     –  Database  connecAvity  is  needed  locally   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   10   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 11. Sqoop  2:  Sqoop  as  a  Service   •  Server-­‐side  installaAon  +  configuraAon   –  Connectors  are  installed/configured  in  one  place   –  Managed  by  administrator  and  run  by  operator   –  JDBC  drivers  are  needed  in  one  place   –  Database  connecAvity  is  needed  on  the  server   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   11   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 12. Client  Interface   •  Sqoop  1  client  interface:   –  Command  line  interface  (CLI)  based   –  Can  be  automated  via  scripAng   •  Sqoop  2  client  interface:   –  CLI  based  (in  either  interacAve  or  script  mode)   –  Web  based  (remotely  accessible)   –  REST  API  is  exposed  for  external  tool  integraAon   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   12   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 13. Sqoop  1:  Service  Level  IntegraAon   •  Hive,  HBase   –  Require  local  installaAon   •  Oozie   –  von  Neumann(esque)  integraAon:     •  Package  Sqoop  as  an  acAon   •  Then  run  Sqoop  from  node  machines,  causing  one  MR   job  to  be  dependent  on  another  MR  job   •  Error-­‐prone,  difficult  to  debug   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   13   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 14. Sqoop  2:  Service  Level  IntegraAon   •  Hive,  HBase   –  Server-­‐side  integraAon   •  Oozie   –  REST  API  integraAon   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   14   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 15. Ease  of  Use   Sqoop  1   Sqoop  2   Client-­‐only  Architecture   Client/Server  Architecture   CLI  based   CLI  +  Web  based   Client  access  to  Hive,  HBase   Server  access  to  Hive,  HBase   Oozie  and  Sqoop  Aghtly  coupled   Oozie  finds  REST  API   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   15   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 16. Sqoop  2  Themes   •  Ease  of  Use   •  Ease  of  Extension   •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   16   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 17. Ease  of  Extension   Sqoop  1   Sqoop  2   Connector  forced  to  follow  JDBC  model   Connector  given  free  rein   Connectors  must  implement  funcAonality   Connectors  benefit  from  common   framework  of  funcAonality   Connector  selecAon  is  implicit   Connector  selecAon  is  explicit     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   17   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 18. Sqoop  1:  ImplemenAng  Connectors   •  Connectors  are  forced  to  follow  JDBC  model   –  Connectors  are  limited/required  to  use  common   JDBC  vocabulary  (URL,  database,  table,  etc)   •  Connectors  must  implement  all  Sqoop   funcAonality  they  want  to  support   –  New  funcAonality  may  not  be  available  for   previously  implemented  connectors   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   18   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 19. Sqoop  2:  ImplemenAng  Connectors   •  Connectors  are  not  restricted  to  JDBC  model   –  Connectors  can  define  own  domain   •  Common  funcAonality  are  abstracted  out  of   connectors   –  Connectors  are  only  responsible  for  data  transfer   –  Common  Reduce  phase  implements  data   transformaAon  and  system  integraAon   –  Connectors  can  benefit  from  future  development   of  common  funcAonality   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   19   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 20. Different  OpAons,  Different  Results   Which  is  running  MySQL?   $ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST $ sqoop import --connect jdbc:mysql://localhost/db --driver com.mysql.jdbc.Driver --username foo --table TEST   •  Different  opAons  may  lead  to  unpredictable   results   –  Sqoop  2  requires  explicit  selecAon  of  a  connector,   thus  disambiguaAng  the  process   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   20   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 21. Sqoop  1:  Using  Connectors   •  Choice  of  connector  is  implicit   –  In  a  simple  case,  based  on  the  URL  in  -­‐-­‐connect   string  to  access  the  database   –  SpecificaAon  of  different  opAons  can  lead  to   different  connector  selecAon   –  Error-­‐prone  but  good  for  power  users   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   21   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 22. Sqoop  1:  Using  Connectors   •  Require  knowledge  of  database  idiosyncrasies     –  e.g.  Couchbase  does  not  need  to  specify  a  table   name,  which  is  required,  causing  -­‐-­‐table  to  get   overloaded  as  backfill  or  dump  operaAon   –  e.g.  -­‐-­‐null-­‐string  representaAon  is  not  supported   by  all  connectors   •  FuncAonality  is  limited  to  what  the  implicitly   chosen  connector  supports   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   22   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 23. Sqoop  2:  Using  Connectors   •  Users  make  explicit  connector  choice   –  Less  error-­‐prone,  more  predictable   •  Users  need  not  be  aware  of  the  funcAonality   of  all  connectors   –  Couchbase  users  need  not  care  that  other   connectors  use  tables   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   23   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 24. Sqoop  2:  Using  Connectors   •  Common  funcAonality  is  available  to  all   connectors   –  Connectors  need  not  worry  about  common   downstream  funcAonality,  such  as  transformaAon   into  various  formats  and  integraAon  with  other   systems   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   24   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 25. Ease  of  Extension   Sqoop  1   Sqoop  2   Connector  forced  to  follow  JDBC  model   Connector  given  free  rein   Connectors  must  implement  funcAonality   Connectors  benefit  from  common   framework  of  funcAonality   Connector  selecAon  is  implicit   Connector  selecAon  is  explicit     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   25   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 26. Sqoop  2  Themes   •  Ease  of  Use   •  Ease  of  Extension   •  Security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   26   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 27. Security   Sqoop  1   Sqoop  2   Support  only  for  Hadoop  security   Support  for  Hadoop  security  and  role-­‐ based  access  control  to  external  systems   High  risk  of  abusing  access  to  external   Reduced  risk  of  abusing  access  to  external   systems   systems   No  resource  management  policy   Resource  management  policy   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   27   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 28. Sqoop  1:  Security   •  Inherit/Propagate  Kerberos  principal  for  the   jobs  it  launches   •  Access  to  files  on  HDFS  can  be  controlled  via   HDFS  security   •  Limited  support  (user/password)  for  secure   access  to  external  systems   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   28   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 29. Sqoop  2:  Security   •  Inherit/Propagate  Kerberos  principal  for  the   jobs  it  launches   •  Access  to  files  on  HDFS  can  be  controlled  via   HDFS  security   •  Support  for  secure  access  to  external  systems   via  role-­‐based  access  to  connecAon  objects   –  Administrators  create/edit/delete  connecAons   –  Operators  use  connecAons   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   29   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 30. Sqoop  1:  External  System  Access   •  Every  invocaAon  requires  necessary   credenAals  to  access  external  systems  (e.g.   relaAonal  database)   –  Workaround:  create  a  user  with  limited  access  in   lieu  of  giving  out  password   •  Does  not  scale   •  Permission  granularity  is  hard  to  obtain   •  Hard  to  prevent  misuse  once  credenAals  are   given   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   30   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 31. Sqoop  2:  External  System  Access   •  ConnecAons  are  enabled  as  first-­‐class  objects   –  ConnecAons  encompass  credenAals   –  ConnecAons  are  created  once  and  then  used   many  Ames  for  various  import/export  jobs   –  ConnecAons  are  created  by  administrator  and   used  by  operator   •  Safeguard  credenAal  access  from  end  users   •  ConnecAons  can  be  restricted  in  scope  based   on  operaAon  (import/export)   –  Operators  cannot  abuse  credenAals   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   31   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 32. Sqoop  1:  Resource  Management   •  No  explicit  resource  management  policy   –  Users  specify  the  number  of  map  jobs  to  run   –  Cannot  throQle  load  on  external  systems   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   32   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 33. Sqoop  2:  Resource  Management   •  ConnecAons  allow  specificaAon  of  resource   management  policy     –  Administrators  can  limit  the  total  number  of   physical  connecAons  open  at  one  Ame   –  ConnecAons  can  also  be  disabled   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   33   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 34. Security   Sqoop  1   Sqoop  2   Support  only  for  Hadoop  security   Support  for  Hadoop  security  and  role-­‐ based  access  control  to  external  systems   High  risk  of  abusing  access  to  external   Reduced  risk  of  abusing  access  to  external   systems   systems   No  resource  management  policy   Resource  management  policy   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   34   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 35. Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   35   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 36. Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   36   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 37. Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   37   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 38. Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   38   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 39. Demo  Screenshots     Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   39   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 40. Takeaway     Sqoop  2  Highights:   –  Ease  of  Use:  Sqoop  as  a  Service   –  Ease  of  Extension:  Connectors  benefit  from   shared  funcAonality   –  Security:  ConnecAons  as  first-­‐class  objects  and   role-­‐based  security   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   40   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 41. Current  Status:  work-­‐in-­‐progress   •  Sqoop2  Development:    hQp://issues.apache.org/jira/browse/SQOOP-­‐365   •  Sqoop2  Blog  Post:    hQp://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop   •  Sqoop2  Design:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   41   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 42. Current  Status:  work-­‐in-­‐progress   •  Sqoop2  Quickstart:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Quickstart   •  Sqoop2  Resource  Layout:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+-­‐+Resource+Layout   •  Sqoop2  Feature  Requests:    hQp://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Feature+Requests   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   42   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 43. Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   43   Copyright  2012  The  Apache  So=ware  FoundaAon  
  • 44. Reception Reception takes place in the Community Showcase (Hall 2) Page 44
  • 45. Sqoop  &  Flume  Meetups  Tonight   6:30pm  (Sqoop)   7:45pm  (Flume)   at   Hilton  San  Jose     (San  Carlos  Conf  Rm)   Hadoop  Summit  2012.  6/13/12  Apache  Sqoop   45   Copyright  2012  The  Apache  So=ware  FoundaAon