SlideShare a Scribd company logo
 
                              	
  
Sqooping	
  50	
  Million	
  Rows	
  a	
  Day	
  from	
  MySQL	
  
                              	
  
                              	
  


                                                              Eric	
  Hernandez	
  
                                                              Database	
  Administrator	
  
Parent_YearMonth_Merge	
  
                                                  Child_YearMonth_0	
  
                                                  Child_YearMonth_1	
  
                                                  Child_YearMonth_2	
  
 App	
  	
                                        Child_YearMonth_3	
  
Servers	
                                         Child_YearMonth_4	
  
                                                  Child_YearMonth_5	
  
                                                  Child_YearMonth_6	
  
                                                  Child_YearMonth_7	
  
                                                  Child_YearMonth_8	
  
                                                  Child_YearMonth_9	
  

               +50	
  Million	
  New	
  Rows	
  a	
  Day	
  	
  

               +1.5	
  Billion	
  Rows	
  a	
  Month	
  
3	
  Month	
  RotaSonal	
  Life	
  Cycle	
  



MySQL	
  AcSve	
  	
                                                            MySQL	
  Archive	
  
Writer	
  Instance	
                                                    Long-­‐Term	
  Storage	
  Instance	
  



                                                                                 Two	
  Months	
  
Current	
  Month	
  
                                                                                     Ago	
  


  One	
  Month	
                                                                Three	
  Months	
  
     Ago	
                                                                          Ago	
  


 Two	
  Months	
  
                                                                                     So	
  on	
  ..	
  	
  
     Ago	
  
Problem:	
  Data	
  Analyst	
  have	
  to	
  pull	
  data	
  from	
  two	
  different	
  sources.	
  
                                   One	
  of	
  the	
  goals	
  of	
  our	
  project	
  is	
  to	
  create	
  a	
  single	
  data	
  
                                   source	
  for	
  analyst	
  to	
  mine.	
  	
  




MySQL	
  AcSve	
  	
                                                                    MySQL	
  Archive	
  
Writer	
  Instance	
                                                            Long-­‐Term	
  Storage	
  Instance	
  

                                                                                             Two	
  Months	
  
Current	
  Month	
  
                                                                                                 Ago	
  
  One	
  Month	
                                                                           Three	
  Months	
  
     Ago	
                                                                                     Ago	
  

                                                                                                 So	
  on	
  ..	
  	
  
Data	
  Analyst	
  with	
  Hadoop	
  only	
  have	
  to	
  pull	
  from	
  one	
  data	
  source.	
  




                                                                      Hadoop	
  Cluster	
  
                                                                                  Hive	
  
MySQL	
  AcSve	
  	
                                                               	
  
Writer	
  Instance	
                                                      With	
  all	
  data,	
  	
  
                                                                              current	
  
                                                                   to	
  the	
  last	
  24	
  hours.	
  	
  
Current	
  Month	
  

  One	
  Month	
  
     Ago	
  
A^empt	
  1.0	
  Sqooping	
  in	
  Data	
  from	
  MySQL	
  
Sqoop	
  enSre	
  table	
  into	
  hive	
  every	
  day	
  at	
  0030	
  
                                                                                                                          Parent_201108_Merge	
  
                 9	
  Node	
  	
                                                                                              Child_201108_0	
  
          Hadoop	
  Cluster	
                                                                                                 Child_201108_1	
  
    4	
  TB	
  Available	
  Storage	
  
                                                                                                                              Child_201108_2	
  
                                                                                                                              Child_201108_3	
  
                                                                                                                              Child_201108_4	
  
    Hive	
  Table	
                                                                                                           Child_201108_5	
  	
  
                                                                                                                              Child_201108_6	
  	
  

2011-­‐08-­‐01	
                                                                                                              Child_201108_7	
  
5	
  Million	
  Rows	
  Per	
  Table	
                                                                                        Child_201108_8	
  
2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
                                                                                                                              Child_201108_9	
  
20	
  Minute	
  Total	
  Time	
  
Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
  

2011-­‐08-­‐02	
                                               2011-­‐08-­‐10	
  
10	
  Million	
  Rows	
  Per	
  Table	
                        50	
  Million	
  Rows	
  Per	
  Table	
  
4	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
                 20	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
40	
  Minutes	
  Total	
  Time	
                               200	
  Minutes	
  Total	
  Time	
  
Total	
  100	
  Million	
  Rows	
  into	
  Hive	
  Table	
     Total	
  500	
  Million	
  Rows	
  into	
  Hive	
  Table	
  
A^empt	
  2.0	
  Incremental	
  Sqoop	
  of	
  Data	
  from	
  MySQL	
  


                                            Child_YearMonth	
  Schema	
  

  ID	
  BIGINT	
                   MISC	
                       MISC	
                       MISC	
                 Date_Created	
  
Auto	
  Increment	
               Column	
                     Column	
                     Column	
                 TimeStamp	
  


                                                 Parent_201108_Merge	
  
                                                     Child_201108_0	
  
                                                     Child_201108_1	
  
                                                     Child_201108_2	
  
                                                     Child_201108_3	
  
                                                     Child_201108_4	
  
                                                     Child_201108_5	
  	
  
                                                     Child_201108_6	
  	
  
                                                     Child_201108_7	
  
                                                     Child_201108_8	
  
                                                     Child_201108_9	
  



	
  
sqoop	
  import	
  -­‐-­‐where	
  "date_created	
  between	
  '${DATE}	
  00:00:00'	
  and	
  '${DATE}	
  23:59:59’”	
  
A^empt	
  2.0	
  Incremental	
  Sqoop	
  of	
  Data	
  from	
  MySQL	
  



                9	
  Node	
  	
                                 Sqoop                                                           Parent_201108_Merge	
  
                                                                         	
  
         Hadoop	
  Cluster	
                                   Last	
  2 with	
  wher                                             Child_201108_0	
  
                                                                        4	
  hou          e
   4	
  TB	
  Available	
  Storage	
                                             rs	
  fro 	
  clause	
                           Child_201108_1	
  
                                                                                          m	
  Da
                                                                                                  te_Cr                           Child_201108_2	
  
                                                                                                          eat     ed	
            Child_201108_3	
  
                                                                                                                                  Child_201108_4	
  
      Hive	
  Table	
                                                                                                             Child_201108_5	
  	
  
                                                                                                                                  Child_201108_6	
  	
  
                                                                                                                                  Child_201108_7	
  
                                                                                                                                  Child_201108_8	
  
                                                                                                                                  Child_201108_9	
  
 2011-­‐08-­‐01	
  
 5	
  Million	
  Rows	
  Per	
  Table	
  
 2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
                   2011-­‐08-­‐10	
  
 10	
  Minute	
  Total	
  Time	
                                  5	
  Million	
  Rows	
  Per	
  Table	
  
 Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
        2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
                                                                  10	
  Minute	
  Total	
  Time	
  
                                                                  Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
  
 2011-­‐08-­‐02	
  
                                                                  	
  
 5	
  Million	
  Rows	
  Per	
  Table	
  
 2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
 10	
  Minute	
  Total	
  Time	
  
 Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
  
                                                                  Consistent	
  run	
  Smes	
  for	
  sqoop	
  jobs	
  achieved	
  
 	
  
Ager	
  our	
  2.0	
  Incremental	
  Process	
  we	
  had	
  achieved	
  consistent	
  run	
  Smes	
  
however,	
  two	
  new	
  problems	
  surfaced.	
  
	
  
1)  Each	
  day	
  10	
  new	
  parts	
  would	
  be	
  added	
  to	
  the	
  Hive	
  table	
  which	
  caused	
  10	
  more	
  map	
  
     tasks	
  per	
  hive	
  query.	
  	
  	
  
2)  Space	
  consumpSon	
  on	
  hadoop	
  cluster.	
  
                                                                  	
  
Too	
  many	
  parts	
  and	
  	
  map	
  tasks	
  per	
  query.	
  


          Hive	
  Table	
  
                                                                                                            Parent_201108_Merge	
  
                         Part-­‐0	
                                                                           Child_201108_0	
  
                         Part-­‐1	
                                                                           Child_201108_1	
  
                         Part-­‐2	
  
                         Part-­‐3	
                                                                           Child_201108_2	
  
2011-­‐08-­‐01	
         Part-­‐4	
                               Sqoop	
                                     Child_201108_3	
  
                         Part-­‐5	
  
                         Part-­‐6	
                                                                           Child_201108_4	
  
                         Part-­‐7	
                                                                           Child_201108_5	
  	
  
                         Part-­‐8	
  
                         Part-­‐9	
                                                                           Child_201108_6	
  	
  
                        Part-­‐10	
                                                                           Child_201108_7	
  
                        Part-­‐11	
                                                                           Child_201108_8	
  
                        Part-­‐12	
  
                        Part-­‐13	
                                                                           Child_201108_9	
  
                        Part-­‐14	
  
2011-­‐08-­‐02	
        Part-­‐15	
                            For	
  3	
  Days	
  of	
  Data	
  
                        Part-­‐16	
  
                        Part-­‐17	
              30	
  Map	
  tasks	
  must	
  be	
  processed	
  for	
  
                        Part-­‐18	
  
                        Part-­‐19	
                             any	
  Hive	
  Query	
  
                        Part-­‐20	
  
                        Part-­‐21	
  
                        Part-­‐22	
  
                        Part-­‐23	
  
2011-­‐08-­‐03	
        Part-­‐24	
                       For	
  30	
  Days	
  of	
  Data	
  
                        Part-­‐25	
  
                        Part-­‐26	
             300	
  Map	
  tasks	
  must	
  be	
  processed	
  for	
  
                        Part-­‐27	
                             any	
  Hive	
  Query	
  
                        Part-­‐28	
  
                        Part-­‐29	
  
Parent_201108_Merge	
  
         Hive	
  Table	
                                                                                                                    Child_201108_0	
  
                                                                                                                                            Child_201108_1	
  
                                                                                            p	
  
                                                                                    Sqoo                                                    Child_201108_2	
  
                                                                                                                                            Child_201108_3	
  
                                             Part-­‐0	
  
                                             Part-­‐1	
                                                                                     Child_201108_4	
  
                                             Part-­‐2	
                                                                                     Child_201108_5	
  	
  
                        ParSSon	
  
                                             Part-­‐3	
                                                                                     Child_201108_6	
  	
  
2011-­‐08-­‐01	
                             Part-­‐4	
  
                     dt=2011-­‐08-­‐01	
                                                                                                    Child_201108_7	
  
                                             Part-­‐5	
  
                                             Part-­‐6	
                                                                                     Child_201108_8	
  
                                             Part-­‐7	
                                                                                     Child_201108_9	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
  
                                                            To	
  sqoop	
  10	
  tables	
  into	
  one	
  parSSon	
  
                                             Part-­‐0	
     I	
  choose	
  to	
  dynamically	
  create	
  a	
  parSSon	
  based	
  on	
  date	
  
                                             Part-­‐1	
  
                                             Part-­‐2	
     and	
  Sqoop	
  the	
  data	
  into	
  parSSon	
  directory	
  with	
  an	
  append	
  
                                             Part-­‐3	
  
2011-­‐08-­‐02	
        ParSSon	
            Part-­‐4	
  
                     dt=2011-­‐08-­‐02	
                    #	
  Set	
  date	
  to	
  yesterday	
  
                                             Part-­‐5	
  
                                             Part-­‐6	
     DATE=`date	
  +%Y-­‐%m-­‐%d	
  -­‐d	
  "1	
  day	
  ago"`	
  
                                             Part-­‐7	
     	
  
                                             Part-­‐8	
     #Create	
  ParSSon	
  
                                             Part-­‐9	
  
                                                            echo	
  "ALTER	
  TABLE	
  ${TABLE}	
  ADD	
  IF	
  NOT	
  EXISTS	
  PARTITION	
  (dt='${DATE}')	
  locaSon	
  
                                             Part-­‐0	
     '${PARTITION_DIR}';	
  exit;"	
  |	
  /usr/bin/hive	
  
                                             Part-­‐1	
  
                                             Part-­‐2	
     	
  
                                             Part-­‐3	
     #	
  Sqoop	
  in	
  event_logs	
  
                        ParSSon	
  
2011-­‐08-­‐03	
                             Part-­‐4	
  
                     dt=2011-­‐08-­‐03	
  
                                             Part-­‐5	
     TABLE_DIR=/user/hive/warehouse/${TABLE}	
  
                                             Part-­‐6	
     PARTITION_DIR=$TABLE_DIR/${DATE}	
  
                                             Part-­‐7	
     	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
     sqoop	
  import	
  -­‐-­‐where	
  "date_created	
  between	
  '${DATE}	
  00:00:00'	
  and	
  '${DATE}	
  
                                                            23:59:59'"	
  -­‐-­‐target-­‐dir	
  $PARTITION_DIR	
  -­‐-­‐append	
  
Hive	
  Table	
                                                                                    Parent_201108_Merge	
  
                                                                                                                 Child_201108_0	
  
                                             Part-­‐0	
  
                                                                                                                 Child_201108_1	
  
                                             Part-­‐1	
                                                          Child_201108_2	
  
                                             Part-­‐2	
  
                                             Part-­‐3	
  
                                                                  Sqoop	
                                        Child_201108_3	
  
                        ParSSon	
            Part-­‐4	
                                                          Child_201108_4	
  
2011-­‐08-­‐01	
     dt=2011-­‐08-­‐01	
     Part-­‐5	
  
                                             Part-­‐6	
                                                          Child_201108_5	
  	
  
                                             Part-­‐7	
                                                          Child_201108_6	
  	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
                                                          Child_201108_7	
  
                                             Part-­‐0	
                                                          Child_201108_8	
  
                                             Part-­‐1	
                                                          Child_201108_9	
  
                                             Part-­‐2	
  
                                             Part-­‐3	
  
2011-­‐08-­‐02	
        ParSSon	
            Part-­‐4	
  
                     dt=2011-­‐08-­‐02	
     Part-­‐5	
  
                                             Part-­‐6	
  
                                             Part-­‐7	
     As	
  a	
  result	
  of	
  sqooping	
  into	
  hive	
  parSSons	
  only	
  a	
  
                                             Part-­‐8	
     minimal	
  amount	
  map	
  task	
  have	
  to	
  be	
  processed.	
  	
  	
  
                                             Part-­‐9	
  
                                                                                       1	
  Day	
  =	
  10	
  Map	
  Tasks	
  
                                             Part-­‐0	
  
                                             Part-­‐1	
                             2	
  Days	
  =	
  20	
  Map	
  Tasks	
  
                                             Part-­‐2	
                                                   …	
  
                                             Part-­‐3	
  
                        ParSSon	
            Part-­‐4	
                           30	
  Days	
  =	
  300	
  Map	
  Tasks	
  
2011-­‐08-­‐03	
     dt=2011-­‐08-­‐03	
     Part-­‐5	
  
                                             Part-­‐6	
  
                                             Part-­‐7	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
  
Space	
  ConsumpSon	
  




Parent_201108_Merge	
  
                                                                         Hadoop	
  

   Child_201108_0	
  
   Child_201108_1	
  
   Child_201108_2	
  
   Child_201108_3	
  
   Child_201108_4	
         1	
  Month	
  of	
  Data	
  
   Child_201108_5	
  	
            =	
  30GB	
  	
  
                                                                        ReplicaSon	
  
   Child_201108_6	
  	
                                                  Factor	
  3	
  	
  	
  	
  	
  
   Child_201108_7	
  
   Child_201108_8	
  
   Child_201108_9	
  

                                    1	
  Year	
  of	
  Data	
  
                                         3	
  Replicas	
          1	
  Replica	
  =	
  30	
  GB	
  
                                   1.08	
  TB	
  in	
  HDFS	
     3	
  Replicas	
  =	
  90	
  GB	
  	
  
                                                                          in	
  HDFS	
  
Sqooping	
  with	
  Snappy	
  

	
  
sqoop	
  import	
  -­‐-­‐compression-­‐codec	
  org.apache.hadoop.io.compress.SnappyCodec	
  -­‐z	
  




 Parent_201108_Merge	
  
                                                                                                 Hadoop	
  
    Child_201108_0	
  
    Child_201108_1	
  
    Child_201108_2	
  
    Child_201108_3	
  
    Child_201108_4	
                      1	
  Month	
  of	
  Data	
  
    Child_201108_5	
  	
                         =	
  30GB	
  	
                                ReplicaSon	
  
    Child_201108_6	
  	
                                                                         Factor	
  3	
  	
  	
  	
  	
  
    Child_201108_7	
  
    Child_201108_8	
  
    Child_201108_9	
  
                                               1	
  Year	
  of	
  Data	
                  1	
  Replica	
  =	
  6	
  GB	
  
                                                    3	
  Replicas	
              3	
  Replicas	
  =	
  18	
  GB	
  	
  in	
  HDFS	
  
                                              216	
  GB	
  in	
  HDFS	
         with	
  5:1	
  Snappy	
  Compression	
  
Summary	
  
	
  
1)  Develop	
  some	
  kind	
  of	
  incremental	
  import	
  when	
  sqooping	
  in	
  large	
  acSve	
  tables.	
  If	
  you	
  
     do	
  not,	
  your	
  sqoop	
  jobs	
  will	
  take	
  longer	
  and	
  longer	
  as	
  the	
  data	
  grows	
  from	
  the	
  
     RDBMS.	
  
2)  Limit	
  the	
  amount	
  of	
  parts	
  that	
  will	
  be	
  stored	
  in	
  HDFS,	
  this	
  translates	
  into	
  Sme	
  
     consuming	
  map	
  tasks,	
  use	
  parSSoning	
  if	
  possible.	
  
3)  Compress	
  data	
  in	
  HDFS.	
  You	
  will	
  save	
  space	
  in	
  HDFS	
  as	
  your	
  replicaSon	
  factor	
  makes	
  
     mulSple	
  copies	
  of	
  your	
  data.	
  You	
  may	
  also	
  benefit	
  in	
  processing	
  as	
  your	
  Map/Reduce	
  
     jobs	
  have	
  less	
  data	
  to	
  transfer	
  and	
  hadoop	
  becomes	
  less	
  I/O	
  bound.	
  	
  
                                                                      	
  
?	
  
 	
  

More Related Content

Similar to Sqooping 50 Million Rows a Day from MySQL (20)

PPTX
Boston HUG - Cloudera presentation
reedshea
 
PDF
Databases and the Cloud
Henrik Ingo
 
PDF
Join-fu: The Art of SQL Tuning for MySQL
ZendCon
 
PDF
Apache HBase: Introduction to a column-oriented data store
Christian Gügi
 
PDF
Sql server common interview questions and answers page 6
Kaing Menglieng
 
PPTX
Bb world 2011 capacity planning
Steve Feldman
 
PDF
Thu 0940 kutemperor_norman_3_color
DATAVERSITY
 
PDF
MySQL optimisation Percona LeMug.fr
cyruss666
 
PDF
Falcon Storage Engine Designed For Speed Presentation
elliando dias
 
PDF
MySQL Cookbook 1st ed Edition Paul Dubois
outqwzytic3178
 
PDF
P90 X Your Database!!
Denish Patel
 
PDF
Users Guide 272 345
guest7ec644
 
PDF
Cassandra at Morningstar (Feb 2011)
jeremiahdjordan
 
PDF
MySQL Cookbook 1st ed Edition Paul Dubois
uzielklael28
 
PPT
What's new in Apache Hive
DataWorks Summit
 
PDF
Sql Antipatterns Strike Back
Karwin Software Solutions LLC
 
PPTX
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
DOC
Doublebookings1
JTHSICT
 
PDF
Invisible loading
Daniel Abadi
 
Boston HUG - Cloudera presentation
reedshea
 
Databases and the Cloud
Henrik Ingo
 
Join-fu: The Art of SQL Tuning for MySQL
ZendCon
 
Apache HBase: Introduction to a column-oriented data store
Christian Gügi
 
Sql server common interview questions and answers page 6
Kaing Menglieng
 
Bb world 2011 capacity planning
Steve Feldman
 
Thu 0940 kutemperor_norman_3_color
DATAVERSITY
 
MySQL optimisation Percona LeMug.fr
cyruss666
 
Falcon Storage Engine Designed For Speed Presentation
elliando dias
 
MySQL Cookbook 1st ed Edition Paul Dubois
outqwzytic3178
 
P90 X Your Database!!
Denish Patel
 
Users Guide 272 345
guest7ec644
 
Cassandra at Morningstar (Feb 2011)
jeremiahdjordan
 
MySQL Cookbook 1st ed Edition Paul Dubois
uzielklael28
 
What's new in Apache Hive
DataWorks Summit
 
Sql Antipatterns Strike Back
Karwin Software Solutions LLC
 
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
Doublebookings1
JTHSICT
 
Invisible loading
Daniel Abadi
 

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Ad

Sqooping 50 Million Rows a Day from MySQL

  • 1.     Sqooping  50  Million  Rows  a  Day  from  MySQL       Eric  Hernandez   Database  Administrator  
  • 2. Parent_YearMonth_Merge   Child_YearMonth_0   Child_YearMonth_1   Child_YearMonth_2   App     Child_YearMonth_3   Servers   Child_YearMonth_4   Child_YearMonth_5   Child_YearMonth_6   Child_YearMonth_7   Child_YearMonth_8   Child_YearMonth_9   +50  Million  New  Rows  a  Day     +1.5  Billion  Rows  a  Month  
  • 3. 3  Month  RotaSonal  Life  Cycle   MySQL  AcSve     MySQL  Archive   Writer  Instance   Long-­‐Term  Storage  Instance   Two  Months   Current  Month   Ago   One  Month   Three  Months   Ago   Ago   Two  Months   So  on  ..     Ago  
  • 4. Problem:  Data  Analyst  have  to  pull  data  from  two  different  sources.   One  of  the  goals  of  our  project  is  to  create  a  single  data   source  for  analyst  to  mine.     MySQL  AcSve     MySQL  Archive   Writer  Instance   Long-­‐Term  Storage  Instance   Two  Months   Current  Month   Ago   One  Month   Three  Months   Ago   Ago   So  on  ..    
  • 5. Data  Analyst  with  Hadoop  only  have  to  pull  from  one  data  source.   Hadoop  Cluster   Hive   MySQL  AcSve       Writer  Instance   With  all  data,     current   to  the  last  24  hours.     Current  Month   One  Month   Ago  
  • 6. A^empt  1.0  Sqooping  in  Data  from  MySQL   Sqoop  enSre  table  into  hive  every  day  at  0030   Parent_201108_Merge   9  Node     Child_201108_0   Hadoop  Cluster   Child_201108_1   4  TB  Available  Storage   Child_201108_2   Child_201108_3   Child_201108_4   Hive  Table   Child_201108_5     Child_201108_6     2011-­‐08-­‐01   Child_201108_7   5  Million  Rows  Per  Table   Child_201108_8   2  Minutes  Sqoop  Sme  Per  Table   Child_201108_9   20  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   2011-­‐08-­‐02   2011-­‐08-­‐10   10  Million  Rows  Per  Table   50  Million  Rows  Per  Table   4  Minutes  Sqoop  Sme  Per  Table   20  Minutes  Sqoop  Sme  Per  Table   40  Minutes  Total  Time   200  Minutes  Total  Time   Total  100  Million  Rows  into  Hive  Table   Total  500  Million  Rows  into  Hive  Table  
  • 7. A^empt  2.0  Incremental  Sqoop  of  Data  from  MySQL   Child_YearMonth  Schema   ID  BIGINT   MISC   MISC   MISC   Date_Created   Auto  Increment   Column   Column   Column   TimeStamp   Parent_201108_Merge   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   Child_201108_5     Child_201108_6     Child_201108_7   Child_201108_8   Child_201108_9     sqoop  import  -­‐-­‐where  "date_created  between  '${DATE}  00:00:00'  and  '${DATE}  23:59:59’”  
  • 8. A^empt  2.0  Incremental  Sqoop  of  Data  from  MySQL   9  Node     Sqoop Parent_201108_Merge     Hadoop  Cluster   Last  2 with  wher Child_201108_0   4  hou e 4  TB  Available  Storage   rs  fro  clause   Child_201108_1   m  Da te_Cr Child_201108_2   eat ed   Child_201108_3   Child_201108_4   Hive  Table   Child_201108_5     Child_201108_6     Child_201108_7   Child_201108_8   Child_201108_9   2011-­‐08-­‐01   5  Million  Rows  Per  Table   2  Minutes  Sqoop  Sme  Per  Table   2011-­‐08-­‐10   10  Minute  Total  Time   5  Million  Rows  Per  Table   Total  50  Million  Rows  into  Hive  Table   2  Minutes  Sqoop  Sme  Per  Table   10  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   2011-­‐08-­‐02     5  Million  Rows  Per  Table   2  Minutes  Sqoop  Sme  Per  Table   10  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   Consistent  run  Smes  for  sqoop  jobs  achieved    
  • 9. Ager  our  2.0  Incremental  Process  we  had  achieved  consistent  run  Smes   however,  two  new  problems  surfaced.     1)  Each  day  10  new  parts  would  be  added  to  the  Hive  table  which  caused  10  more  map   tasks  per  hive  query.       2)  Space  consumpSon  on  hadoop  cluster.    
  • 10. Too  many  parts  and    map  tasks  per  query.   Hive  Table   Parent_201108_Merge   Part-­‐0   Child_201108_0   Part-­‐1   Child_201108_1   Part-­‐2   Part-­‐3   Child_201108_2   2011-­‐08-­‐01   Part-­‐4   Sqoop   Child_201108_3   Part-­‐5   Part-­‐6   Child_201108_4   Part-­‐7   Child_201108_5     Part-­‐8   Part-­‐9   Child_201108_6     Part-­‐10   Child_201108_7   Part-­‐11   Child_201108_8   Part-­‐12   Part-­‐13   Child_201108_9   Part-­‐14   2011-­‐08-­‐02   Part-­‐15   For  3  Days  of  Data   Part-­‐16   Part-­‐17   30  Map  tasks  must  be  processed  for   Part-­‐18   Part-­‐19   any  Hive  Query   Part-­‐20   Part-­‐21   Part-­‐22   Part-­‐23   2011-­‐08-­‐03   Part-­‐24   For  30  Days  of  Data   Part-­‐25   Part-­‐26   300  Map  tasks  must  be  processed  for   Part-­‐27   any  Hive  Query   Part-­‐28   Part-­‐29  
  • 11. Parent_201108_Merge   Hive  Table   Child_201108_0   Child_201108_1   p   Sqoo Child_201108_2   Child_201108_3   Part-­‐0   Part-­‐1   Child_201108_4   Part-­‐2   Child_201108_5     ParSSon   Part-­‐3   Child_201108_6     2011-­‐08-­‐01   Part-­‐4   dt=2011-­‐08-­‐01   Child_201108_7   Part-­‐5   Part-­‐6   Child_201108_8   Part-­‐7   Child_201108_9   Part-­‐8   Part-­‐9   To  sqoop  10  tables  into  one  parSSon   Part-­‐0   I  choose  to  dynamically  create  a  parSSon  based  on  date   Part-­‐1   Part-­‐2   and  Sqoop  the  data  into  parSSon  directory  with  an  append   Part-­‐3   2011-­‐08-­‐02   ParSSon   Part-­‐4   dt=2011-­‐08-­‐02   #  Set  date  to  yesterday   Part-­‐5   Part-­‐6   DATE=`date  +%Y-­‐%m-­‐%d  -­‐d  "1  day  ago"`   Part-­‐7     Part-­‐8   #Create  ParSSon   Part-­‐9   echo  "ALTER  TABLE  ${TABLE}  ADD  IF  NOT  EXISTS  PARTITION  (dt='${DATE}')  locaSon   Part-­‐0   '${PARTITION_DIR}';  exit;"  |  /usr/bin/hive   Part-­‐1   Part-­‐2     Part-­‐3   #  Sqoop  in  event_logs   ParSSon   2011-­‐08-­‐03   Part-­‐4   dt=2011-­‐08-­‐03   Part-­‐5   TABLE_DIR=/user/hive/warehouse/${TABLE}   Part-­‐6   PARTITION_DIR=$TABLE_DIR/${DATE}   Part-­‐7     Part-­‐8   Part-­‐9   sqoop  import  -­‐-­‐where  "date_created  between  '${DATE}  00:00:00'  and  '${DATE}   23:59:59'"  -­‐-­‐target-­‐dir  $PARTITION_DIR  -­‐-­‐append  
  • 12. Hive  Table   Parent_201108_Merge   Child_201108_0   Part-­‐0   Child_201108_1   Part-­‐1   Child_201108_2   Part-­‐2   Part-­‐3   Sqoop   Child_201108_3   ParSSon   Part-­‐4   Child_201108_4   2011-­‐08-­‐01   dt=2011-­‐08-­‐01   Part-­‐5   Part-­‐6   Child_201108_5     Part-­‐7   Child_201108_6     Part-­‐8   Part-­‐9   Child_201108_7   Part-­‐0   Child_201108_8   Part-­‐1   Child_201108_9   Part-­‐2   Part-­‐3   2011-­‐08-­‐02   ParSSon   Part-­‐4   dt=2011-­‐08-­‐02   Part-­‐5   Part-­‐6   Part-­‐7   As  a  result  of  sqooping  into  hive  parSSons  only  a   Part-­‐8   minimal  amount  map  task  have  to  be  processed.       Part-­‐9   1  Day  =  10  Map  Tasks   Part-­‐0   Part-­‐1   2  Days  =  20  Map  Tasks   Part-­‐2   …   Part-­‐3   ParSSon   Part-­‐4   30  Days  =  300  Map  Tasks   2011-­‐08-­‐03   dt=2011-­‐08-­‐03   Part-­‐5   Part-­‐6   Part-­‐7   Part-­‐8   Part-­‐9  
  • 13. Space  ConsumpSon   Parent_201108_Merge   Hadoop   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   1  Month  of  Data   Child_201108_5     =  30GB     ReplicaSon   Child_201108_6     Factor  3           Child_201108_7   Child_201108_8   Child_201108_9   1  Year  of  Data   3  Replicas   1  Replica  =  30  GB   1.08  TB  in  HDFS   3  Replicas  =  90  GB     in  HDFS  
  • 14. Sqooping  with  Snappy     sqoop  import  -­‐-­‐compression-­‐codec  org.apache.hadoop.io.compress.SnappyCodec  -­‐z   Parent_201108_Merge   Hadoop   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   1  Month  of  Data   Child_201108_5     =  30GB     ReplicaSon   Child_201108_6     Factor  3           Child_201108_7   Child_201108_8   Child_201108_9   1  Year  of  Data   1  Replica  =  6  GB   3  Replicas   3  Replicas  =  18  GB    in  HDFS   216  GB  in  HDFS   with  5:1  Snappy  Compression  
  • 15. Summary     1)  Develop  some  kind  of  incremental  import  when  sqooping  in  large  acSve  tables.  If  you   do  not,  your  sqoop  jobs  will  take  longer  and  longer  as  the  data  grows  from  the   RDBMS.   2)  Limit  the  amount  of  parts  that  will  be  stored  in  HDFS,  this  translates  into  Sme   consuming  map  tasks,  use  parSSoning  if  possible.   3)  Compress  data  in  HDFS.  You  will  save  space  in  HDFS  as  your  replicaSon  factor  makes   mulSple  copies  of  your  data.  You  may  also  benefit  in  processing  as  your  Map/Reduce   jobs  have  less  data  to  transfer  and  hadoop  becomes  less  I/O  bound.      
  • 16. ?