SlideShare a Scribd company logo
Introduction to
Zak Stone <zak@eecs.harvard.edu>
PhD candidate, Harvard School of Engineering and Applied Sciences
Advisor: Todd Zickler (Computer Vision)
Hadoop distributes data and computation across a
large number of computers.
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Why should you care? - Lots of Data




   LOTS OF DATA
   EVERYWHERE
Why should you care? - Lots of Data




                                      L
                                      O
                                      T
                                      S
                                      !
Why should you care? - Lots of Data
Why should you care? - Even Grocery Stores Care




                      ...
Why!! ! ! ! ! !                    for big data?

• Most credible open-source toolset for large-scale, general-purpose computing


  • Backed by                 ,


  • Used by                   ,              , many others


  • Increasing support from                          web services


  • Hadoop closely imitates infrastructure developed by


  • Hadoop processes petabytes daily, right now
Why!! ! ! ! ! !   for big data?
DISCLAIMER
   • Don’t use Hadoop if your data and computation fit on one machine


   • Getting easier to use, but still complicated




http://www.wired.com/gadgetlab/2008/07/patent-crazines/
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
What exactly is ! ! ! ! ! ! !                    ?

• Actually a growing collection of subprojects
What exactly is ! ! ! ! ! ! !                        ?

• Actually a growing collection of subprojects; focus on two right now
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
An overview of Hadoop Map-Reduce




   Traditional
                              Hadoop
   Computing



    (one computer)

                            (many computers)
An overview of Hadoop Map-Reduce

            (Actually more like this)




                    (many computers, little communication,
                           stragglers and failures)
Map-Reduce: Three phases



              1. Map

              2. Sort

              3. Reduce
Map-Reduce: Map phase


   Only specify operations on key-value pairs!
    INPUT PAIR                    OUTPUT PAIRS
  (key, value)                  (key, value)
                                (key, value)
                                (key, value)
                                (zero or more output pairs)


       (each “elephant” works on an input pair;
         doesn’t know other elephants exist )
Map-Reduce: Map phase, word-count example



   (line1, “Hello there.”)   (“hello”, 1)

                             (“there”, 1)




   (line2, “Why, hello.”)     (“why”, 1)

                              (“hello”, 1)
Map-Reduce: Sort phase

          (key1, value289)
           (key1, value43)
           (key1, value3)
                 ...
          (key2, value512)
           (key2, value11)
           (key2, value67)
                   ...
Map-Reduce: Sort phase, word-count example

                              (“hello”, 1)
                              (“hello”, 1)




                              (“there”, 1)




                               (“why”, 1)
Map-Reduce: Reduce phase




(key1, value289)
(key1, value43)            (key1, output1)
 (key1, value3)

                   ...
Map-Reduce: Reduce phase, word-count example


   (“hello”, 1)
                               (“hello”, 2)
   (“hello”, 1)




   (“there”, 1)                (“there”, 1)




    (“why”, 1)                  (“why”, 1)
Map-Reduce: Code for word-count


     def mapper(key,value):
       for word in value.split():
         yield word,1

     def reducer(key,values):
       yield key,sum(values)
Seems like too much work
   for a word-count!
Map-Reduce: Imagine word-count on the Web
Map-Reduce: The main advantage

With Hadoop, this very same code could run on
      the entire Web! (In theory, at least)
     def mapper(key,value):
       for word in value.split():
         yield word,1

     def reducer(key,values):
       yield key,sum(values)
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
HDFS: Hadoop Distributed File System



                            ...        (chunks of data
                                        on computers)


       Data                 ...      (each chunk
                                   replicated more
                                    than once for
                                       reliability)

                            ...
                          ...
HDFS: Hadoop Distributed File System
                       (key1, value1)
                       (key2, value2)
                             ...



  ...                  (key1, value1)
                       (key2, value2)
                             ...
                                          ...



         Computation is local to the data
Key-value pairs processed independently in parallel
HDFS: Inspired by the Google File System
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation

   • Computation local to data avoids network overload

• Tasks are independent

   • Easy to handle partial failures - entire nodes can fail and restart

   • Avoid crawling horrors of failure-tolerant synchronous distributed systems

   • Speculative execution to work around stragglers

• Linear scaling in the ideal case

   • Designed for cheap, commodity hardware

• Simple programming model

   • The “end-user” programmer only writes map-reduce tasks
Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development

   • e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

   • Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

   • No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job flow isn’t trivial when intermediate data should be kept

• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

   • Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/
Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

   • Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

   • The Python word-count example and others come with Dumbo

   • Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/


• Always test locally on a tiny dataset before running on a cluster!
...
Thanks for your attention!

More Related Content

What's hot (20)

PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPT
Introduction To Map Reduce
rantav
 
PPTX
Analysing of big data using map reduce
Paladion Networks
 
PPTX
Introduction to MapReduce
Chicago Hadoop Users Group
 
PDF
Introduction to Map-Reduce
Brendan Tierney
 
PDF
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
PPT
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
PPT
Map Reduce introduction
Muralidharan Deenathayalan
 
PPTX
Map Reduce Online
Hadoop User Group
 
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
PPTX
Introduction to map reduce
M Baddar
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PPT
Map Reduce
Manuel Correa
 
PDF
An Introduction to MapReduce
Frane Bandov
 
PPTX
MapReduce basic
Chirag Ahuja
 
PPTX
Join optimization in hive
Liyin Tang
 
PDF
MapReduce Algorithm Design
Gabriela Agustini
 
PPT
Map Reduce
schapht
 
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
PPTX
Map reduce in Hadoop
ishan0019
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Introduction To Map Reduce
rantav
 
Analysing of big data using map reduce
Paladion Networks
 
Introduction to MapReduce
Chicago Hadoop Users Group
 
Introduction to Map-Reduce
Brendan Tierney
 
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Map Reduce introduction
Muralidharan Deenathayalan
 
Map Reduce Online
Hadoop User Group
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Introduction to map reduce
M Baddar
 
Hadoop-Introduction
Sandeep Deshmukh
 
Map Reduce
Manuel Correa
 
An Introduction to MapReduce
Frane Bandov
 
MapReduce basic
Chirag Ahuja
 
Join optimization in hive
Liyin Tang
 
MapReduce Algorithm Design
Gabriela Agustini
 
Map Reduce
schapht
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Map reduce in Hadoop
ishan0019
 

Similar to [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard) (20)

PPTX
Hadoop
David Xie
 
PPTX
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
PPT
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Scaling Storage and Computation with Hadoop
yaevents
 
PPTX
This gives a brief detail about big data
chinky1118
 
PPT
Hadoop by sunitha
Sunitha Satyadas
 
PDF
Hadoop programming
Muthusamy Manigandan
 
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
harikumar288574
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PPTX
introduction to Complete Map and Reduce Framework
harikumar288574
 
PPTX
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PDF
Hadoop: A Hands-on Introduction
Claudio Martella
 
PDF
Seminar_Report_hadoop
Varun Narang
 
PDF
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
PPT
Big Data Technologies - Hadoop
Talentica Software
 
PPTX
Python in big data world
Rohit
 
Hadoop
David Xie
 
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Hadoop Overview & Architecture
EMC
 
Scaling Storage and Computation with Hadoop
yaevents
 
This gives a brief detail about big data
chinky1118
 
Hadoop by sunitha
Sunitha Satyadas
 
Hadoop programming
Muthusamy Manigandan
 
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
harikumar288574
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
introduction to Complete Map and Reduce Framework
harikumar288574
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop: A Hands-on Introduction
Claudio Martella
 
Seminar_Report_hadoop
Varun Narang
 
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
Lecture2-MapReduce - An introductory lecture to Map Reduce
ssuserb91a20
 
Big Data Technologies - Hadoop
Talentica Software
 
Python in big data world
Rohit
 
Ad

More from npinto (20)

PDF
"AI" for Blockchain Security (Case Study: Cosmos)
npinto
 
PDF
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
 
PDF
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
 
PDF
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
 
PDF
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
npinto
 
PDF
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
 
PDF
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
npinto
 
PDF
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
 
PDF
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
PDF
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 
PDF
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
 
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
PDF
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
PDF
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
 
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
 
PDF
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
 
PDF
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
 
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
"AI" for Blockchain Security (Case Study: Cosmos)
npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
Ad

Recently uploaded (20)

PPTX
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PDF
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PPTX
How to Set Maximum Difference Odoo 18 POS
Celine George
 
PPTX
BANDHA (BANDAGES) PPT.pptx ayurveda shalya tantra
rakhan78619
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PDF
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
ARAL_Orientation_Day-2-Sessions_ARAL-Readung ARAL-Mathematics ARAL-Sciencev2.pdf
JoelVilloso1
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
How to Set Maximum Difference Odoo 18 POS
Celine George
 
BANDHA (BANDAGES) PPT.pptx ayurveda shalya tantra
rakhan78619
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
A PPT on Alfred Lord Tennyson's Ulysses.
Beena E S
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
Dimensions of Societal Planning in Commonism
StefanMz
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

  • 1. Introduction to Zak Stone <[email protected]> PhD candidate, Harvard School of Engineering and Applied Sciences Advisor: Todd Zickler (Computer Vision)
  • 2. Hadoop distributes data and computation across a large number of computers.
  • 3. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 4. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 5. Why should you care? - Lots of Data LOTS OF DATA EVERYWHERE
  • 6. Why should you care? - Lots of Data L O T S !
  • 7. Why should you care? - Lots of Data
  • 8. Why should you care? - Even Grocery Stores Care ...
  • 9. Why!! ! ! ! ! ! for big data? • Most credible open-source toolset for large-scale, general-purpose computing • Backed by , • Used by , , many others • Increasing support from web services • Hadoop closely imitates infrastructure developed by • Hadoop processes petabytes daily, right now
  • 10. Why!! ! ! ! ! ! for big data?
  • 11. DISCLAIMER • Don’t use Hadoop if your data and computation fit on one machine • Getting easier to use, but still complicated http://www.wired.com/gadgetlab/2008/07/patent-crazines/
  • 12. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 13. What exactly is ! ! ! ! ! ! ! ? • Actually a growing collection of subprojects
  • 14. What exactly is ! ! ! ! ! ! ! ? • Actually a growing collection of subprojects; focus on two right now
  • 15. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 16. An overview of Hadoop Map-Reduce Traditional Hadoop Computing (one computer) (many computers)
  • 17. An overview of Hadoop Map-Reduce (Actually more like this) (many computers, little communication, stragglers and failures)
  • 18. Map-Reduce: Three phases 1. Map 2. Sort 3. Reduce
  • 19. Map-Reduce: Map phase Only specify operations on key-value pairs! INPUT PAIR OUTPUT PAIRS (key, value) (key, value) (key, value) (key, value) (zero or more output pairs) (each “elephant” works on an input pair; doesn’t know other elephants exist )
  • 20. Map-Reduce: Map phase, word-count example (line1, “Hello there.”) (“hello”, 1) (“there”, 1) (line2, “Why, hello.”) (“why”, 1) (“hello”, 1)
  • 21. Map-Reduce: Sort phase (key1, value289) (key1, value43) (key1, value3) ... (key2, value512) (key2, value11) (key2, value67) ...
  • 22. Map-Reduce: Sort phase, word-count example (“hello”, 1) (“hello”, 1) (“there”, 1) (“why”, 1)
  • 23. Map-Reduce: Reduce phase (key1, value289) (key1, value43) (key1, output1) (key1, value3) ...
  • 24. Map-Reduce: Reduce phase, word-count example (“hello”, 1) (“hello”, 2) (“hello”, 1) (“there”, 1) (“there”, 1) (“why”, 1) (“why”, 1)
  • 25. Map-Reduce: Code for word-count def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  • 26. Seems like too much work for a word-count!
  • 28. Map-Reduce: The main advantage With Hadoop, this very same code could run on the entire Web! (In theory, at least) def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  • 29. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 30. HDFS: Hadoop Distributed File System ... (chunks of data on computers) Data ... (each chunk replicated more than once for reliability) ... ...
  • 31. HDFS: Hadoop Distributed File System (key1, value1) (key2, value2) ... ... (key1, value1) (key2, value2) ... ... Computation is local to the data Key-value pairs processed independently in parallel
  • 32. HDFS: Inspired by the Google File System
  • 33. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 34. Hadoop Map-Reduce and HDFS: Advantages • Distribute data and computation • Computation local to data avoids network overload • Tasks are independent • Easy to handle partial failures - entire nodes can fail and restart • Avoid crawling horrors of failure-tolerant synchronous distributed systems • Speculative execution to work around stragglers • Linear scaling in the ideal case • Designed for cheap, commodity hardware • Simple programming model • The “end-user” programmer only writes map-reduce tasks
  • 35. Hadoop Map-Reduce and HDFS: Disadvantages • Still rough - software under active development • e.g. HDFS only recently added support for append operations • Programming model is very restrictive • Lack of central data can be frustrating • “Joins” of multiple datasets are tricky and slow • No indices! Often, entire dataset gets copied in the process • Cluster management is hard (debugging, distributing software, collecting logs...) • Still single master, which requires care and may limit scaling • Managing job flow isn’t trivial when intermediate data should be kept • Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
  • 36. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 37. Getting started: Installation options • Cloudera virtual machine • Your own virtual machine (install Ubuntu in VirtualBox, which is free) • Elastic MapReduce on EC2 • StarCluster with Hadoop on EC2 • Cloudera’s distribution of Hadoop on EC2 • Install Cloudera’s distribution of Hadoop on your own machine • Available for RPM and Debian deployments • Or download Hadoop directly from http://hadoop.apache.org/
  • 38. Getting started: Language choices • Hadoop is written in Java • However, Hadoop Streaming allows mappers and reducers in any language! • Binary data is a little tricky with Hadoop Streaming • Could use base64 encoding, but TypedBytes are much better • For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo • The Python word-count example and others come with Dumbo • Dumbo makes binary data with TypedBytes easy • Also consider Hadoopy: https://github.com/bwhite/hadoopy
  • 39. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 40. Useful resources and tips • The Hadoop homepage: http://hadoop.apache.org/ • Cloudera: http://cloudera.com/ • Dumbo: http://wiki.github.com/klbostee/dumbo • Hadoopy: https://github.com/bwhite/hadoopy • Amazon Elastic Compute Cloud Getting Started Guide: • http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/ • Always test locally on a tiny dataset before running on a cluster!
  • 41. ...
  • 42. Thanks for your attention!