SlideShare a Scribd company logo
Apache Hadoop - Big Data Engineering
Apache Hadoop
Big Data Engineering
Prepared by:
● Islam Elbanna
● Mahmoud Hanafy
Presented by:
● Ahmed Mahran
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Introduction
What is Hadoop?
"Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is designed to scale up from single servers to thousands
of machines, each providing computation and storage"
Open Source software + Hardware commodity = IT Cost reduction
Introduction - Cont.
Why Hadoop ?
● Performance
● Storage
● Scalability
● Fault tolerance
● Cost efficiency (Commodity Machines)
Introduction - Cont.
What is Hadoop used for ?
● Searching
● Log processing
● Recommendation system
● Analytics
● Video and Image analysis
Introduction - Cont.
Who uses Hadoop ?
● Amazon
● Facebook
● Google
● IBM
● New York Times
● Yahoo
● Twitter
● LinkedIn
● …
Introduction - Cont.
Hadoop RDBMS
Non-Structured/Structured data Structured data
Scale Out Scale Up
Procedural/Functional programming Declarative Queries
Offline batch processing Online/Batch Transactions
Petabytes Gigabytes
Key Value Pairs Predefined fields
Hadoop Vs RDBMS
Introduction - Cont.
Problem:
20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
~ Four months to read the web (Time).
~1,000 hard drives just to store the web (Storage).
Introduction - Cont.
Solution: same problem with 1000 machines < 3 hours
But we need:
● Communication and coordination
● Recovering from machine failure
● Status reporting
● Debugging
● Optimization
Distributed System
Introduction - Cont.
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Introduction - Cont.
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Introduction - Cont.
Introduction - Cont.
Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
History
● 2002-2004 Started as a sub-project of Apache
Nutch.
● 2003-2004 Google published Google File System
(GFS) and MapReduce Framework Paper.
● 2004 Doug Cutting and Mike Cafarella
implemented Google’s frameworks in Nutch.
● In 2006 Yahoo hires Doug Cutting to work on
Hadoop with a dedicated team.
● In 2008 Hadoop became Apache Top Level Project.
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Assumptions
● Hardware Failure
● Streaming Data Access
● Large Data Sets
● Simple Coherency Model
● Moving Computation is Cheaper than Moving Data
● Software Platform Portability
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture
Hadoop designed and built on two independent
frameworks
Hadoop = HDFS + MapReduce
HDFS: is a reliable distributed file system that provides
high-throughput access to data.
● File divided into blocks 64MB (default)
● Each block replicated 3 times (default)
MapReduce: is a framework for performing high
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Case Study: Word Count
Problem: We need to calculate word
frequencies in billions of web pages
● Input: Files with one document per
record
● Output: List of words and their
frequencies in the whole documents
Case Study: Solution
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
MapReduce Design
● Map
● Reduce
● Shuffle & Sort
Case Study: Map Phase
● Specify a map function that takes a key/value pair
key = document URL
value = document contents
● Output of map function is key/value pairs.
In our case, output(word, “1”) once per word in the document
Case Study: Reduce Phase
● MapReduce library gathers together all pairs with the same key
(shuffle/sort)
● The reduce function combines the values for a key
In our case, compute the sum
● Output of reduce will be like that
Architecture - Cont.
MapReduce Design
● Map: extract
something you
care about from
each record.
Architecture - Cont.
MapReduce Design
● Reduce :
aggregate,
summarize, filter,
or transform
mapper output
Architecture - Cont.
MapReduce Design
Overall View:
Architecture - Cont.
MapReduce Design
● Shuffle & Sort :
redirect the
mapper output to
the right reducer
Case Study: Overall View
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
MapReduce
Programmer specifies two primary methods:
map(k1, v1) → <k2, v2>
reduce(k2, list<v2>) → <k3, v3>
Case Study : Code Example
Map Function
Case Study : Code Example
Reduce Function
Hadoop not only JAVA (streaming)
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker
Architecture - Cont.
Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
⚪
Architecture - Cont.
Main Modules
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Architecture - Cont.
Access Procedure
● Read From HDFS
● Write to HDFS
Architecture - Cont.
Access Procedure
● Read From HDFS
● Write to HDFS
Architecture - Cont.
Access Procedure
● Read From HDFS
● Write to HDFS
Architecture - Cont.
Tasks distribution Procedure:
JobTracker choses the nodes to
execute the tasks to achieve the
data locality principle
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Hadoop Modes
Hadoop Modes
● Standalone
● Pseudo-Distributed
● Fully-Distributed
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
MapReduce 1 Vs MapReduce 2(YARN)
Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions
Questions
References
● Book “Hadoop in Action” by Chuck Lam
● Book “Hadoop The Definitive Guide” by Tom Wbite
● http://hadoop.apache.org/
● http://en.wikipedia.org/wiki/Apache_Hadoop
● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil
● http://www.slideshare.net/PhilippeJulio/hadoop-architecture
● http://www.slideshare.net/rantav/introduction-to-map-reduce
● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=
&from_search=2
● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q
f1&b=&from_search=12
● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig
● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1
4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1
Thanks

More Related Content

What's hot (20)

PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PPTX
Hadoop technology
tipanagiriharika
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPTX
Big data Hadoop presentation
Shivanee garg
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PPTX
Big data ppt
Thirunavukkarasu Ps
 
PDF
Introduction to Hadoop part1
Giovanna Roda
 
DOCX
Hadoop technology doc
tipanagiriharika
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PPT
Hadoop Technology
Atul Kushwaha
 
PPTX
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
PDF
Introduction to Hadoop and MapReduce
eakasit_dpu
 
DOC
Hadoop
Himanshu Soni
 
PPTX
Hadoop bigdata overview
harithakannan
 
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
DOCX
Hadoop Seminar Report
Bhushan Kulkarni
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Seminar Presentation Hadoop
Varun Narang
 
Hadoop technology
tipanagiriharika
 
Hadoop and Big Data
Harshdeep Kaur
 
Big data Hadoop presentation
Shivanee garg
 
Hadoop Seminar Report
Atul Kushwaha
 
Big data ppt
Thirunavukkarasu Ps
 
Introduction to Hadoop part1
Giovanna Roda
 
Hadoop technology doc
tipanagiriharika
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
Hadoop Technology
Atul Kushwaha
 
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
 
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Hadoop bigdata overview
harithakannan
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Hadoop Seminar Report
Bhushan Kulkarni
 

Similar to Apache Hadoop - Big Data Engineering (20)

PDF
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
PPTX
Apache spark - History and market overview
Martin Zapletal
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PPTX
Hadoop introduction
Rabindra Nath Nandi
 
PPT
Map reducecloudtech
Jakir Hossain
 
PDF
Hadoop, MapReduce and R = RHadoop
Victoria López
 
PPTX
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PDF
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PDF
Hadoop scheduler with deadline constraint
ijccsa
 
PDF
Report Hadoop Map Reduce
Urvashi Kataria
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
2. Develop a MapReduce program to calculate the frequency of a given word in ...
Prof. Maulik Trivedi
 
PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
PPT
Scala and spark
Fabio Fumarola
 
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
Apache spark - History and market overview
Martin Zapletal
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Hadoop introduction
Rabindra Nath Nandi
 
Map reducecloudtech
Jakir Hossain
 
Hadoop, MapReduce and R = RHadoop
Victoria López
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
sonukumar379092
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop scheduler with deadline constraint
ijccsa
 
Report Hadoop Map Reduce
Urvashi Kataria
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
2. Develop a MapReduce program to calculate the frequency of a given word in ...
Prof. Maulik Trivedi
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Scala and spark
Fabio Fumarola
 
Ad

More from BADR (15)

PDF
Sunspot - The Ruby Way into Solr
BADR
 
PDF
Docker up and Running For Web Developers
BADR
 
PDF
Vue.js
BADR
 
PDF
There and Back Again - A Tale of Programming Languages
BADR
 
PDF
Take Pride in Your Code - Test-Driven Development
BADR
 
PDF
Single Responsibility Principle
BADR
 
PDF
NoSQL Databases
BADR
 
PDF
Explicit Semantic Analysis
BADR
 
PDF
Getting some Git
BADR
 
PDF
ReactiveX
BADR
 
PDF
Algorithms - A Sneak Peek
BADR
 
PDF
Android from A to Z
BADR
 
PDF
MySQL Indexing
BADR
 
PDF
Duckville - The Strategy Design Pattern
BADR
 
PDF
The Perks and Perils of the Singleton Design Pattern
BADR
 
Sunspot - The Ruby Way into Solr
BADR
 
Docker up and Running For Web Developers
BADR
 
Vue.js
BADR
 
There and Back Again - A Tale of Programming Languages
BADR
 
Take Pride in Your Code - Test-Driven Development
BADR
 
Single Responsibility Principle
BADR
 
NoSQL Databases
BADR
 
Explicit Semantic Analysis
BADR
 
Getting some Git
BADR
 
ReactiveX
BADR
 
Algorithms - A Sneak Peek
BADR
 
Android from A to Z
BADR
 
MySQL Indexing
BADR
 
Duckville - The Strategy Design Pattern
BADR
 
The Perks and Perils of the Singleton Design Pattern
BADR
 
Ad

Recently uploaded (20)

PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 

Apache Hadoop - Big Data Engineering

  • 2. Apache Hadoop Big Data Engineering Prepared by: ● Islam Elbanna ● Mahmoud Hanafy Presented by: ● Ahmed Mahran
  • 3. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 4. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 5. Introduction What is Hadoop? "Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each providing computation and storage" Open Source software + Hardware commodity = IT Cost reduction
  • 6. Introduction - Cont. Why Hadoop ? ● Performance ● Storage ● Scalability ● Fault tolerance ● Cost efficiency (Commodity Machines)
  • 7. Introduction - Cont. What is Hadoop used for ? ● Searching ● Log processing ● Recommendation system ● Analytics ● Video and Image analysis
  • 8. Introduction - Cont. Who uses Hadoop ? ● Amazon ● Facebook ● Google ● IBM ● New York Times ● Yahoo ● Twitter ● LinkedIn ● …
  • 9. Introduction - Cont. Hadoop RDBMS Non-Structured/Structured data Structured data Scale Out Scale Up Procedural/Functional programming Declarative Queries Offline batch processing Online/Batch Transactions Petabytes Gigabytes Key Value Pairs Predefined fields Hadoop Vs RDBMS
  • 10. Introduction - Cont. Problem: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~ Four months to read the web (Time). ~1,000 hard drives just to store the web (Storage).
  • 11. Introduction - Cont. Solution: same problem with 1000 machines < 3 hours But we need: ● Communication and coordination ● Recovering from machine failure ● Status reporting ● Debugging ● Optimization Distributed System
  • 12. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing
  • 13. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing
  • 14. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing Introduction - Cont.
  • 15. Introduction - Cont. Distributed systems ● Cluster of machines ● Distributed Storage ● Distributed Computing
  • 16. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 17. History ● 2002-2004 Started as a sub-project of Apache Nutch. ● 2003-2004 Google published Google File System (GFS) and MapReduce Framework Paper. ● 2004 Doug Cutting and Mike Cafarella implemented Google’s frameworks in Nutch. ● In 2006 Yahoo hires Doug Cutting to work on Hadoop with a dedicated team. ● In 2008 Hadoop became Apache Top Level Project.
  • 18. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 19. Assumptions ● Hardware Failure ● Streaming Data Access ● Large Data Sets ● Simple Coherency Model ● Moving Computation is Cheaper than Moving Data ● Software Platform Portability
  • 20. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 21. Architecture Hadoop designed and built on two independent frameworks Hadoop = HDFS + MapReduce HDFS: is a reliable distributed file system that provides high-throughput access to data. ● File divided into blocks 64MB (default) ● Each block replicated 3 times (default) MapReduce: is a framework for performing high
  • 22. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 23. Case Study: Word Count Problem: We need to calculate word frequencies in billions of web pages ● Input: Files with one document per record ● Output: List of words and their frequencies in the whole documents
  • 25. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 26. Architecture - Cont. MapReduce Design ● Map ● Reduce ● Shuffle & Sort
  • 27. Case Study: Map Phase ● Specify a map function that takes a key/value pair key = document URL value = document contents ● Output of map function is key/value pairs. In our case, output(word, “1”) once per word in the document
  • 28. Case Study: Reduce Phase ● MapReduce library gathers together all pairs with the same key (shuffle/sort) ● The reduce function combines the values for a key In our case, compute the sum ● Output of reduce will be like that
  • 29. Architecture - Cont. MapReduce Design ● Map: extract something you care about from each record.
  • 30. Architecture - Cont. MapReduce Design ● Reduce : aggregate, summarize, filter, or transform mapper output
  • 31. Architecture - Cont. MapReduce Design Overall View:
  • 32. Architecture - Cont. MapReduce Design ● Shuffle & Sort : redirect the mapper output to the right reducer
  • 34. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 35. Architecture - Cont. MapReduce Programmer specifies two primary methods: map(k1, v1) → <k2, v2> reduce(k2, list<v2>) → <k3, v3>
  • 36. Case Study : Code Example Map Function
  • 37. Case Study : Code Example Reduce Function
  • 38. Hadoop not only JAVA (streaming)
  • 39. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 40. Architecture - Cont. Main Modules ● File System (HDFS) ⚪ Name Node ⚪ Secondary Name Node ⚪ Data Node ● MapReduce Framework ⚪ Job Tracker ⚪ Task Tracker
  • 41. Architecture - Cont. Main Modules ● File System (HDFS) ⚪ Name Node ⚪ Secondary Name Node ⚪ Data Node ⚪
  • 42. Architecture - Cont. Main Modules ● MapReduce Framework ⚪ Job Tracker ⚪ Task Tracker
  • 43. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 44. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS
  • 45. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS
  • 46. Architecture - Cont. Access Procedure ● Read From HDFS ● Write to HDFS
  • 47. Architecture - Cont. Tasks distribution Procedure: JobTracker choses the nodes to execute the tasks to achieve the data locality principle
  • 48. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 49. Hadoop Modes Hadoop Modes ● Standalone ● Pseudo-Distributed ● Fully-Distributed
  • 50. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 51. MapReduce 1 Vs MapReduce 2(YARN)
  • 52. Outlines 1. Introduction 2. History 3. Assumptions 4. Architecture a. Case Study b. MapReduce Design c. Code Example d. Main Modules e. Access Procedure 5. Hadoop Modes 6. MapReduce 1 VS MapReduce 2 (YARN) 7. Questions
  • 54. References ● Book “Hadoop in Action” by Chuck Lam ● Book “Hadoop The Definitive Guide” by Tom Wbite ● http://hadoop.apache.org/ ● http://en.wikipedia.org/wiki/Apache_Hadoop ● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/ ● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil ● http://www.slideshare.net/PhilippeJulio/hadoop-architecture ● http://www.slideshare.net/rantav/introduction-to-map-reduce ● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b= &from_search=2 ● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q f1&b=&from_search=12 ● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig ● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig ● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1 4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1