Apache Hadoop - Big Data Engineering

Apache Hadoop
Big Data Engineering
Prepared by:
● Islam Elbanna
● Mahmoud Hanafy
Presented by:
● Ahmed Mahran

Outlines
1. Introduction
2. History
3. Assumptions
4. Architecture
a. Case Study
b. MapReduce Design
c. Code Example
d. Main Modules
e. Access Procedure
5. Hadoop Modes
6. MapReduce 1 VS MapReduce 2 (YARN)
7. Questions

Introduction
What is Hadoop?
"Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is designed to scale up from single servers to thousands
of machines, each providing computation and storage"
Open Source software + Hardware commodity = IT Cost reduction

Introduction - Cont.
Why Hadoop ?
● Performance
● Storage
● Scalability
● Fault tolerance
● Cost efficiency (Commodity Machines)

What is Hadoop used for ?
● Searching
● Log processing
● Recommendation system
● Analytics
● Video and Image analysis

Who uses Hadoop ?
● Amazon
● Facebook
● Google
● IBM
● New York Times
● Yahoo
● Twitter
● LinkedIn
● …

Hadoop RDBMS
Non-Structured/Structured data Structured data
Scale Out Scale Up
Procedural/Functional programming Declarative Queries
Offline batch processing Online/Batch Transactions
Petabytes Gigabytes
Key Value Pairs Predefined fields
Hadoop Vs RDBMS

Problem:
20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk
~ Four months to read the web (Time).
~1,000 hard drives just to store the web (Storage).

Solution: same problem with 1000 machines < 3 hours
But we need:
● Communication and coordination
● Recovering from machine failure
● Status reporting
● Debugging
● Optimization
Distributed System

Distributed systems
● Cluster of machines
● Distributed Storage
● Distributed Computing

History
● 2002-2004 Started as a sub-project of Apache
Nutch.
● 2003-2004 Google published Google File System
(GFS) and MapReduce Framework Paper.
● 2004 Doug Cutting and Mike Cafarella
implemented Google’s frameworks in Nutch.
● In 2006 Yahoo hires Doug Cutting to work on
Hadoop with a dedicated team.
● In 2008 Hadoop became Apache Top Level Project.

Assumptions
● Hardware Failure
● Streaming Data Access
● Large Data Sets
● Simple Coherency Model
● Moving Computation is Cheaper than Moving Data
● Software Platform Portability

Architecture
Hadoop designed and built on two independent
frameworks
Hadoop = HDFS + MapReduce
HDFS: is a reliable distributed file system that provides
high-throughput access to data.
● File divided into blocks 64MB (default)
● Each block replicated 3 times (default)
MapReduce: is a framework for performing high

Case Study: Word Count
Problem: We need to calculate word
frequencies in billions of web pages
● Input: Files with one document per
record
● Output: List of words and their
frequencies in the whole documents

Architecture - Cont.
MapReduce Design
● Map
● Reduce
● Shuffle & Sort

Case Study: Map Phase
● Specify a map function that takes a key/value pair
key = document URL
value = document contents
● Output of map function is key/value pairs.
In our case, output(word, “1”) once per word in the document

Case Study: Reduce Phase
● MapReduce library gathers together all pairs with the same key
(shuffle/sort)
● The reduce function combines the values for a key
In our case, compute the sum
● Output of reduce will be like that

MapReduce Design
● Map: extract
something you
care about from
each record.

MapReduce Design
● Reduce :
aggregate,
summarize, filter,
or transform
mapper output

MapReduce Design
Overall View:

MapReduce Design
● Shuffle & Sort :
redirect the
mapper output to
the right reducer

MapReduce
Programmer specifies two primary methods:
map(k1, v1) → <k2, v2>
reduce(k2, list<v2>) → <k3, v3>

Case Study : Code Example
Map Function

Case Study : Code Example
Reduce Function

Hadoop not only JAVA (streaming)

Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker

Main Modules
● File System (HDFS)
⚪ Name Node
⚪ Secondary Name Node
⚪ Data Node
⚪

Main Modules
● MapReduce Framework
⚪ Job Tracker
⚪ Task Tracker

Access Procedure
● Read From HDFS
● Write to HDFS

Tasks distribution Procedure:
JobTracker choses the nodes to
execute the tasks to achieve the
data locality principle

Hadoop Modes
Hadoop Modes
● Standalone
● Pseudo-Distributed
● Fully-Distributed

MapReduce 1 Vs MapReduce 2(YARN)

References
● Book “Hadoop in Action” by Chuck Lam
● Book “Hadoop The Definitive Guide” by Tom Wbite
● http://hadoop.apache.org/
● http://en.wikipedia.org/wiki/Apache_Hadoop
● https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
● http://www.slideshare.net/emcacademics/milind-hadoop-trainingbrazil
● http://www.slideshare.net/PhilippeJulio/hadoop-architecture
● http://www.slideshare.net/rantav/introduction-to-map-reduce
● http://www.slideshare.net/sudhakara_st/hadoop-intruduction?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=
&from_search=2
● http://www.slideshare.net/ZhijieShen/hadoop-summit-san-jose-2014?qid=a14580f7-23be-45b8-bd1e-b3417b8a0ec1&v=q
f1&b=&from_search=12
● http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
● http://www.slideshare.net/phobeo/introduction-to-data-processing-using-hadoop-and-pig
● http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam?qid=a1
4580f7-23be-45b8-bd1e-b3417b8a0ec1&v=qf1&b=&from_search=1

Apache Hadoop - Big Data Engineering

More Related Content

What's hot (20)

Similar to Apache Hadoop - Big Data Engineering (20)

More from BADR (15)

Recently uploaded (20)

Apache Hadoop - Big Data Engineering