B.MONICA II M.SC COMPUTER SCIENCE

HADOOP FOUNDATION FOR ANALYTICS
BY
B.MONICA
II M.SC COMPUTER SCIENCE
BON SECOURS COLLEGE FOR WOMEN
1

HADOOP
 It is an open-source software framework
 licensed under the Apache v2 license
 It includes:
– Map Reduce : offline computing engine
– HDFS : Hadoop distributed file system
EXAMPLE
2

HADOOP GOALS
 Scalable: It can reliably store and process petabytes.
 Economical: It distributes the data
 Efficient: it can process it in parallel on the nodes where the
data is located.
 Reliable: It automatically maintains multiple copies of data
3

USES FOR HADOOP
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis
4

HADOOP: ASSUMPTIONS
 Hardware will fail.
 Applications need a write-once-read-many access model.
 EXAMPLE
Facebook:
- To store copies of internal log and dimension
data sources
- it as a source for reporting/analytics and
machine learning
- 320 machine cluster with 2,560 cores and
about 1.3 PB raw storage 5

HADOOP CONFIGURATION
Conf /hdfs-site.xml:
<configuration>
<property>
<name>
Dfs . replication
</name>
<value>
1
</value>
</property>
</configuration> 6

HISTORY OF HADOOP
 Hadoop was started by Doug Cutting to support
two of his other well known projects, Lucene and
Nutch
 Hadoop has been inspired by Google's File
System (GFS) which was detailed in a paper by
released by Google in 2003
 Hadoop, originally called Nutch Distributed File
System (NDFS) split from Nutch in 2006 to
become a sub-project of Lucene. At this point it
was renamed to Hadoop.
7

 EXAMPLE
Google search engine
 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari , Cassandra, Mahout have been
added
8

• Hadoop is in use at most organizations that
handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
9

APACHE MAP REDUCE
 A software framework for distributed
processing of large data sets
 The framework takes care of scheduling tasks,
monitoring them and re-executing any failed
tasks.
 It splits the input data set into independent
chunks.
 Map Reduce framework sorts the outputs of
the maps, which are then input to the reduce
tasks..
10

MAP REDUCE DATAFLOW
 An input reader
 A Map function
 A partition function
 A compare function
 A Reduce function
 An output writer
EXAMPLE:
JOB TRACKER
TASK TRACKER 12

MAP REDUCE-FAULT TOLERANCE
 Worker failure: The master pings every worker
periodically.
 Master Failure: It is easy to make the master write
periodic checkpoints of the master data structures
13

JOB TRACKER
 Tracking Map Reduce jobs in Hadoop
 Job Tracker performs following actions in Hadoop
 It accepts the Map Reduce Jobs from client
applications
 Talks to Name Node to determine data location
 Locates available Task Tracker Node
 Submits the work to the chosen Task Tracker
Node
14

OTHER TOOLS
 Hive
 Hadoop processing with SQL
 Pig
 Hadoop processing with scripting
 Cascading
 Pipe and Filter processing model
 H Base
 Database model built on top of Hadoop
 Flume
 Designed for large scale data movement
15

B.MONICA II M.SC COMPUTER SCIENCE

More Related Content

What's hot (20)

Similar to B.MONICA II M.SC COMPUTER SCIENCE (20)

Recently uploaded (20)

B.MONICA II M.SC COMPUTER SCIENCE