Introduction to apache hadoop

Agenda
 Need for a new processing platform (BigData)
 Origin of Hadoop
 What is Hadoop & what it is not ?
 Hadoop architecture
 Hadoop components
(Common/HDFS/MapReduce)
 Hadoop ecosystem
 When should we go for Hadoop ?
 Real world use cases
 Questions

Need for a new processing
platform (Big Data)
 What is BigData ?
- Twitter (over 7~ TB/day)
- Facebook (over 10~ TB/day)
- Google (over 20~ PB/day)
 Where does it come from ?
 Why to take so much of pain ?
 - Information everywhere, but where is the
 knowledge?
 Existing systems (vertical scalibility)
 Why Hadoop (horizontal scalibility)?

Origin of Hadoop
 Seminal whitepapers by Google in 2004
on a new programming paradigm to
handle data at internet scale
 Hadoop started as a part of the Nutch
project.
 In Jan 2006 Doug Cutting started working
on Hadoop at Yahoo
 Factored out of Nutch in Feb 2006
 First release of Apache Hadoop in
September 2007
 Jan 2008 - Hadoop became a top level
Apache project

Hadoop distributions

 Amazon
 Cloudera
 MapR
 HortonWorks
 Microsoft Windows Azure.
 IBM InfoSphere Biginsights
 Datameer
 EMC Greenplum HD Hadoop distribution
 Hadapt

What is Hadoop ?
 Flexibleinfrastructure for large
scale computation & data
processing on a network of
commodity hardware
 Completely written in java
 Open source & distributed under
Apache license
 Hadoop Common, HDFS &
MapReduce

What Hadoop is not

A replacement for existing data
warehouse systems
 A File system
 An online transaction
processing (OLTP) system
 Replacement of all
programming logic
 A database

Hadoop architecture
 High level view (NN, DN, JT, TT) –

HDFS (Hadoop Distributed File
System)
 Hadoop distributed file system
 Default storage for the Hadoop cluster
 NameNode/DataNode
 The File System Namespace(similar to our local
file system)
 Master/slave architecture (1 master 'n' slaves)
 Virtual not physical
 Provides configurable replication (user specific)
 Data is stored as chunks (64 MB default, but
configurable) across all the nodes

Rack awareness

Typically large Hadoop clusters are arranged in racks and
network traffic between different nodes with in the same rack
is much more desirable than network traffic across the racks.
In addition Namenode tries to place replicas of block on
multiple racks for improved fault tolerance. A default
installation assumes all the nodes belong to the same rack.

MapReduce
 Framework provided by Hadoop to process
large amount of data across a cluster of
machines in a parallel manner
 Comprises of three classes –
Mapper class
Reducer class
Driver class
 Tasktracker/ Jobtracker
 Reducer phase will start only after mapper is
done
 Takes (k,v) pairs and emits (k,v) pair

 public static class Map extends Mapper<LongWritable,
Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); public void
map(LongWritable key, Text value, Context context)
throws
IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one); } } }

Modes of operation

 Standalone mode

 Pseudo-distributed mode

 Fully-distributed mode

When should we go for
Hadoop?
 Data is too huge
 Processes are independent
 Online analytical processing
(OLAP)
 Better scalability
 Parallelism

 Unstructured data

Real world use cases

Clickstream analysis
Sentiment analysis
Recommendation engines
Ad Targeting
Search Quality

 What I have been doing…
 Seismic Data Management & Processing
 WITSML Server & Drilling Analytics
 Orchestra Permission Map management for
Search
 SDIS (just started)
 Next steps: Get your hands dirty with
code in a workshop on …
 Hadoop Configuration
 HDFS Data loading
 Map Reduce programming
 Hbase

 Hive & Pig

Introduction to apache hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Introduction to apache hadoop (20)

More from Shashwat Shriparv (20)

Recently uploaded (20)

Introduction to apache hadoop