SlideShare a Scribd company logo
Lecture 2 part 3
 What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.
 What is Rack, Cluster, Nodes and Commodity Hardware?
 HDFS - Hadoop Distributed File System
 Using HDFS commands
 MapReduce
 Higher-level languages over Hadoop: Pig and Hive
 HBase – Overview
 HCatalog
 What is Hadoop and its components?
 What is the commodity server/Hardware?
 Why HDFS ?
 What is the responsibility of NameNode in HDFS?
 What is Fault Tolerance?
 What is the default replication factor in HDFS?
 What is the heartbeat in HDFS?
 What are JobTracker and TaskTracker?
 Why MapReduce programming model?
 Where do we have Data Locality in MapReduce?
 Why we need to use Pig and Hive?
 What is the difference between Hbase and HCatalog
• At Google:
• Index building for Google Search
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!:
• Index building for Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook:
• Data mining
• Ad optimization
• Spam detection
The MapReduce algorithm contains two important tasks (Map and Reduce tasks)
• The Map task:
• The Reduce task
Map Output (key-value pairs)
The quick
Brown fox
The fox ate
Map input (set of data )
converts
The 1
quick 1
Brown 1
Fox 1
The 1
Fox 1
Ate 1
Ate 1
Brown 1
Fox 1
Fox 1
quick 1
The 1
The 1
combines
Reduce input (key-value pairs)
Ate 1
Brown 1
Fox 2
quick 1
The 2
Reduce Output
I’m a
leading task
MapReduce
By the way,
I always
start first
• Data type: key-value records
• Map function:
• Reduce function:
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
• Single master controls job execution on multiple slaves
• Mappers preferentially placed on same node or same rack as their
input block
• Minimizes network usage
• Mappers save outputs to local disk before serving them to
reducers
• Allows recovery if a reducer crashes
• Allows having more reducers than nodes
• A combiner is a local aggregation function for repeated keys
produced by same map
• Works for associative functions like sum, count, max
• Decreases size of intermediate data
• Example: map-side aggregation for Word Count:
def combiner(key, values):
output(key, sum(values))
Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Phase − Here we have a Record Reader that
translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
Map Phase − Map is a user-defined function, which takes
a series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by
the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that
groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as
input and applies a user-defined code to aggregate the
values in a small scope of one mapper. It is not a part of
the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the
Shuffle and Sort step. It downloads the grouped key-value
pairs onto the local machine, where the Reducer is
running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent
keys together so that their values can be iterated easily in
the Reducer task.
Reducer − The Reducer takes the grouped key-value
paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide
range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a
record writer.
Word Count in Java
public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
Word Count in Java
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class); // out keys are words
(strings)
conf.setOutputValueClass(IntWritable.class); // values are counts
JobClient.runJob(conf);
}
import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "t" + 1)
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("t”)
dict[word] = dict.get(word, 0) +
int(count)
for word, count in counts:
print(word.lower() + "t" + 1)
A real-world example to comprehend the power of MapReduce.
Twitter receives around 500 million tweets per day, which is nearly
3000 tweets per second. The following illustration shows how
Tweeter manages its tweets with the help of MapReduce.
 Many parallel algorithms can be expressed by a
series of MapReduce jobs
 But MapReduce is fairly low-level: must think
about keys, values, partitioning, etc
 Can we capture common “job building blocks”?
 Started at Yahoo! Research
 Runs about 30% of Yahoo!’s jobs
 Features:
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY, etc)
• Easy to plug in Java functions
• Pig Pen development environment for Eclipse
• Suppose you have user data in
one file, page view data in
another, and you need to find
the top 5 most visited pages by
users aged 18 - 25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
In MapReduce
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Notice how naturally the components of the job translate into Pig Latin.
Job 1
Job 3
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
 Developed at Facebook
 Used for majority of Facebook jobs
 “Relational database” built on Hadoop
 Maintains list of table schemas
 SQL-like query language (HQL)
 Can call Hadoop Streaming scripts from HQL
 Supports table partitioning, clustering, complex
data types, some optimizations
•Find top 5 pages visited by users aged 18-25:
•Filter page views through Python script:
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner. That means one has to search the entire
dataset even for the simplest of jobs. A new solution is needed to
access any point of data in a single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3
Features of HBase
• HBase is linearly scalable.
• It has automatic failure
support.
• It provides consistent read
and writes.
• It integrates with Hadoop,
both as a source and a
destination.
• It has easy java API for client.
• It provides data replication
across clusters.
Where to Use HBase
• Apache HBase is used to have
random, real-time read/write
access to Big Data.
• It hosts very large tables on top of
clusters of commodity hardware.
• Apache HBase is a non-relational
database modeled after Google's
Bigtable. Bigtable acts up on
Google File System, likewise
Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
• It is used whenever
there is a need to write
heavy applications.
• HBase is used whenever
we need to provide fast
random access to
available data.
• Companies such as
Facebook, Twitter,
Yahoo, and Adobe use
HBase internally.
HDFS HBase
HBase RDBMS
HBase is schema-less, it doesn't have the concept
of fixed columns schema; defines only column
families.
An RDBMS is governed by its schema, which
describes the whole structure of tables.
It is built for wide tables. HBase is horizontally
scalable.
It is thin and built for small tables. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured
data.
It is good for structured data.
HCatalog, provides a relational table
abstraction layer over HDFS. Using the
HCatalog abstraction layer allows query tools
such as Pig and Hive to treat the data in a
familiar relational architecture. It also permits
easier exchange of data between the HDFS
storage and client tools used to present the data
for analysis using familiar data exchange
application programming interfaces (APIs) such
as Java Database Connectivity (JDBC) and
Open Database Connectivity.




Lecture 2 part 3

More Related Content

What's hot (19)

PDF
Mapreduce by examples
Andrea Iacono
 
PPTX
01 hbase
Subhas Kumar Ghosh
 
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PPTX
MapReduce Design Patterns
Donald Miner
 
PPT
Hive Training -- Motivations and Real World Use Cases
nzhang
 
PDF
Hadoop M/R Pig Hive
zahid-mian
 
PPT
Introduction To Map Reduce
rantav
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PPT
Hive ICDE 2010
ragho
 
PPT
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
PPTX
Session 04 pig - slides
AnandMHadoop
 
PDF
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PPT
Session 19 - MapReduce
AnandMHadoop
 
PPT
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
PPTX
Unit 4 lecture2
vishal choudhary
 
PPTX
Spark Sql and DataFrame
Prashant Gupta
 
Mapreduce by examples
Andrea Iacono
 
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Hive User Meeting August 2009 Facebook
ragho
 
MapReduce Design Patterns
Donald Miner
 
Hive Training -- Motivations and Real World Use Cases
nzhang
 
Hadoop M/R Pig Hive
zahid-mian
 
Introduction To Map Reduce
rantav
 
Hadoop-Introduction
Sandeep Deshmukh
 
Hive ICDE 2010
ragho
 
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Session 04 pig - slides
AnandMHadoop
 
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
03 pig intro
Subhas Kumar Ghosh
 
Session 19 - MapReduce
AnandMHadoop
 
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Unit 4 lecture2
vishal choudhary
 
Spark Sql and DataFrame
Prashant Gupta
 

Similar to Lecture 2 part 3 (20)

PPTX
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PPT
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
PPTX
Hadoop overview
Siva Pandeti
 
PPTX
Big data week presentation
Joseph Adler
 
PPTX
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
PPTX
This gives a brief detail about big data
chinky1118
 
PDF
Intro to Map Reduce
Doron Vainrub
 
PPTX
Apache hadoop
Sai Koppuravuri
 
PPT
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
PPTX
Intro to Big Data using Hadoop
Sergejus Barinovas
 
PDF
Hadoop Overview & Architecture
EMC
 
PPTX
Python in big data world
Rohit
 
PPTX
Hadoop training-in-hyderabad
sreehari orienit
 
PDF
Basics of big data analytics hadoop
Ambuj Kumar
 
PPTX
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
PPTX
Basic of Big Data
Amar kumar
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Hands on Hadoop and pig
Sudar Muthu
 
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Hadoop overview
Siva Pandeti
 
Big data week presentation
Joseph Adler
 
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
This gives a brief detail about big data
chinky1118
 
Intro to Map Reduce
Doron Vainrub
 
Apache hadoop
Sai Koppuravuri
 
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
Intro to Big Data using Hadoop
Sergejus Barinovas
 
Hadoop Overview & Architecture
EMC
 
Python in big data world
Rohit
 
Hadoop training-in-hyderabad
sreehari orienit
 
Basics of big data analytics hadoop
Ambuj Kumar
 
Map-Reduce and Apache Hadoop
Svetlin Nakov
 
Basic of Big Data
Amar kumar
 
Apache Hadoop and HBase
Cloudera, Inc.
 
Ad

More from Jazan University (6)

PPTX
Relationship between cloud computing and big data
Jazan University
 
PPTX
Cyber-infrastructure Presentation 2015
Jazan University
 
PPTX
Anas bahkali 2
Jazan University
 
PPTX
Lecture 2 part 2
Jazan University
 
PDF
Lecture 2 part 1
Jazan University
 
PPTX
Presentation of Kent Park
Jazan University
 
Relationship between cloud computing and big data
Jazan University
 
Cyber-infrastructure Presentation 2015
Jazan University
 
Anas bahkali 2
Jazan University
 
Lecture 2 part 2
Jazan University
 
Lecture 2 part 1
Jazan University
 
Presentation of Kent Park
Jazan University
 
Ad

Recently uploaded (20)

PPTX
Difference between write and update in odoo 18
Celine George
 
PDF
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
PDF
Council of Chalcedon Re-Examined
Smiling Lungs
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PDF
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
PPTX
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
PPTX
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
Difference between write and update in odoo 18
Celine George
 
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
Controller Request and Response in Odoo18
Celine George
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
Council of Chalcedon Re-Examined
Smiling Lungs
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Horarios de distribución de agua en julio
pegazohn1978
 
Is Assignment Help Legal in Australia_.pdf
thomas19williams83
 
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
epi editorial commitee meeting presentation
MIPLM
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 

Lecture 2 part 3

  • 2.  What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.  What is Rack, Cluster, Nodes and Commodity Hardware?  HDFS - Hadoop Distributed File System  Using HDFS commands  MapReduce  Higher-level languages over Hadoop: Pig and Hive  HBase – Overview  HCatalog
  • 3.  What is Hadoop and its components?  What is the commodity server/Hardware?  Why HDFS ?  What is the responsibility of NameNode in HDFS?  What is Fault Tolerance?  What is the default replication factor in HDFS?  What is the heartbeat in HDFS?  What are JobTracker and TaskTracker?  Why MapReduce programming model?  Where do we have Data Locality in MapReduce?  Why we need to use Pig and Hive?  What is the difference between Hbase and HCatalog
  • 4. • At Google: • Index building for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • Index building for Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection
  • 5. The MapReduce algorithm contains two important tasks (Map and Reduce tasks) • The Map task: • The Reduce task Map Output (key-value pairs) The quick Brown fox The fox ate Map input (set of data ) converts The 1 quick 1 Brown 1 Fox 1 The 1 Fox 1 Ate 1 Ate 1 Brown 1 Fox 1 Fox 1 quick 1 The 1 The 1 combines Reduce input (key-value pairs) Ate 1 Brown 1 Fox 2 quick 1 The 2 Reduce Output I’m a leading task MapReduce By the way, I always start first
  • 6. • Data type: key-value records • Map function: • Reduce function:
  • 7. def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 8. the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output
  • 9. • Single master controls job execution on multiple slaves • Mappers preferentially placed on same node or same rack as their input block • Minimizes network usage • Mappers save outputs to local disk before serving them to reducers • Allows recovery if a reducer crashes • Allows having more reducers than nodes
  • 10. • A combiner is a local aggregation function for repeated keys produced by same map • Works for associative functions like sum, count, max • Decreases size of intermediate data • Example: map-side aggregation for Word Count: def combiner(key, values): output(key, sum(values))
  • 11. Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 2 fox, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1
  • 12. Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. Map Phase − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs. Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
  • 13. Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional. Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
  • 14. Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step. Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
  • 15. Word Count in Java public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } } }
  • 16. public class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } Word Count in Java
  • 17. public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); // out keys are words (strings) conf.setOutputValueClass(IntWritable.class); // values are counts JobClient.runJob(conf); }
  • 18. import sys for line in sys.stdin: for word in line.split(): print(word.lower() + "t" + 1) import sys counts = {} for line in sys.stdin: word, count = line.split("t”) dict[word] = dict.get(word, 0) + int(count) for word, count in counts: print(word.lower() + "t" + 1)
  • 19. A real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.
  • 20.  Many parallel algorithms can be expressed by a series of MapReduce jobs  But MapReduce is fairly low-level: must think about keys, values, partitioning, etc  Can we capture common “job building blocks”?
  • 21.  Started at Yahoo! Research  Runs about 30% of Yahoo!’s jobs  Features: • Expresses sequences of MapReduce jobs • Data model: nested “bags” of items • Provides relational (SQL) operators (JOIN, GROUP BY, etc) • Easy to plug in Java functions • Pig Pen development environment for Eclipse
  • 22. • Suppose you have user data in one file, page view data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
  • 23. In MapReduce Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 24. Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 25. Notice how naturally the components of the job translate into Pig Latin. Job 1 Job 3 Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit …
  • 26.  Developed at Facebook  Used for majority of Facebook jobs  “Relational database” built on Hadoop  Maintains list of table schemas  SQL-like query language (HQL)  Can call Hadoop Streaming scripts from HQL  Supports table partitioning, clustering, complex data types, some optimizations
  • 27. •Find top 5 pages visited by users aged 18-25: •Filter page views through Python script:
  • 28. Limitations of Hadoop Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A new solution is needed to access any point of data in a single unit of time (random access). What is HBase? HBase is a distributed column-oriented database built on top of the Hadoop file system. It is designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).
  • 29. Rowid Column Family Column Family Column Family Column Family col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3 1 2 3
  • 30. Features of HBase • HBase is linearly scalable. • It has automatic failure support. • It provides consistent read and writes. • It integrates with Hadoop, both as a source and a destination. • It has easy java API for client. • It provides data replication across clusters. Where to Use HBase • Apache HBase is used to have random, real-time read/write access to Big Data. • It hosts very large tables on top of clusters of commodity hardware. • Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS. Applications of HBase • It is used whenever there is a need to write heavy applications. • HBase is used whenever we need to provide fast random access to available data. • Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
  • 32. HBase RDBMS HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables. It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale. No transactions are there in HBase. RDBMS is transactional. It has de-normalized data. It will have normalized data. It is good for semi-structured as well as structured data. It is good for structured data.
  • 33. HCatalog, provides a relational table abstraction layer over HDFS. Using the HCatalog abstraction layer allows query tools such as Pig and Hive to treat the data in a familiar relational architecture. It also permits easier exchange of data between the HDFS storage and client tools used to present the data for analysis using familiar data exchange application programming interfaces (APIs) such as Java Database Connectivity (JDBC) and Open Database Connectivity.