SlideShare a Scribd company logo
Chaining
Simple idea is sequential call through JobClient 
• Not every problem can be solved with a MapReduce 
program, but fewer still are those which can be solved 
with a single MapReduce job. 
• Many problems can be solved with MapReduce, by 
writing several MapReduce steps which run in series to 
accomplish a goal: 
– Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3... 
• You can easily chain jobs together in this fashion by 
writing multiple driver methods, one for each job.
JobClient in action 
• Call the first driver method, which 
uses JobClient.runJob() to run the job and wait for it to 
complete. 
• When that job has completed, then call the next driver 
method, which creates a new JobConf object referring 
to different instances of Mapper and Reducer, etc. 
• The first job in the chain should write its output to a 
path which is then used as the input path for the 
second job. 
• This process can be repeated for as many jobs are 
necessary to arrive at a complete solution to the 
problem.
Jobcontrol 
• Hadoop provides another mechanism for managing batches of 
jobs with dependencies between jobs. 
• Rather than submit a JobConf to the JobClient's 
– runJob() or 
– submitJob() methods 
• org.apache.hadoop.mapred.jobcontrol.Job objects can be 
created to represent each job 
• A Job takes a JobConf object as its constructor argument. 
• Jobs can depend on one another through the use of 
the addDependingJob() method. 
• The code: x.addDependingJob(y) 
• says that Job x cannot start until y has successfully completed.
Creating dependency using jobcontrol 
• Dependency information cannot be added to a 
job after it has already been started. 
• Given a set of jobs, these can be passed to an 
instance of the JobControl class. 
• JobControl can receive individual jobs via 
the addJob() method, or a collection of jobs 
via addJobs(). 
• The JobControl object will spawn a thread in 
the client to launch the jobs.
Jobcontrol special features 
• Individual jobs will be launched when their 
dependencies have all successfully completed 
and when the MapReduce system as a whole 
has resources to execute the jobs. 
• The JobControl interface allows you to query it 
to retrieve the state of individual jobs, as well 
as the list of jobs waiting, ready, running, and 
finished. 
• The job submission process does not begin until 
the run() method of the JobControlobject is 
called.
ChainMapper 
• The ChainMapper class allows to use multiple Mapper 
classes within a single Map task. 
• The Mapper classes are invoked in a chained (or piped) 
fashion, the output of the first becomes the input of 
the second, and so on until the last Mapper, the 
output of the last Mapper will be written to the task's 
output. 
• The key functionality of this feature is that the 
Mappers in the chain do not need to be aware that 
they are executed in a chain. 
• This enables having reusable specialized Mappers that 
can be combined to perform composite operations 
within a single task.
ChainMapper 
• Special care has to be taken when creating chains that 
the key/values output by a Mapper are valid for the 
following Mapper in the chain. 
• It is assumed all Mappers and the Reduce in the chain 
use matching output and input key and value classes 
as no conversion is done by the chaining code. 
• Using the ChainMapper and the ChainReducer classes 
is possible to compose Map/Reduce jobs that look 
like [MAP+ / REDUCE MAP*]. 
• And immediate benefit of this pattern is a dramatic 
reduction in disk IO.
ChainMapper usage pattern 
... 
Job = new Job(conf); 
Configuration mapAConf = new Configuration(false); 
... 
ChainMapper.addMapper(job, AMap.class, LongWritable.class, Text.class, 
Text.class, Text.class, true, mapAConf); 
Configuration mapBConf = new Configuration(false); 
... 
ChainMapper.addMapper(job, BMap.class, Text.class, Text.class, 
LongWritable.class, Text.class, false, mapBConf); 
... 
job.waitForComplettion(true); 
...
End of session 
Day – 2: Chaining

More Related Content

PDF
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
PPTX
Mapreduce advanced
Chirag Ahuja
 
PPTX
04 pig data operations
Subhas Kumar Ghosh
 
PDF
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
PPTX
Mapreduce total order sorting technique
Uday Vakalapudi
 
PDF
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
PPTX
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
PPTX
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Mapreduce advanced
Chirag Ahuja
 
04 pig data operations
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Mapreduce total order sorting technique
Uday Vakalapudi
 
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 

What's hot (20)

PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PDF
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPTX
Map reduce in Hadoop
ishan0019
 
PPTX
Introduction to MapReduce
Chicago Hadoop Users Group
 
PPT
Map Reduce
Manuel Correa
 
PPT
Map Reduce
schapht
 
PPTX
06 pig etl features
Subhas Kumar Ghosh
 
PPTX
Map reduce presentation
ateeq ateeq
 
PDF
Mapreduce by examples
Andrea Iacono
 
PPT
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
PPT
Map Reduce
Sri Prasanna
 
PDF
Map Reduce
Vigen Sahakyan
 
PDF
Topic 6: MapReduce Applications
Zubair Nabi
 
PPT
Hadoop introduction 2
Tianwei Liu
 
PPTX
Introduction to map reduce
M Baddar
 
PPTX
Map reduce and Hadoop on windows
Muhammad Shahid
 
PDF
Introduction to Map-Reduce
Brendan Tierney
 
PPTX
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
PPTX
Unit 2 part-2
vishal choudhary
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Map reduce in Hadoop
ishan0019
 
Introduction to MapReduce
Chicago Hadoop Users Group
 
Map Reduce
Manuel Correa
 
Map Reduce
schapht
 
06 pig etl features
Subhas Kumar Ghosh
 
Map reduce presentation
ateeq ateeq
 
Mapreduce by examples
Andrea Iacono
 
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Map Reduce
Sri Prasanna
 
Map Reduce
Vigen Sahakyan
 
Topic 6: MapReduce Applications
Zubair Nabi
 
Hadoop introduction 2
Tianwei Liu
 
Introduction to map reduce
M Baddar
 
Map reduce and Hadoop on windows
Muhammad Shahid
 
Introduction to Map-Reduce
Brendan Tierney
 
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Unit 2 part-2
vishal choudhary
 
Ad

Viewers also liked (7)

PPTX
Challenges and Issues of Next Cloud Computing Platforms
Frederic Desprez
 
PDF
Reduce Side Joins
Edureka!
 
PDF
Distributed Cache With MapReduce
Edureka!
 
PDF
Hadoop MapReduce Framework
Edureka!
 
PPT
Upgrading To The New Map Reduce API
Tom Croucher
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PPTX
Hadoop MapReduce joins
Shalish VJ
 
Challenges and Issues of Next Cloud Computing Platforms
Frederic Desprez
 
Reduce Side Joins
Edureka!
 
Distributed Cache With MapReduce
Edureka!
 
Hadoop MapReduce Framework
Edureka!
 
Upgrading To The New Map Reduce API
Tom Croucher
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop MapReduce joins
Shalish VJ
 
Ad

Similar to Hadoop job chaining (20)

DOCX
Big data unit iv and v lecture notes qb model exam
Indhujeni
 
PPTX
Hadoop training-in-hyderabad
sreehari orienit
 
PDF
Hadoop
devakalyan143
 
PDF
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
PPTX
Map reduce prashant
Prashant Gupta
 
PDF
an detailed notes on Hadoop Map-Reduce.pdf
YASWANTHP717822I163
 
PPTX
map reduce Technic in big data
Jay Nagar
 
PDF
Hadoop interview question
pappupassindia
 
PPT
Hadoop 2
EasyMedico.com
 
PPT
Hadoop 3
shams03159691010
 
PDF
Apache Hadoop Java API
Adam Kawa
 
PPTX
Hadoop fault tolerance
Pallav Jha
 
PPTX
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
PPTX
Hadoop-part1 in cloud computing subject.pptx
JyotiLohar6
 
PDF
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
PPT
Big Data- process of map reducing MapReduce- .ppt
sunilsoni446112
 
PPT
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
veyetas395
 
Big data unit iv and v lecture notes qb model exam
Indhujeni
 
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Map reduce prashant
Prashant Gupta
 
an detailed notes on Hadoop Map-Reduce.pdf
YASWANTHP717822I163
 
map reduce Technic in big data
Jay Nagar
 
Hadoop interview question
pappupassindia
 
Hadoop 2
EasyMedico.com
 
Apache Hadoop Java API
Adam Kawa
 
Hadoop fault tolerance
Pallav Jha
 
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
Hadoop-part1 in cloud computing subject.pptx
JyotiLohar6
 
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Big Data- process of map reducing MapReduce- .ppt
sunilsoni446112
 
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
veyetas395
 

More from Subhas Kumar Ghosh (17)

PPTX
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPTX
03 hive query language (hql)
Subhas Kumar Ghosh
 
PPTX
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
PPTX
01 hbase
Subhas Kumar Ghosh
 
PPTX
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PPTX
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
PPTX
Hadoop Day 3
Subhas Kumar Ghosh
 
PDF
Hadoop exercise
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce v2
Subhas Kumar Ghosh
 
PDF
Hadoop availability
Subhas Kumar Ghosh
 
PDF
Hadoop scheduler
Subhas Kumar Ghosh
 
PDF
Hadoop data management
Subhas Kumar Ghosh
 
PDF
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
PDF
Greedy embedding problem
Subhas Kumar Ghosh
 
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
05 k-means clustering
Subhas Kumar Ghosh
 
03 hive query language (hql)
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
03 pig intro
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop exercise
Subhas Kumar Ghosh
 
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Hadoop availability
Subhas Kumar Ghosh
 
Hadoop scheduler
Subhas Kumar Ghosh
 
Hadoop data management
Subhas Kumar Ghosh
 
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Hadoop introduction
Subhas Kumar Ghosh
 
Greedy embedding problem
Subhas Kumar Ghosh
 

Recently uploaded (20)

PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
short term internship project on Data visualization
JMJCollegeComputerde
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Presentation on animal welfare a good topic
kidscream385
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 

Hadoop job chaining

  • 2. Simple idea is sequential call through JobClient • Not every problem can be solved with a MapReduce program, but fewer still are those which can be solved with a single MapReduce job. • Many problems can be solved with MapReduce, by writing several MapReduce steps which run in series to accomplish a goal: – Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3... • You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job.
  • 3. JobClient in action • Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. • When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. • The first job in the chain should write its output to a path which is then used as the input path for the second job. • This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem.
  • 4. Jobcontrol • Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. • Rather than submit a JobConf to the JobClient's – runJob() or – submitJob() methods • org.apache.hadoop.mapred.jobcontrol.Job objects can be created to represent each job • A Job takes a JobConf object as its constructor argument. • Jobs can depend on one another through the use of the addDependingJob() method. • The code: x.addDependingJob(y) • says that Job x cannot start until y has successfully completed.
  • 5. Creating dependency using jobcontrol • Dependency information cannot be added to a job after it has already been started. • Given a set of jobs, these can be passed to an instance of the JobControl class. • JobControl can receive individual jobs via the addJob() method, or a collection of jobs via addJobs(). • The JobControl object will spawn a thread in the client to launch the jobs.
  • 6. Jobcontrol special features • Individual jobs will be launched when their dependencies have all successfully completed and when the MapReduce system as a whole has resources to execute the jobs. • The JobControl interface allows you to query it to retrieve the state of individual jobs, as well as the list of jobs waiting, ready, running, and finished. • The job submission process does not begin until the run() method of the JobControlobject is called.
  • 7. ChainMapper • The ChainMapper class allows to use multiple Mapper classes within a single Map task. • The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output. • The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed in a chain. • This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.
  • 8. ChainMapper • Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. • It is assumed all Mappers and the Reduce in the chain use matching output and input key and value classes as no conversion is done by the chaining code. • Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. • And immediate benefit of this pattern is a dramatic reduction in disk IO.
  • 9. ChainMapper usage pattern ... Job = new Job(conf); Configuration mapAConf = new Configuration(false); ... ChainMapper.addMapper(job, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf); Configuration mapBConf = new Configuration(false); ... ChainMapper.addMapper(job, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, mapBConf); ... job.waitForComplettion(true); ...
  • 10. End of session Day – 2: Chaining