SlideShare a Scribd company logo
Processing Data with Map Reduce



Allahbaksh Mohammedali Asadullah
       Infosys Labs, Infosys Technologies




                                            1
Content
Map Function
Reduce Function
Why Hadoop
HDFS
Map Reduce – Hadoop
Some Questions


                      2
What is Map Function
Map is a classic primitive of Functional
Programming.
Map means apply a function to a list of
elements and return the modified list.
    function List Map(Function func, List elements){

        List newElement;
        foreach element in elements{
            newElement.put(apply(func, element))
        }
        return newElement
                                                       3
    }
Example Map Function
    function double increaseSalary(double salary){
            return salary* (1+0.15);
    }
    function double Map(Function increaseSalary, List<Employee>
 employees){
                List<Employees> newList;
                foreach employee in employees{
                    Employee tempEmployee = new (
                    newList.add(tempEmployee.income=increaseSalary(
                    tempEmployee.income)
                }
    }                                                                 4
Fold or Reduce Funtion
Fold/Reduce reduces a list of values to one.
Fold means apply a function to a list of
elements and return a resulting element
  function Element Reduce(Function func, List elements){

         Element earlierResult;
         foreach element in elements{
             func(element, earlierResult)
         }   return earlierResult;
     }


                                                           5
Example Reduce Function
function double add(double number1, double number2){
      return number1 + number2;
}


function double Reduce (Function add, List<Employee> employees){
      double totalAmout=0.0;
      foreach employee in employees{
          totalAmount =add(totalAmount,emplyee.income);
      }
      return totalAmount
}
                                                                   6
I know Map and Reduce, How do I
use it

I will use    some   library   or
framework.


                               7
Why some framework?
Lazy to write boiler plate code
For modularity
Code reusability




                                  8
What is best choice




                      9
Why Hadoop?




              10
11
Programming Language Support



                   C++



                               12
Who uses it




              13
Strong Community




                                     14
Image Courtesy http://goo.gl/15Nu3
Commercial Support




                     15
Hadoop




         16
Hadoop HDFS



              17
Hadoop Distributed File System
Large Distributed File          System     on
commudity hardware
4k nodes, Thousands of files, Petabytes of
data
Files are replicated so that hard disk failure
can be handled easily
One NameNode and many DataNode

                                             18
Hadoop Distributed File System
                  HDFS ARCHITECTURE

                                                 Metadata (Name,replicas,..):
                          Namenode
   Metadata ops

        Client                                   Block ops
 Read              Data Nodes                                    Data Nodes
                                         Replication

                                                                         Blocks
                                              Write

                 Rack 1                                        Rack 2
                                     Client


                                                                                  19
NameNode
Meta-data in RAM
    The entire metadata is in main memory.
  Metadata consist of
    •   List of files
    •   List of Blocks for each file
    •   List of DataNodes for each block
    •   File attributes, e.g creation time
    •   Transaction Log
NameNode uses heartbeats to detect DataNode failure
                                                      20
Data Node
Data Node stores the data in file system
Stores meta-data of a block
Serves data and meta-data to Clients
Pipelining of Data i.e forwards data to other
specified DataNodes
DataNodes send heartbeat to the NameNode
every three sec.


                                           21
HDFS Commands
Accessing HDFS
   hadoop dfs –mkdir myDirectory
    hadoop dfs -cat myFirstFile.txt
Web Interface
   http://host:port/dfshealth.jsp



                                      22
Hadoop MapReduce



                   23
Map Reduce Diagramtically
                Mapper
                                                 Reducer

                                                           Output Files

 Input Files

Input Split 0
Input Split 1
Input Split 2
Input Split 3
Input Split 4
Input Split 5




                  Intermediate file is divided into R
                  partitions, by partitioning function               24
Input Format
InputFormat descirbes the input sepcification to a MapReduce
job. That is how the data is to be read from the File System .
Split up the input file into logical InputSplits, each of which is
assigned to an Mapper
Provide the RecordReader implementation to be used to collect
input record from logical InputSplit for processing by Mapper
RecordReader, typically, converts the byte-oriented view of the
input, provided by the InputSplit, and presents a record-oriented
view for the Mapper & Reducer tasks for processing. It thus
assumes the responsibility of processing record boundaries and
presenting the tasks with keys and values.
 
                                                                25
Creating a your Mapper
The mapper should implements .mapred.Mapper
Earlier version use to extend class .mapreduce.Mapper class
Extend .mapred.MapReduceBase class which provides default
implementation of close and configure method.
The Main method is map ( WritableComparable key, Writable
value, OutputCollector<K2,V2> output, Reporter reporter)
One instance of your Mapper is initialized per task. Exists in separate process from all
other instances of Mapper – no data sharing. So static variables will be different for
different map task.
Writable -- Hadoop defines a interface called Writable which is Serializable.
Examples IntWritable, LongWritable, Text etc.

WritableComparables can be compared to each other, typically via Comparators. Any
type which is to be used as a key in the Hadoop Map-Reduce framework should
implement this interface.
InverseMapper swaps the key and value                                           26
Combiner
Combiners are used to optimize/minimize the number
of key value pairs that will be shuffled across the
network between mappers and reducers.
Combiner are sort of mini reducer that will be applied
potentially several times still during the map phase
before to send the new set of key/value pairs to the
reducer(s).
Combiners should be used when the function you want
to apply is both commutative and associative.
Example: WordCount and Mean value computation
Reference   http://goo.gl/iU5kR
                                                    27
Partitioner
Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
The key (or a subset of the key) is used to derive the
partition, typically by a hash function.
The total number of partitions is the same as the
number of reduce tasks for the job.
Some Partitioner are BinaryPartitioner, HashPartitioner,
KeyFieldBasedPartitioner, TotalOrderPartitioner



                                                      28
Creating a your Reducer
The mapper should implements .mapred.Reducer
Earlier version use to extend class .mapreduce.Reduces class
Extend .mapred.MapReduceBase class which provides default
implementation of close and configure method.
The Main method is reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter)
Keys & values sent to one partition all goes to the same reduce task

Iterator.next() always returns the same object, different data

HashPartioner partition it based on Hash function written

IdentityReducer is default implementation of the Reducer


                                                                       29
Output Format
OutputFormat is similar to InputFormat
Different type of output formats are
   TextOutputFormat
   SequenceFileOutputFormat
   NullOutputFormat




                                         30
Mechanics of whole process
Configure the Input and Output
Configure the Mapper and Reducer
Specify other parameters like number Map
job, number of reduce job etc.
Submit the job to client




                                       31
Example
JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new
    Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); //JobClient.submit

                                                                              32
Job Tracker & Task Tracker
                   Master Node

                     Job Tracker




 Slave Node                             Slave Node
   Task Tracker                           Task Tracker
                   .....           ..
 Task       Task                        Task       Task

                                                          33
Job Launch Process
JobClient determines proper division of
input into InputSplits
Sends job data to master JobTracker server.
Saves the jar and JobConf (serialized to
XML) in shared location and posts the job
into a queue.



                                            34
                                34
Job Launch Process Contd..
TaskTrackers running on slave nodes
periodically query JobTracker for
work.
Get the job jar from the Master node
to the data node.
Launch the main class in separate
JVM queue.
TaskTracker.Child.main()   35      35
Small File Problem
What should I do if I have lots of small
 files?
One word answer is SequenceFile.
               SequenceFile Layout
        Key     Value   Key   Value   Key   Value   Key   Value


     File Name File Content




Tar to SequenceFile     http://goo.gl/mKGC7
                                                                  36
Consolidator      http://goo.gl/EVvi7
Problem of Large File
What if I have single big file of 20Gb?
One word answer is There is no problems with
large files




                                               37
SQL Data
What is way to access SQL data?
One word answer is DBInputFormat.
DBInputFormat provides a simple method of scanning entire tables from a
database, as well as the means to read from arbitrary SQL queries performed
against the database.
DBInputFormat provides a simple method of scanning entire tables from a
database, as well as the means to read from arbitrary SQL queries performed
against the database.


Database Access with Hadoop       http://goo.gl/CNOBc



                                                                              38
JobConf conf = new JobConf(getConf(), MyDriver.class);

     conf.setInputFormat(DBInputFormat.class);

      DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”,
“jdbc:mysql://localhost:port/dbNamee”);

     String [] fields = { “employee_id”, "name" };

      DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”,
fields);




                                                                                                     39
public class MyRow implements Writable, DBWritable {

         private int employeeNumber;

         private String employeeName;

         public void write(DataOutput out) throws IOException {

                    out.writeInt(employeeNumber);                 out.writeChars(employeeName);

         }

         public void readFields(DataInput in) throws IOException {

                    employeeNumber= in.readInt();                 employeeName = in.readUTF();

         }

         public void write(PreparedStatement statement) throws SQLException {

              statement.setInt(1, employeeNumber); statement.setString(2, employeeName);

         }

         public void readFields(ResultSet resultSet) throws SQLException {

              employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2);

         }
                                                                                                  40
}
Question
   &
Answer
           41
Thanks You


             42

More Related Content

What's hot (20)

PDF
Mapreduce by examples
Andrea Iacono
 
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PPT
Anatomy of classic map reduce in hadoop
Rajesh Ananda Kumar
 
PPTX
Unit 2
vishal choudhary
 
PDF
The Pregel Programming Model with Spark GraphX
Andrea Iacono
 
PPTX
Unit 2 part-2
vishal choudhary
 
PPTX
Map Reduce Online
Hadoop User Group
 
PPT
Map Reduce
schapht
 
PPT
Upgrading To The New Map Reduce API
Tom Croucher
 
PPTX
Unit 4-apache pig
vishal choudhary
 
PPTX
Unit 3
vishal choudhary
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Scalding
Mario Pastorelli
 
PPTX
Lec_4_1_IntrotoPIG.pptx
vishal choudhary
 
PPT
Hadoop 2
EasyMedico.com
 
PPTX
Unit 3 writable collections
vishal choudhary
 
PDF
Map/Reduce intro
CARLOS III UNIVERSITY OF MADRID
 
PPTX
Mapreduce advanced
Chirag Ahuja
 
PDF
Introduction to Map-Reduce
Brendan Tierney
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Mapreduce by examples
Andrea Iacono
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Anatomy of classic map reduce in hadoop
Rajesh Ananda Kumar
 
The Pregel Programming Model with Spark GraphX
Andrea Iacono
 
Unit 2 part-2
vishal choudhary
 
Map Reduce Online
Hadoop User Group
 
Map Reduce
schapht
 
Upgrading To The New Map Reduce API
Tom Croucher
 
Unit 4-apache pig
vishal choudhary
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Lec_4_1_IntrotoPIG.pptx
vishal choudhary
 
Hadoop 2
EasyMedico.com
 
Unit 3 writable collections
vishal choudhary
 
Mapreduce advanced
Chirag Ahuja
 
Introduction to Map-Reduce
Brendan Tierney
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 

Viewers also liked (6)

PPT
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PDF
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei
 
PPTX
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
PPTX
Hadoop and Big Data Security
Chicago Hadoop Users Group
 
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei
 
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
Hadoop and Big Data Security
Chicago Hadoop Users Group
 
Ad

Similar to Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011 (20)

PPT
Hadoop
Girish Khanzode
 
PDF
Hadoop pig
Sean Murphy
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPTX
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
veyetas395
 
PPT
Hadoop ppt2
Ankit Gupta
 
PPTX
Hadoop training-in-hyderabad
sreehari orienit
 
PPT
Meethadoop
IIIT-H
 
PPTX
Hadoop – Architecture.pptx
SakthiVinoth78
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PPTX
Cppt
chunkypandey12
 
PPTX
Cppt
chunkypandey12
 
PPTX
Cppt Hadoop
chunkypandey12
 
PDF
Hadoop-Introduction
Sandeep Deshmukh
 
PDF
Introduction to Hadoop
Apache Apex
 
PDF
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
PPTX
Managing Big data with Hadoop
Nalini Mehta
 
PPTX
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop pig
Sean Murphy
 
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
Dilip Reddy
 
Types_of_Stats.pptxTypes_of_Stats.pptxTypes_of_Stats.pptx
veyetas395
 
Hadoop ppt2
Ankit Gupta
 
Hadoop training-in-hyderabad
sreehari orienit
 
Meethadoop
IIIT-H
 
Hadoop – Architecture.pptx
SakthiVinoth78
 
Seminar Presentation Hadoop
Varun Narang
 
Cppt Hadoop
chunkypandey12
 
Hadoop-Introduction
Sandeep Deshmukh
 
Introduction to Hadoop
Apache Apex
 
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Managing Big data with Hadoop
Nalini Mehta
 
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Ad

More from IndicThreads (20)

PPTX
Http2 is here! And why the web needs it
IndicThreads
 
ODP
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
IndicThreads
 
PPT
Go Programming Language - Learning The Go Lang way
IndicThreads
 
PPT
Building Resilient Microservices
IndicThreads
 
PPT
App using golang indicthreads
IndicThreads
 
PDF
Building on quicksand microservices indicthreads
IndicThreads
 
PDF
How to Think in RxJava Before Reacting
IndicThreads
 
PPT
Iot secure connected devices indicthreads
IndicThreads
 
PDF
Real world IoT for enterprises
IndicThreads
 
PPT
IoT testing and quality assurance indicthreads
IndicThreads
 
PPT
Functional Programming Past Present Future
IndicThreads
 
PDF
Harnessing the Power of Java 8 Streams
IndicThreads
 
PDF
Building & scaling a live streaming mobile platform - Gr8 road to fame
IndicThreads
 
PPTX
Internet of things architecture perspective - IndicThreads Conference
IndicThreads
 
PDF
Cars and Computers: Building a Java Carputer
IndicThreads
 
PPTX
Scrap Your MapReduce - Apache Spark
IndicThreads
 
PPT
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
PPTX
Speed up your build pipeline for faster feedback
IndicThreads
 
PPT
Unraveling OpenStack Clouds
IndicThreads
 
PPTX
Digital Transformation of the Enterprise. What IT leaders need to know!
IndicThreads
 
Http2 is here! And why the web needs it
IndicThreads
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
IndicThreads
 
Go Programming Language - Learning The Go Lang way
IndicThreads
 
Building Resilient Microservices
IndicThreads
 
App using golang indicthreads
IndicThreads
 
Building on quicksand microservices indicthreads
IndicThreads
 
How to Think in RxJava Before Reacting
IndicThreads
 
Iot secure connected devices indicthreads
IndicThreads
 
Real world IoT for enterprises
IndicThreads
 
IoT testing and quality assurance indicthreads
IndicThreads
 
Functional Programming Past Present Future
IndicThreads
 
Harnessing the Power of Java 8 Streams
IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
IndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
IndicThreads
 
Cars and Computers: Building a Java Carputer
IndicThreads
 
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
Speed up your build pipeline for faster feedback
IndicThreads
 
Unraveling OpenStack Clouds
IndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
IndicThreads
 

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Python basic programing language for automation
DanialHabibi2
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 

Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011

  • 1. Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1
  • 2. Content Map Function Reduce Function Why Hadoop HDFS Map Reduce – Hadoop Some Questions 2
  • 3. What is Map Function Map is a classic primitive of Functional Programming. Map means apply a function to a list of elements and return the modified list. function List Map(Function func, List elements){ List newElement; foreach element in elements{ newElement.put(apply(func, element)) } return newElement 3 }
  • 4. Example Map Function function double increaseSalary(double salary){ return salary* (1+0.15); } function double Map(Function increaseSalary, List<Employee> employees){ List<Employees> newList; foreach employee in employees{ Employee tempEmployee = new ( newList.add(tempEmployee.income=increaseSalary( tempEmployee.income) } } 4
  • 5. Fold or Reduce Funtion Fold/Reduce reduces a list of values to one. Fold means apply a function to a list of elements and return a resulting element function Element Reduce(Function func, List elements){ Element earlierResult; foreach element in elements{ func(element, earlierResult) } return earlierResult; } 5
  • 6. Example Reduce Function function double add(double number1, double number2){ return number1 + number2; } function double Reduce (Function add, List<Employee> employees){ double totalAmout=0.0; foreach employee in employees{ totalAmount =add(totalAmount,emplyee.income); } return totalAmount } 6
  • 7. I know Map and Reduce, How do I use it I will use some library or framework. 7
  • 8. Why some framework? Lazy to write boiler plate code For modularity Code reusability 8
  • 9. What is best choice 9
  • 11. 11
  • 14. Strong Community 14 Image Courtesy http://goo.gl/15Nu3
  • 16. Hadoop 16
  • 18. Hadoop Distributed File System Large Distributed File System on commudity hardware 4k nodes, Thousands of files, Petabytes of data Files are replicated so that hard disk failure can be handled easily One NameNode and many DataNode 18
  • 19. Hadoop Distributed File System HDFS ARCHITECTURE Metadata (Name,replicas,..): Namenode Metadata ops Client Block ops Read Data Nodes Data Nodes Replication Blocks Write Rack 1 Rack 2 Client 19
  • 20. NameNode Meta-data in RAM The entire metadata is in main memory. Metadata consist of • List of files • List of Blocks for each file • List of DataNodes for each block • File attributes, e.g creation time • Transaction Log NameNode uses heartbeats to detect DataNode failure 20
  • 21. Data Node Data Node stores the data in file system Stores meta-data of a block Serves data and meta-data to Clients Pipelining of Data i.e forwards data to other specified DataNodes DataNodes send heartbeat to the NameNode every three sec. 21
  • 22. HDFS Commands Accessing HDFS hadoop dfs –mkdir myDirectory hadoop dfs -cat myFirstFile.txt Web Interface http://host:port/dfshealth.jsp 22
  • 24. Map Reduce Diagramtically Mapper Reducer Output Files Input Files Input Split 0 Input Split 1 Input Split 2 Input Split 3 Input Split 4 Input Split 5 Intermediate file is divided into R partitions, by partitioning function 24
  • 25. Input Format InputFormat descirbes the input sepcification to a MapReduce job. That is how the data is to be read from the File System . Split up the input file into logical InputSplits, each of which is assigned to an Mapper Provide the RecordReader implementation to be used to collect input record from logical InputSplit for processing by Mapper RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.   25
  • 26. Creating a your Mapper The mapper should implements .mapred.Mapper Earlier version use to extend class .mapreduce.Mapper class Extend .mapred.MapReduceBase class which provides default implementation of close and configure method. The Main method is map ( WritableComparable key, Writable value, OutputCollector<K2,V2> output, Reporter reporter) One instance of your Mapper is initialized per task. Exists in separate process from all other instances of Mapper – no data sharing. So static variables will be different for different map task. Writable -- Hadoop defines a interface called Writable which is Serializable. Examples IntWritable, LongWritable, Text etc. WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. InverseMapper swaps the key and value 26
  • 27. Combiner Combiners are used to optimize/minimize the number of key value pairs that will be shuffled across the network between mappers and reducers. Combiner are sort of mini reducer that will be applied potentially several times still during the map phase before to send the new set of key/value pairs to the reducer(s). Combiners should be used when the function you want to apply is both commutative and associative. Example: WordCount and Mean value computation Reference http://goo.gl/iU5kR 27
  • 28. Partitioner Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Some Partitioner are BinaryPartitioner, HashPartitioner, KeyFieldBasedPartitioner, TotalOrderPartitioner 28
  • 29. Creating a your Reducer The mapper should implements .mapred.Reducer Earlier version use to extend class .mapreduce.Reduces class Extend .mapred.MapReduceBase class which provides default implementation of close and configure method. The Main method is reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Keys & values sent to one partition all goes to the same reduce task Iterator.next() always returns the same object, different data HashPartioner partition it based on Hash function written IdentityReducer is default implementation of the Reducer 29
  • 30. Output Format OutputFormat is similar to InputFormat Different type of output formats are TextOutputFormat SequenceFileOutputFormat NullOutputFormat 30
  • 31. Mechanics of whole process Configure the Input and Output Configure the Mapper and Reducer Specify other parameters like number Map job, number of reduce job etc. Submit the job to client 31
  • 32. Example JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); //JobClient.submit 32
  • 33. Job Tracker & Task Tracker Master Node Job Tracker Slave Node Slave Node Task Tracker Task Tracker ..... .. Task Task Task Task 33
  • 34. Job Launch Process JobClient determines proper division of input into InputSplits Sends job data to master JobTracker server. Saves the jar and JobConf (serialized to XML) in shared location and posts the job into a queue. 34 34
  • 35. Job Launch Process Contd.. TaskTrackers running on slave nodes periodically query JobTracker for work. Get the job jar from the Master node to the data node. Launch the main class in separate JVM queue. TaskTracker.Child.main() 35 35
  • 36. Small File Problem What should I do if I have lots of small files? One word answer is SequenceFile. SequenceFile Layout Key Value Key Value Key Value Key Value File Name File Content Tar to SequenceFile http://goo.gl/mKGC7 36 Consolidator http://goo.gl/EVvi7
  • 37. Problem of Large File What if I have single big file of 20Gb? One word answer is There is no problems with large files 37
  • 38. SQL Data What is way to access SQL data? One word answer is DBInputFormat. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Database Access with Hadoop http://goo.gl/CNOBc 38
  • 39. JobConf conf = new JobConf(getConf(), MyDriver.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost:port/dbNamee”); String [] fields = { “employee_id”, "name" }; DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”, fields); 39
  • 40. public class MyRow implements Writable, DBWritable { private int employeeNumber; private String employeeName; public void write(DataOutput out) throws IOException { out.writeInt(employeeNumber); out.writeChars(employeeName); } public void readFields(DataInput in) throws IOException { employeeNumber= in.readInt(); employeeName = in.readUTF(); } public void write(PreparedStatement statement) throws SQLException { statement.setInt(1, employeeNumber); statement.setString(2, employeeName); } public void readFields(ResultSet resultSet) throws SQLException { employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2); } 40 }
  • 41. Question & Answer 41