Introduction To Hadoop

                                    Kenneth Heafield

                                          Google Inc


                                    January 14, 2008



Example code from Hadoop 0.13.1 used under the Apache License Version 2.0
and modified for presentation. Except as otherwise noted, the content of this
presentation is licensed under the Creative Commons Attribution 2.5 License.


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   1 / 12
Outline


1    Word Count Code
      Mapper
      Reducer
      Main

2    How it Works
       Serialization
       Data Flow

3    Lab




    Kenneth Heafield (Google Inc)   Introduction To Hadoop   January 14, 2008   2 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Mapper


Mapper

< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >

public void map(WritableComparable key,
    Writable value, OutputCollector output,
    Reporter reporter) throws IOException {
  String line = ((Text)value).toString();
  StringTokenizer itr = new StringTokenizer(line);
  Text word = new Text();
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    output.collect(word, new IntWritable(1));
  }
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   3 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Reducer


Reducer

< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >

public void reduce(WritableComparable key,
                   Iterator values,
                   OutputCollector output,
                   Reporter reporter)
                   throws IOException {
  int sum = 0;
  while (values.hasNext()) {
    sum += ((IntWritable) values.next()).get();
  }
  output.collect(key, new IntWritable(sum));
}


 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   4 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
Word Count Code   Main


Main

public static void main(String[] args)
    throws IOException {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(ReduceClass.class);
  conf.setReducerClass(ReduceClass.class);
  conf.setNumMapTasks(new Integer(40));
  conf.setNumReduceTasks(new Integer(30));
  conf.setInputPath(new Path("/shared/wikipedia_small"));
  conf.setOutputPath(new Path("/user/kheafield/word_count"));
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
  JobClient.runJob(conf);
}
 Kenneth Heafield (Google Inc)         Introduction To Hadoop   January 14, 2008   5 / 12
How it Works   Serialization


Types

Purpose
Simple serialization for keys, values, and other data

Interface Writable
     Read and write binary format
     Convert to String for text formats
     WritableComparable adds sorting order for keys

Example Implementations
    ArrayWritable is only Writable
     BooleanWritable
     IntWritable sorts in increasing order
     Text holds a String

 Kenneth Heafield (Google Inc)      Introduction To Hadoop      January 14, 2008   6 / 12
How it Works   Serialization


A Writable

public class IntPairWritable implements Writable {
  public int first;
  public int second;
  public void write(DataOutput out) throws IOException {
    out.writeInt(first);
    out.writeInt(second);
  }
  public void readFields(DataInput in) throws IOException {
    first = in.readInt();
    second = in.readInt();
  }
  public int hashCode() { return first + second; }
  public String toString() {
    return Integer.toString(first) + "," +
           Integer.toString(second);
  }
 Kenneth Heafield (Google Inc)      Introduction To Hadoop      January 14, 2008   7 / 12
How it Works   Serialization


WritableComparable Method



public int compareTo(Object other) {
  IntPairWritable o = (IntPairWritable)other;
  if (first < o.first) return -1;
  if (first > o.first) return 1;
  if (second < o.second) return -1;
  if (second > o.second) return 1;
  return 0;
}




 Kenneth Heafield (Google Inc)      Introduction To Hadoop      January 14, 2008   8 / 12
How it Works   Data Flow


Data Flow

Default Flow
 1 Mappers read from HDFS

 2   Map output is partitioned by key and sent to Reducers
 3   Reducers sort input by key
 4   Reduce output is written to HDFS

          1   HDFS              2   Mapper      3   Reducer                    4    HDFS
              Input                 Map             Sort              Reduce       Output




              Input                 Map             Sort              Reduce       Output


 Kenneth Heafield (Google Inc)                Introduction To Hadoop                January 14, 2008   9 / 12
How it Works   Data Flow


Combiners

Concept
    Add counts at Mapper before sending to Reducer.
     Word count is 6 minutes with combiners and 14 without.

Implementation
    Mapper caches output and periodically calls Combiner
     Input to Combine may be from Map or Combine
     Combiner uses interface as Reducer
              Mapper
                                Combine            Sort        Reduce        Output

Input              Map

                                 Cache             Sort        Reduce        Output
 Kenneth Heafield (Google Inc)         Introduction To Hadoop            January 14, 2008   10 / 12
Lab


Exercises


Recommended: Word Count
Get word count running.

Bigrams
Count bigrams and unigrams efficiently.

Capitalization
With what probability is a word capitalized?

Indexer
In what documents does each word appear? Where in the documents?



 Kenneth Heafield (Google Inc)   Introduction To Hadoop   January 14, 2008   11 / 12
Lab


Instructions




  1   Login to the cluster successfully (and set your password).
  2   Get Eclipse installed, so you can build Java code.
  3   Install the Hadoop plugin for Eclipse so you can deploy jobs to the
      cluster.
  4   Set up your Eclipse workspace from a template that we provide.
  5   Run the word counter example over the Wikipedia data set.




 Kenneth Heafield (Google Inc)   Introduction To Hadoop      January 14, 2008   12 / 12

Hadoop

  • 1.
    Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 1 / 12
  • 2.
    Outline 1 Word Count Code Mapper Reducer Main 2 How it Works Serialization Data Flow 3 Lab Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 2 / 12
  • 3.
    Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 4.
    Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 5.
    Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 6.
    Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 7.
    Word Count Code Mapper Mapper < “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 > public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); } } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  • 8.
    Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 9.
    Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 10.
    Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 11.
    Word Count Code Reducer Reducer < “The”, 1 >, < “The”, 1 >→ < “The”, 2 > public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  • 12.
    Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 13.
    Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 14.
    Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 15.
    Word Count Code Main Main public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  • 16.
    How it Works Serialization Types Purpose Simple serialization for keys, values, and other data Interface Writable Read and write binary format Convert to String for text formats WritableComparable adds sorting order for keys Example Implementations ArrayWritable is only Writable BooleanWritable IntWritable sorts in increasing order Text holds a String Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 6 / 12
  • 17.
    How it Works Serialization A Writable public class IntPairWritable implements Writable { public int first; public int second; public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); } public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); } public int hashCode() { return first + second; } public String toString() { return Integer.toString(first) + "," + Integer.toString(second); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 7 / 12
  • 18.
    How it Works Serialization WritableComparable Method public int compareTo(Object other) { IntPairWritable o = (IntPairWritable)other; if (first < o.first) return -1; if (first > o.first) return 1; if (second < o.second) return -1; if (second > o.second) return 1; return 0; } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 8 / 12
  • 19.
    How it Works Data Flow Data Flow Default Flow 1 Mappers read from HDFS 2 Map output is partitioned by key and sent to Reducers 3 Reducers sort input by key 4 Reduce output is written to HDFS 1 HDFS 2 Mapper 3 Reducer 4 HDFS Input Map Sort Reduce Output Input Map Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 9 / 12
  • 20.
    How it Works Data Flow Combiners Concept Add counts at Mapper before sending to Reducer. Word count is 6 minutes with combiners and 14 without. Implementation Mapper caches output and periodically calls Combiner Input to Combine may be from Map or Combine Combiner uses interface as Reducer Mapper Combine Sort Reduce Output Input Map Cache Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 10 / 12
  • 21.
    Lab Exercises Recommended: Word Count Getword count running. Bigrams Count bigrams and unigrams efficiently. Capitalization With what probability is a word capitalized? Indexer In what documents does each word appear? Where in the documents? Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 11 / 12
  • 22.
    Lab Instructions 1 Login to the cluster successfully (and set your password). 2 Get Eclipse installed, so you can build Java code. 3 Install the Hadoop plugin for Eclipse so you can deploy jobs to the cluster. 4 Set up your Eclipse workspace from a template that we provide. 5 Run the word counter example over the Wikipedia data set. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 12 / 12