SlideShare a Scribd company logo
Next Generation of Apache Hadoop
          MapReduce
                       Owen O’Malley
                       oom@yahoo-inc.com
                          @owen_omalley
What is Hadoop?
   A framework for storing and processing big data on
    lots of commodity machines.
     - Up to 4,000 machines in a cluster
     - Up to 20 PB in a cluster
   Open Source Apache project
   High reliability done in software
     - Automated failover for data and computation
   Implemented in Java
   Primary data analysis platform at Yahoo!
     - 40,000+ machines running Hadoop
What is Hadoop?
   HDFS – Distributed File System
     - Combines cluster’s local storage into a single namespace.
     - All data is replicated to multiple machines.
     - Provides locality information to clients
   MapReduce
     -   Batch computation framework
     -   Tasks re-executed on failure
     -   User code wrapped around a distributed sort
     -   Optimizes for data locality of input
Case Study: Yahoo Front Page

 Personalized
 for each visitor

twice the engagement
 Result:
 twice the engagement


                         Recommended links        News Interests          Top Searches

                        +79% clicks +160% clicks +43% clicks
                        vs. randomly selected   vs. one size fits all   vs. editor selected


                                                                                              3
Hadoop MapReduce Today
   JobTracker
    - Manages cluster resources
      and job scheduling
   TaskTracker
    - Per-node agent
    - Manage tasks
Current Limitations
   Scalability
    - Maximum Cluster size – 4,000 nodes
    - Maximum concurrent tasks – 40,000
    - Coarse synchronization in JobTracker
   Single point of failure
    - Failure kills all queued and running jobs
    - Jobs need to be re-submitted by users
   Restart is very tricky due to complex state
   Hard partition of resources into map and reduce
    slots
Current Limitations

   Lacks support for alternate paradigms
    - Iterative applications implemented using MapReduce
      are 10x slower.
    - Users use MapReduce to run arbitrary code
    - Example: K-Means, PageRank
   Lack of wire-compatible protocols
    - Client and cluster must be of same version
    - Applications and workflows cannot migrate to
      different clusters
MapReduce Requirements for 2011
   Reliability
   Availability
   Scalability - Clusters of 6,000 machines
    - Each machine with 16 cores, 48G RAM, 24TB disks
    - 100,000 concurrent tasks
    - 10,000 concurrent jobs
   Wire Compatibility
   Agility & Evolution – Ability for customers to
    control upgrades to the grid software stack.
MapReduce – Design Focus
   Split up the two major functions of JobTracker
    - Cluster resource management
    - Application life-cycle management
   MapReduce becomes user-land library
Architecture
Architecture
   Resource Manager
    - Global resource scheduler
    - Hierarchical queues
   Node Manager
    - Per-machine agent
    - Manages the life-cycle of container
    - Container resource monitoring
   Application Master
    - Per-application
    - Manages application scheduling and task execution
    - E.g. MapReduce Application Master
Improvements vis-à-vis current MapReduce
     Scalability
      - Application life-cycle management is very
        expensive
      - Partition resource management and application
        life-cycle management
      - Application management is distributed
      - Hardware trends
          • Machines are getting bigger and faster
          • Moving toward 12 2TB disks instead of 4 1TB disks
          • Enables more tasks per a machine
Improvements vis-à-vis current MapReduce
     Availability
      - Application Master
          • Optional failover via application-specific checkpoint
          • MapReduce applications pick up where they left off
      - Resource Manager
          • No single point of failure - failover via ZooKeeper
          • Application Masters are restarted automatically
Improvements vis-à-vis current MapReduce
     Wire Compatibility
      - Protocols are wire-compatible
      - Old clients can talk to new servers
      - Evolution toward rolling upgrades
Improvements vis-à-vis current MapReduce
     Innovation and Agility
      - MapReduce now becomes a user-land library
      - Multiple versions of MapReduce can run in the
        same cluster (a la Apache Pig)
         • Faster deployment cycles for improvements
      - Customers upgrade MapReduce versions on their
        schedule
      - Users can use customized MapReduce versions
        without affecting everyone!
Improvements vis-à-vis current MapReduce
     Utilization
      - Generic resource model
          •   Memory
          •   CPU
          •   Disk b/w
          •   Network b/w
      - Remove fixed partition of map and reduce slots
Improvements vis-à-vis current MapReduce
     Support for programming paradigms other
      than MapReduce
      -   MPI
      -   Master-Worker
      -   Machine Learning and Iterative processing
      -   Enabled by paradigm-specific Application Master
      -   All can run on the same Hadoop cluster
Summary
   Takes Hadoop to the next level
    -   Scale-out even further
    -   High availability
    -   Cluster Utilization
    -   Support for paradigms other than MapReduce
Questions?
     http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/

More Related Content

What's hot (19)

PPTX
Messaging architecture @FB (Fifth Elephant Conference)
Joydeep Sen Sarma
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PDF
Introduction to Hadoop
Vigen Sahakyan
 
PPTX
Introduction to Hadoop at Data-360 Conference
Avkash Chauhan
 
PPTX
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
PDF
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
PPTX
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
Cloud Optimized Big Data
Joydeep Sen Sarma
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PPTX
Hadoop Architecture
Dr. C.V. Suresh Babu
 
PPT
Hadoop distributions - ecosystem
Jakub Stransky
 
PPTX
PPT on Hadoop
Shubham Parmar
 
PPT
Hadoop tutorial
Aamir Ameen
 
PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
 
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
PPTX
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Renato Bonomini
 
PPTX
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
 
PDF
Bn1028 demo hadoop administration and development
conline training
 
PPTX
Towards SLA-based Scheduling on YARN Clusters
DataWorks Summit
 
Messaging architecture @FB (Fifth Elephant Conference)
Joydeep Sen Sarma
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Introduction to Hadoop
Vigen Sahakyan
 
Introduction to Hadoop at Data-360 Conference
Avkash Chauhan
 
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Cloud Optimized Big Data
Joydeep Sen Sarma
 
Introduction To Hadoop Ecosystem
InSemble
 
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Hadoop distributions - ecosystem
Jakub Stransky
 
PPT on Hadoop
Shubham Parmar
 
Hadoop tutorial
Aamir Ameen
 
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Sumeet Singh
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Renato Bonomini
 
Qubole @ AWS Meetup Bangalore - July 2015
Joydeep Sen Sarma
 
Bn1028 demo hadoop administration and development
conline training
 
Towards SLA-based Scheduling on YARN Clusters
DataWorks Summit
 

Viewers also liked (17)

PDF
Map reduce and hadoop at mylife
responseteam
 
PPT
Hadoop ppt2
Ankit Gupta
 
PPT
Sensor(zigbee)
rajrayala
 
PPT
Advance ethernet
Online
 
PPTX
MapReduce Paradigm
Dilip Reddy
 
PPT
Ethernet
T Uppili Srinivasan
 
PDF
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
PPTX
Ethernet technology
Josekutty James
 
PDF
Intro to HDFS and MapReduce
Ryan Tabora
 
PDF
Apache Spark & Hadoop
MapR Technologies
 
PPTX
Ethernet - Networking presentation
Viet Nguyen
 
PPTX
Zigbee technology ppt edited
rakeshkumarchary
 
PPTX
Zigbee Presentation
Maathu Michael
 
PDF
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
PPT
Internet of Things and its applications
Pasquale Puzio
 
PPTX
How to think like a startup
Loic Le Meur
 
Map reduce and hadoop at mylife
responseteam
 
Hadoop ppt2
Ankit Gupta
 
Sensor(zigbee)
rajrayala
 
Advance ethernet
Online
 
MapReduce Paradigm
Dilip Reddy
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
Ethernet technology
Josekutty James
 
Intro to HDFS and MapReduce
Ryan Tabora
 
Apache Spark & Hadoop
MapR Technologies
 
Ethernet - Networking presentation
Viet Nguyen
 
Zigbee technology ppt edited
rakeshkumarchary
 
Zigbee Presentation
Maathu Michael
 
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Internet of Things and its applications
Pasquale Puzio
 
How to think like a startup
Loic Le Meur
 
Ad

Similar to Next Generation of Hadoop MapReduce (20)

PPTX
YARN Hadoop Summit Bangalore 2011
Sharad Agarwal
 
PDF
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Cloudera, Inc.
 
PDF
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
InMobi
 
PDF
YARN: Future of Data Processing with Apache Hadoop
Hortonworks
 
PDF
Apache Hadoop MapReduce: What's Next
DataWorks Summit
 
PPTX
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hortonworks
 
PPT
Architecting the Future of Big Data and Search
Hortonworks
 
PPT
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
PPT
Hadoop and Mapreduce Introduction
rajsandhu1989
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
PPTX
Distributed computing poli
ivascucristian
 
PPTX
2. hadoop fundamentals
Lokesh Ramaswamy
 
PPTX
Hadoop introduction
musrath mohammad
 
PPT
Hadoop online-training
Geohedrick
 
PPT
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
SuchithraaPalani
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
YARN Hadoop Summit Bangalore 2011
Sharad Agarwal
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Cloudera, Inc.
 
Hadoop bangalore-meetup-dec-2011-hadoop nextgen
InMobi
 
YARN: Future of Data Processing with Apache Hadoop
Hortonworks
 
Apache Hadoop MapReduce: What's Next
DataWorks Summit
 
Hadoop World 2011, Apache Hadoop MapReduce Next Gen
Hortonworks
 
Architecting the Future of Big Data and Search
Hortonworks
 
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Hadoop and Mapreduce Introduction
rajsandhu1989
 
Big Data and Hadoop
Flavio Vit
 
Introduction to Hadoop and Big Data
Joe Alex
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Distributed computing poli
ivascucristian
 
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop introduction
musrath mohammad
 
Hadoop online-training
Geohedrick
 
Lecture Slide - Introduction to Hadoop, HDFS, MapR.ppt
SuchithraaPalani
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Ad

More from huguk (20)

PDF
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
PDF
ether.camp - Hackathon & ether.camp intro
huguk
 
PPTX
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
PDF
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
PDF
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
PDF
Streaming Dataflow with Apache Flink
huguk
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PDF
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
PDF
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
PDF
Signal Media: Real-Time Media & News Monitoring
huguk
 
PDF
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
PDF
Peter Karney: Intro to the Digital catapult
huguk
 
PDF
Cytora: Real-Time Political Risk Analysis
huguk
 
PDF
Cubitic: Predictive Analytics
huguk
 
PDF
Bird.i: Earth Observation Data Made Social
huguk
 
PDF
Aiseedo: Real Time Machine Intelligence
huguk
 
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
PDF
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
PPTX
Hadoop - Looking to the Future By Arun Murthy
huguk
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
ether.camp - Hackathon & ether.camp intro
huguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
Streaming Dataflow with Apache Flink
huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
Signal Media: Real-Time Media & News Monitoring
huguk
 
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
Peter Karney: Intro to the Digital catapult
huguk
 
Cytora: Real-Time Political Risk Analysis
huguk
 
Cubitic: Predictive Analytics
huguk
 
Bird.i: Earth Observation Data Made Social
huguk
 
Aiseedo: Real Time Machine Intelligence
huguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
Hadoop - Looking to the Future By Arun Murthy
huguk
 

Next Generation of Hadoop MapReduce

  • 1. Next Generation of Apache Hadoop MapReduce Owen O’Malley [email protected] @owen_omalley
  • 2. What is Hadoop?  A framework for storing and processing big data on lots of commodity machines. - Up to 4,000 machines in a cluster - Up to 20 PB in a cluster  Open Source Apache project  High reliability done in software - Automated failover for data and computation  Implemented in Java  Primary data analysis platform at Yahoo! - 40,000+ machines running Hadoop
  • 3. What is Hadoop?  HDFS – Distributed File System - Combines cluster’s local storage into a single namespace. - All data is replicated to multiple machines. - Provides locality information to clients  MapReduce - Batch computation framework - Tasks re-executed on failure - User code wrapped around a distributed sort - Optimizes for data locality of input
  • 4. Case Study: Yahoo Front Page Personalized for each visitor twice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs. one size fits all vs. editor selected 3
  • 5. Hadoop MapReduce Today  JobTracker - Manages cluster resources and job scheduling  TaskTracker - Per-node agent - Manage tasks
  • 6. Current Limitations  Scalability - Maximum Cluster size – 4,000 nodes - Maximum concurrent tasks – 40,000 - Coarse synchronization in JobTracker  Single point of failure - Failure kills all queued and running jobs - Jobs need to be re-submitted by users  Restart is very tricky due to complex state  Hard partition of resources into map and reduce slots
  • 7. Current Limitations  Lacks support for alternate paradigms - Iterative applications implemented using MapReduce are 10x slower. - Users use MapReduce to run arbitrary code - Example: K-Means, PageRank  Lack of wire-compatible protocols - Client and cluster must be of same version - Applications and workflows cannot migrate to different clusters
  • 8. MapReduce Requirements for 2011  Reliability  Availability  Scalability - Clusters of 6,000 machines - Each machine with 16 cores, 48G RAM, 24TB disks - 100,000 concurrent tasks - 10,000 concurrent jobs  Wire Compatibility  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
  • 9. MapReduce – Design Focus  Split up the two major functions of JobTracker - Cluster resource management - Application life-cycle management  MapReduce becomes user-land library
  • 11. Architecture  Resource Manager - Global resource scheduler - Hierarchical queues  Node Manager - Per-machine agent - Manages the life-cycle of container - Container resource monitoring  Application Master - Per-application - Manages application scheduling and task execution - E.g. MapReduce Application Master
  • 12. Improvements vis-à-vis current MapReduce  Scalability - Application life-cycle management is very expensive - Partition resource management and application life-cycle management - Application management is distributed - Hardware trends • Machines are getting bigger and faster • Moving toward 12 2TB disks instead of 4 1TB disks • Enables more tasks per a machine
  • 13. Improvements vis-à-vis current MapReduce  Availability - Application Master • Optional failover via application-specific checkpoint • MapReduce applications pick up where they left off - Resource Manager • No single point of failure - failover via ZooKeeper • Application Masters are restarted automatically
  • 14. Improvements vis-à-vis current MapReduce  Wire Compatibility - Protocols are wire-compatible - Old clients can talk to new servers - Evolution toward rolling upgrades
  • 15. Improvements vis-à-vis current MapReduce  Innovation and Agility - MapReduce now becomes a user-land library - Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) • Faster deployment cycles for improvements - Customers upgrade MapReduce versions on their schedule - Users can use customized MapReduce versions without affecting everyone!
  • 16. Improvements vis-à-vis current MapReduce  Utilization - Generic resource model • Memory • CPU • Disk b/w • Network b/w - Remove fixed partition of map and reduce slots
  • 17. Improvements vis-à-vis current MapReduce  Support for programming paradigms other than MapReduce - MPI - Master-Worker - Machine Learning and Iterative processing - Enabled by paradigm-specific Application Master - All can run on the same Hadoop cluster
  • 18. Summary  Takes Hadoop to the next level - Scale-out even further - High availability - Cluster Utilization - Support for paradigms other than MapReduce
  • 19. Questions? http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/ http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/