Overview on HADOOP Distributed Computing RAGHU JULURI Senior Member Technical Staff Oracle India Development Center. 2/7/2011
Dealing with lots of Data 20 billion web pages * 20 kb =400 TB 1000 hard disks to store web 1 computer can read ~50 MB/sec  from disk => 3 months Sol : spread the work over many machines  Hardware  & Software Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library.  2/7/2011
2/7/2011
2/7/2011
Standard Model 2/7/2011
Hadoop EcoSystem 2/7/2011
2/7/2011
2/7/2011
Hadoop, Why? Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day –  Failure is expected, rather than exceptional. –  The number of nodes in a cluster is not constant. Need common infrastructure –  Efficient, reliable, Open Source Apache License The above goals are same as Condor, but Workloads are IO bound and not CPU bound 2/7/2011
2/7/2011
2/7/2011
HDFS  splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011
Goals of HDFS Very Large Distributed File System –  10K nodes, 100 million files, 10 PB Assumes Commodity Hardware –  Files are replicated to handle hardware failure –  Detect failures and recovers from them Optimized for Batch Processing –  Data locations exposed so that computations can move to where data resides –  Provides very high aggregate bandwidth User Space, runs on heterogeneous OS  2/7/2011
Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode  : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011
MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011
MapReduce: Programming Model Process data using special  map () and  reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output 2/7/2011
MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday. 2/7/2011
MapReduce Examples Word frequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011
A Brief History Functional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator 2/7/2011
MapReduce Execution Overview The user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011
MapReduce Execution Overview The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers 2/7/2011
MapReduce Resources The master distributes M map and  R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task) 2/7/2011
MapReduce Resources Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs 2/7/2011
MapReduce Execution Overview Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process.  Master Map worker Disk locations Local Storage 2/7/2011
MapReduce Execution Overview Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data.  Master Reduce worker Disk locations remote Storage 2/7/2011
MapReduce Execution Overview Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file 2/7/2011
MapReduce Execution Overview Master process wakes up user process when all tasks have completed.  Output contained in R output files. wakeup User Program Master Output files 2/7/2011
2/7/2011
Pig Data-flow oriented language “ Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for complex tasks •  Developed at Yahoo! Hive •  SQL-based data warehousing app Feature set is similar to Pig –  Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets –  Partition columns –  Sampling –  Buckets Developed at Facebook 2/7/2011
Hbase Column-store database –  Based on design of Google BigTable –  Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model –  (key, val) lookup –  Limited  transactions (only one row) 2/7/2011
ZooKeeper Distributed consensus engine Provides well-defined concurrent access semantics: –  Leader election –  Service discovery –  Distributed locking / mutual exclusion –  Message board / mailboxes 2/7/2011
Some more projects… Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P backend Dumbo – Python library for streaming Ganglia – distributed monitoring 2/7/2011
Conclusions Computing with big datasets is a fundamentally different challenge than doing “big compute” over a small dataset •  New ways of thinking about problems needed –  New tools provide means to capture this –  MapReduce, HDFS, etc. can help 2/7/2011
2/7/2011

More Related Content

PPTX
Map Reduce
PPT
Map reducecloudtech
PPT
Taylor bosc2010
PPTX
PDF
O connor bosc2010
PDF
High level languages for Big Data Analytics (Report)
PPT
Hadoop online-training
Map Reduce
Map reducecloudtech
Taylor bosc2010
O connor bosc2010
High level languages for Big Data Analytics (Report)
Hadoop online-training

What's hot (17)

PPTX
High-level languages for Big Data Analytics (Presentation)
PPT
Introduccion a Hadoop / Introduction to Hadoop
PDF
Python in an Evolving Enterprise System (PyData SV 2013)
DOC
PDF
Introduction to R and R Studio
PPTX
Pegasus-Poster-2016-final-v2
PDF
2014 hadoop wrocław jug
PPTX
Hadoop architecture-tutorial
PDF
TheETLBottleneckinBigDataAnalytics(1)
PDF
MapReduce in Cloud Computing
PDF
A sql implementation on the map reduce framework
PDF
Small Overview of Skype Database Tools
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PDF
Jan 2012 HUG: HCatalog
PPTX
Features of Hadoop
ODP
Hadoop Ecosystem Overview
High-level languages for Big Data Analytics (Presentation)
Introduccion a Hadoop / Introduction to Hadoop
Python in an Evolving Enterprise System (PyData SV 2013)
Introduction to R and R Studio
Pegasus-Poster-2016-final-v2
2014 hadoop wrocław jug
Hadoop architecture-tutorial
TheETLBottleneckinBigDataAnalytics(1)
MapReduce in Cloud Computing
A sql implementation on the map reduce framework
Small Overview of Skype Database Tools
Hadoop a Natural Choice for Data Intensive Log Processing
Jan 2012 HUG: HCatalog
Features of Hadoop
Hadoop Ecosystem Overview
Ad

Similar to Hadoop (20)

PPTX
This gives a brief detail about big data
DOCX
Hadoop Seminar Report
PPTX
Hadoop training-in-hyderabad
PPTX
Lecture2-MapReduce - An introductory lecture to Map Reduce
PPTX
Introduction to Hadoop and Big Data
PDF
Hadoop scalability
PPT
Hadoop - Introduction to HDFS
PPT
PPTX
Map reduce paradigm explained
PDF
Hadoop & MapReduce
PDF
getFamiliarWithHadoop
PPTX
MapReduce.pptx
PPT
hadoop
PPT
hadoop
PPTX
Mapreduce is for Hadoop Ecosystem in Data Science
PPTX
ch02-mapreduce.pptx
PPTX
Big data & Hadoop
PPT
Hadoop and Mapreduce Introduction
This gives a brief detail about big data
Hadoop Seminar Report
Hadoop training-in-hyderabad
Lecture2-MapReduce - An introductory lecture to Map Reduce
Introduction to Hadoop and Big Data
Hadoop scalability
Hadoop - Introduction to HDFS
Map reduce paradigm explained
Hadoop & MapReduce
getFamiliarWithHadoop
MapReduce.pptx
hadoop
hadoop
Mapreduce is for Hadoop Ecosystem in Data Science
ch02-mapreduce.pptx
Big data & Hadoop
Hadoop and Mapreduce Introduction
Ad

Recently uploaded (20)

PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Auditboard EB SOX Playbook 2023 edition.
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
Configure Apache Mutual Authentication
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
DOCX
search engine optimization ppt fir known well about this
future_of_ai_comprehensive_20250822032121.pptx
Consumable AI The What, Why & How for Small Teams.pdf
Training Program for knowledge in solar cell and solar industry
Convolutional neural network based encoder-decoder for efficient real-time ob...
Basics of Cloud Computing - Cloud Ecosystem
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Auditboard EB SOX Playbook 2023 edition.
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Configure Apache Mutual Authentication
Module 1 Introduction to Web Programming .pptx
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
sustainability-14-14877-v2.pddhzftheheeeee
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
4 layer Arch & Reference Arch of IoT.pdf
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
search engine optimization ppt fir known well about this

Hadoop

  • 1. Overview on HADOOP Distributed Computing RAGHU JULURI Senior Member Technical Staff Oracle India Development Center. 2/7/2011
  • 2. Dealing with lots of Data 20 billion web pages * 20 kb =400 TB 1000 hard disks to store web 1 computer can read ~50 MB/sec from disk => 3 months Sol : spread the work over many machines Hardware & Software Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library. 2/7/2011
  • 9. Hadoop, Why? Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. Need common infrastructure – Efficient, reliable, Open Source Apache License The above goals are same as Condor, but Workloads are IO bound and not CPU bound 2/7/2011
  • 12. HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011
  • 13. Goals of HDFS Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS 2/7/2011
  • 14. Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011
  • 15. MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011
  • 16. MapReduce: Programming Model Process data using special map () and reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output 2/7/2011
  • 17. MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday. 2/7/2011
  • 18. MapReduce Examples Word frequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011
  • 19. A Brief History Functional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator 2/7/2011
  • 20. MapReduce Execution Overview The user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011
  • 21. MapReduce Execution Overview The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers 2/7/2011
  • 22. MapReduce Resources The master distributes M map and R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task) 2/7/2011
  • 23. MapReduce Resources Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs 2/7/2011
  • 24. MapReduce Execution Overview Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. Master Map worker Disk locations Local Storage 2/7/2011
  • 25. MapReduce Execution Overview Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data. Master Reduce worker Disk locations remote Storage 2/7/2011
  • 26. MapReduce Execution Overview Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file 2/7/2011
  • 27. MapReduce Execution Overview Master process wakes up user process when all tasks have completed. Output contained in R output files. wakeup User Program Master Output files 2/7/2011
  • 29. Pig Data-flow oriented language “ Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for complex tasks • Developed at Yahoo! Hive • SQL-based data warehousing app Feature set is similar to Pig – Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets – Partition columns – Sampling – Buckets Developed at Facebook 2/7/2011
  • 30. Hbase Column-store database – Based on design of Google BigTable – Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model – (key, val) lookup – Limited transactions (only one row) 2/7/2011
  • 31. ZooKeeper Distributed consensus engine Provides well-defined concurrent access semantics: – Leader election – Service discovery – Distributed locking / mutual exclusion – Message board / mailboxes 2/7/2011
  • 32. Some more projects… Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P backend Dumbo – Python library for streaming Ganglia – distributed monitoring 2/7/2011
  • 33. Conclusions Computing with big datasets is a fundamentally different challenge than doing “big compute” over a small dataset • New ways of thinking about problems needed – New tools provide means to capture this – MapReduce, HDFS, etc. can help 2/7/2011