Hadoop

Overview on HADOOP Distributed Computing RAGHU JULURI Senior Member Technical Staff Oracle India Development Center. 2/7/2011

Dealing with lots of Data 20 billion web pages * 20 kb =400 TB 1000 hard disks to store web 1 computer can read ~50 MB/sec from disk => 3 months Sol : spread the work over many machines Hardware & Software Software – Communication & Co-ordination , recovery from failure ,status reporting, debugging . Every application need to implement above functionality (Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library. 2/7/2011

Hadoop, Why? Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. Need common infrastructure – Efficient, reliable, Open Source Apache License The above goals are same as Condor, but Workloads are IO bound and not CPU bound 2/7/2011

HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss. 2/7/2011

Goals of HDFS Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS 2/7/2011

Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes o 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log 2/7/2011

MapReduce: Programming Model How now Brown cow How does It work now brown 1 cow 1 does 1 How 2 it 1 now 2 work 1 M M M M R R <How,1> <now,1> <brown,1> <cow,1> <How,1> <does,1> <it,1> <work,1> <now,1> <How,1 1> <now,1 1> <brown,1> <cow,1> <does,1> <it,1> <work,1> Input Output Map Reduce MapReduce Framework 2/7/2011

MapReduce: Programming Model Process data using special map () and reduce () functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output 2/7/2011

MapReduce Benefits Greatly reduces parallel programming complexity Reduces synchronization complexity Automatically partitions data Provides failure transparency Handles load balancing Practical Approximately 1000 Google MapReduce jobs run everyday. 2/7/2011

MapReduce Examples Word frequency Map doc Reduce <word,3> <word,1> <word,1> <word,1> Runtime System <word,1,1,1> 2/7/2011

A Brief History Functional programming (e.g., Lisp) map() function Applies a function to each value of a sequence reduce() function Combines all elements of a sequence using a binary operator 2/7/2011

MapReduce Execution Overview The user program, via the MapReduce library, shards the input data User Program Input Data Shard 0 Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 * Shards are typically 16-64mb in size 2/7/2011

MapReduce Execution Overview The user program creates process copies distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads. User Program Master Workers Workers Workers Workers Workers 2/7/2011

MapReduce Resources The master distributes M map and R reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided into R parts Master Idle Worker Message(Do_map_task) 2/7/2011

MapReduce Resources Each map-task worker reads assigned input shard and outputs intermediate key/value pairs. Output buffered in RAM. Map worker Shard 0 Key/value pairs 2/7/2011

MapReduce Execution Overview Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process. Master Map worker Disk locations Local Storage 2/7/2011

MapReduce Execution Overview Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data. Master Reduce worker Disk locations remote Storage 2/7/2011

MapReduce Execution Overview Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file. Reduce worker Sorts data Partition Output file 2/7/2011

MapReduce Execution Overview Master process wakes up user process when all tasks have completed. Output contained in R output files. wakeup User Program Master Output files 2/7/2011

Pig Data-flow oriented language “ Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for complex tasks • Developed at Yahoo! Hive • SQL-based data warehousing app Feature set is similar to Pig – Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets – Partition columns – Sampling – Buckets Developed at Facebook 2/7/2011

Hbase Column-store database – Based on design of Google BigTable – Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model – (key, val) lookup – Limited transactions (only one row) 2/7/2011

ZooKeeper Distributed consensus engine Provides well-defined concurrent access semantics: – Leader election – Service discovery – Distributed locking / mutual exclusion – Message board / mailboxes 2/7/2011

Some more projects… Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P backend Dumbo – Python library for streaming Ganglia – distributed monitoring 2/7/2011

Conclusions Computing with big datasets is a fundamentally different challenge than doing “big compute” over a small dataset • New ways of thinking about problems needed – New tools provide means to capture this – MapReduce, HDFS, etc. can help 2/7/2011

Hadoop

More Related Content

What's hot (17)

Similar to Hadoop (20)

Recently uploaded (20)

Hadoop