Another Intro To Hadoop

Another Intro to Hadoop [email_address] Context Optional April 2, 2010 By Adeel Ahmad

About Me Follow me on Twitter @_adeel The AI Show podcast: www.aishow.org Artificial intelligence news every week. Senior App Genius at Context Optional We're hiring Ruby developers. Contact me!

Too much data User-generated, social networks, logging and tracking Google, Yahoo and others need to index the entire internet and return search results in milliseconds NYSE generates 1 TB data/day Facebook has 400 terabytes of stored data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)

Can't scale Challenge to both store and analyze datasets Slow to process Unreliable machines (CPUs and disks can do down) Not affordable (faster, more reliable machines are expensive)

Solve it through software Split up the data Run jobs in parallel Sort and combine to get the answer Schedule across arbitrarily-sized cluster Handle fault-tolerance Since even the best systems breakdown, use cheap commodity computers

Enter Hadoop Open-source Apache project written in Java MapReduce implementation for parallelizing application Distributed filesystem for redundant data Many other sub-projects Meant for cheap, heterogenous hardware Scale up by simply adding more cheap hardware

History Open-source Apache project Grew out of Apache Nutch project, an open-source search engine Two Google papers MapReduce (2003): programming model for parallel processing Google File System (2003) for fault-tolerant processing of large amounts of data

MapReduce Operates exclusively on <key, value> pairs Split the input data into independent chunks Processed by the map tasks in parallel Sort the outputs of the maps Send to the reduce tasks Write to output files

HDFS Hadoop Distributed File System Files split into large blocks Designed for streaming reads and appending writes, not random access 3 replicas for each piece of data by default Data can be encoded/archived formats

Self-managing and self-healing Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data Tries to use same node, then same rack, then same data center Auto-replication if data lost Auto-kill and restart of tasks on another node if taking too long or flaky

Hadoop Streaming Don't need to write mappers and reducers in Java Text-based API that exposes stdin and stdout Use any language Ruby gems: Wukong, Mandy

Example: Word count # mapper.rb STDIN.each_line do |line| word_count = {} line.split.each do |word| word_count[word] ||= 0 word_count[word] += 1 end word_count.each do |k,v| puts "#{k}\t#{v}" end end # reducer.rb word = nil count = 0 STDIN.each_line do |line| wordx, countx = line.strip.split if word x!= word puts "#{word}\t#{count}" unless word.nil? word = wordx count = 0 end count += countx.to_i end puts "#{word}\t#{count}" unless word.nil?

Who Uses Hadoop? Yahoo Facebook Netflix eHarmony LinkedIn NY Times Digg Flightcaster RapLeaf Trulia Last.fm Ning CNET Lots more...

Developing With Hadoop Don't need a whole cluster to start Standalone Non-distributed Single Java process Pseudo-distributed Just like full-distributed Components in separate processes Full distributed Now you need a real cluster

How to Run Hadoop Linux, OSX, Windows, Solaris Just need Java, SSH access to nodes XML config files Download core Hadoop Can do everything we mentioned Still needs user to play with config files and create scripts

How to Run Hadoop Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop Core Hadoop plus patches Bundled with command-line scripts, Hive, Pig Publish AMI and scripts for EC2 Best option for your own cluster

How to Run Hadoop Amazon Elastic MapReduce (EMR) GUI or command-line cluster management Supports Streaming, Hive, Pig Grabs data and MapReduce code from S3 buckets and puts it into HDFS Auto-shutdown EC2 instances Cloudera now has scripts for EMR Easiest option

Pig High-level scripting language developed by Yahoo Describes multi-step jobs Translated into MapReduce tasks Grunt command-line interface Ex: Find top 5 most visited pages by users aged 18 to 25 Users = LOAD 'users' AS (name, age); Filtered = FILTER Users BY age >=18 AND age <= 25; Pages = LOAD 'pages' AS (user, url); Joined = JOIN Filtered BY name, Pages BY user; Grouped = GROUP Joined BY url; Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks; Sorted = ORDER Summed BY clicks DESC

Hive High-level interface created by Facebook Gives db-like structure to data HIveQL declarative language for querying Queries get turned into MapReduce jobs Command-line interface ex. CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING); LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table; SELECT … FROM … JOIN ...

Mahout Machine-learning libraries for Hadoop Collaborative filtering Clustering Frequent pattern recognition Genetic algorithms Applications Product/friend recommendation Classify content into defined groups Find associations, patterns, behaviors Identify important topics in conversations

More stuff Hbase – database based on Google's Bigtable Sqoop – database import tool Zookeeper – coordination service for distributed apps to keep track of servers, like a filesystem Avro – data serialization system Scribe – logging system developed by Facebook

Another Intro To Hadoop

More Related Content

What's hot (20)

Similar to Another Intro To Hadoop (20)

Recently uploaded (20)

Another Intro To Hadoop

Editor's Notes