SlideShare a Scribd company logo
Another Intro to Hadoop [email_address] Context Optional April 2, 2010 By Adeel Ahmad
About Me Follow me on Twitter @_adeel The AI Show podcast: www.aishow.org Artificial intelligence news every week. Senior App Genius at Context Optional We're hiring Ruby developers. Contact me!
Too much data User-generated, social networks, logging and tracking Google, Yahoo and others need to index the entire internet and return search results in milliseconds NYSE generates 1 TB data/day Facebook has 400 terabytes of stored data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
Can't scale Challenge to both store and analyze datasets Slow to process Unreliable machines (CPUs and disks can do down) Not affordable (faster, more reliable machines are expensive)
Solve it through software Split up the data Run jobs in parallel Sort and combine to get the answer Schedule across arbitrarily-sized cluster Handle fault-tolerance Since even the best systems breakdown, use cheap commodity computers
Enter Hadoop Open-source Apache project written in Java MapReduce implementation for parallelizing application Distributed filesystem for redundant data Many other sub-projects Meant for cheap, heterogenous hardware Scale up by simply adding more cheap hardware
History Open-source Apache project Grew out of Apache Nutch project, an open-source search engine Two Google papers MapReduce (2003): programming model for parallel processing Google File System (2003) for fault-tolerant processing of large amounts of data
MapReduce Operates exclusively on <key, value> pairs Split the input data into independent chunks Processed by the map tasks in parallel Sort the outputs of the maps Send to the reduce tasks Write to output files
MapReduce
MapReduce
HDFS Hadoop Distributed File System Files split into large blocks Designed for streaming reads and appending writes, not random access 3 replicas for each piece of data by default Data can be encoded/archived formats
Self-managing and self-healing Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data Tries to use same node, then same rack, then same data center Auto-replication if data lost Auto-kill and restart of tasks on another node if taking too long or flaky
Hadoop Streaming Don't need to write mappers and reducers in Java Text-based API that exposes stdin and stdout Use any language Ruby gems: Wukong, Mandy
Example: Word count # mapper.rb STDIN.each_line do |line| word_count = {} line.split.each do |word| word_count[word] ||= 0 word_count[word] += 1 end word_count.each do |k,v| puts &quot;#{k}\t#{v}&quot; end end # reducer.rb word = nil count = 0 STDIN.each_line do |line| wordx, countx = line.strip.split if word x!= word puts &quot;#{word}\t#{count}&quot; unless word.nil? word = wordx count = 0 end count += countx.to_i end puts &quot;#{word}\t#{count}&quot; unless word.nil?
Who Uses Hadoop? Yahoo Facebook Netflix eHarmony LinkedIn NY Times Digg Flightcaster RapLeaf Trulia Last.fm Ning CNET Lots more...
Developing With Hadoop Don't need a whole cluster to start Standalone Non-distributed Single Java process Pseudo-distributed Just like full-distributed Components in separate processes Full distributed Now you need a real cluster
How to Run Hadoop Linux, OSX, Windows, Solaris Just need Java, SSH access to nodes XML config files Download core Hadoop Can do everything we mentioned Still needs user to play with config files and create scripts
How to Run Hadoop Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop Core Hadoop plus patches Bundled with command-line scripts, Hive, Pig Publish AMI and scripts for EC2 Best option for your own cluster
How to Run Hadoop Amazon Elastic MapReduce (EMR) GUI or command-line cluster management Supports Streaming, Hive, Pig Grabs data and MapReduce code from S3 buckets and puts it into HDFS Auto-shutdown EC2 instances Cloudera now has scripts for EMR Easiest option
Pig High-level scripting language developed by Yahoo Describes multi-step jobs Translated into MapReduce tasks Grunt command-line interface Ex: Find top 5 most visited pages by users aged 18 to 25 Users = LOAD 'users' AS (name, age); Filtered = FILTER Users BY age >=18 AND age <= 25; Pages = LOAD 'pages' AS (user, url); Joined = JOIN Filtered BY name, Pages BY user; Grouped  = GROUP Joined BY url; Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks; Sorted = ORDER Summed BY clicks DESC
Hive High-level interface created by Facebook Gives db-like structure to data HIveQL declarative language for querying  Queries get turned into MapReduce jobs Command-line interface ex. CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING); LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table; SELECT … FROM … JOIN ...
Mahout Machine-learning libraries for Hadoop Collaborative filtering Clustering Frequent pattern recognition Genetic algorithms Applications Product/friend recommendation Classify content into defined groups Find associations, patterns, behaviors Identify important topics in conversations
More stuff Hbase – database based on Google's Bigtable Sqoop – database import tool Zookeeper – coordination service for distributed apps to keep track of servers, like a filesystem Avro – data serialization system Scribe – logging system developed by Facebook

More Related Content

What's hot (20)

PPTX
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PPTX
Pig, Making Hadoop Easy
Nick Dimiduk
 
PPTX
Hadoop
Shamama Kamal
 
PDF
Introduction to Hadoop
joelcrabb
 
PDF
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
PPT
Hadoop basics
Antonio Silveira
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PPT
An Introduction to Hadoop
DerrekYoungDotCom
 
PPTX
Hadoop and big data
Sharad Pandey
 
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
PDF
Facebook Hadoop Data & Applications
dzhou
 
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
PPTX
Hadoop technology
tipanagiriharika
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
PPT
Hadoop hive presentation
Arvind Kumar
 
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Introduction to Apache Hadoop
Christopher Pezza
 
Introduction to Hadoop Technology
Manish Borkar
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Pig, Making Hadoop Easy
Nick Dimiduk
 
Hadoop
Shamama Kamal
 
Introduction to Hadoop
joelcrabb
 
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hadoop basics
Antonio Silveira
 
Hadoop seminar
KrishnenduKrishh
 
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop and big data
Sharad Pandey
 
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Facebook Hadoop Data & Applications
dzhou
 
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop technology
tipanagiriharika
 
Big data and Hadoop
Rahul Agarwal
 
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop hive presentation
Arvind Kumar
 

Similar to Another Intro To Hadoop (20)

PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PPT
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PDF
Hadoop Tutorial with @techmilind
EMC
 
PPTX
Hands on Hadoop and pig
Sudar Muthu
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
PPTX
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
PDF
Hadoop breizhjug
David Morin
 
KEY
Hadoop london
Yahoo Developer Network
 
PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
PDF
Tools and techniques for data science
Ajay Ohri
 
PPTX
Bw tech hadoop
Mindgrub Technologies
 
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
PPTX
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Introduction to Hadoop and Big Data
Joe Alex
 
Hadoop Overview & Architecture
EMC
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop Tutorial with @techmilind
EMC
 
Hands on Hadoop and pig
Sudar Muthu
 
Introduction To Hadoop Ecosystem
InSemble
 
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop breizhjug
David Morin
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Tools and techniques for data science
Ajay Ohri
 
Bw tech hadoop
Mindgrub Technologies
 
BW Tech Meetup: Hadoop and The rise of Big Data
Mindgrub Technologies
 
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Ad

Recently uploaded (20)

PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
July Patch Tuesday
Ivanti
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
July Patch Tuesday
Ivanti
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Ad

Another Intro To Hadoop

  • 1. Another Intro to Hadoop [email_address] Context Optional April 2, 2010 By Adeel Ahmad
  • 2. About Me Follow me on Twitter @_adeel The AI Show podcast: www.aishow.org Artificial intelligence news every week. Senior App Genius at Context Optional We're hiring Ruby developers. Contact me!
  • 3. Too much data User-generated, social networks, logging and tracking Google, Yahoo and others need to index the entire internet and return search results in milliseconds NYSE generates 1 TB data/day Facebook has 400 terabytes of stored data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
  • 4. Can't scale Challenge to both store and analyze datasets Slow to process Unreliable machines (CPUs and disks can do down) Not affordable (faster, more reliable machines are expensive)
  • 5. Solve it through software Split up the data Run jobs in parallel Sort and combine to get the answer Schedule across arbitrarily-sized cluster Handle fault-tolerance Since even the best systems breakdown, use cheap commodity computers
  • 6. Enter Hadoop Open-source Apache project written in Java MapReduce implementation for parallelizing application Distributed filesystem for redundant data Many other sub-projects Meant for cheap, heterogenous hardware Scale up by simply adding more cheap hardware
  • 7. History Open-source Apache project Grew out of Apache Nutch project, an open-source search engine Two Google papers MapReduce (2003): programming model for parallel processing Google File System (2003) for fault-tolerant processing of large amounts of data
  • 8. MapReduce Operates exclusively on <key, value> pairs Split the input data into independent chunks Processed by the map tasks in parallel Sort the outputs of the maps Send to the reduce tasks Write to output files
  • 11. HDFS Hadoop Distributed File System Files split into large blocks Designed for streaming reads and appending writes, not random access 3 replicas for each piece of data by default Data can be encoded/archived formats
  • 12. Self-managing and self-healing Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data Tries to use same node, then same rack, then same data center Auto-replication if data lost Auto-kill and restart of tasks on another node if taking too long or flaky
  • 13. Hadoop Streaming Don't need to write mappers and reducers in Java Text-based API that exposes stdin and stdout Use any language Ruby gems: Wukong, Mandy
  • 14. Example: Word count # mapper.rb STDIN.each_line do |line| word_count = {} line.split.each do |word| word_count[word] ||= 0 word_count[word] += 1 end word_count.each do |k,v| puts &quot;#{k}\t#{v}&quot; end end # reducer.rb word = nil count = 0 STDIN.each_line do |line| wordx, countx = line.strip.split if word x!= word puts &quot;#{word}\t#{count}&quot; unless word.nil? word = wordx count = 0 end count += countx.to_i end puts &quot;#{word}\t#{count}&quot; unless word.nil?
  • 15. Who Uses Hadoop? Yahoo Facebook Netflix eHarmony LinkedIn NY Times Digg Flightcaster RapLeaf Trulia Last.fm Ning CNET Lots more...
  • 16. Developing With Hadoop Don't need a whole cluster to start Standalone Non-distributed Single Java process Pseudo-distributed Just like full-distributed Components in separate processes Full distributed Now you need a real cluster
  • 17. How to Run Hadoop Linux, OSX, Windows, Solaris Just need Java, SSH access to nodes XML config files Download core Hadoop Can do everything we mentioned Still needs user to play with config files and create scripts
  • 18. How to Run Hadoop Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop Core Hadoop plus patches Bundled with command-line scripts, Hive, Pig Publish AMI and scripts for EC2 Best option for your own cluster
  • 19. How to Run Hadoop Amazon Elastic MapReduce (EMR) GUI or command-line cluster management Supports Streaming, Hive, Pig Grabs data and MapReduce code from S3 buckets and puts it into HDFS Auto-shutdown EC2 instances Cloudera now has scripts for EMR Easiest option
  • 20. Pig High-level scripting language developed by Yahoo Describes multi-step jobs Translated into MapReduce tasks Grunt command-line interface Ex: Find top 5 most visited pages by users aged 18 to 25 Users = LOAD 'users' AS (name, age); Filtered = FILTER Users BY age >=18 AND age <= 25; Pages = LOAD 'pages' AS (user, url); Joined = JOIN Filtered BY name, Pages BY user; Grouped = GROUP Joined BY url; Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks; Sorted = ORDER Summed BY clicks DESC
  • 21. Hive High-level interface created by Facebook Gives db-like structure to data HIveQL declarative language for querying Queries get turned into MapReduce jobs Command-line interface ex. CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING); LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table; SELECT … FROM … JOIN ...
  • 22. Mahout Machine-learning libraries for Hadoop Collaborative filtering Clustering Frequent pattern recognition Genetic algorithms Applications Product/friend recommendation Classify content into defined groups Find associations, patterns, behaviors Identify important topics in conversations
  • 23. More stuff Hbase – database based on Google's Bigtable Sqoop – database import tool Zookeeper – coordination service for distributed apps to keep track of servers, like a filesystem Avro – data serialization system Scribe – logging system developed by Facebook

Editor's Notes

  • #4: - There is a flood of data and content being produced (User generated content, social networks, sharing, logging and tracking) - Google, Yahoo and others need to index the entire internet and return search results in milliseconds - NYSE generates 1 TB data/day - Facebook uses Hadoop to manage 400 terabytes of stored data and ingest 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
  • #5: - Challenge to both store and analyze this data - reliably (computers break down, storage crashes) - affordably (fast, reliable systems expensive) - and quickly (lots of data takes time)
  • #6: - split up the data - run jobs in parallel - recombine to get the answer - schedule across arbitrarily-sized cluster - handle fault-tolerance - since even the best systems breakdown, use cheap commodity computers
  • #8: - open-source Apache project - grew out of Apache Nutch project: open-source search engine - Two Google papers: - MapReduce (2003): programming model for parallel processing - distributed filesystem for fault-tolerant data processing
  • #9: A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. UNIX: cat input | grep | sort | uniq -c | cat &gt; output Input | Map | Shuffle &amp; Sort | Reduce | Output
  • #10: The Map/Reduce framework operates exclusively on &lt;key, value&gt; pairs, that is, the framework views the input to the job as a set of &lt;key, value&gt; pairs and produces a set of &lt;key, value&gt; pairs as the output of the job, conceivably of different types. (input) &lt;k1, v1&gt; -&gt;map -&gt;&lt;k2, v2&gt; -&gt;combine -&gt;&lt;k2, v2&gt; -&gt;reduce -&gt;&lt;k3, v3&gt; (output)
  • #12: - Files split into large blocks - designed for streaming reads and appending writes, not random access - 3 replicas for each piece of data by default - data can be encoded/archived
  • #13: - Hadoop brings the computation as physically close to the data for best bandwidth, instead of copying data - tries to use same node, then same rack, then same data center - auto-replication if data lost - auto-kill and restart of tasks on another node if taking too long or flaky
  • #15: - simplest example - most Hadoop jobs are a series of jobs that prepare the data first by filtering, cleaning, formatting
  • #16: Yahoo! - More than 100,000 CPUs in &gt;25,000 computers running Hadoop - Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk &amp; 16GB RAM) - Used to support research for Ad Systems and Web Search Facebook - all their stats, daily and hourly reports on user growth, page views, avg. time spent on page, ad campaign performance, suggest friends and applications, ad hoc jobs on historical data for product and executive teams to compare performance of new features Netflix - movie recommendation. Rub jobs every hour to parse and analyze logs. eHarmony - writes MapReduce in Ruby to match 20 million people and improve algorithms NYTimes - used it to process 4 TB of scanned archives and convert them to PDF in 24 hours on 100 machines on EC2 Last.fm - hundreds of daily jobs, analyze logs, evaluate A/B testing, generating charts
  • #17: Difference between standalone and pseudo?
  • #18: - how to set up your own cluster? - Cloudera&apos;s distribution runs on your own cluster - They have scripts to launch and manage EC2 clusters