Explained

Sunitha Raghurajan
Data…Data….Data….
• We live in a data world ????
• Total FaceBook Users:835,525,280 (March
  31 st 2012)
• The New York Stock Exchange generates
  about one terabyte of new trade data per
• day.
• • Facebook hosts approximately 10 billion
  photos, taking up one petabyte of storage



http://www.internetworldstats.com/facebook.htm
Data…is growing ????




From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/
collateral/analyst-reports/diverse-exploding-digital-universe.pdf).
Problem??
• How do we store and analyze the date???
• one terabyte drives the transfer speed is
  around 100 MB/s- more than two and a half
  hours to read all the data off the disk.
  Writing more slower
• We had 100 drives holding one hundredth of
  the data.
•   Reliability issues ( failure in hard drive)
•   Combine data from 100 drives?.
• Existing Tools inadequate to process large
  data sets
Why can’t we use RDBMS?
• An RDBMS is good for point queries or
  updates, where the dataset has been indexed
  to deliver low-latency retrieval and update
  times of a relatively small amount of
  data. Longer time to read data

        CPU

      Memory


       Disk
Hadoop is the answer!!!!!
• Hadoop is an open source project licensed
  under the Apache v2 license
  http://hadoop.apache.org/

• Used for processing large datasets in parallel
  with the use of low level commodity machines.

•    Hadoop is build on two main parts. An special
    file system called Hadoop Distributed File
    System (HDFS) and the Map Reduce
    Framework.
Hadoop History
• Hadoop was created by Doug Cutting, who named it
  after his son's toy elephant .
• 2002-2004 Nutch Open Source web-scale, crawler-
  based search
• 2004-2006 Google File System & MapReduce papers
  published.Added DFS & MapReduce impl to Nutch
• 2006-2008 Yahoo hired Doug Cutting
• On February 19, 2008, Yahoo! Inc. launched what it
  claimed was the world's largest Hadoop production
  application
• The Yahoo! Search Webmap is a Hadoop application
  that runs on more than 10,000 core Linux cluster and
  produces data that is now used in every Yahoo! Web
  search query.[22]
Who uses Hadoop ?
Amazon       American Airlines

AOL          Apple

eBay         Federal Reserve Board of
             Governors
foursquare   Fox Interactive Media


FaceBook     StumbleUpon

Gemvara      Hewlett-Packard


IBM          MicroSoft

Twitter      NYTimes

NetFlix      Linkedin
Why Hadoop?
• Reliable: The software is fault tolerant, it
  expects and handles hardware and software
  failures
• Scalable: Designed for massive scale of
  processors, memory, and local attached
  storage
• Distributed: Handles replication. Offers
  massively parallel programming model,
   MapReduce
What is MapReduce???


 – Programming model used by Google


 – A combination of the Map and Reduce models
   with an associated implementation


 – Used for processing and generating large data
   sets
MapReduce Explained
• The basic idea is that you divide the job into
  two parts: a Map, and a Reduce.
• Map basically takes the problem, splits it into
  sub-parts, and sends the sub-parts to different
  machines – so all the pieces run at the same
  time.
• Reduce takes the results from the sub-parts
  and combines them back together to get a
  single answer.
Distributed Grep



       Split data   grep   matches
       Split data   grep   matches
Very                                         All
 big   Split data   grep   matches   cat   matches
data
       Split data   grep   matches
MAP REDUCE ARCHITURE
How Map and Reduce Work
Together
Map Reduce

                                                   R
                               M                   E
Very                                Partitioning
                               A                   D   Result
 big                                 Function
                               P                   U
data
                                                   C
                                                   E
• Map:
   – Accepts input key/value
     pair
                               Reduce :
   – Emits intermediate
     key/value pair              Accepts intermediate
                                 key/value* pair
                                 Emits output key/value pair
http://ayende.com/blog/4435/map-reduce-a-visual-explanation
RDBMS compared to
MapReduce
Data        Gigabytes         Petabytes
Size
Access      Interactive and   Batch
            batch
Updates     Read and write    Write once, read many
            many times        times
integrity   High              Low

Scaling     Nonlinear         Linear

Structur    Static schema     Dynamic schema
e
Hadoop Family

  Pig          A platform for manipulating
                      large data sets
                                                  Scripting

                                                  Machine
 Mahout        Machine Learning Algorithms        Learning
              Bigtable-like structured storage
 HBASE               for Hadoop HDFS           Non-Rel RDBMS


   HIVE          data warehouse system
                                                Non-Rel RDBMS

               Distribute and replicated data
  HDFS                among machines
                                                Hadoop common
 MapReduce      Distribute and monitor tasks


 Zoo Keeper   Distributed Contributed Service
When to use Hadoop?
•   Complex information processing is needed
•   Unstructured data needs to be turned into structured data
•   Queries can’t be reasonably expressed using SQL
•   Heavily recursive algorithms
•   Complex but parallelizable algorithms needed, such as geo-spatial
    analysis or genome sequencing
•   Machine learning
•   Data sets are too large to fit into database RAM, discs, or require too
    many cores (10’s of TB up to PB)
•   Data value does not justify expense of constant real-time availability,
    such as archives or special interest info, which can be moved to
    Hadoop and remain available at lower cost
•   Results are not needed in real time
•   Fault tolerance is critical
•   Significant custom coding would be required to handle job scheduling

•   Reference:http://timoelliott.com/blog/2011/09/hadoop-big-data-and-
    enterprise-business-intelligence.html
Building Blocks of Hadoop
• Running a set of daemons on different servers
  on the network

  •NameNode
  •DataNode
  •Secondary NameNode
  •JobTracker
  •TaskTracker
• Questions????
References
• Hadoop in Action By Chuck Lam
• Hadoop The Definitive Guide By Tom White
• http://hadoop.apache.org/

More Related Content

PPTX
Hadoop Architecture
PPTX
Analysing of big data using map reduce
PDF
Hadoop Ecosystem Architecture Overview
PDF
Hadoop Ecosystem
PPTX
Introduction to Hadoop Technology
PPTX
MapReduce Paradigm
ODP
Hadoop demo ppt
PPTX
Map Reduce
Hadoop Architecture
Analysing of big data using map reduce
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem
Introduction to Hadoop Technology
MapReduce Paradigm
Hadoop demo ppt
Map Reduce

What's hot (19)

PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PPTX
Hadoop
PPT
Hadoop Technologies
DOCX
Hadoop Seminar Report
PPSX
PPTX
Hadoop And Their Ecosystem
PDF
Introduction to Hadoop
PDF
An Introduction to the World of Hadoop
PPTX
Big Data and Hadoop - An Introduction
PPT
Hadoop tutorial
DOC
KEY
Intro to Hadoop
PPT
Another Intro To Hadoop
PDF
Seminar_Report_hadoop
PDF
Facebook Hadoop Data & Applications
PPTX
Hadoop
PPTX
Hive vs Hbase, a Friendly Competition
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Hadoop
Hadoop Technologies
Hadoop Seminar Report
Hadoop And Their Ecosystem
Introduction to Hadoop
An Introduction to the World of Hadoop
Big Data and Hadoop - An Introduction
Hadoop tutorial
Intro to Hadoop
Another Intro To Hadoop
Seminar_Report_hadoop
Facebook Hadoop Data & Applications
Hadoop
Hive vs Hbase, a Friendly Competition
Ad

Similar to Hadoop by sunitha (20)

PDF
Hadoop programming
PPTX
Introduction to Apache Hadoop
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
BW Tech Meetup: Hadoop and The rise of Big Data
PPTX
Bw tech hadoop
PDF
Hadoop Overview & Architecture
 
PDF
MapReduce and Hadoop
PPTX
Hands on Hadoop and pig
PPTX
Big data hadoop ecosystem and nosql
PPTX
Real time hadoop + mapreduce intro
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Hadoop Overview kdd2011
PPTX
Python in big data world
PDF
Data Processing in the Work of NoSQL? An Introduction to Hadoop
PPTX
Big data ppt
PPTX
Apache hadoop
PDF
Scaling Storage and Computation with Hadoop
PDF
Hadoop Master Class : A concise overview
KEY
MapReduce and NoSQL
KEY
Hadoop programming
Introduction to Apache Hadoop
2016-07-21-Godil-presentation.pptx
BW Tech Meetup: Hadoop and The rise of Big Data
Bw tech hadoop
Hadoop Overview & Architecture
 
MapReduce and Hadoop
Hands on Hadoop and pig
Big data hadoop ecosystem and nosql
Real time hadoop + mapreduce intro
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Hadoop Overview kdd2011
Python in big data world
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Big data ppt
Apache hadoop
Scaling Storage and Computation with Hadoop
Hadoop Master Class : A concise overview
MapReduce and NoSQL
Ad

Recently uploaded (20)

PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPTX
Configure Apache Mutual Authentication
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Statistics on Ai - sourced from AIPRM.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
MuleSoft-Compete-Deck for midddleware integrations
Convolutional neural network based encoder-decoder for efficient real-time ob...
NewMind AI Weekly Chronicles – August ’25 Week IV
Custom Battery Pack Design Considerations for Performance and Safety
Consumable AI The What, Why & How for Small Teams.pdf
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Configure Apache Mutual Authentication
Training Program for knowledge in solar cell and solar industry
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Basics of Cloud Computing - Cloud Ecosystem
sustainability-14-14877-v2.pddhzftheheeeee
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Lung cancer patients survival prediction using outlier detection and optimize...
Statistics on Ai - sourced from AIPRM.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Advancing precision in air quality forecasting through machine learning integ...
Rapid Prototyping: A lecture on prototyping techniques for interface design
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx

Hadoop by sunitha

  • 2. Data…Data….Data…. • We live in a data world ???? • Total FaceBook Users:835,525,280 (March 31 st 2012) • The New York Stock Exchange generates about one terabyte of new trade data per • day. • • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage http://www.internetworldstats.com/facebook.htm
  • 3. Data…is growing ???? From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/ collateral/analyst-reports/diverse-exploding-digital-universe.pdf).
  • 4. Problem?? • How do we store and analyze the date??? • one terabyte drives the transfer speed is around 100 MB/s- more than two and a half hours to read all the data off the disk. Writing more slower • We had 100 drives holding one hundredth of the data. • Reliability issues ( failure in hard drive) • Combine data from 100 drives?. • Existing Tools inadequate to process large data sets
  • 5. Why can’t we use RDBMS? • An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. Longer time to read data CPU Memory Disk
  • 6. Hadoop is the answer!!!!! • Hadoop is an open source project licensed under the Apache v2 license http://hadoop.apache.org/ • Used for processing large datasets in parallel with the use of low level commodity machines. • Hadoop is build on two main parts. An special file system called Hadoop Distributed File System (HDFS) and the Map Reduce Framework.
  • 7. Hadoop History • Hadoop was created by Doug Cutting, who named it after his son's toy elephant . • 2002-2004 Nutch Open Source web-scale, crawler- based search • 2004-2006 Google File System & MapReduce papers published.Added DFS & MapReduce impl to Nutch • 2006-2008 Yahoo hired Doug Cutting • On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application • The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.[22]
  • 8. Who uses Hadoop ? Amazon American Airlines AOL Apple eBay Federal Reserve Board of Governors foursquare Fox Interactive Media FaceBook StumbleUpon Gemvara Hewlett-Packard IBM MicroSoft Twitter NYTimes NetFlix Linkedin
  • 9. Why Hadoop? • Reliable: The software is fault tolerant, it expects and handles hardware and software failures • Scalable: Designed for massive scale of processors, memory, and local attached storage • Distributed: Handles replication. Offers massively parallel programming model, MapReduce
  • 10. What is MapReduce??? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets
  • 11. MapReduce Explained • The basic idea is that you divide the job into two parts: a Map, and a Reduce. • Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines – so all the pieces run at the same time. • Reduce takes the results from the sub-parts and combines them back together to get a single answer.
  • 12. Distributed Grep Split data grep matches Split data grep matches Very All big Split data grep matches cat matches data Split data grep matches
  • 14. How Map and Reduce Work Together
  • 15. Map Reduce R M E Very Partitioning A D Result big Function P U data C E • Map: – Accepts input key/value pair Reduce : – Emits intermediate key/value pair Accepts intermediate key/value* pair Emits output key/value pair
  • 17. RDBMS compared to MapReduce Data Gigabytes Petabytes Size Access Interactive and Batch batch Updates Read and write Write once, read many many times times integrity High Low Scaling Nonlinear Linear Structur Static schema Dynamic schema e
  • 18. Hadoop Family Pig A platform for manipulating large data sets Scripting Machine Mahout Machine Learning Algorithms Learning Bigtable-like structured storage HBASE for Hadoop HDFS Non-Rel RDBMS HIVE data warehouse system Non-Rel RDBMS Distribute and replicated data HDFS among machines Hadoop common MapReduce Distribute and monitor tasks Zoo Keeper Distributed Contributed Service
  • 19. When to use Hadoop? • Complex information processing is needed • Unstructured data needs to be turned into structured data • Queries can’t be reasonably expressed using SQL • Heavily recursive algorithms • Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing • Machine learning • Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB) • Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost • Results are not needed in real time • Fault tolerance is critical • Significant custom coding would be required to handle job scheduling • Reference:http://timoelliott.com/blog/2011/09/hadoop-big-data-and- enterprise-business-intelligence.html
  • 20. Building Blocks of Hadoop • Running a set of daemons on different servers on the network •NameNode •DataNode •Secondary NameNode •JobTracker •TaskTracker
  • 22. References • Hadoop in Action By Chuck Lam • Hadoop The Definitive Guide By Tom White • http://hadoop.apache.org/