SlideShare a Scribd company logo
Hadoop
An Introduction
Mohanasundaram Ponnusamy
 The term used to describe large collections of data
and grow so large and quickly
 With this massive quantity of data, more than 80%
are unstructured or semi-structured.
 it is difficult to manage with regular database or
statistical tools
 Its relevant Big Data solutions based on Hadoop
and other analytics software.
What is Bigdata ?
 Open Source project of the Apache Foundation
 A framework written in Java originally developed by Doug Cutting
 Hadoop was based on Google White Papers
 2003: Description of the Google File System (GFS)
 A method for storing data in a distributed, reliable fashion
 2004: Description of distributed MapReduce
 A method for processing data in a parallel fashion
 Optimized to handle
 massive quantities of data through parallelism
 Variety of data i.e., structured, unstructured or semi-structured
 Commodity hardware relatively inexpensive
What is Hadoop ?
 Massive parallel processing is done with great
performance
 Reliability through replication.
 It replicates its data across multiple computers. if one
goes down, the data is processed on one of the others.
 Current Hadoop Version is 2.6 (Nov 2014)
 Appends possible but updates not.
 Find more info at http://hadoop.apache.org/
What is Hadoop? Cont..
Who uses?
 Store & Process large data ( TB, PB)
 Fault Tolerant. Failure is common
 Highly Scalable Horizontally with commodity
hardware
 Moving computation to storage instead of moving
data to the processors
 Distributed Computing ( MR)
 Not Only SQL
Why Hadoop ?
 We can process data very quickly, but we can only
read/write it very slowly
 Solution: parallel reads
 1 HDD = 75MB/sec
 1,000 HDDs = 75GB/sec
 Far more acceptable
Why Hadoop? Cont..
 HDFS
 A distributed file system that provides high-throughput access to
application data.
 MR
 A distributed data processing model and execution environment
that runs on large clusters of commodity machines.
 YARN
 A framework for job scheduling and cluster resource
management.
 Common Utilities
 The set of utilities that supports Hadoop Subprojects.
 It includes File System, RPC, Serialization libs, etc.
Hadoop Components
Hadoop 1 & 2
 Pig (Yahoo)
 A data flow scripting (PigLatin) language and execution environment for exploring
very large datasets.
 Pig runs on HDFS and MapReduce clusters.
 Hive (Facebook)
 A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (HiveSQL) for querying the data.
 HBase
 A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
 ZooKeeper
 A distributed, highly available coordination service.
 ZooKeeper provides primitives such as distributed locks that can be used for
building distributed applications.
Hadoop Ecosystem Projects
 Sqoop
 A tool for efficiently moving data between relational databases and
HDFS.
 Flume
 Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data
 Oozie
 Oozie is a workflow scheduler system to manage Apache Hadoop jobs
 Cassandra
 A scalable multi-master database with no single points of failure.
 Mahout
 A Scalable machine learning and data mining library.
Hadoop Ecosystem Projects Cont..
 Spark
 A fast analytics and Stream processing for Hadoop data.
 Spark provides a simple and expressive programming model that
supports a wide range of applications, including ETL, machine learning,
stream processing, and graph computation.
 Tez
 A generalized data-flow programming framework
 built on Hadoop YARN
 a powerful and flexible engine to execute an arbitrary DAG of tasks to
process data for both batch and interactive use-cases.
 Tez is being adopted by Hive™, Pig™ and other frameworks in the
Hadoop ecosystem, and also by other commercial software (e.g. ETL
tools), to replace Hadoop™ MapReduce as the underlying execution
engine.
Hadoop Ecosystem Projects Cont..
High Level Architecture
Hortonworks
Cloudera
Cloudera Manager

More Related Content

What's hot (20)

PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPTX
Hadoop
Archana Gopinath
 
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
PPTX
Hadoop
Tuan Cuong Luu
 
PPTX
Hadoop
Zubair Arshad
 
PPTX
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
PPTX
Hadoop Technology
Ece Seçil AKBAŞ
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PPTX
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
PPTX
Hadoop
reddivarihareesh
 
PPTX
Hadoop Architecture
Ganesh B
 
DOCX
HDFS
Vardhman Kale
 
PPT
Hadoop Technology
Atul Kushwaha
 
PPT
Hadoop technology
Sohini~~ Music
 
PPTX
Seminar ppt
RajatTripathi34
 
PPTX
Hadoop
Mayuri Gupta
 
Big data and Hadoop
Rahul Agarwal
 
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Hadoop Technology
Ece Seçil AKBAŞ
 
Intro to Big Data Hadoop
Apache Apex
 
Hadoop seminar
KrishnenduKrishh
 
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop Architecture
Ganesh B
 
Hadoop Technology
Atul Kushwaha
 
Hadoop technology
Sohini~~ Music
 
Seminar ppt
RajatTripathi34
 
Hadoop
Mayuri Gupta
 

Similar to Hadoop An Introduction (20)

ODP
Hadoop introduction
葵慶 李
 
PPTX
hadoop-ecosystem-ppt.pptx
raghavanand36
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PPTX
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
PPTX
Hadoop And Their Ecosystem
sunera pathan
 
PPTX
Hadoop And Their Ecosystem ppt
sunera pathan
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
PDF
Scaling Storage and Computation with Hadoop
yaevents
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PDF
2.1-HADOOP.pdf
MarianJRuben
 
PPTX
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPTX
Foxvalley bigdata
Tom Rogers
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
PDF
Introduction to HADOOP.pdf
8840VinayShelke
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PDF
Hadoop Primer
Steve Staso
 
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Hadoop introduction
葵慶 李
 
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Hadoop Big Data A big picture
J S Jodha
 
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop And Their Ecosystem ppt
sunera pathan
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Scaling Storage and Computation with Hadoop
yaevents
 
Big data and hadoop overvew
Kunal Khanna
 
2.1-HADOOP.pdf
MarianJRuben
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
Dr.Florence Dayana
 
Big Data and Cloud Computing
Farzad Nozarian
 
Foxvalley bigdata
Tom Rogers
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
Introduction to HADOOP.pdf
8840VinayShelke
 
Hadoop Seminar Report
Atul Kushwaha
 
Hadoop Primer
Steve Staso
 
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Ad

Recently uploaded (20)

PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Designing Production-Ready AI Agents
Kunal Rai
 
Ad

Hadoop An Introduction

  • 2.  The term used to describe large collections of data and grow so large and quickly  With this massive quantity of data, more than 80% are unstructured or semi-structured.  it is difficult to manage with regular database or statistical tools  Its relevant Big Data solutions based on Hadoop and other analytics software. What is Bigdata ?
  • 3.  Open Source project of the Apache Foundation  A framework written in Java originally developed by Doug Cutting  Hadoop was based on Google White Papers  2003: Description of the Google File System (GFS)  A method for storing data in a distributed, reliable fashion  2004: Description of distributed MapReduce  A method for processing data in a parallel fashion  Optimized to handle  massive quantities of data through parallelism  Variety of data i.e., structured, unstructured or semi-structured  Commodity hardware relatively inexpensive What is Hadoop ?
  • 4.  Massive parallel processing is done with great performance  Reliability through replication.  It replicates its data across multiple computers. if one goes down, the data is processed on one of the others.  Current Hadoop Version is 2.6 (Nov 2014)  Appends possible but updates not.  Find more info at http://hadoop.apache.org/ What is Hadoop? Cont..
  • 6.  Store & Process large data ( TB, PB)  Fault Tolerant. Failure is common  Highly Scalable Horizontally with commodity hardware  Moving computation to storage instead of moving data to the processors  Distributed Computing ( MR)  Not Only SQL Why Hadoop ?
  • 7.  We can process data very quickly, but we can only read/write it very slowly  Solution: parallel reads  1 HDD = 75MB/sec  1,000 HDDs = 75GB/sec  Far more acceptable Why Hadoop? Cont..
  • 8.  HDFS  A distributed file system that provides high-throughput access to application data.  MR  A distributed data processing model and execution environment that runs on large clusters of commodity machines.  YARN  A framework for job scheduling and cluster resource management.  Common Utilities  The set of utilities that supports Hadoop Subprojects.  It includes File System, RPC, Serialization libs, etc. Hadoop Components
  • 10.  Pig (Yahoo)  A data flow scripting (PigLatin) language and execution environment for exploring very large datasets.  Pig runs on HDFS and MapReduce clusters.  Hive (Facebook)  A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (HiveSQL) for querying the data.  HBase  A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).  ZooKeeper  A distributed, highly available coordination service.  ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Hadoop Ecosystem Projects
  • 11.  Sqoop  A tool for efficiently moving data between relational databases and HDFS.  Flume  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data  Oozie  Oozie is a workflow scheduler system to manage Apache Hadoop jobs  Cassandra  A scalable multi-master database with no single points of failure.  Mahout  A Scalable machine learning and data mining library. Hadoop Ecosystem Projects Cont..
  • 12.  Spark  A fast analytics and Stream processing for Hadoop data.  Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.  Tez  A generalized data-flow programming framework  built on Hadoop YARN  a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases.  Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. Hadoop Ecosystem Projects Cont..