Big Data and Hadoop
Data Facts:-
 The New York Stock Exchange generates about 1 TB of trade data per day.
 Facebook hosts approximately 10 billion of photos, taking up one petabyte of storage.
 Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
 8 TB generated per day by Twitter.
 The internet Archive Stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.
 The Large Hardon Collider near Geneva, Switzerland will produce about 15 petabytes of
data per year.
Big Data:-
It is commonly summarizeas 3Vs of data. Though there is another V which is also equally
important. They are as follows:-
Volume: - This clearly tells about the total size of data which could be in TB or PB or
Zettabytes of data which happens to be semi or multi-structure.
Variety: - Mostly generated data are messy because diverse data sources do not provide a
static structure enabling the traditional RDBMS timely manage.
Velocity: - It is the speed at which data is collected i.e. the rate at which the data is
becoming available to the organization and do the analysis of streaming data to enable
decision within very short time frame.
Veracity: - It is the uncertainty about the genuineness of huge data which is being
generated.
Pic: - Different levels of data generation
Market trends is having New Set of Questions like:-
Social and Web Analytics:-
 What is the social sentiment of my brand or products?
 How effective is our online campaign?
 How can I optimize my traffic to reach the target audience?
Live data feeds:-
 How can we optimize the fleet based on weather and traffic patterns?
Advanced Analytics:-
 How can we better predict our future outcomes?
Hadoop:-
 Big Data Processing Platform.
 Use the “MAP-Reduce” processing paradigm.
 Characteristics:
i>Highly Scalable (Scaled out).
ii>Commodity Hardware-based.
iii> Open source -> Very low cost for acquisition and storage costs.
Hadoop is consist of two different parts and they are Hadoop Distributed File System
(HDFS)and MapReduce Framework.
Hadoop Eco-System:-
HDFS Architecture:-
In HDFS, NameNode is the node which actually receive all the requests coming towards the
system and manages all the datanodes (datanodes are the commodity machine which does the
computation as well as storing of data) in the cluster. When data comes to NameNode it split
the incoming volumes into multiple blocks and evenly shared among datanodes. Data will be
replicated (for high availability) as per the policy (default value is 3) i.e. every block will be
copied N times and stored in different datanodes.
Secondary NameNode stores the metadata of Primary NameNode, so if at any point the
primary goes down also secondary NameNode can be used as an alternative option. As
automatic failover does not support, so we need to manually change the Secondary NameNode
to Primary NameNode.
MapReduce Framework:-
MapReduce consist of multiple functions which is being performed to come to the final stage of
any result set. Below diagram has depicted the same-
Pic:- Flow of MapReduce
Hadoop 1.x- In Summary:-
Limitationof Hadoop 1.x:-
 No Horizontal scalability of NameNode:-
Challenges:-
i. Metadata will store in NameNode memory i.e RAM.
ii. Bottleneck after ~4000 Nodes.
iii. Results in cascading failures of DataNode.
 Does not support NameNode High Aviability:-
Challenges:-
i. Secondary NameNode is not aHot Standby for the NameNode.
 Overburdened JobTracker:-
Challenges:-
i. CPU spends a very significant portion of time and effort managing the life cycle of
applications.
ii. Single Network Listener Thread to communicate with thousands of Map and Reduce
jobs.
 No possible to run Non-MapReduce Big Data Applications on HDFS:-
Challenges:-
i. Only MapReduce processing can be achieved.
ii. Alternate Data Storage is needed for other processing such as Real-time and Graph
Analysis.
 Does not support Multi-tenancy.
Hadoop 2.x:- Enhanced features are as follows-
 HDFS Federation.
 Support NameNode High Availability.
 YARN- Yet Another Resource Negotiator.
i. Better Processing Control.
ii. Support for non-MapReduce type of processing.
iii. Support for Multi-tenancy.
Hadoop 2.x- In Summary:-
Pic:- Structure of Hadoop 2.x
Yet Another Resource Negotiator (YARN):-It makes enable to run multiple types of workloads.
Multi-tenancy - Capacity Scheduler:-
Structure difference of Hadoop1.x and 2.x:-

More Related Content

PPTX
Big data
PPTX
Rebot Project Contents and Description
PDF
Cred_hadoop_presenatation
PPTX
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
PPTX
What is hadoop
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PDF
Thinking Outside the Table
PPTX
Comparison with Traditional databases
Big data
Rebot Project Contents and Description
Cred_hadoop_presenatation
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
What is hadoop
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Thinking Outside the Table
Comparison with Traditional databases

What's hot (20)

PPTX
INTRODUCTION OF BIG DATA
PDF
simple introduction to hadoop
PPTX
Big data computing
PDF
Hadoop
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
PPTX
Big data PPT
PPTX
Introduction to Big Data processing (FGRE2016)
PPTX
A Glimpse of Bigdata - Introduction
PDF
It Don’t Mean a Thing If It Ain’t Got Semantics
PPTX
MongoDB and Hadoop Handling for Big Data
PPTX
Big Data Unit 4 - Hadoop
PPTX
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
PDF
Big data presentation
PPTX
Intro to BigData , Hadoop and Mapreduce
DOC
Big data no company
PPTX
Introduction of big data unit 1
PPTX
Big data ppt
PPTX
Microsoft on Big Data
INTRODUCTION OF BIG DATA
simple introduction to hadoop
Big data computing
Hadoop
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Big data PPT
Introduction to Big Data processing (FGRE2016)
A Glimpse of Bigdata - Introduction
It Don’t Mean a Thing If It Ain’t Got Semantics
MongoDB and Hadoop Handling for Big Data
Big Data Unit 4 - Hadoop
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Big data presentation
Intro to BigData , Hadoop and Mapreduce
Big data no company
Introduction of big data unit 1
Big data ppt
Microsoft on Big Data
Ad

Similar to Bigdata & Hadoop (20)

PPTX
Hadoop
PPTX
Hadoop by kamran khan
PPSX
Hadoop-Quick introduction
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPTX
Big Data and Hadoop
PDF
Survey Paper on Big Data and Hadoop
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Hadoop and Big data in Big data and cloud.pptx
PDF
Understanding Hadoop
PDF
Intro to Big Data - Spark
PDF
Hadoop and its role in Facebook: An Overview
PDF
Elementary Concepts of Big Data and Hadoop
PDF
IRJET - Survey Paper on Map Reduce Processing using HADOOP
PDF
A Survey on Big Data, Hadoop and it’s Ecosystem
PPTX
Big data and hadoop anupama
PDF
Big data and hadoop overvew
Hadoop
Hadoop by kamran khan
Hadoop-Quick introduction
Hadoop_EcoSystem slide by CIDAC India.pptx
Introduction to Apache Hadoop Eco-System
Hadoop and MapReduce addDdaDadadDDAD.pptx
Big Data and Hadoop
Survey Paper on Big Data and Hadoop
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Hadoop and Big data in Big data and cloud.pptx
Understanding Hadoop
Intro to Big Data - Spark
Hadoop and its role in Facebook: An Overview
Elementary Concepts of Big Data and Hadoop
IRJET - Survey Paper on Map Reduce Processing using HADOOP
A Survey on Big Data, Hadoop and it’s Ecosystem
Big data and hadoop anupama
Big data and hadoop overvew
Ad

Recently uploaded (20)

PPT
Geologic Time for studying geology for geologist
PDF
The influence of sentiment analysis in enhancing early warning system model f...
DOCX
search engine optimization ppt fir known well about this
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PPTX
Modernising the Digital Integration Hub
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PPTX
TEXTILE technology diploma scope and career opportunities
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
Geologic Time for studying geology for geologist
The influence of sentiment analysis in enhancing early warning system model f...
search engine optimization ppt fir known well about this
Basics of Cloud Computing - Cloud Ecosystem
Modernising the Digital Integration Hub
CloudStack 4.21: First Look Webinar slides
Build Your First AI Agent with UiPath.pptx
Consumable AI The What, Why & How for Small Teams.pdf
Module 1 Introduction to Web Programming .pptx
TEXTILE technology diploma scope and career opportunities
NewMind AI Weekly Chronicles – August ’25 Week III
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Final SEM Unit 1 for mit wpu at pune .pptx
Training Program for knowledge in solar cell and solar industry
Getting started with AI Agents and Multi-Agent Systems
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Flame analysis and combustion estimation using large language and vision assi...
4 layer Arch & Reference Arch of IoT.pdf
sustainability-14-14877-v2.pddhzftheheeeee

Bigdata & Hadoop

  • 1. Big Data and Hadoop Data Facts:-  The New York Stock Exchange generates about 1 TB of trade data per day.  Facebook hosts approximately 10 billion of photos, taking up one petabyte of storage.  Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.  8 TB generated per day by Twitter.  The internet Archive Stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.  The Large Hardon Collider near Geneva, Switzerland will produce about 15 petabytes of data per year. Big Data:- It is commonly summarizeas 3Vs of data. Though there is another V which is also equally important. They are as follows:- Volume: - This clearly tells about the total size of data which could be in TB or PB or Zettabytes of data which happens to be semi or multi-structure. Variety: - Mostly generated data are messy because diverse data sources do not provide a static structure enabling the traditional RDBMS timely manage. Velocity: - It is the speed at which data is collected i.e. the rate at which the data is becoming available to the organization and do the analysis of streaming data to enable decision within very short time frame. Veracity: - It is the uncertainty about the genuineness of huge data which is being generated.
  • 2. Pic: - Different levels of data generation Market trends is having New Set of Questions like:- Social and Web Analytics:-  What is the social sentiment of my brand or products?  How effective is our online campaign?  How can I optimize my traffic to reach the target audience? Live data feeds:-  How can we optimize the fleet based on weather and traffic patterns? Advanced Analytics:-  How can we better predict our future outcomes? Hadoop:-  Big Data Processing Platform.  Use the “MAP-Reduce” processing paradigm.  Characteristics: i>Highly Scalable (Scaled out). ii>Commodity Hardware-based. iii> Open source -> Very low cost for acquisition and storage costs. Hadoop is consist of two different parts and they are Hadoop Distributed File System (HDFS)and MapReduce Framework.
  • 3. Hadoop Eco-System:- HDFS Architecture:- In HDFS, NameNode is the node which actually receive all the requests coming towards the system and manages all the datanodes (datanodes are the commodity machine which does the computation as well as storing of data) in the cluster. When data comes to NameNode it split the incoming volumes into multiple blocks and evenly shared among datanodes. Data will be replicated (for high availability) as per the policy (default value is 3) i.e. every block will be copied N times and stored in different datanodes. Secondary NameNode stores the metadata of Primary NameNode, so if at any point the primary goes down also secondary NameNode can be used as an alternative option. As
  • 4. automatic failover does not support, so we need to manually change the Secondary NameNode to Primary NameNode. MapReduce Framework:- MapReduce consist of multiple functions which is being performed to come to the final stage of any result set. Below diagram has depicted the same-
  • 5. Pic:- Flow of MapReduce Hadoop 1.x- In Summary:- Limitationof Hadoop 1.x:-  No Horizontal scalability of NameNode:-
  • 6. Challenges:- i. Metadata will store in NameNode memory i.e RAM. ii. Bottleneck after ~4000 Nodes. iii. Results in cascading failures of DataNode.  Does not support NameNode High Aviability:- Challenges:- i. Secondary NameNode is not aHot Standby for the NameNode.
  • 7.  Overburdened JobTracker:- Challenges:- i. CPU spends a very significant portion of time and effort managing the life cycle of applications. ii. Single Network Listener Thread to communicate with thousands of Map and Reduce jobs.  No possible to run Non-MapReduce Big Data Applications on HDFS:- Challenges:- i. Only MapReduce processing can be achieved. ii. Alternate Data Storage is needed for other processing such as Real-time and Graph Analysis.  Does not support Multi-tenancy. Hadoop 2.x:- Enhanced features are as follows-  HDFS Federation.  Support NameNode High Availability.  YARN- Yet Another Resource Negotiator. i. Better Processing Control. ii. Support for non-MapReduce type of processing. iii. Support for Multi-tenancy.
  • 8. Hadoop 2.x- In Summary:- Pic:- Structure of Hadoop 2.x Yet Another Resource Negotiator (YARN):-It makes enable to run multiple types of workloads. Multi-tenancy - Capacity Scheduler:-
  • 9. Structure difference of Hadoop1.x and 2.x:-