SlideShare a Scribd company logo
BIG DATA
A PROBLEM
http://www.allsoftsolutions.in
BIG DATA
 Big data is a term for data sets that are so
large or complex that traditional data
processing application softwares are
inadequate to deal with them.
 Challenges
include capture, storage, analysis,
search, sharing, transfer, visualization, quer
ying, updating and information privacy.
 The term "big data" often refers simply to
the use of predictive analytics, user
behavior analytics, or certain other
advanced data analytics methods that
extract value from data, and seldom to a
particular size of data set.http://www.allsoftsolutions.in
BIG DATA
http://www.allsoftsolutions.in
WHY BIG DATA
 Over 2.5 Exabyte(2.5 billion gigabytes) of data is
generated every day. Following are some of the
sources of the huge volume of data:
 A typical, large stock exchange captures more
than 1 TB of data every day. There are around 5
billion mobile phones (including 1.75 billion smart
phones) in the world.
 YouTube users upload more than 48 hours of
video every minute.
 Large social networks such as Twitter and
Facebook capture more than 10 TB of data daily.
 There are more than 30 million networked
sensors in the world.http://www.allsoftsolutions.in
4V’s BY IBM
 Volume
 Velocity
 Variety
 Veracity
http://www.allsoftsolutions.in
IBM’s definition
 Volume:- Big data is always large in volume. It
actually doesn't have to be a certain number
of petabytes to qualify. If your store of old data and
new incoming data has gotten so large that you are
having difficulty handling it, that's big data.
Remember that it's going to keep getting bigger.
http://www.allsoftsolutions.in
IBM’s definition
 Velocity :-Velocity or speed refers to how fast the
data is coming in, but also to how fast we need to be
able to analyze and utilize it. If we have one or more
business processes that require real-time data
analysis, we have a velocity challenge. Solving this
issue might mean expanding our private cloud using
a hybrid model that allows bursting for additional
compute power as-needed for data analysis.
http://www.allsoftsolutions.in
IBM’s definition
 Variety:- Variety points to the number of sources or
incoming vectors leading to databases. That might
be embedded sensor data, phone conversations,
documents, video uploads or feeds, social media,
and much more. Variety in data means variety in
databases – we will almost certainly need to add a
non-relational database if you haven't already done
so.
http://www.allsoftsolutions.in
IBM’s definition
 Veracity :-Veracity is probably the toughest nut to
crack. If we can't trust the data itself, the source of
the data, or the processes we are using to identify
which data points are important, we have a veracity
problem. One of the biggest problems with big
data is the tendency for errors to snowball. User
entry errors, redundancy and corruption all affect the
value of data. We must clean our existing data and
put processes in place to reduce the accumulation
of dirty data going forward.
http://www.allsoftsolutions.in
Types of Data
Structured data:
 Data which is represented in a tabular format
 E.g.: Databases
Semi-structured data:
 Data which does not have a formal data model
 E.g.: XML files
Unstructured data:
 Data which does not have a pre-defined data model
 E.g.: Text files
http://www.allsoftsolutions.in
Structured Data
 Structured data refers to kinds of data with a
high level of organization, such as
information in a relational database.
 When information is highly structured and
predictable, search engines can more easily
organize and display it in creative ways.
 Structured data markup is a text-based
organization of data that is included in a file
and served from the web.
http://www.allsoftsolutions.in
Semi-structured data
 It is a form of structured data that does not
conform with the formal structure of data models
associated with relational databases or other
forms of data tables.
 But nonetheless contains tags or other markers
to separate semantic elements and enforce
hierarchies of records and fields within the data.
Therefore, it is also known as self-
describing structure.
 In semi-structured data, the entities belonging to
the same class may have
different attributes even though they are groupedhttp://www.allsoftsolutions.in
Unstructured data
 It refers to information that either does not have a
pre-defined data model or is not organized in a pre-
defined manner.
 It is typically text-heavy, but may contain data such
as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that
make it difficult to understand using traditional
programs as compared to data stored in fielded form
in databases or annotated (semantically tagged) in
documents.
http://www.allsoftsolutions.in
WHAT IS HADOOP
 Hadoop is an open source, Java-based programming
framework that supports the processing and storage of
extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored
by the Apache Software Foundation.
 It consists of computer clusters built from commodity
hardware.
 All the modules in Hadoop are designed with a
fundamental assumption that hardware failures are
common occurrences and should be automatically
handled by the framework.
 The core of Apache Hadoop consists of a storage part,
known as Hadoop Distributed File System (HDFS), and a
processing part which is a MapReduce programming
model.
http://www.allsoftsolutions.in
WHAT IS HADOOP
 Hadoop splits files into large blocks and distributes
them across nodes in a cluster.
 It then transfers packaged code into nodes to
process the data in parallel. This approach takes
advantage of data locality – nodes manipulating the
data they have access to – to allow the dataset to
be processed faster and more efficiently than it
would be in a more conventional supercomputer
architecture that relies on a parallel file
system where computation and data are distributed
via high-speed networking.
http://www.allsoftsolutions.in
Core components of Hadoop
 Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth
across the cluster.
 Hadoop MapReduce – an implementation of
the MapReduce programming model for large scale
data processing.
http://www.allsoftsolutions.in
HISTORY OF HADOOP
 The genesis of Hadoop came from the Google File
System paper that was published in October 2003.
 This paper spawned another research paper from
Google – MapReduce: Simplified Data Processing
on Large Clusters.
 Development started on the Apache Nutch project,
but was moved to the new Hadoop subproject in
January 2006.
 Doug Cutting, who was working at Yahoo! at the
time, named it after his son's toy elephant.
http://www.allsoftsolutions.in
HADOOP-ECOSYSTEM
http://www.allsoftsolutions.in
CORE COMPONENTS-HDFS
 Hadoop File System was developed using
distributed file system design.
 It is run on commodity hardware.
 Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
 HDFS holds very large amount of data and provides
easier access.
 To store such huge data, the files are stored across
multiple machines.
 HDFS also makes applications available to parallel
processing.
http://www.allsoftsolutions.in
Features of HDFS
 It is suitable for the distributed storage and
processing.
 Hadoop provides a command interface to interact
with HDFS.
 The built-in servers of namenode and datanode help
users to easily check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
http://www.allsoftsolutions.in
HDFS Architecture
http://www.allsoftsolutions.in
NAMENODE
HDFS follows the master-slave architecture and it has
the following elements.
 Namenode
 The namenode is the commodity hardware that
contains an operating system and the namenode
software.
 It is a software that can be run on commodity
hardware.
http://www.allsoftsolutions.in
NAMENODE
 The system having the namenode acts as the
master server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as
renaming, closing, and opening files and directories.
http://www.allsoftsolutions.in
DATANODE
 The datanode is a commodity hardware having any
operating system and datanode software. For every
node (Commodity hardware/System) in a cluster,
there will be a datanode. These nodes manage the
data storage of their system.
 Datanodes perform read-write operations on the file
systems, as per client request.
 They also perform operations such as block creation,
deletion, and replication according to the instructions
of the namenode.
http://www.allsoftsolutions.in
BLOCK
 Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual
data nodes. These file segments are called as
blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block. The
default block size is 64MB, but it can be increased
as per the need to change in HDFS configuration.
http://www.allsoftsolutions.in

More Related Content

What's hot (20)

PPT
Cibm work shop 2chapter six
Shaheen Khan
 
PDF
A Survey on Big Data, Hadoop and it’s Ecosystem
rahulmonikasharma
 
PDF
Data Cleaning
Becky Nahas
 
PPTX
FAIR data overview
Luiz Olavo Bonino da Silva Santos
 
ODP
Quick Introduction to the Semantic Web, RDFa & Microformats
University of California, San Diego
 
PPT
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
PDF
What You Need To Know About The Top Database Trends
Dell World
 
PDF
General concepts: DDI
Arhiv družboslovnih podatkov
 
PDF
Comparative study of no sql document, column store databases and evaluation o...
IJDMS
 
PDF
IRJET- Secured Hadoop Environment
IRJET Journal
 
PPTX
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Cory Lampert
 
PDF
Data Management Planning - 02/21/13
Lizzy_Rolando
 
PPT
Metadata For Catalogers (introductions)
robin fay
 
PPTX
Data Management 101
Kristin Briney
 
PPT
Electronic Databases
Heather Lambert
 
PDF
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
PPTX
From Big Data to Smart Data
Semantic Web Company
 
DOCX
What is big data
ShubShubi
 
PPT
David Shotton - Research Integrity: Integrity of the published record
Jisc
 
PDF
History of NoSQL and Azure Documentdb feature set
Soner Altin
 
Cibm work shop 2chapter six
Shaheen Khan
 
A Survey on Big Data, Hadoop and it’s Ecosystem
rahulmonikasharma
 
Data Cleaning
Becky Nahas
 
Quick Introduction to the Semantic Web, RDFa & Microformats
University of California, San Diego
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
What You Need To Know About The Top Database Trends
Dell World
 
General concepts: DDI
Arhiv družboslovnih podatkov
 
Comparative study of no sql document, column store databases and evaluation o...
IJDMS
 
IRJET- Secured Hadoop Environment
IRJET Journal
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Cory Lampert
 
Data Management Planning - 02/21/13
Lizzy_Rolando
 
Metadata For Catalogers (introductions)
robin fay
 
Data Management 101
Kristin Briney
 
Electronic Databases
Heather Lambert
 
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
From Big Data to Smart Data
Semantic Web Company
 
What is big data
ShubShubi
 
David Shotton - Research Integrity: Integrity of the published record
Jisc
 
History of NoSQL and Azure Documentdb feature set
Soner Altin
 

Similar to Bigdata overview (20)

PPTX
Big data Hadoop presentation
Shivanee garg
 
PPTX
Big data Presentation
himanshu arora
 
DOCX
Big data and Hadoop overview
Nitesh Ghosh
 
PDF
XA Secure | Whitepaper on data security within Hadoop
balajiganesan03
 
PDF
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
nikhilyada769
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PDF
Hadoop hdfs interview questions
Kalyan Hadoop
 
PPTX
A Glimpse of Bigdata - Introduction
saisreealekhya
 
PDF
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
PDF
Infrastructure Considerations for Analytical Workloads
Cognizant
 
PPTX
big data and hadoop
ahmed alshikh
 
PDF
Big Data
Kirubaburi R
 
PPTX
Big data
revathireddyb
 
PPTX
Big data
revathireddyb
 
PDF
Hadoop
Ankit Prasad
 
PPTX
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
PPTX
Big Data & Hadoop
Ankan Banerjee
 
PDF
Big Data Processing with Hadoop : A Review
IRJET Journal
 
PPTX
Security issues in big data
Shallote Dsouza
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
Big data Hadoop presentation
Shivanee garg
 
Big data Presentation
himanshu arora
 
Big data and Hadoop overview
Nitesh Ghosh
 
XA Secure | Whitepaper on data security within Hadoop
balajiganesan03
 
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
nikhilyada769
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Hadoop hdfs interview questions
Kalyan Hadoop
 
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Hadoop and Big Data Analytics | Sysfore
Sysfore Technologies
 
Infrastructure Considerations for Analytical Workloads
Cognizant
 
big data and hadoop
ahmed alshikh
 
Big Data
Kirubaburi R
 
Big data
revathireddyb
 
Big data
revathireddyb
 
Hadoop
Ankit Prasad
 
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Big Data & Hadoop
Ankan Banerjee
 
Big Data Processing with Hadoop : A Review
IRJET Journal
 
Security issues in big data
Shallote Dsouza
 
Hadoop introduction
Subhas Kumar Ghosh
 
Ad

More from AllsoftSolutions (9)

PPT
C#.net
AllsoftSolutions
 
PPT
Python tutorial
AllsoftSolutions
 
PPTX
Iot basics
AllsoftSolutions
 
PPTX
C++ basics
AllsoftSolutions
 
PPT
Python1
AllsoftSolutions
 
PPT
R Basics
AllsoftSolutions
 
PPT
Mysql using php
AllsoftSolutions
 
PPTX
Hbase
AllsoftSolutions
 
PPTX
Map reduce part1
AllsoftSolutions
 
Python tutorial
AllsoftSolutions
 
Iot basics
AllsoftSolutions
 
C++ basics
AllsoftSolutions
 
Mysql using php
AllsoftSolutions
 
Map reduce part1
AllsoftSolutions
 
Ad

Recently uploaded (20)

PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PPTX
Soil and agriculture microbiology .pptx
Keerthana Ramesh
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
PPTX
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPT
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PPTX
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
Soil and agriculture microbiology .pptx
Keerthana Ramesh
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
How to Convert an Opportunity into a Quotation in Odoo 18 CRM
Celine George
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
CEREBRAL PALSY: NURSING MANAGEMENT .pdf
PRADEEP ABOTHU
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
SSHS-2025-PKLP_Quarter-1-Dr.-Kerby-Alvarez.pdf
AishahSangcopan1
 
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Talk on Critical Theory, Part II, Philosophy of Social Sciences
Soraj Hongladarom
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Dimensions of Societal Planning in Commonism
StefanMz
 
Lesson 2 - WATER,pH, BUFFERS, AND ACID-BASE.pdf
marvinnbustamante1
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
SPINA BIFIDA: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 

Bigdata overview

  • 2. BIG DATA  Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them.  Challenges include capture, storage, analysis, search, sharing, transfer, visualization, quer ying, updating and information privacy.  The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.http://www.allsoftsolutions.in
  • 4. WHY BIG DATA  Over 2.5 Exabyte(2.5 billion gigabytes) of data is generated every day. Following are some of the sources of the huge volume of data:  A typical, large stock exchange captures more than 1 TB of data every day. There are around 5 billion mobile phones (including 1.75 billion smart phones) in the world.  YouTube users upload more than 48 hours of video every minute.  Large social networks such as Twitter and Facebook capture more than 10 TB of data daily.  There are more than 30 million networked sensors in the world.http://www.allsoftsolutions.in
  • 5. 4V’s BY IBM  Volume  Velocity  Variety  Veracity http://www.allsoftsolutions.in
  • 6. IBM’s definition  Volume:- Big data is always large in volume. It actually doesn't have to be a certain number of petabytes to qualify. If your store of old data and new incoming data has gotten so large that you are having difficulty handling it, that's big data. Remember that it's going to keep getting bigger. http://www.allsoftsolutions.in
  • 7. IBM’s definition  Velocity :-Velocity or speed refers to how fast the data is coming in, but also to how fast we need to be able to analyze and utilize it. If we have one or more business processes that require real-time data analysis, we have a velocity challenge. Solving this issue might mean expanding our private cloud using a hybrid model that allows bursting for additional compute power as-needed for data analysis. http://www.allsoftsolutions.in
  • 8. IBM’s definition  Variety:- Variety points to the number of sources or incoming vectors leading to databases. That might be embedded sensor data, phone conversations, documents, video uploads or feeds, social media, and much more. Variety in data means variety in databases – we will almost certainly need to add a non-relational database if you haven't already done so. http://www.allsoftsolutions.in
  • 9. IBM’s definition  Veracity :-Veracity is probably the toughest nut to crack. If we can't trust the data itself, the source of the data, or the processes we are using to identify which data points are important, we have a veracity problem. One of the biggest problems with big data is the tendency for errors to snowball. User entry errors, redundancy and corruption all affect the value of data. We must clean our existing data and put processes in place to reduce the accumulation of dirty data going forward. http://www.allsoftsolutions.in
  • 10. Types of Data Structured data:  Data which is represented in a tabular format  E.g.: Databases Semi-structured data:  Data which does not have a formal data model  E.g.: XML files Unstructured data:  Data which does not have a pre-defined data model  E.g.: Text files http://www.allsoftsolutions.in
  • 11. Structured Data  Structured data refers to kinds of data with a high level of organization, such as information in a relational database.  When information is highly structured and predictable, search engines can more easily organize and display it in creative ways.  Structured data markup is a text-based organization of data that is included in a file and served from the web. http://www.allsoftsolutions.in
  • 12. Semi-structured data  It is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables.  But nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self- describing structure.  In semi-structured data, the entities belonging to the same class may have different attributes even though they are groupedhttp://www.allsoftsolutions.in
  • 13. Unstructured data  It refers to information that either does not have a pre-defined data model or is not organized in a pre- defined manner.  It is typically text-heavy, but may contain data such as dates, numbers, and facts as well.  This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents. http://www.allsoftsolutions.in
  • 14. WHAT IS HADOOP  Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.  It consists of computer clusters built from commodity hardware.  All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.  The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. http://www.allsoftsolutions.in
  • 15. WHAT IS HADOOP  Hadoop splits files into large blocks and distributes them across nodes in a cluster.  It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality – nodes manipulating the data they have access to – to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. http://www.allsoftsolutions.in
  • 16. Core components of Hadoop  Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.  Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing. http://www.allsoftsolutions.in
  • 17. HISTORY OF HADOOP  The genesis of Hadoop came from the Google File System paper that was published in October 2003.  This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters.  Development started on the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006.  Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. http://www.allsoftsolutions.in
  • 19. CORE COMPONENTS-HDFS  Hadoop File System was developed using distributed file system design.  It is run on commodity hardware.  Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.  HDFS holds very large amount of data and provides easier access.  To store such huge data, the files are stored across multiple machines.  HDFS also makes applications available to parallel processing. http://www.allsoftsolutions.in
  • 20. Features of HDFS  It is suitable for the distributed storage and processing.  Hadoop provides a command interface to interact with HDFS.  The built-in servers of namenode and datanode help users to easily check the status of cluster.  Streaming access to file system data.  HDFS provides file permissions and authentication. http://www.allsoftsolutions.in
  • 22. NAMENODE HDFS follows the master-slave architecture and it has the following elements.  Namenode  The namenode is the commodity hardware that contains an operating system and the namenode software.  It is a software that can be run on commodity hardware. http://www.allsoftsolutions.in
  • 23. NAMENODE  The system having the namenode acts as the master server and it does the following tasks:  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories. http://www.allsoftsolutions.in
  • 24. DATANODE  The datanode is a commodity hardware having any operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.  Datanodes perform read-write operations on the file systems, as per client request.  They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode. http://www.allsoftsolutions.in
  • 25. BLOCK  Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration. http://www.allsoftsolutions.in