Bigdata overview

BIG DATA
A PROBLEM
http://www.allsoftsolutions.in

BIG DATA
 Big data is a term for data sets that are so
large or complex that traditional data
processing application softwares are
inadequate to deal with them.
 Challenges
include capture, storage, analysis,
search, sharing, transfer, visualization, quer
ying, updating and information privacy.
 The term "big data" often refers simply to
the use of predictive analytics, user
behavior analytics, or certain other
advanced data analytics methods that
extract value from data, and seldom to a
particular size of data set.http://www.allsoftsolutions.in

BIG DATA

WHY BIG DATA
 Over 2.5 Exabyte(2.5 billion gigabytes) of data is
generated every day. Following are some of the
sources of the huge volume of data:
 A typical, large stock exchange captures more
than 1 TB of data every day. There are around 5
billion mobile phones (including 1.75 billion smart
phones) in the world.
 YouTube users upload more than 48 hours of
video every minute.
 Large social networks such as Twitter and
Facebook capture more than 10 TB of data daily.
 There are more than 30 million networked
sensors in the world.http://www.allsoftsolutions.in

4V’s BY IBM
 Volume
 Velocity
 Variety
 Veracity

IBM’s definition
 Volume:- Big data is always large in volume. It
actually doesn't have to be a certain number
of petabytes to qualify. If your store of old data and
new incoming data has gotten so large that you are
having difficulty handling it, that's big data.
Remember that it's going to keep getting bigger.

IBM’s definition
 Velocity :-Velocity or speed refers to how fast the
data is coming in, but also to how fast we need to be
able to analyze and utilize it. If we have one or more
business processes that require real-time data
analysis, we have a velocity challenge. Solving this
issue might mean expanding our private cloud using
a hybrid model that allows bursting for additional
compute power as-needed for data analysis.

IBM’s definition
 Variety:- Variety points to the number of sources or
incoming vectors leading to databases. That might
be embedded sensor data, phone conversations,
documents, video uploads or feeds, social media,
and much more. Variety in data means variety in
databases – we will almost certainly need to add a
non-relational database if you haven't already done
so.

IBM’s definition
 Veracity :-Veracity is probably the toughest nut to
crack. If we can't trust the data itself, the source of
the data, or the processes we are using to identify
which data points are important, we have a veracity
problem. One of the biggest problems with big
data is the tendency for errors to snowball. User
entry errors, redundancy and corruption all affect the
value of data. We must clean our existing data and
put processes in place to reduce the accumulation
of dirty data going forward.

Types of Data
Structured data:
 Data which is represented in a tabular format
 E.g.: Databases
Semi-structured data:
 Data which does not have a formal data model
 E.g.: XML files
Unstructured data:
 Data which does not have a pre-defined data model
 E.g.: Text files

Structured Data
 Structured data refers to kinds of data with a
high level of organization, such as
information in a relational database.
 When information is highly structured and
predictable, search engines can more easily
organize and display it in creative ways.
 Structured data markup is a text-based
organization of data that is included in a file
and served from the web.

Semi-structured data
 It is a form of structured data that does not
conform with the formal structure of data models
associated with relational databases or other
forms of data tables.
 But nonetheless contains tags or other markers
to separate semantic elements and enforce
hierarchies of records and fields within the data.
Therefore, it is also known as self-
describing structure.
 In semi-structured data, the entities belonging to
the same class may have
different attributes even though they are groupedhttp://www.allsoftsolutions.in

Unstructured data
 It refers to information that either does not have a
pre-defined data model or is not organized in a pre-
defined manner.
 It is typically text-heavy, but may contain data such
as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that
make it difficult to understand using traditional
programs as compared to data stored in fielded form
in databases or annotated (semantically tagged) in
documents.

WHAT IS HADOOP
 Hadoop is an open source, Java-based programming
framework that supports the processing and storage of
extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored
by the Apache Software Foundation.
 It consists of computer clusters built from commodity
hardware.
 All the modules in Hadoop are designed with a
fundamental assumption that hardware failures are
common occurrences and should be automatically
handled by the framework.
 The core of Apache Hadoop consists of a storage part,
known as Hadoop Distributed File System (HDFS), and a
processing part which is a MapReduce programming
model.

WHAT IS HADOOP
 Hadoop splits files into large blocks and distributes
them across nodes in a cluster.
 It then transfers packaged code into nodes to
process the data in parallel. This approach takes
advantage of data locality – nodes manipulating the
data they have access to – to allow the dataset to
be processed faster and more efficiently than it
would be in a more conventional supercomputer
architecture that relies on a parallel file
system where computation and data are distributed
via high-speed networking.

Core components of Hadoop
 Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth
across the cluster.
 Hadoop MapReduce – an implementation of
the MapReduce programming model for large scale
data processing.

HISTORY OF HADOOP
 The genesis of Hadoop came from the Google File
System paper that was published in October 2003.
 This paper spawned another research paper from
Google – MapReduce: Simplified Data Processing
on Large Clusters.
 Development started on the Apache Nutch project,
but was moved to the new Hadoop subproject in
January 2006.
 Doug Cutting, who was working at Yahoo! at the
time, named it after his son's toy elephant.

HADOOP-ECOSYSTEM

CORE COMPONENTS-HDFS
 Hadoop File System was developed using
distributed file system design.
 It is run on commodity hardware.
 Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
 HDFS holds very large amount of data and provides
easier access.
 To store such huge data, the files are stored across
multiple machines.
 HDFS also makes applications available to parallel
processing.

Features of HDFS
 It is suitable for the distributed storage and
processing.
 Hadoop provides a command interface to interact
with HDFS.
 The built-in servers of namenode and datanode help
users to easily check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

HDFS Architecture

NAMENODE
HDFS follows the master-slave architecture and it has
the following elements.
 Namenode
 The namenode is the commodity hardware that
contains an operating system and the namenode
software.
 It is a software that can be run on commodity
hardware.

NAMENODE
 The system having the namenode acts as the
master server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as
renaming, closing, and opening files and directories.

DATANODE
 The datanode is a commodity hardware having any
operating system and datanode software. For every
node (Commodity hardware/System) in a cluster,
there will be a datanode. These nodes manage the
data storage of their system.
 Datanodes perform read-write operations on the file
systems, as per client request.
 They also perform operations such as block creation,
deletion, and replication according to the instructions
of the namenode.

BLOCK
 Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual
data nodes. These file segments are called as
blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block. The
default block size is 64MB, but it can be increased
as per the need to change in HDFS configuration.

Bigdata overview

More Related Content

What's hot (20)

Similar to Bigdata overview (20)

More from AllsoftSolutions (9)

Recently uploaded (20)

Bigdata overview