Big Data and Hadoop Basics

PRESENTATION FOR
BIG DATA & HADOOP
BY SONAL TIWARI

UNDERSTANDING BIG DATA
Big data involves the data produced by different devices and
applications some of the fields that comes under Big Data are:
WHAT IS BIG DATA?
BlackBoxData−Itisacomponentofhelicopter,airplanes,andjets,etc.It
capturesvoicesoftheflightcrew,recordingsofmicrophonesand
earphones,andtheperformanceinformationoftheaircraft.
01
SocialMediaData−SocialmediasuchasFacebookandTwitterhold
informationandtheviewspostedbymillionsofpeopleacrosstheglobe.02
Stock Exchange Data − The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a
share of different companies made by the customers.
03

Transport Data − Transport data includes model, capacity,
distance and availability of a vehicle.04
SearchEngineData−Searchenginesretrievelotsofdatafromdifferent
databases.05
Power Grid Data − The power grid data holds information
consumed by a particular node with respect to a base
station.
06

 Big data is a collection of large datasets that cannot be processed
using traditional computing techniques.
 The 4V’s of data that defines the data sets in Big Data are:
o Volume
o Velocity
o Variety
o Veracity
DEFINITION OF BIG DATA?

4V’S OF BIG DATA
Refers to vast
amount of data
generated every
second
Refers to the
different types
of data such as
messages, audio
and video
recordings,
images
Refers to speed
at which ne data
is generated and
the speed at
which it moves
around.
Refers to
messiness and
trustworthiness
of the data
VOLUME VARIETY VELOCITY VERACITY

DEFINITION OF BIG DATA?
Big Data
Challenges
Capturing
Data
Curation
Storage
SearchingSharing
Transfer
Analysis

 The enterprise stores and processes Big data in a
computer/database such as Oracle, IBM, etc.
 The user interacts with the application, which in turn handles the
part of data storage and analysis.
TRADITIONAL APPROACH OF BIG DATA PROCESSING AND LIMITATIONS
Centralised
System
Relational Data
Base
User
 This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the
data.
LIMITATIONS

 Google solved the limitations of traditional methods using an
algorithm called MapReduce.
 This algorithm divides the task into small parts and assigns them to
many computers, and collects the results from them which when
integrated, form the result dataset.
LATEST APPROACH: GOOGLE SOLUTION
Commodity
Hardware
Commodity
Hardware
Centralised
System
User

UNDERSTANDING HADOOP & IT’S COMPONENTS
 Using the solution provided by Google, Doug Cutting and his
team developed an Open Source Project called HADOOP.
 Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel with others.
 Hadoop is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
 Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming models.
 The Hadoop framework application works in an environment
that provides distributed storage and computation across
clusters of computers.
 Hadoop is designed to scale up from single server to thousands
of machines, each offering local computation and storage.
INTRODUCTION TO HADOOP

INTRODUCTION TO HADOOP

LAYERS OF HADOOP

LAYERS OF HADOOP
 MapReduce : MapReduce is a parallel programming model for writing
distributed applications devised at Google for efficient processing of
large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner. The MapReduce program runs on Hadoop which is an Apache
open-source framework.
 Hadoop Distributed File System: The Hadoop Distributed File System
(HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on commodity hardware.
It is highly fault-tolerant and is designed to be deployed on low-cost
hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
 Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce), and
 Storage layer (Hadoop Distributed File System).

LAYERS OF HADOOP
 Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules −
 Hadoop Common − These are Java libraries and utilities
required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and
cluster resource management.

ADVANTAGES OF HADOOP
 Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes
the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically
and Hadoop continues to operate without interruption.
 Hadoop is open source as well as it is compatible on all the
platforms since it is Java based.

COMPONENTS OF HADOOP ECOSYSTEM
 HDFS: Hadoop Distributed File System
 MapReduce: Programming based Data Processing
 YARN: Yet Another Resource Negotiator
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

COMPONENTS OF HADOOP ECOSYSTEM
HDFS
(File System)
HBASE
(Column DB
Storage)
Oozie
(Workflow Monitoring)
Chukwa
(Monitoring)
Hive
(SQL)
Map Reduce
(Cluster Management)
YARN
(Cluster & Resource
Management)
Pig
(Dataflow)
Mahout
(Machine
Learning)
Avro
(RPC)
Sqoop
(RDBMS
Connector)
Flume
(Monitoring)
ZooKeeper
(Management)
Data
Storage
Data
Processing
Data
Access
Data
Management

DATA STORAGE COMPONENT OF HADOOP
HDFS
 Hadoop File System was developed using distributed file system
design.
 It is run on commodity hardware.
 Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple
machines.
 The files are stored in redundant fashion to rescue the system
from possible data losses in case of failure.
 HDFS makes applications available to parallel processing

HDFS - ARCHITECTURE

HDFS ARCHITECTURE
HDFS follows the master-slave architecture and it has the following
elements.
 Namenode : The namenode is the commodity hardware that
contains the GNU/Linux operating system and the namenode
software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master
server and it does the following tasks
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming,
closing, and opening files and directories.

HDFS ARCHITECTURE
 Datanode: The datanode is a commodity hardware having the
GNU/Linux operating system and datanode software. For every
node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems,
as per client request.
 They perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
 Block: The user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or
stored in individual data nodes. These file segments are called as
blocks. Or the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB.

HBASE
 It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database.
 It provides capabilities of Google’s BigTable, thus able to work on
Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be
processed within a short quick span of time. At such times, HBase
comes handy as it gives a tolerant way of storing limited data

HBASE- COMPONENTS
 HBase master: It is not part of the actual data storage, but it
manages load balancing activities across all Region Servers.
 It controls the failovers.
 Performs administration activities which provide an interface
for creating, updating and deleting tables.
 Handles DDL operations.
 It maintains and monitors the Hadoop cluster.
 Regional server: It is a worker node. It reads, writes, and deletes
request from Clients. Region server runs on every node of Hadoop
cluster. Its server runs on HDFS data nodes.

DATA PROCESSING COMPONENT OF HADOOP
MAP REDUCE
 By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s logic
and helps to write applications which transform big data sets into
a manageable one.
 MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
 Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
 Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.

MAP REDUCE- FEATURES
Features
of Map
Reduce
Simplicity
(jobs are easy
to run)
Scalability
(can process
petabytes of
data)
Speed
(parallel
processing
improves
speed)
Fault
Tolerance
(takes care of
failures)

YARN
 Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. It
performs scheduling and resource allocation for the Hadoop
System.
 Consists of three major components i.e.
 Resource Manager
 Nodes Manager
 Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.

YARN- KEY BENEFITS
Key
Benefits
of YARN
Improved
cluster
utilization
Highly
scalable
Beyond Java
Novel
programmin
g models &
services
Agility

DATA ACCESS COMPONENT OF HADOOP
HIVE
 With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. Its query language is called
as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by
Hive thus, making the query processing easier.
 HIVE comes with two components: JDBC Drivers and HIVE
Command Line.
 JDBC, along with ODBC drivers work on establishing the data
storage permissions and connection whereas HIVE Command line
helps in the processing of queries.

HIVE

PIG
 Pig was developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and
analyzing huge data sets.
 Pig does the work of executing commands and in the background,
all the activities of MapReduce are taken care of. After the
processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which
runs on Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and
hence is a major segment of the Hadoop Ecosystem.

PIG
Data Flows
Local Execution in
a Single JVM
Input file stored
in HDFS
Output file stored
in HDFS
Pig Latin
Compiler
Execution of Map
Reduce job
internally
Pig Script
UDF present in
local file system
Execution
Environment
Pig
Pig latin is used to
express data flows
Register In
Produces Map Reduce job
1
2
4
5
3

MAHOUT
 Mahout, allows Machine Learnability to a system or application.
 Machine Learning, helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of
algorithms.
 It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which are
nothing but concepts of Machine learning.
 It allows invoking algorithms as per our need with the help of its
own libraries.

AVRO
 Apache Avro works as a data serialization system. It helps Hadoop
in data serialization and data exchange.
 Avro enables big data in exchanging programs written in different
languages. It serializes data into files or messages.
 Avro Schema: Schema helps Avaro in serialization and
deserialization process without code generation. Avro needs a
schema for data to read and write.
 Dynamic typing: it means serializing and deserializing data
without generating any code. It replaces the code generation
process with its statistically typed language as an optional
optimization.

SQOOP
 Sqoop works as a front-end loader of Big data.
 Sqoop is a front-end interface that enables in moving bulk data
from Hadoop to relational databases and into variously structured
data marts.
 Sqoop replaces the function called ‘developing scripts’ to import
and export data. It mainly helps in moving data from an
enterprise database to Hadoop cluster to performing the ETL
process.
 Sqoop fulfills the growing need to transfer data from the
mainframe to HDFS.
 Sqoop helps in achieving improved compression and light-weight
indexing for advanced query performance.

SQOOP
 It facilitates feature to transfer data parallelly for effective
performance and optimal system utilization.
 Sqoop creates fast data copies from an external source into
Hadoop.
 It acts as a load balancer by mitigating extra storage and
processing loads to other devices.

SQOOP: AS AN ETL
RDBMS
(MySQL, Oracle etc)
Hadoop File System
(HDFS, HIVE, etc)
Import
Export
Sqoop

DATA MANAGEMENT COMPONENT OF HADOOP
OOZIE
 Apache Ooze is a tool in which all sort of programs can be
pipelined in a required manner to work in Hadoop's distributed
environment.
 Oozie works as a scheduler system to run and manage Hadoop
jobs.
 Oozie allows combining multiple complex jobs to be run in a
sequential order to achieve the desired output.
 It is strongly integrated with Hadoop stack supporting various jobs
like Pig, Hive, Sqoop, and system-specific jobs like Java, and
Shell.
 Oozie is an open source Java web application.

OOZIE
Oozie consists of two jobs:
 Oozie workflow: It is a collection of actions arranged to perform
the jobs one after another. It is just like a relay race where one
has to start right after one finish, to complete the race.
 Oozie Coordinator: It runs workflow jobs based on the
availability of data and predefined schedules.

FLUME
 Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of
streaming data such as log files, events (etc...) from various
sources to a centralized data store.
 Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from
various web servers to HDFS.

ZOOKEEPER
 Apache Zookeeper is an open source project designed to
coordinate multiple services in the Hadoop ecosystem.
 Zookeeper performs task like synchronization, inter-component
based communication, grouping, and maintenance.
 Features of Zookeeper:
 Zookeeper acts fast enough with workloads where reads to
data are more common than writes.
 Zookeeper maintains a record of all transactions.

OTHER IMPORTANT COMPONENTS OF HADOOP
 Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries,
especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
 Spark
 It’s a platform that handles all the process consumptive tasks
like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster
than the prior in terms of optimization.
 Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence both are
used in most of the companies interchangeably.

BIG DATA & HADOOP SECURITY
 Knox provides a framework for managing security and supports
security implementations on Hadoop clusters.
 Knox is a REST API gateway developed within the Apache
community to support monitoring, authorization management,
auditing, and policy enforcement on Hadoop clusters.
 Knox provides a single access point for all REST interactions with
clusters.
 Through Knox, system administrators can manage authentication
via LDAP and Active Directory, conduct HTTP header-based
federated identity management, and audit hardware on the
clusters.
 Knox supports enhanced security because it can integrate with
enterprise identity management solutions and is Kerberos
compatible.
HADOOP SECURITY MANAGEMENT TOOL: KNOX

 The Ranger provides a centralized framework that can be used to
manage policies at the resource level, such as files, folders,
databases, and even for specific lines and columns within
databases.
 Ranger helps administrators implement access policies by group,
data type, etc.
 Ranger has different authorization functionality for different
Hadoop components such as YARN, HBase, Hive, etc.
HADOOP SECURITY MANAGEMENT TOOL: RANGER

 In core Hadoop technology the HFDS has directories called
encryption zones. When data is written to Hadoop it is
automatically encrypted (with a user-selected algorithm) and
assigned to an encryption zone.
 Encryption is file specific, not zone specific. That means each file
within the zone is encrypted with its own unique data encryption
key (DEK).
 Clients decrypt data from HFDS uses an encrypted data
encryption key (EDEK), then use the DEK to read and write data.
 Encryption zones and DEK encryption occurs between the file
system and database levels of the architecture.
HADOOP ENCRYPTION

Big Data and Hadoop Basics

More Related Content

What's hot (20)

Similar to Big Data and Hadoop Basics (20)

Recently uploaded (20)

Big Data and Hadoop Basics