SlideShare a Scribd company logo
PRESENTATION FOR
BIG DATA & HADOOP
BY SONAL TIWARI
UNDERSTANDING BIG DATA
Big data involves the data produced by different devices and
applications some of the fields that comes under Big Data are:
WHAT IS BIG DATA?
BlackBoxData−Itisacomponentofhelicopter,airplanes,andjets,etc.It
capturesvoicesoftheflightcrew,recordingsofmicrophonesand
earphones,andtheperformanceinformationoftheaircraft.
01
SocialMediaData−SocialmediasuchasFacebookandTwitterhold
informationandtheviewspostedbymillionsofpeopleacrosstheglobe.02
Stock Exchange Data − The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a
share of different companies made by the customers.
03
UNDERSTANDING BIG DATA
Transport Data − Transport data includes model, capacity,
distance and availability of a vehicle.04
SearchEngineData−Searchenginesretrievelotsofdatafromdifferent
databases.05
Power Grid Data − The power grid data holds information
consumed by a particular node with respect to a base
station.
06
UNDERSTANDING BIG DATA
 Big data is a collection of large datasets that cannot be processed
using traditional computing techniques.
 The 4V’s of data that defines the data sets in Big Data are:
o Volume
o Velocity
o Variety
o Veracity
DEFINITION OF BIG DATA?
UNDERSTANDING BIG DATA
4V’S OF BIG DATA
Refers to vast
amount of data
generated every
second
Refers to the
different types
of data such as
messages, audio
and video
recordings,
images
Refers to speed
at which ne data
is generated and
the speed at
which it moves
around.
Refers to
messiness and
trustworthiness
of the data
VOLUME VARIETY VELOCITY VERACITY
UNDERSTANDING BIG DATA
DEFINITION OF BIG DATA?
Big Data
Challenges
Capturing
Data
Curation
Storage
SearchingSharing
Transfer
Analysis
UNDERSTANDING BIG DATA
 The enterprise stores and processes Big data in a
computer/database such as Oracle, IBM, etc.
 The user interacts with the application, which in turn handles the
part of data storage and analysis.
TRADITIONAL APPROACH OF BIG DATA PROCESSING AND LIMITATIONS
Centralised
System
Relational Data
Base
User
 This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database
servers, or up to the limit of the processor that is processing the
data.
LIMITATIONS
UNDERSTANDING BIG DATA
 Google solved the limitations of traditional methods using an
algorithm called MapReduce.
 This algorithm divides the task into small parts and assigns them to
many computers, and collects the results from them which when
integrated, form the result dataset.
LATEST APPROACH: GOOGLE SOLUTION
Commodity
Hardware
Commodity
Hardware
Centralised
System
User
UNDERSTANDING HADOOP & IT’S COMPONENTS
 Using the solution provided by Google, Doug Cutting and his
team developed an Open Source Project called HADOOP.
 Hadoop runs applications using the MapReduce algorithm, where
the data is processed in parallel with others.
 Hadoop is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
 Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming models.
 The Hadoop framework application works in an environment
that provides distributed storage and computation across
clusters of computers.
 Hadoop is designed to scale up from single server to thousands
of machines, each offering local computation and storage.
INTRODUCTION TO HADOOP
UNDERSTANDING HADOOP & IT’S COMPONENTS
INTRODUCTION TO HADOOP
UNDERSTANDING HADOOP & IT’S COMPONENTS
LAYERS OF HADOOP
UNDERSTANDING HADOOP & IT’S COMPONENTS
LAYERS OF HADOOP
 MapReduce : MapReduce is a parallel programming model for writing
distributed applications devised at Google for efficient processing of
large amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant
manner. The MapReduce program runs on Hadoop which is an Apache
open-source framework.
 Hadoop Distributed File System: The Hadoop Distributed File System
(HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on commodity hardware.
It is highly fault-tolerant and is designed to be deployed on low-cost
hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
 Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce), and
 Storage layer (Hadoop Distributed File System).
UNDERSTANDING HADOOP & IT’S COMPONENTS
LAYERS OF HADOOP
 Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules −
 Hadoop Common − These are Java libraries and utilities
required by other Hadoop modules.
 Hadoop YARN − This is a framework for job scheduling and
cluster resource management.
UNDERSTANDING HADOOP & IT’S COMPONENTS
ADVANTAGES OF HADOOP
 Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes
the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the application layer.
 Servers can be added or removed from the cluster dynamically
and Hadoop continues to operate without interruption.
 Hadoop is open source as well as it is compatible on all the
platforms since it is Java based.
COMPONENTS OF HADOOP ECOSYSTEM
 HDFS: Hadoop Distributed File System
 MapReduce: Programming based Data Processing
 YARN: Yet Another Resource Negotiator
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
COMPONENTS OF HADOOP ECOSYSTEM
HDFS
(File System)
HBASE
(Column DB
Storage)
Oozie
(Workflow Monitoring)
Chukwa
(Monitoring)
Hive
(SQL)
Map Reduce
(Cluster Management)
YARN
(Cluster & Resource
Management)
Pig
(Dataflow)
Mahout
(Machine
Learning)
Avro
(RPC)
Sqoop
(RDBMS
Connector)
Flume
(Monitoring)
ZooKeeper
(Management)
Data
Storage
Data
Processing
Data
Access
Data
Management
DATA STORAGE COMPONENT OF HADOOP
HDFS
 Hadoop File System was developed using distributed file system
design.
 It is run on commodity hardware.
 Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple
machines.
 The files are stored in redundant fashion to rescue the system
from possible data losses in case of failure.
 HDFS makes applications available to parallel processing
DATA STORAGE COMPONENT OF HADOOP
HDFS - ARCHITECTURE
DATA STORAGE COMPONENT OF HADOOP
HDFS ARCHITECTURE
HDFS follows the master-slave architecture and it has the following
elements.
 Namenode : The namenode is the commodity hardware that
contains the GNU/Linux operating system and the namenode
software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master
server and it does the following tasks
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming,
closing, and opening files and directories.
DATA STORAGE COMPONENT OF HADOOP
HDFS ARCHITECTURE
 Datanode: The datanode is a commodity hardware having the
GNU/Linux operating system and datanode software. For every
node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems,
as per client request.
 They perform operations such as block creation, deletion, and
replication according to the instructions of the namenode.
 Block: The user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or
stored in individual data nodes. These file segments are called as
blocks. Or the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB.
DATA STORAGE COMPONENT OF HADOOP
HBASE
 It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database.
 It provides capabilities of Google’s BigTable, thus able to work on
Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be
processed within a short quick span of time. At such times, HBase
comes handy as it gives a tolerant way of storing limited data
DATA STORAGE COMPONENT OF HADOOP
HBASE- COMPONENTS
 HBase master: It is not part of the actual data storage, but it
manages load balancing activities across all Region Servers.
 It controls the failovers.
 Performs administration activities which provide an interface
for creating, updating and deleting tables.
 Handles DDL operations.
 It maintains and monitors the Hadoop cluster.
 Regional server: It is a worker node. It reads, writes, and deletes
request from Clients. Region server runs on every node of Hadoop
cluster. Its server runs on HDFS data nodes.
DATA PROCESSING COMPONENT OF HADOOP
MAP REDUCE
 By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s logic
and helps to write applications which transform big data sets into
a manageable one.
 MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
 Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
 Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
DATA PROCESSING COMPONENT OF HADOOP
MAP REDUCE- FEATURES
Features
of Map
Reduce
Simplicity
(jobs are easy
to run)
Scalability
(can process
petabytes of
data)
Speed
(parallel
processing
improves
speed)
Fault
Tolerance
(takes care of
failures)
DATA PROCESSING COMPONENT OF HADOOP
YARN
 Yet Another Resource Negotiator, as the name implies, YARN is the
one who helps to manage the resources across the clusters. It
performs scheduling and resource allocation for the Hadoop
System.
 Consists of three major components i.e.
 Resource Manager
 Nodes Manager
 Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the
allocation of resources such as CPU, memory, bandwidth per
machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the
requirement of the two.
DATA PROCESSING COMPONENT OF HADOOP
YARN- KEY BENEFITS
Key
Benefits
of YARN
Improved
cluster
utilization
Highly
scalable
Beyond Java
Novel
programmin
g models &
services
Agility
DATA ACCESS COMPONENT OF HADOOP
HIVE
 With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. Its query language is called
as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch
processing both. Also, all the SQL datatypes are supported by
Hive thus, making the query processing easier.
 HIVE comes with two components: JDBC Drivers and HIVE
Command Line.
 JDBC, along with ODBC drivers work on establishing the data
storage permissions and connection whereas HIVE Command line
helps in the processing of queries.
DATA ACCESS COMPONENT OF HADOOP
HIVE
DATA ACCESS COMPONENT OF HADOOP
PIG
 Pig was developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and
analyzing huge data sets.
 Pig does the work of executing commands and in the background,
all the activities of MapReduce are taken care of. After the
processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which
runs on Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and
hence is a major segment of the Hadoop Ecosystem.
DATA ACCESS COMPONENT OF HADOOP
PIG
Data Flows
Local Execution in
a Single JVM
Input file stored
in HDFS
Output file stored
in HDFS
Pig Latin
Compiler
Execution of Map
Reduce job
internally
Pig Script
UDF present in
local file system
Execution
Environment
Pig
Pig latin is used to
express data flows
Register In
Produces Map Reduce job
1
2
4
5
3
DATA ACCESS COMPONENT OF HADOOP
MAHOUT
 Mahout, allows Machine Learnability to a system or application.
 Machine Learning, helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of
algorithms.
 It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which are
nothing but concepts of Machine learning.
 It allows invoking algorithms as per our need with the help of its
own libraries.
DATA ACCESS COMPONENT OF HADOOP
AVRO
 Apache Avro works as a data serialization system. It helps Hadoop
in data serialization and data exchange.
 Avro enables big data in exchanging programs written in different
languages. It serializes data into files or messages.
 Avro Schema: Schema helps Avaro in serialization and
deserialization process without code generation. Avro needs a
schema for data to read and write.
 Dynamic typing: it means serializing and deserializing data
without generating any code. It replaces the code generation
process with its statistically typed language as an optional
optimization.
DATA ACCESS COMPONENT OF HADOOP
SQOOP
 Sqoop works as a front-end loader of Big data.
 Sqoop is a front-end interface that enables in moving bulk data
from Hadoop to relational databases and into variously structured
data marts.
 Sqoop replaces the function called ‘developing scripts’ to import
and export data. It mainly helps in moving data from an
enterprise database to Hadoop cluster to performing the ETL
process.
 Sqoop fulfills the growing need to transfer data from the
mainframe to HDFS.
 Sqoop helps in achieving improved compression and light-weight
indexing for advanced query performance.
DATA ACCESS COMPONENT OF HADOOP
SQOOP
 It facilitates feature to transfer data parallelly for effective
performance and optimal system utilization.
 Sqoop creates fast data copies from an external source into
Hadoop.
 It acts as a load balancer by mitigating extra storage and
processing loads to other devices.
DATA ACCESS COMPONENT OF HADOOP
SQOOP: AS AN ETL
RDBMS
(MySQL, Oracle etc)
Hadoop File System
(HDFS, HIVE, etc)
Import
Export
Sqoop
DATA MANAGEMENT COMPONENT OF HADOOP
OOZIE
 Apache Ooze is a tool in which all sort of programs can be
pipelined in a required manner to work in Hadoop's distributed
environment.
 Oozie works as a scheduler system to run and manage Hadoop
jobs.
 Oozie allows combining multiple complex jobs to be run in a
sequential order to achieve the desired output.
 It is strongly integrated with Hadoop stack supporting various jobs
like Pig, Hive, Sqoop, and system-specific jobs like Java, and
Shell.
 Oozie is an open source Java web application.
DATA MANAGEMENT COMPONENT OF HADOOP
OOZIE
Oozie consists of two jobs:
 Oozie workflow: It is a collection of actions arranged to perform
the jobs one after another. It is just like a relay race where one
has to start right after one finish, to complete the race.
 Oozie Coordinator: It runs workflow jobs based on the
availability of data and predefined schedules.
DATA MANAGEMENT COMPONENT OF HADOOP
FLUME
 Apache Flume is a tool/service/data ingestion mechanism for
collecting aggregating and transporting large amounts of
streaming data such as log files, events (etc...) from various
sources to a centralized data store.
 Flume is a highly reliable, distributed, and configurable tool. It is
principally designed to copy streaming data (log data) from
various web servers to HDFS.
DATA MANAGEMENT COMPONENT OF HADOOP
ZOOKEEPER
 Apache Zookeeper is an open source project designed to
coordinate multiple services in the Hadoop ecosystem.
 Zookeeper performs task like synchronization, inter-component
based communication, grouping, and maintenance.
 Features of Zookeeper:
 Zookeeper acts fast enough with workloads where reads to
data are more common than writes.
 Zookeeper maintains a record of all transactions.
OTHER IMPORTANT COMPONENTS OF HADOOP
 Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries,
especially Lucene is based on Java which allows spell check
mechanism, as well. However, Lucene is driven by Solr.
 Spark
 It’s a platform that handles all the process consumptive tasks
like batch processing, interactive or iterative real-time
processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster
than the prior in terms of optimization.
 Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence both are
used in most of the companies interchangeably.
BIG DATA & HADOOP SECURITY
 Knox provides a framework for managing security and supports
security implementations on Hadoop clusters.
 Knox is a REST API gateway developed within the Apache
community to support monitoring, authorization management,
auditing, and policy enforcement on Hadoop clusters.
 Knox provides a single access point for all REST interactions with
clusters.
 Through Knox, system administrators can manage authentication
via LDAP and Active Directory, conduct HTTP header-based
federated identity management, and audit hardware on the
clusters.
 Knox supports enhanced security because it can integrate with
enterprise identity management solutions and is Kerberos
compatible.
HADOOP SECURITY MANAGEMENT TOOL: KNOX
BIG DATA & HADOOP SECURITY
 The Ranger provides a centralized framework that can be used to
manage policies at the resource level, such as files, folders,
databases, and even for specific lines and columns within
databases.
 Ranger helps administrators implement access policies by group,
data type, etc.
 Ranger has different authorization functionality for different
Hadoop components such as YARN, HBase, Hive, etc.
HADOOP SECURITY MANAGEMENT TOOL: RANGER
BIG DATA & HADOOP SECURITY
 In core Hadoop technology the HFDS has directories called
encryption zones. When data is written to Hadoop it is
automatically encrypted (with a user-selected algorithm) and
assigned to an encryption zone.
 Encryption is file specific, not zone specific. That means each file
within the zone is encrypted with its own unique data encryption
key (DEK).
 Clients decrypt data from HFDS uses an encrypted data
encryption key (EDEK), then use the DEK to read and write data.
 Encryption zones and DEK encryption occurs between the file
system and database levels of the architecture.
HADOOP ENCRYPTION

More Related Content

What's hot (20)

PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PPTX
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
PPT
Big data introduction, Hadoop in details
Mahmoud Yassin
 
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
PPTX
Learn Big Data & Hadoop
Edureka!
 
PDF
What is hadoop
Asis Mohanty
 
PPTX
Intro to Big Data Hadoop
Apache Apex
 
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
PPTX
Big Data and Hadoop Introduction
Dzung Nguyen
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PDF
Introduction to Big Data and Hadoop
Edureka!
 
PPTX
Hadoop and big data
Yukti Kaura
 
PPTX
Introduction to BIg Data and Hadoop
Amir Shaikh
 
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
PPTX
Big Data Concepts
Ahmed Salman
 
PPTX
Big data analytics with hadoop volume 2
Imviplav
 
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
PDF
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
PPTX
Big data analytics - hadoop
Vishwajeet Jadeja
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Learn Big Data & Hadoop
Edureka!
 
What is hadoop
Asis Mohanty
 
Intro to Big Data Hadoop
Apache Apex
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Introduction to Big Data & Hadoop
Edureka!
 
Big Data and Hadoop Introduction
Dzung Nguyen
 
Big Data and Hadoop
Flavio Vit
 
Introduction to Big Data and Hadoop
Edureka!
 
Hadoop and big data
Yukti Kaura
 
Introduction to BIg Data and Hadoop
Amir Shaikh
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Big Data Concepts
Ahmed Salman
 
Big data analytics with hadoop volume 2
Imviplav
 
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Big data technologies and Hadoop infrastructure
Roman Nikitchenko
 
Big data analytics - hadoop
Vishwajeet Jadeja
 

Similar to Big Data and Hadoop Basics (20)

PDF
Understanding Hadoop
Ahmed Ossama
 
PPTX
Big data and hadoop
Roushan Sinha
 
PPT
Hadoop
chandinisanz
 
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
ms236400269
 
PPTX
Hadoop
RittikaBaksi
 
PDF
Unit IV.pdf
KennyPratheepKumar
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PPTX
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
ODP
Hadoop seminar
KrishnenduKrishh
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PDF
Hadoop
Ankit Prasad
 
PDF
Hadoop paper
ATWIINE Simon Alex
 
DOCX
project report on hadoop
Manoj Jangalva
 
PDF
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
PPT
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
PPTX
Big Data and Hadoop
Mr. Ankit
 
PPTX
Big Data & Hadoop
Ankan Banerjee
 
PPTX
Cap 10 ingles
ElianaSalinas4
 
PPTX
Cap 10 ingles
ElianaSalinas4
 
Understanding Hadoop
Ahmed Ossama
 
Big data and hadoop
Roushan Sinha
 
Hadoop
chandinisanz
 
Hadoop and MapReduce addDdaDadadDDAD.pptx
ms236400269
 
Hadoop
RittikaBaksi
 
Unit IV.pdf
KennyPratheepKumar
 
Introduction to Hadoop and Big Data
Joe Alex
 
002 Introduction to hadoop v3
Dendej Sawarnkatat
 
Hadoop seminar
KrishnenduKrishh
 
Big data and hadoop overvew
Kunal Khanna
 
Hadoop
Ankit Prasad
 
Hadoop paper
ATWIINE Simon Alex
 
project report on hadoop
Manoj Jangalva
 
Hadoop and its role in Facebook: An Overview
rahulmonikasharma
 
HADOOP AND MAPREDUCE ARCHITECTURE-Unit-5.ppt
ManiMaran230751
 
Big Data and Hadoop
Mr. Ankit
 
Big Data & Hadoop
Ankan Banerjee
 
Cap 10 ingles
ElianaSalinas4
 
Cap 10 ingles
ElianaSalinas4
 
Ad

Recently uploaded (20)

PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Research Methodology Overview Introduction
ayeshagul29594
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Ad

Big Data and Hadoop Basics

  • 1. PRESENTATION FOR BIG DATA & HADOOP BY SONAL TIWARI
  • 2. UNDERSTANDING BIG DATA Big data involves the data produced by different devices and applications some of the fields that comes under Big Data are: WHAT IS BIG DATA? BlackBoxData−Itisacomponentofhelicopter,airplanes,andjets,etc.It capturesvoicesoftheflightcrew,recordingsofmicrophonesand earphones,andtheperformanceinformationoftheaircraft. 01 SocialMediaData−SocialmediasuchasFacebookandTwitterhold informationandtheviewspostedbymillionsofpeopleacrosstheglobe.02 Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers. 03
  • 3. UNDERSTANDING BIG DATA Transport Data − Transport data includes model, capacity, distance and availability of a vehicle.04 SearchEngineData−Searchenginesretrievelotsofdatafromdifferent databases.05 Power Grid Data − The power grid data holds information consumed by a particular node with respect to a base station. 06
  • 4. UNDERSTANDING BIG DATA  Big data is a collection of large datasets that cannot be processed using traditional computing techniques.  The 4V’s of data that defines the data sets in Big Data are: o Volume o Velocity o Variety o Veracity DEFINITION OF BIG DATA?
  • 5. UNDERSTANDING BIG DATA 4V’S OF BIG DATA Refers to vast amount of data generated every second Refers to the different types of data such as messages, audio and video recordings, images Refers to speed at which ne data is generated and the speed at which it moves around. Refers to messiness and trustworthiness of the data VOLUME VARIETY VELOCITY VERACITY
  • 6. UNDERSTANDING BIG DATA DEFINITION OF BIG DATA? Big Data Challenges Capturing Data Curation Storage SearchingSharing Transfer Analysis
  • 7. UNDERSTANDING BIG DATA  The enterprise stores and processes Big data in a computer/database such as Oracle, IBM, etc.  The user interacts with the application, which in turn handles the part of data storage and analysis. TRADITIONAL APPROACH OF BIG DATA PROCESSING AND LIMITATIONS Centralised System Relational Data Base User  This approach works fine with those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. LIMITATIONS
  • 8. UNDERSTANDING BIG DATA  Google solved the limitations of traditional methods using an algorithm called MapReduce.  This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset. LATEST APPROACH: GOOGLE SOLUTION Commodity Hardware Commodity Hardware Centralised System User
  • 9. UNDERSTANDING HADOOP & IT’S COMPONENTS  Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP.  Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others.  Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.  Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.  The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers.  Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. INTRODUCTION TO HADOOP
  • 10. UNDERSTANDING HADOOP & IT’S COMPONENTS INTRODUCTION TO HADOOP
  • 11. UNDERSTANDING HADOOP & IT’S COMPONENTS LAYERS OF HADOOP
  • 12. UNDERSTANDING HADOOP & IT’S COMPONENTS LAYERS OF HADOOP  MapReduce : MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.  Hadoop Distributed File System: The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.  Hadoop has two major layers namely −  Processing/Computation layer (MapReduce), and  Storage layer (Hadoop Distributed File System).
  • 13. UNDERSTANDING HADOOP & IT’S COMPONENTS LAYERS OF HADOOP  Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −  Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.  Hadoop YARN − This is a framework for job scheduling and cluster resource management.
  • 14. UNDERSTANDING HADOOP & IT’S COMPONENTS ADVANTAGES OF HADOOP  Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.  Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.  Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.  Hadoop is open source as well as it is compatible on all the platforms since it is Java based.
  • 15. COMPONENTS OF HADOOP ECOSYSTEM  HDFS: Hadoop Distributed File System  MapReduce: Programming based Data Processing  YARN: Yet Another Resource Negotiator  Spark: In-Memory data processing  PIG, HIVE: Query based processing of data services  HBase: NoSQL Database  Mahout, Spark MLLib: Machine Learning algorithm libraries  Solar, Lucene: Searching and Indexing  Zookeeper: Managing cluster  Oozie: Job Scheduling
  • 16. COMPONENTS OF HADOOP ECOSYSTEM HDFS (File System) HBASE (Column DB Storage) Oozie (Workflow Monitoring) Chukwa (Monitoring) Hive (SQL) Map Reduce (Cluster Management) YARN (Cluster & Resource Management) Pig (Dataflow) Mahout (Machine Learning) Avro (RPC) Sqoop (RDBMS Connector) Flume (Monitoring) ZooKeeper (Management) Data Storage Data Processing Data Access Data Management
  • 17. DATA STORAGE COMPONENT OF HADOOP HDFS  Hadoop File System was developed using distributed file system design.  It is run on commodity hardware.  Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.  HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines.  The files are stored in redundant fashion to rescue the system from possible data losses in case of failure.  HDFS makes applications available to parallel processing
  • 18. DATA STORAGE COMPONENT OF HADOOP HDFS - ARCHITECTURE
  • 19. DATA STORAGE COMPONENT OF HADOOP HDFS ARCHITECTURE HDFS follows the master-slave architecture and it has the following elements.  Namenode : The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks  Manages the file system namespace.  Regulates client’s access to files.  It also executes file system operations such as renaming, closing, and opening files and directories.
  • 20. DATA STORAGE COMPONENT OF HADOOP HDFS ARCHITECTURE  Datanode: The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.  Datanodes perform read-write operations on the file systems, as per client request.  They perform operations such as block creation, deletion, and replication according to the instructions of the namenode.  Block: The user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. Or the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB.
  • 21. DATA STORAGE COMPONENT OF HADOOP HBASE  It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database.  It provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.  At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives a tolerant way of storing limited data
  • 22. DATA STORAGE COMPONENT OF HADOOP HBASE- COMPONENTS  HBase master: It is not part of the actual data storage, but it manages load balancing activities across all Region Servers.  It controls the failovers.  Performs administration activities which provide an interface for creating, updating and deleting tables.  Handles DDL operations.  It maintains and monitors the Hadoop cluster.  Regional server: It is a worker node. It reads, writes, and deletes request from Clients. Region server runs on every node of Hadoop cluster. Its server runs on HDFS data nodes.
  • 23. DATA PROCESSING COMPONENT OF HADOOP MAP REDUCE  By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one.  MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:  Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key- value pair based result which is later on processed by the Reduce() method.  Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
  • 24. DATA PROCESSING COMPONENT OF HADOOP MAP REDUCE- FEATURES Features of Map Reduce Simplicity (jobs are easy to run) Scalability (can process petabytes of data) Speed (parallel processing improves speed) Fault Tolerance (takes care of failures)
  • 25. DATA PROCESSING COMPONENT OF HADOOP YARN  Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. It performs scheduling and resource allocation for the Hadoop System.  Consists of three major components i.e.  Resource Manager  Nodes Manager  Application Manager  Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. Application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
  • 26. DATA PROCESSING COMPONENT OF HADOOP YARN- KEY BENEFITS Key Benefits of YARN Improved cluster utilization Highly scalable Beyond Java Novel programmin g models & services Agility
  • 27. DATA ACCESS COMPONENT OF HADOOP HIVE  With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. Its query language is called as HQL (Hive Query Language).  It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier.  HIVE comes with two components: JDBC Drivers and HIVE Command Line.  JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries.
  • 28. DATA ACCESS COMPONENT OF HADOOP HIVE
  • 29. DATA ACCESS COMPONENT OF HADOOP PIG  Pig was developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL.  It is a platform for structuring the data flow, processing and analyzing huge data sets.  Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS.  Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM.  Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.
  • 30. DATA ACCESS COMPONENT OF HADOOP PIG Data Flows Local Execution in a Single JVM Input file stored in HDFS Output file stored in HDFS Pig Latin Compiler Execution of Map Reduce job internally Pig Script UDF present in local file system Execution Environment Pig Pig latin is used to express data flows Register In Produces Map Reduce job 1 2 4 5 3
  • 31. DATA ACCESS COMPONENT OF HADOOP MAHOUT  Mahout, allows Machine Learnability to a system or application.  Machine Learning, helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms.  It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning.  It allows invoking algorithms as per our need with the help of its own libraries.
  • 32. DATA ACCESS COMPONENT OF HADOOP AVRO  Apache Avro works as a data serialization system. It helps Hadoop in data serialization and data exchange.  Avro enables big data in exchanging programs written in different languages. It serializes data into files or messages.  Avro Schema: Schema helps Avaro in serialization and deserialization process without code generation. Avro needs a schema for data to read and write.  Dynamic typing: it means serializing and deserializing data without generating any code. It replaces the code generation process with its statistically typed language as an optional optimization.
  • 33. DATA ACCESS COMPONENT OF HADOOP SQOOP  Sqoop works as a front-end loader of Big data.  Sqoop is a front-end interface that enables in moving bulk data from Hadoop to relational databases and into variously structured data marts.  Sqoop replaces the function called ‘developing scripts’ to import and export data. It mainly helps in moving data from an enterprise database to Hadoop cluster to performing the ETL process.  Sqoop fulfills the growing need to transfer data from the mainframe to HDFS.  Sqoop helps in achieving improved compression and light-weight indexing for advanced query performance.
  • 34. DATA ACCESS COMPONENT OF HADOOP SQOOP  It facilitates feature to transfer data parallelly for effective performance and optimal system utilization.  Sqoop creates fast data copies from an external source into Hadoop.  It acts as a load balancer by mitigating extra storage and processing loads to other devices.
  • 35. DATA ACCESS COMPONENT OF HADOOP SQOOP: AS AN ETL RDBMS (MySQL, Oracle etc) Hadoop File System (HDFS, HIVE, etc) Import Export Sqoop
  • 36. DATA MANAGEMENT COMPONENT OF HADOOP OOZIE  Apache Ooze is a tool in which all sort of programs can be pipelined in a required manner to work in Hadoop's distributed environment.  Oozie works as a scheduler system to run and manage Hadoop jobs.  Oozie allows combining multiple complex jobs to be run in a sequential order to achieve the desired output.  It is strongly integrated with Hadoop stack supporting various jobs like Pig, Hive, Sqoop, and system-specific jobs like Java, and Shell.  Oozie is an open source Java web application.
  • 37. DATA MANAGEMENT COMPONENT OF HADOOP OOZIE Oozie consists of two jobs:  Oozie workflow: It is a collection of actions arranged to perform the jobs one after another. It is just like a relay race where one has to start right after one finish, to complete the race.  Oozie Coordinator: It runs workflow jobs based on the availability of data and predefined schedules.
  • 38. DATA MANAGEMENT COMPONENT OF HADOOP FLUME  Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.  Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.
  • 39. DATA MANAGEMENT COMPONENT OF HADOOP ZOOKEEPER  Apache Zookeeper is an open source project designed to coordinate multiple services in the Hadoop ecosystem.  Zookeeper performs task like synchronization, inter-component based communication, grouping, and maintenance.  Features of Zookeeper:  Zookeeper acts fast enough with workloads where reads to data are more common than writes.  Zookeeper maintains a record of all transactions.
  • 40. OTHER IMPORTANT COMPONENTS OF HADOOP  Solr, Lucene: These are the two services that perform the task of searching and indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr.  Spark  It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc.  It consumes in memory resources hence, thus being faster than the prior in terms of optimization.  Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably.
  • 41. BIG DATA & HADOOP SECURITY  Knox provides a framework for managing security and supports security implementations on Hadoop clusters.  Knox is a REST API gateway developed within the Apache community to support monitoring, authorization management, auditing, and policy enforcement on Hadoop clusters.  Knox provides a single access point for all REST interactions with clusters.  Through Knox, system administrators can manage authentication via LDAP and Active Directory, conduct HTTP header-based federated identity management, and audit hardware on the clusters.  Knox supports enhanced security because it can integrate with enterprise identity management solutions and is Kerberos compatible. HADOOP SECURITY MANAGEMENT TOOL: KNOX
  • 42. BIG DATA & HADOOP SECURITY  The Ranger provides a centralized framework that can be used to manage policies at the resource level, such as files, folders, databases, and even for specific lines and columns within databases.  Ranger helps administrators implement access policies by group, data type, etc.  Ranger has different authorization functionality for different Hadoop components such as YARN, HBase, Hive, etc. HADOOP SECURITY MANAGEMENT TOOL: RANGER
  • 43. BIG DATA & HADOOP SECURITY  In core Hadoop technology the HFDS has directories called encryption zones. When data is written to Hadoop it is automatically encrypted (with a user-selected algorithm) and assigned to an encryption zone.  Encryption is file specific, not zone specific. That means each file within the zone is encrypted with its own unique data encryption key (DEK).  Clients decrypt data from HFDS uses an encrypted data encryption key (EDEK), then use the DEK to read and write data.  Encryption zones and DEK encryption occurs between the file system and database levels of the architecture. HADOOP ENCRYPTION