Presentation on
Big Data/Hadoop
Submitted to :
Department of CSE
AITS, Udaipur
Submitted by:
1.Laxmi Rauth
2.Anand Mohan
B.Tech (4th year)
Big Data
"Big Data” is a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing
applications.
In simple terms, "Big Data" consists of very large volumes of
heterogeneous data that is being generated, often, at high
speeds.
Big Data requires the use of a new set of tools, applications
and frameworks to process and manage the data.
Characteristics of Big Data:
The characteristics of Big Data are
popularly known as Three V's of
Big Data.
Volume:
This size aspect of data is referred to as
Volume in the Big Data world.
Velocity:
This speed aspect of data generation is
referred to as Velocity in the Big Data
world.
Variety:
This aspect of varied data formats is
referred to as Variety in the Big Data
world.
Sources of Big Data can be
broadly classified into six
different categories:
1.Enterprise Data
2. Transactional Data
3. Social Media
4. Activity Generated
5. Public Data
6. Archives
Sources of Big Data:
Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming
models. That manages data processing and storage for big
data applications running in clustered systems.
History of Hadoop had
started in the year 2002 with
the project Apache
Nutch. Hadoop was created
by Doug Cutting, the creator
of Apache Lucene, the widely
used text search library.
According to Hadoop's creator
Doug Cutting, "The name
Hadoop given by my kid to a
stuffed yellow elephant. Short,
relatively easy to spell and
pronounce, meaningless, and
not used elsewhere.
Hadoop was created by Doug Cutting and
Mike Cafarella.
History of Hadoop
Characteristics of Hadoop
Hadoop provides
a reliable shared
storage (HDFS)
and analysis
system (Map-
Reduce).
Hadoop is highly
scalable
As Hadoop
scales linearly, a
Hadoop Cluster
can contain tens,
hundreds, or
even thousands
of servers.
Hadoop is highly
flexible and can
process both
structured as
well as
unstructured
data. Hadoop has
built-in fault
tolerance.
Hadoop works on
the principle of
write once and
read multiple
times. Hadoop is
optimized for
large and very
large data sets.
Hadoop is very
cost effective as
it can work with
commodity
hardware and
does not require
expensive high-
end hardware.
Hadoop works in a master-worker / master-slave
fashion.
Hadoop has two core components: HDFS and
MapReduce.
HDFS (Hadoop Distributed File System) offers a
highly reliable and distributed storage, and ensures
reliability, by storing the data across multiple nodes.
MapReduce offers an analysis system which can
perform complex computations on large datasets. This
component is responsible for performing all the
computations and works by breaking down a large
complex computation into multiple tasks and assigns
those to individual worker/slave nodes.
The master contains the Namenode and Job Tracker
components.
Namenode holds the information about all the other
nodes in the Hadoop Cluster.
Job Tracker keeps track of the individual tasks/jobs
assigned to each of the nodes and coordinates the
exchange of information and results.
Each Worker / Slave contains the Task Tracker and a
Datanode components.
Task Tracker is responsible for running the task /
computation assigned to it.
Datanode is responsible for holding the data.
Hadoop Distributions
Cloudera was the first company
to be formed to build enterprise
solutions based on Hadoop.
Cloudera has a Hadoop
distribution known as
Cloudera's Distribution for
Hadoop (CDH).
MapR is another major
distribution available in the
market.
MapR is available in the cloud
through some of the leading
cloud providers Amazon Web
Services (AWS), Google
Compute Engine, CenturyLink
Technology Solutions, and
OpenStack.
Amazon Web Services (AWS)
Elastic MapReduce (EMR) was
among the first Hadoop
offerings available in the market.
Azure HDInsight is Microsoft's
distribution of Hadoop.
Hortonworks has a Hadoop
distribution known as
Hortonworks Data Platform
(HDP).
Cloudera
Hortonworks
Amazon
Elastic Map
Reduce
(EMR)
MapR
Azure
HDInsight
Hadoop Ecosystem
Apach
e Hive
Apache
Pig
Apache
ZooKeeper
Y
1. Apache Pig is a software framework which offers a run-time
environment for execution of MapReduce jobs on a Hadoop Cluster via
a high-level scripting language called Pig Latin.
2. Apache Hive Data Warehouse framework facilitates the querying and
management of large datasets residing in a distributed store/file system
like Hadoop Distributed File System (HDFS).
3. Apache Mahout is a scalable machine learning and data mining library.
4. Apache HBase is a distributed, versioned, column-oriented, scalable
and a big data store on top of Hadoop/HDFS.
5. Apache Sqoop is a tool designed for efficiently transferring the data
between Hadoop and Relational Databases (RDBMS).
6. Apache Oozie is a job workflow scheduling and coordination manager
for managing the jobs executed on Hadoop.
7. Apache ZooKeeper is an open source coordination service for
distributed applications.
8. Apache Ambari is an open source software framework for provisioning,
managing, and monitoring Hadoop clusters.
Apache
Hive
Apache
HBase
Apache
Mahout
Apache
Sqoop
Apache
Oozie
HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework.
other Apache Projects which are built around the Hadoop Framework which are part of the Hadoop Ecosystem.
Apache
Ambari
Hadoop Core Components
YARN
Yet another resource negotiator) is a
resource manager that knows how to
allocate distributed computer
resources to various cluster
MapReduce
MapReduce is a framework that
enables running MapReduce jobs on
the hadoop cluster powered by YARN.
It provides a high level API for
implementing Custom Map and
Reduce function in various languages
as well as the code infrastructure
needed to submit, run and monitor
MapReduce jobs.
HDFS
(Hadoop diatributed file system)
designed for storing large files of the
magnitute of hundreds of megabytes
or gigabytes and provides
high_throughput streaming data
access to them.
Hadoop Versions
2.7.x2.7.7
31 May 2018
2.8 x.2.8.5
15 September 2018
2.9 x 2.9.2
9 november 2018
3.1 x 3.1.2
6 February 2019
3.2 x 3.2.0
16 january
Hadoop 2.8.0 installation
Download
Hadoop
and Java
Install
Java
Extract
the
Hadoop
file
Testing
Edit
configuration
files
Set
environment
variable
Set path
Replace
the bin
file
Format
name
node
Download Hadoop 2.8.0 (Link: http://www-
eu.apache.org/dist/hadoop/common/hadoop-
2.8.0/hadoop-
2.8.0.tar.gz OR http://archive.apache.org/dist/hadoop/c
ore//hadoop-2.8.0/hadoop-2.8.0.tar.gz)
install java on your under "C:JAVA"Java JDK 1.8.0.zip
(Link: http://www.oracle.com/technetwork/java/javase/
downloads/jdk8-downloads-2133151.html)
.
use "Javac -version" to check the version of Java installed
in your system
Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place under "C:Hadoop-2.8.0".
HADOOP_HOME
JAVA_HOME
path
Set the path HADOOP_HOME Environment variable
Environment Variable -> New ->
Variable name: HADOOP_HOME
Variable value: C:hadoop-2.8.0bin
->OK
Set the path JAVA_HOME Environment variable
Environment Variable -> New ->
Variable name: JAVA_HOME
Variable value: C:javabin
->OK
set the Hadoop bin directory path and JAVA bin
directory path.
Environment variables -> System variables ->
path -> Edit -> New -> C:hadoop-2.8.0bin -> New
-> C:javabin -> OK
Edit the configuration
files, paste these
code below xml
paragraph and save
the files.
Create foldersfile C:/Hadoop-2.8.0/etc/hadoop
/mapred-site.xml
file C:Hadoop-2.8.0/etc/hadoop/h
dfs-site.xml
file C:/Hadoop-2.8.0/etc/hadoo
p/hadoop-env.cmd
file C:/Hadoop-2.8.0/etc/hadoop/y
arn-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value
>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:hadoop-
2.8.0datanamenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:hadoop-
2.8.0datadatanode</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.nam
e</name>
<value>yarn</value>
</property>
</configuration>
Create
folder "data" under "C:Hadoo
p-2.8.0"
Create
folder "datanode" under "C:H
adoop-2.8.0data"
Create
folder "namenode" under "C:
Hadoop-2.8.0data"
<configuration>
<property>
<name>yarn.nodemanager.aux-
services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices
.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.Sh
uffleHandler</value>
</property>
</configuration>
Edit file C:/Hadoop-
2.8.0/etc/hadoop/hadoop-
env.cmd by closing the
command
line "JAVA_HOME=%JAVA_HO
ME%" instead of
set "JAVA_HOME=C:Java" (On
C:java this is path to file
jdk.18.0)
file C:/Hadoop-
2.8.0/etc/hadoop/core-site.xml
Replace the bin file
Delete file bin on C:Hadoop-
2.8.0bin, replaced by file bin
on file just download (from
Hadoop Configuration.zip).
Dowload file Hadoop
Configuration.zip
(Link: https://github.com/Muha
mmadBilalYar/HADOOP-
INSTALLATION-ON-
WINDOW-
10/blob/master/Hadoop%20C
onfiguration.zip)
Open cmd and typing command "hdfs namenode –format" .
Format the namenode
We Create
Quality Professional
PPT Presentation
Testing
Open cmd and change directory to "C:Hadoop-2.8.0sbin" and
type "start-all.cmd" to start apache.
Make sure these apps
are running
1.Hadoop Namenode
2.Hadoop datanode
3.YARN Resourc
Manager
4.YARN Node
Manager
Open: http://localhost:50070
Open: http://localhost:8088
MAPREDUCE
The MapReduce algorithm contains two important
tasks, namely Map and Reduce.
MapReduce is a framework using which we can write applications to process
huge amounts of data, in parallel, on large clusters of commodity hardware in a
reliable manner.
Map takes a set of data and converts it into another set
of data, where individual elements are broken down into
tuples (key/value pairs).
MapReduce is a processing technique and a program
model for distributed computing based on java.
reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
Generally MapReduce
paradigm is based on sending
the computer to where the data
resides!
The major advantage of
MapReduce is that it is easy to
scale data processing over
multiple computing nodes.
MapReduce program executes in
three stages, namely map stage,
shuffle stage, and reduce stage.
MapReduce
1
8
7
6
5
4
3
2
Map stage − The
map or mapper’s job
is to process the input
data. Generally the
input data is in the
form of file or
directory and is stored
in the Hadoop file
system (HDFS). The
input file is passed to
the mapper function
line by line. The
mapper processes the
data and creates
several small chunks
of data.
Stages of
MapReduce
program
Reduce stage − This
stage is the
combination of
the Shuffle stage and
the Reduce stage.
The Reducer’s job is
to process the data
that comes from the
mapper. After
processing, it
produces a new set of
output, which will be
stored in the HDFS.
Thank you

Hadoop basics

  • 1.
    Presentation on Big Data/Hadoop Submittedto : Department of CSE AITS, Udaipur Submitted by: 1.Laxmi Rauth 2.Anand Mohan B.Tech (4th year)
  • 2.
    Big Data "Big Data”is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. In simple terms, "Big Data" consists of very large volumes of heterogeneous data that is being generated, often, at high speeds. Big Data requires the use of a new set of tools, applications and frameworks to process and manage the data.
  • 3.
    Characteristics of BigData: The characteristics of Big Data are popularly known as Three V's of Big Data. Volume: This size aspect of data is referred to as Volume in the Big Data world. Velocity: This speed aspect of data generation is referred to as Velocity in the Big Data world. Variety: This aspect of varied data formats is referred to as Variety in the Big Data world. Sources of Big Data can be broadly classified into six different categories: 1.Enterprise Data 2. Transactional Data 3. Social Media 4. Activity Generated 5. Public Data 6. Archives Sources of Big Data:
  • 4.
    Hadoop is anApache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. That manages data processing and storage for big data applications running in clustered systems.
  • 5.
    History of Hadoophad started in the year 2002 with the project Apache Nutch. Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. According to Hadoop's creator Doug Cutting, "The name Hadoop given by my kid to a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere. Hadoop was created by Doug Cutting and Mike Cafarella. History of Hadoop
  • 6.
    Characteristics of Hadoop Hadoopprovides a reliable shared storage (HDFS) and analysis system (Map- Reduce). Hadoop is highly scalable As Hadoop scales linearly, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers. Hadoop is highly flexible and can process both structured as well as unstructured data. Hadoop has built-in fault tolerance. Hadoop works on the principle of write once and read multiple times. Hadoop is optimized for large and very large data sets. Hadoop is very cost effective as it can work with commodity hardware and does not require expensive high- end hardware.
  • 7.
    Hadoop works ina master-worker / master-slave fashion. Hadoop has two core components: HDFS and MapReduce. HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, by storing the data across multiple nodes. MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes. The master contains the Namenode and Job Tracker components. Namenode holds the information about all the other nodes in the Hadoop Cluster. Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results. Each Worker / Slave contains the Task Tracker and a Datanode components. Task Tracker is responsible for running the task / computation assigned to it. Datanode is responsible for holding the data.
  • 8.
    Hadoop Distributions Cloudera wasthe first company to be formed to build enterprise solutions based on Hadoop. Cloudera has a Hadoop distribution known as Cloudera's Distribution for Hadoop (CDH). MapR is another major distribution available in the market. MapR is available in the cloud through some of the leading cloud providers Amazon Web Services (AWS), Google Compute Engine, CenturyLink Technology Solutions, and OpenStack. Amazon Web Services (AWS) Elastic MapReduce (EMR) was among the first Hadoop offerings available in the market. Azure HDInsight is Microsoft's distribution of Hadoop. Hortonworks has a Hadoop distribution known as Hortonworks Data Platform (HDP). Cloudera Hortonworks Amazon Elastic Map Reduce (EMR) MapR Azure HDInsight
  • 9.
    Hadoop Ecosystem Apach e Hive Apache Pig Apache ZooKeeper Y 1.Apache Pig is a software framework which offers a run-time environment for execution of MapReduce jobs on a Hadoop Cluster via a high-level scripting language called Pig Latin. 2. Apache Hive Data Warehouse framework facilitates the querying and management of large datasets residing in a distributed store/file system like Hadoop Distributed File System (HDFS). 3. Apache Mahout is a scalable machine learning and data mining library. 4. Apache HBase is a distributed, versioned, column-oriented, scalable and a big data store on top of Hadoop/HDFS. 5. Apache Sqoop is a tool designed for efficiently transferring the data between Hadoop and Relational Databases (RDBMS). 6. Apache Oozie is a job workflow scheduling and coordination manager for managing the jobs executed on Hadoop. 7. Apache ZooKeeper is an open source coordination service for distributed applications. 8. Apache Ambari is an open source software framework for provisioning, managing, and monitoring Hadoop clusters. Apache Hive Apache HBase Apache Mahout Apache Sqoop Apache Oozie HDFS and MapReduce are the two core components of the Hadoop Ecosystem and are at the heart of the Hadoop framework. other Apache Projects which are built around the Hadoop Framework which are part of the Hadoop Ecosystem. Apache Ambari
  • 10.
    Hadoop Core Components YARN Yetanother resource negotiator) is a resource manager that knows how to allocate distributed computer resources to various cluster MapReduce MapReduce is a framework that enables running MapReduce jobs on the hadoop cluster powered by YARN. It provides a high level API for implementing Custom Map and Reduce function in various languages as well as the code infrastructure needed to submit, run and monitor MapReduce jobs. HDFS (Hadoop diatributed file system) designed for storing large files of the magnitute of hundreds of megabytes or gigabytes and provides high_throughput streaming data access to them.
  • 11.
    Hadoop Versions 2.7.x2.7.7 31 May2018 2.8 x.2.8.5 15 September 2018 2.9 x 2.9.2 9 november 2018 3.1 x 3.1.2 6 February 2019 3.2 x 3.2.0 16 january
  • 12.
    Hadoop 2.8.0 installation Download Hadoop andJava Install Java Extract the Hadoop file Testing Edit configuration files Set environment variable Set path Replace the bin file Format name node
  • 13.
    Download Hadoop 2.8.0(Link: http://www- eu.apache.org/dist/hadoop/common/hadoop- 2.8.0/hadoop- 2.8.0.tar.gz OR http://archive.apache.org/dist/hadoop/c ore//hadoop-2.8.0/hadoop-2.8.0.tar.gz) install java on your under "C:JAVA"Java JDK 1.8.0.zip (Link: http://www.oracle.com/technetwork/java/javase/ downloads/jdk8-downloads-2133151.html) . use "Javac -version" to check the version of Java installed in your system
  • 14.
    Extract file Hadoop2.8.0.tar.gz or Hadoop-2.8.0.zip and place under "C:Hadoop-2.8.0". HADOOP_HOME JAVA_HOME path Set the path HADOOP_HOME Environment variable Environment Variable -> New -> Variable name: HADOOP_HOME Variable value: C:hadoop-2.8.0bin ->OK Set the path JAVA_HOME Environment variable Environment Variable -> New -> Variable name: JAVA_HOME Variable value: C:javabin ->OK set the Hadoop bin directory path and JAVA bin directory path. Environment variables -> System variables -> path -> Edit -> New -> C:hadoop-2.8.0bin -> New -> C:javabin -> OK
  • 15.
    Edit the configuration files,paste these code below xml paragraph and save the files. Create foldersfile C:/Hadoop-2.8.0/etc/hadoop /mapred-site.xml file C:Hadoop-2.8.0/etc/hadoop/h dfs-site.xml file C:/Hadoop-2.8.0/etc/hadoo p/hadoop-env.cmd file C:/Hadoop-2.8.0/etc/hadoop/y arn-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value > </property> </configuration> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>C:hadoop- 2.8.0datanamenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>C:hadoop- 2.8.0datadatanode</value> </property> </configuration> <configuration> <property> <name>mapreduce.framework.nam e</name> <value>yarn</value> </property> </configuration> Create folder "data" under "C:Hadoo p-2.8.0" Create folder "datanode" under "C:H adoop-2.8.0data" Create folder "namenode" under "C: Hadoop-2.8.0data" <configuration> <property> <name>yarn.nodemanager.aux- services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices .mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.Sh uffleHandler</value> </property> </configuration> Edit file C:/Hadoop- 2.8.0/etc/hadoop/hadoop- env.cmd by closing the command line "JAVA_HOME=%JAVA_HO ME%" instead of set "JAVA_HOME=C:Java" (On C:java this is path to file jdk.18.0) file C:/Hadoop- 2.8.0/etc/hadoop/core-site.xml
  • 16.
    Replace the binfile Delete file bin on C:Hadoop- 2.8.0bin, replaced by file bin on file just download (from Hadoop Configuration.zip). Dowload file Hadoop Configuration.zip (Link: https://github.com/Muha mmadBilalYar/HADOOP- INSTALLATION-ON- WINDOW- 10/blob/master/Hadoop%20C onfiguration.zip)
  • 17.
    Open cmd andtyping command "hdfs namenode –format" . Format the namenode We Create Quality Professional PPT Presentation
  • 18.
    Testing Open cmd andchange directory to "C:Hadoop-2.8.0sbin" and type "start-all.cmd" to start apache. Make sure these apps are running 1.Hadoop Namenode 2.Hadoop datanode 3.YARN Resourc Manager 4.YARN Node Manager Open: http://localhost:50070 Open: http://localhost:8088
  • 19.
    MAPREDUCE The MapReduce algorithmcontains two important tasks, namely Map and Reduce. MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). MapReduce is a processing technique and a program model for distributed computing based on java. reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. Generally MapReduce paradigm is based on sending the computer to where the data resides! The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. MapReduce 1 8 7 6 5 4 3 2
  • 20.
    Map stage −The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Stages of MapReduce program Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • 21.