SlideShare a Scribd company logo
A
STUDY
on
HADOOP FRAMEWORK
towards partial fulfillment of the requirement for the
award of degree of
Master of Computer Applications
from
SIR CHHOTU RAM INSTITUTE OF ENGINEERING & TECHNOLOGY
CHAUDHARY CHARAN SINGH UNIVERSITY, MEERUT
Academic Session 2018 – 21
Submitted by: Under Guidance of:
JONY KUMAR Mr. VIKASH JAIN
Roll no.:-100180702
MCA 3rd
year (6th
semester)
1
ACKNOWLEDGEMENT
The achievement that is associated with the successful completion of this colloquium
would be incomplete without mentioning the names whose endless cooperation made it
possible. I would like to convey my regards to our college SIR CHHOTU RAM INSTITUTE
OF TECHNOLOGY MEERUT and our respected Head of DCA for giving us such a nice
opportunity to enhance our skills in this domain. I take this opportunity to express our
deep gratitude towards our colloquium Supervisor MR. VIKASH JAIN for giving us such
valuable suggestions, guidance and encouragement during the development of this
project work. Last but not the least we are grateful to all the faculty members of SIR
CHHOTU RAM INSTITUTE OF TECHNOLOGY MEERUT for their support.
(JONY KUMAR)
2
ABSTRACT
My topic is Hadoop which is a cluster computing framework. Apache Hadoop is a
software framework that supports data-intensive distributed applications under a free
license. Hadoop was inspired by Google's MapReduce and Google File System (GFS)
papers. Hadoop, however, was designed to solve a different problem: the fast, reliable
analysis of both structured data and complex data. As a result, many enterprises deploy
Hadoop alongside their legacy IT systems, which allows them to combine old data and
new data sets in powerful new ways. The Hadoop framework is used by major players
including Google, Yahoo and IBM, largely for applications involving search engines and
advertising. I am going to represent the History, Development and Current Situation of
this Technology. This technology is now under the Apache Software Foundation via
Cloudera.
3
Contents:
Chapter 1: Introduction to HADOOP............................... 5
Chapter 2: History of HADOOP ......................................... 6
Top 10 Features of Big Data Hadoop......... 8
Chapter 3: Key Technology...................................................10
3.1 MapReduce.......................................................10
3.2 Programming Model .....................................12
3.3 Map ....................................................................12
3.4 Reduce ..............................................................12
3.5 HDFS ..................................................................14
3.6 HDFS Concepts ............................................15
Chapter 4: Other Projects on HADOOP ...........................19
4.1 Avro .................................................................. 19
4.2 Chukwa............................................................. 19
4.3 HBase ................................................................20
4.4 Hive .................................................................. 20
4.5 Pig ......................................................................21
4.6 ZooKeeper .......................................................22
Chapter 5: Applications of Hadoop ................................23
Chapter 6: Conclusion .......................................................... 25
Chapter 7: References ........................................................ 26
4
1.INTRODUCTION
Today, we’re surrounded by data. People upload videos, take pictures on their cell phones, text friends,
update their Facebook status, leave comments around the web, click on ads, and so forth. Machines,
too, are generating and keeping more and more data. The exponential growth of data first presented
challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to
go through terabytes and petabytes of data to figure out which websites were popular, what books
were in demand, and what kinds of ads appealed to people. Existing tools were becoming inadequate
to process such large data sets. Google was the first to publicize MapReduce—a system they had used
to scale their data processing needs. This system aroused a lot of interest because many other
businesses were facing similar scaling challenges, and it wasn’t feasible for everyone to reinvent their
own proprietary tool. Doug Cutting saw an opportunity and led the charge to develop an open source
version of this MapReduce system called Hadoop. Soon after, Yahoo and others rallied around to
support this effort. Today, Hadoop is a core part of the computing infrastructure for many web
companies, such as Yahoo, Facebook, LinkedIn, and Twitter. Hadoop is an open source framework for
writing and running distributed applications that process large amounts of data. Distributed computing
is a wide and varied field, but the key distinctions of Hadoop are that it is
■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
■ Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the
assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the
cluster.
■ Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop’s accessibility and simplicity give it an edge over writing and running large
distributed programs. Even college students can quickly and cheaply create their own Hadoop
cluster. On the other hand, its robustness and scalability make it suitable for even the most
demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both
academia and industry.
5
2.History of HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the
Lucene project.
2.1 The Origin of the Name ``Hadoop”:
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting,
explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively
easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids
are good at generating such. Googol is a kid’s term. Subprojects and “contrib” modules in Hadoop also
tend to have names that are unrelated to their function, often with an elephant or other animal theme
(“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane)
names. This is a good principle, as it means you can generally work out what something does from its
name. For example, the jobtracker keeps track of MapReduce jobs. Building a web search engine from
scratch was an ambitious goal, for not only is the software required to crawl and index websites
complex to write, but it is also a challenge to run without a dedicated operations team, since there are
so many moving parts. It’s expensive too: Mike Cafarella and Doug Cutting estimated a system
supporting a 1- billion-page index would cost around half a million dollars in hardware, with a monthly
running cost of $30,000. Nevertheless, they believed it was a worthy goal, as it would open up and
ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and
search system quickly emerged. However, they realized that their architecture wouldn’t scale to the
billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described
the architecture of Google’s distributed file system, called GFS, which was being used in production at
Google.# GFS, or something like it, would solve their storage needs for the very large files generated as
a part of the web crawl and indexing process. In particular, GFS would free up time being spent on
administrative tasks such as managing storage nodes. In 2004, they set about writing an open source
implementation, the Nutch Distributed File System (NDFS).
In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch
developers had a working MapReduce implementation in Nutch, and by the middle of that year all the
major Nutch algorithms had been ported to run using MapReduce and NDFS. NDFS and the MapReduce
implementation in Nutch were applicable beyond the realm of search, and in February 2006 they
moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same
time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop
6
into a system that ran at web scale (see sidebar). This was demonstrated in February 2008 when Yahoo!
announced that its production search index was being generated by a 10,000-core Hadoop cluster.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its
diverse, active community. By this time Hadoop was being used by many other companies besides
Yahoo!, such as Last.fm, Facebook, and the New York Times.
7
Top 10 Features of Big Data Hadoop
a. Open source
It is an open source Java-based programming framework. Open source means it is freely available and
even we can change its source code as per your requirements.
b. Fault Tolerance
Hadoop control faults by the process of replica creation. When a client stores a file in HDFS, Hadoop
framework divides the file into blocks. Then the client distributes data blocks across different machines
present in the HDFS cluster.
And, then create the replica of each block on other machines present in the cluster. HDFS, by default,
creates 3 copies of a block on other machines present in the cluster.
If any machine in the cluster goes down or fails due to unfavorable conditions. Then also, the user can
easily access that data from other machines.
c. Distributed Processing
Hadoop stores huge amounts of data in a distributed manner in HDFS. Process the data in parallel on a
cluster of nodes.
d. Scalability
Hadoop is an open-source platform. This makes it an extremely scalable platform. So, new nodes can
be easily added without any downtime. Hadoop provides horizontal scalability so new nodes are added
on the fly model to the system. In Apache Hadoop, applications run on more than thousands of nodes.
e. Reliability
Data is reliably stored on the cluster of machines despite machine failure due to replication of data. So,
if any of the nodes fails, then also we can store data reliably.
f. High Availability
8
Due to multiple copies of data, data is highly available and accessible despite hardware failure. So, any
machine that goes down data can be retrieved from the other path. Learn Hadoop High Availability
features in detail.
g. Economic
Hadoop is not very expensive as it runs on the cluster of commodity hardware. As we are using
low-cost commodity hardware, we don’t need to spend a huge amount of money for scaling out your
Hadoop cluster.
i. Flexibility
Hadoop is very flexible in terms of ability to deal with all kinds of data. It deals with structured,
semi-structured or unstructured.
j. Easy to use
No need for a client to deal with distributed computing, the framework takes care of all the things. So it
is easy to use.
k. Data locality
It refers to the ability to move the computation close to where actual data resides on the node. Instead
of moving data to computation. This minimizes network congestion and increases the over throughput
of the system. Learn more about Data Locality.
9
3.Key Technology
The key technology for Hadoop is the MapReduce programming model and Hadoop Distributed File
System. The operation on large data is not possible in the serial programming paradigm. MapReduce
does tasks parallel to accomplish work in less time which is the main aim of this technology.
MapReduce requires a special file system. In the real scenario , the data which are in terms on
perabyte. To store and maintain this much data on distributed commodity hardware, Hadoop
Distributed File System was invented. It is basically inspired by the Google File System.
3.1 MapReduce
MapReduce is a framework for processing highly distributable problems across huge datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same
hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data
stored either in a filesystem (unstructured) or in a database (structured).
10
Figure. MapReduce Programming Model
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and
distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree
structure. The worker node processes the smaller problem, and passes the answer back to its master
node. “Reduce" step: The master node then collects the answers to all the sub-problems and combines
them in some way to form the output – the answer to the problem it was originally trying to solve.
MapReduce allows for distributed processing of the map and reduction operations. Provided each
mapping operation is independent of the others, all maps can be performed in parallel – though in
practice it is limited by the number of independent data sources and/or the number of CPUs near each
source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map
operation that share the same key are presented to the same reducer at the same time. While this
process can often appear inefficient compared to algorithms that are more sequential, MapReduce can
be applied to significantly larger datasets than "commodity" servers can handle – a large server farm
can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some
possibility of recovering from partial failure of servers or storage during the operation: if one mapper or
reducer fails, the work can be rescheduled – assuming the input data is still available.
MapReduce is a programming model and an associated implementation for processing and
generating large data sets. Users specify a map function that processes a key/value pair to generate a
set of intermediate key/value pairs, and a reduce function that merges all intermediate values
associated with the same intermediate key. Many real world tasks are expressible in this model.
11
3.2 PROGRAMMING MODEL
The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The
user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map,
written by the user, takes an input pair and produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate values associated with the same intermediate key I
and passes them to the Reduce function. The Reduce function, also written by the user, accepts an
intermediate key I and a set of values for that key. It merges together these values to form a possibly
smaller set of values.Typically just zero or one output value is produced per Reduce invocation. The
intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle
lists of values that are too large to fit in memory
3.3 MAP
map (in_key, in_value) -> (out_key, intermediate_value) list
Example: Upper-case Mapper
let map(k, v) = emit(k.toUpper(), v.toUpper())
(“foo”, “bar”) --> (“FOO”, “BAR”)
(“Foo”, “other”) -->(“FOO”, “OTHER”)
(“key2”, “data”) --> (“KEY2”, “DATA”)
3.4 REDUCE
reduce (out_key, intermediate_value list) ->out_value list
Example: Sum Reducer
let reduce(k, vals)
sum = 0
foreachint v in vals:
sum += v
12
emit(k, sum)
(“A”, [42, 100, 312]) --> (“A”, 454)
(“B”, [12, 6, -2]) --> (“B”, 16)
Hadoop Map-Reduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the
map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are
then input to the reduce tasks. Typically both the input and the output of the job are stored in a
file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed
tasks.
A MapReducejob is a unit of work that the client wants to be performed: it consists of the input data,
the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks,
of which there are two types: map tasks and reduce tasks. There are two types of nodes that control
the job execution process: a jobtracker and a number of tasktrackers. The jobtracker coordinates all the
jobs run on the system by scheduling tasks to run on tasktrackers.
Figure : HadoopMapReduce
13
3.5 HDFS (Hadoop Distributed File System)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However, the differences from
other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware. HDFS provides high throughput access to application data and is
suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was
originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an
Apache Hadoop subproject.
Figure: HDFS Architecture
14
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files by clients. In addition, there are a
number of DataNodes, usually one per node in the cluster, which manage storage attached to the
nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in
files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
The NameNode executes file system namespace operations like opening, closing, and renaming files
and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible
for serving read and write requests from the file system’s clients. The DataNodes also perform block
creation, deletion, and replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed to run on commodity machines. These
machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any
machine that supports Java can run the NameNode or the DataNode software. Usage of the highly
portable Java language means that HDFS can be deployed on a wide range of machines. A typical
deployment has a dedicated machine that runs only the NameNode software. Each of the other
machines in the cluster runs one instance of the DataNode software. The architecture does not
preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the
case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The
NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way
that user data never flows through the NameNode.
Filesystems that manage the storage across a network of machines are called distributed file systems.
Since they are network-based, all the complications of network programming kick in, thus making
distributed file systems more complex than regular disk filesystems. For example, one of the biggest
challenges is making the filesystem tolerate node failure without suffering data loss. Hadoop comes
with a distributed file system called HDFS, which stands for HadoopDistributed File System.
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large
amounts of data (terabytes or even petabytes), and provide high throughput access to this information.
Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and
high availability to very parallel applications.
3.6 HDFS CONCEPTS:
a) Blocks
A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a
single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block
size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes. This
is generally transparent to the filesystem user who is simply reading or writing a file—of whatever
length. However, there are tools to do with filesystem maintenance, such as dfand fsck, that operate on
15
the filesystem block level. HDFS too has the concept of a block, but it is a much larger unit—64 MB by
default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are
stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a
single block does not occupy a full block’s worth of underlying storage. When unqualified, the term
“block” in this book refers to a block in HDFS.
b) Namenodes and Datanodes
A HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode(the master)
and a number of datanodes(workers). The namenode manages the filesystem namespace. It maintains
the filesystem tree and the metadata for all the files and directories in the tree. This information is
stored persistently on the local disk in the form of two files: the namespace image and the edit log. The
namenode also knows the datanodes on which all the blocks for a given file are located, however, it
does not store block locations persistently, since this information is reconstructed from datanodes
when the system starts. A client accesses the filesystem on behalf of the user by communicating with
the namenode and datanodes.
Figure: HDFS Architecture
16
The client presents a POSIX-like file system interface, so the user code does not need to know about the
namenode and datanode to function. Datanodes are the workhorses of the file system. They store and
retrieve blocks when they are told to (by clients or the namenode), and they report back to the
namenode periodically with lists of blocks that they are storing. Without the namenode, the filesystem
cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the
filesystem would be lost since there would be no way of knowing how to reconstruct the files from the
blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure.
c) The File System Namespace
HDFS supports a traditional hierarchical file organization. A user or an application can create directories
and store files inside these directories. The file system namespace hierarchy is similar to most other
existing file systems; one can create and remove files, move a file from one directory to another, or
rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support
hard links or soft links. However, the HDFS architecture does not preclude implementing these features.
d) Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as
a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The replication factor can be specified at file
creation time and can be changed later.
Data replication
17
3.7 HADOOP ARCHIVES
HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in
memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the
namenode. (Note, however, that small files do not take up any more disk space than is required to store
the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of
disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into
HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing
transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce.
18
4. Other Projects on HADOOP
4.1 Avro
Apache Avro is a data serialization system.
Avro provides:
1. Rich data structures.
2. A compact, fast, binary data format.
3. A container file, to store persistent data.
4. Simple integration with dynamic languages. Code generation is not required to read or write data
files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth
implementing for statically typed languages
4.2 Chukwa
19
Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on
top of the Hadoop distributed file system (HDFS) and MapReduce framework and inherits Hadoop’s
scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying,
monitoring and analyzing results, in order to make the best use of this collected data.
4.3 HBase
Just as Google's Bigtable leverages the distributed data storage provided by the Google File System,
HBase provides Bigtable-like capabilities on top of Hadoop Core.
4.4 Hive
20
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries,
and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism
to project structure onto this data and query the data using a SQL-like language called HiveQL. At the
same time this language also allows traditional map/reduce programmers to plug in their custom
mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL
4.5 Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs. The
salient property of Pig programs is that their structure is amenable to substantial parallelization, which
in turns enables them to handle very large data sets.
21
4.6 ZooKeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are implemented there is a lot of
work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually skimp on them ,which makes them
brittle in the presence of change and difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when the applications are
deployed.
22
5: Top 6 Real Time Big Data Hadoop Applications
Apache Hadoop Applications
Let us see hadoop use cases in various sectors.
5.1. Finance sectors
Financial organizations use hadoop for fraud detection and prevention. They use Apache Hadoop for
reducing risk, identifying rogue traders, and analyzing fraud patterns. Hadoop helps them to precisely
target their marketing campaigns on the basis of customer segmentation.
Hadoop helps financial agencies to improve customer satisfaction. Credit card companies also use
Apache Hadoop for finding out the exact customer for their product.
5.2. Security and Law Enforcement
The USA national security agency uses Hadoop in order to prevent terrorist attacks and to detect and
prevent cyber-attacks. Big Data tools are used by the Police forces for catching criminals and even
predicting criminal activity. Hadoop is used by different public sector fields such as defense,
intelligence, research, cybersecurity, etc.
5.3. Companies use Hadoop for understanding customers requirements
The most important application of Hadoop is understanding Customer’ requirements.
Different companies such as finance, telecom use Hadoop for finding out the customer’s requirements
by examining a big amount of data and discovering useful information from these vast amounts of data.
By understanding customers' behaviors, organizations can improve their sales.
5.4. Hadoop Applications in Retail industry
Retailers both online and offline use Hadoop for improving their sales. Many e-commerce companies
use Hadoop for keeping track of the products bought together by the customers. On the basis of this,
they provide suggestions to the customer to buy the other product when the customer is trying to buy
one of the relevant products from that group.
23
For example, when a customer tries to buy a mobile phone, then it suggests a customer for the mobile
back cover, screen guard.
Also, Hadoop helps retailers to customize their stocks based on the predictions that came from
different sources such as Google search, social media websites, etc. Based on these predictions retailers
can make the best decision which helps them to improve their business and maximize their profits.
5.5. Real-time analysis of customers data
Hadoop can analyze customer data in real-time. It can track clickstream data as it’s for storing and
processing high volumes of clickstream data. When a visitor visits a website, then Hadoop can capture
information like from where the visitor originated before reaching a particular website, the search used
for landing on the website.
Hadoop can also grab data about the other webpages in which the visitor shows interest, time spent by
the visitor on each page, etc. This is the analysis of website performance and user engagement.
Enterprises of all types, by implementing Hadoop perform clickstream analysis for optimizing the
user-path, predicting the next product to buy, carrying out market basket analysis, etc.
5.6. Uses of Hadoop in Government sectors
The government uses Hadoop for the country, states, and cities development by analyzing vast
amounts of data.
For example,they use Hadoop for managing traffic in the streets, for the development of smart cities, or
for improving transportation in the city.
24
6 CONCLUSION:
•Hadoop is a technology of the future. Sure, it might not be an integral part of the curriculum, but it is
and will be an integral part of the workings of an E-commerce, finance, insurance, IT, healthcare are
some of the starting points.
• By the above description we can understand the need of Big Data in future, so Hadoop can be the
best of maintenance and efficient implementation of large data.
• This technology has a bright future scope because day by day the need of data would increase and
security issues also the major point.In nowadays many Multinational organisations prefer Hadoop over
RDBMS.
• So major companies like facebook amazon, yahoo, linkedIn etc. are adapting Hadoop and in future
there can be many names in the list.
• Hence Hadoop Technology is the most appropriate approach for handling the large data in a smart
way and its future is bright…
25
7 REFERENCES:
https://www.cloudera.com/products/open-source/apache-hadoop/hdfs-mapreduce-yarn.html
https://data-flair.training/blogs/hadoop-tutorial/
https://techvidvan.com/tutorials/top-features-of-big-data-hadoop
https://en.wikipedia.org/wiki/Apache_Hadoop
https://hadoop.apache.org/
http://grut-computing.com/HadoopBook.pdf
26

More Related Content

What's hot (17)

PPTX
Hadoop for beginners free course ppt
Njain85
 
PDF
Twitter word frequency count using hadoop components 150331221753
pradip patel
 
PDF
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
PDF
Hadoop, MapReduce and R = RHadoop
Victoria López
 
DOCX
Big data processing using - Hadoop Technology
Shital Kat
 
PDF
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
Kwang Woo NAM
 
PDF
Big Data with Modern R & Spark
Xavier de Pedro
 
PDF
field_guide_to_hadoop_pentaho
Martin Ferguson
 
DOCX
10 Popular Hadoop Technical Interview Questions
ZaranTech LLC
 
PDF
Big data and hadoop
AshishRathore72
 
PDF
D04501036040
ijceronline
 
PPT
Bigdata processing with Spark - part II
Arjen de Vries
 
PPT
Bigdata processing with Spark
Arjen de Vries
 
PDF
Survey Paper on Big Data and Hadoop
IRJET Journal
 
PDF
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
PDF
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
Hadoop for beginners free course ppt
Njain85
 
Twitter word frequency count using hadoop components 150331221753
pradip patel
 
Survey on Performance of Hadoop Map reduce Optimization Methods
paperpublications3
 
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Big data processing using - Hadoop Technology
Shital Kat
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
Kwang Woo NAM
 
Big Data with Modern R & Spark
Xavier de Pedro
 
field_guide_to_hadoop_pentaho
Martin Ferguson
 
10 Popular Hadoop Technical Interview Questions
ZaranTech LLC
 
Big data and hadoop
AshishRathore72
 
D04501036040
ijceronline
 
Bigdata processing with Spark - part II
Arjen de Vries
 
Bigdata processing with Spark
Arjen de Vries
 
Survey Paper on Big Data and Hadoop
IRJET Journal
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
Puneet Kansal
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 

Similar to Hadoop framework thesis (3) (20)

DOCX
Hadoop Report
Nishant Gandhi
 
DOCX
Hadoop Seminar Report
Bhushan Kulkarni
 
PPTX
Cap 10 ingles
ElianaSalinas4
 
PPTX
Cap 10 ingles
ElianaSalinas4
 
ODP
Hadoop seminar
KrishnenduKrishh
 
DOCX
Hadoop Seminar Report
Atul Kushwaha
 
PPTX
Hadoo its a good pdf to read some notes p.pptx
helloworldw793
 
PPTX
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
PDF
Hadoop Primer
Steve Staso
 
PPTX
Hadoop training
TIB Academy
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PPTX
Hadoop
Zubair Arshad
 
PPSX
Hadoop
Nishant Gandhi
 
PPTX
Hadoop info
Nikita Sure
 
PDF
00 hadoop welcome_transcript
Guru Janbheshver University, Hisar
 
PPTX
Big data and hadoop anupama
Anupama Prabhudesai
 
DOC
Hadoop
Himanshu Soni
 
PPTX
introduction to hadoop
ASIT
 
Hadoop Report
Nishant Gandhi
 
Hadoop Seminar Report
Bhushan Kulkarni
 
Cap 10 ingles
ElianaSalinas4
 
Cap 10 ingles
ElianaSalinas4
 
Hadoop seminar
KrishnenduKrishh
 
Hadoop Seminar Report
Atul Kushwaha
 
Hadoo its a good pdf to read some notes p.pptx
helloworldw793
 
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop Primer
Steve Staso
 
Hadoop training
TIB Academy
 
Big data and hadoop overvew
Kunal Khanna
 
Hadoop info
Nikita Sure
 
00 hadoop welcome_transcript
Guru Janbheshver University, Hisar
 
Big data and hadoop anupama
Anupama Prabhudesai
 
introduction to hadoop
ASIT
 
Ad

Recently uploaded (20)

PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Design Thinking basics for Engineers.pdf
CMR University
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Ad

Hadoop framework thesis (3)

  • 1. A STUDY on HADOOP FRAMEWORK towards partial fulfillment of the requirement for the award of degree of Master of Computer Applications from SIR CHHOTU RAM INSTITUTE OF ENGINEERING & TECHNOLOGY CHAUDHARY CHARAN SINGH UNIVERSITY, MEERUT Academic Session 2018 – 21 Submitted by: Under Guidance of: JONY KUMAR Mr. VIKASH JAIN Roll no.:-100180702 MCA 3rd year (6th semester) 1
  • 2. ACKNOWLEDGEMENT The achievement that is associated with the successful completion of this colloquium would be incomplete without mentioning the names whose endless cooperation made it possible. I would like to convey my regards to our college SIR CHHOTU RAM INSTITUTE OF TECHNOLOGY MEERUT and our respected Head of DCA for giving us such a nice opportunity to enhance our skills in this domain. I take this opportunity to express our deep gratitude towards our colloquium Supervisor MR. VIKASH JAIN for giving us such valuable suggestions, guidance and encouragement during the development of this project work. Last but not the least we are grateful to all the faculty members of SIR CHHOTU RAM INSTITUTE OF TECHNOLOGY MEERUT for their support. (JONY KUMAR) 2
  • 3. ABSTRACT My topic is Hadoop which is a cluster computing framework. Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured data and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. I am going to represent the History, Development and Current Situation of this Technology. This technology is now under the Apache Software Foundation via Cloudera. 3
  • 4. Contents: Chapter 1: Introduction to HADOOP............................... 5 Chapter 2: History of HADOOP ......................................... 6 Top 10 Features of Big Data Hadoop......... 8 Chapter 3: Key Technology...................................................10 3.1 MapReduce.......................................................10 3.2 Programming Model .....................................12 3.3 Map ....................................................................12 3.4 Reduce ..............................................................12 3.5 HDFS ..................................................................14 3.6 HDFS Concepts ............................................15 Chapter 4: Other Projects on HADOOP ...........................19 4.1 Avro .................................................................. 19 4.2 Chukwa............................................................. 19 4.3 HBase ................................................................20 4.4 Hive .................................................................. 20 4.5 Pig ......................................................................21 4.6 ZooKeeper .......................................................22 Chapter 5: Applications of Hadoop ................................23 Chapter 6: Conclusion .......................................................... 25 Chapter 7: References ........................................................ 26 4
  • 5. 1.INTRODUCTION Today, we’re surrounded by data. People upload videos, take pictures on their cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Existing tools were becoming inadequate to process such large data sets. Google was the first to publicize MapReduce—a system they had used to scale their data processing needs. This system aroused a lot of interest because many other businesses were facing similar scaling challenges, and it wasn’t feasible for everyone to reinvent their own proprietary tool. Doug Cutting saw an opportunity and led the charge to develop an open source version of this MapReduce system called Hadoop. Soon after, Yahoo and others rallied around to support this effort. Today, Hadoop is a core part of the computing infrastructure for many web companies, such as Yahoo, Facebook, LinkedIn, and Twitter. Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is ■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2 ). ■ Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. ■ Simple—Hadoop allows users to quickly write efficient parallel code. Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs. Even college students can quickly and cheaply create their own Hadoop cluster. On the other hand, its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both academia and industry. 5
  • 6. 2.History of HADOOP Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. 2.1 The Origin of the Name ``Hadoop”: The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term. Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs. Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. It’s expensive too: Mike Cafarella and Doug Cutting estimated a system supporting a 1- billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed file system, called GFS, which was being used in production at Google.# GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed File System (NDFS). In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop 6
  • 7. into a system that ran at web scale (see sidebar). This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster. In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. 7
  • 8. Top 10 Features of Big Data Hadoop a. Open source It is an open source Java-based programming framework. Open source means it is freely available and even we can change its source code as per your requirements. b. Fault Tolerance Hadoop control faults by the process of replica creation. When a client stores a file in HDFS, Hadoop framework divides the file into blocks. Then the client distributes data blocks across different machines present in the HDFS cluster. And, then create the replica of each block on other machines present in the cluster. HDFS, by default, creates 3 copies of a block on other machines present in the cluster. If any machine in the cluster goes down or fails due to unfavorable conditions. Then also, the user can easily access that data from other machines. c. Distributed Processing Hadoop stores huge amounts of data in a distributed manner in HDFS. Process the data in parallel on a cluster of nodes. d. Scalability Hadoop is an open-source platform. This makes it an extremely scalable platform. So, new nodes can be easily added without any downtime. Hadoop provides horizontal scalability so new nodes are added on the fly model to the system. In Apache Hadoop, applications run on more than thousands of nodes. e. Reliability Data is reliably stored on the cluster of machines despite machine failure due to replication of data. So, if any of the nodes fails, then also we can store data reliably. f. High Availability 8
  • 9. Due to multiple copies of data, data is highly available and accessible despite hardware failure. So, any machine that goes down data can be retrieved from the other path. Learn Hadoop High Availability features in detail. g. Economic Hadoop is not very expensive as it runs on the cluster of commodity hardware. As we are using low-cost commodity hardware, we don’t need to spend a huge amount of money for scaling out your Hadoop cluster. i. Flexibility Hadoop is very flexible in terms of ability to deal with all kinds of data. It deals with structured, semi-structured or unstructured. j. Easy to use No need for a client to deal with distributed computing, the framework takes care of all the things. So it is easy to use. k. Data locality It refers to the ability to move the computation close to where actual data resides on the node. Instead of moving data to computation. This minimizes network congestion and increases the over throughput of the system. Learn more about Data Locality. 9
  • 10. 3.Key Technology The key technology for Hadoop is the MapReduce programming model and Hadoop Distributed File System. The operation on large data is not possible in the serial programming paradigm. MapReduce does tasks parallel to accomplish work in less time which is the main aim of this technology. MapReduce requires a special file system. In the real scenario , the data which are in terms on perabyte. To store and maintain this much data on distributed commodity hardware, Hadoop Distributed File System was invented. It is basically inspired by the Google File System. 3.1 MapReduce MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). 10
  • 11. Figure. MapReduce Programming Model "Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. “Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. 11
  • 12. 3.2 PROGRAMMING MODEL The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory 3.3 MAP map (in_key, in_value) -> (out_key, intermediate_value) list Example: Upper-case Mapper let map(k, v) = emit(k.toUpper(), v.toUpper()) (“foo”, “bar”) --> (“FOO”, “BAR”) (“Foo”, “other”) -->(“FOO”, “OTHER”) (“key2”, “data”) --> (“KEY2”, “DATA”) 3.4 REDUCE reduce (out_key, intermediate_value list) ->out_value list Example: Sum Reducer let reduce(k, vals) sum = 0 foreachint v in vals: sum += v 12
  • 13. emit(k, sum) (“A”, [42, 100, 312]) --> (“A”, 454) (“B”, [12, 6, -2]) --> (“B”, 16) Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. A MapReducejob is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program, and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks. There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. Figure : HadoopMapReduce 13
  • 14. 3.5 HDFS (Hadoop Distributed File System) The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. Figure: HDFS Architecture 14
  • 15. HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. Filesystems that manage the storage across a network of machines are called distributed file systems. Since they are network-based, all the complications of network programming kick in, thus making distributed file systems more complex than regular disk filesystems. For example, one of the biggest challenges is making the filesystem tolerate node failure without suffering data loss. Hadoop comes with a distributed file system called HDFS, which stands for HadoopDistributed File System. HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. 3.6 HDFS CONCEPTS: a) Blocks A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes. This is generally transparent to the filesystem user who is simply reading or writing a file—of whatever length. However, there are tools to do with filesystem maintenance, such as dfand fsck, that operate on 15
  • 16. the filesystem block level. HDFS too has the concept of a block, but it is a much larger unit—64 MB by default. Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. When unqualified, the term “block” in this book refers to a block in HDFS. b) Namenodes and Datanodes A HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode(the master) and a number of datanodes(workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts. A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. Figure: HDFS Architecture 16
  • 17. The client presents a POSIX-like file system interface, so the user code does not need to know about the namenode and datanode to function. Datanodes are the workhorses of the file system. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure. c) The File System Namespace HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. d) Data Replication HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Data replication 17
  • 18. 3.7 HADOOP ARCHIVES HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode. (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce. 18
  • 19. 4. Other Projects on HADOOP 4.1 Avro Apache Avro is a data serialization system. Avro provides: 1. Rich data structures. 2. A compact, fast, binary data format. 3. A container file, to store persistent data. 4. Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages 4.2 Chukwa 19
  • 20. Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed file system (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results, in order to make the best use of this collected data. 4.3 HBase Just as Google's Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop Core. 4.4 Hive 20
  • 21. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL 4.5 Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. 21
  • 22. 4.6 ZooKeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which makes them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. 22
  • 23. 5: Top 6 Real Time Big Data Hadoop Applications Apache Hadoop Applications Let us see hadoop use cases in various sectors. 5.1. Finance sectors Financial organizations use hadoop for fraud detection and prevention. They use Apache Hadoop for reducing risk, identifying rogue traders, and analyzing fraud patterns. Hadoop helps them to precisely target their marketing campaigns on the basis of customer segmentation. Hadoop helps financial agencies to improve customer satisfaction. Credit card companies also use Apache Hadoop for finding out the exact customer for their product. 5.2. Security and Law Enforcement The USA national security agency uses Hadoop in order to prevent terrorist attacks and to detect and prevent cyber-attacks. Big Data tools are used by the Police forces for catching criminals and even predicting criminal activity. Hadoop is used by different public sector fields such as defense, intelligence, research, cybersecurity, etc. 5.3. Companies use Hadoop for understanding customers requirements The most important application of Hadoop is understanding Customer’ requirements. Different companies such as finance, telecom use Hadoop for finding out the customer’s requirements by examining a big amount of data and discovering useful information from these vast amounts of data. By understanding customers' behaviors, organizations can improve their sales. 5.4. Hadoop Applications in Retail industry Retailers both online and offline use Hadoop for improving their sales. Many e-commerce companies use Hadoop for keeping track of the products bought together by the customers. On the basis of this, they provide suggestions to the customer to buy the other product when the customer is trying to buy one of the relevant products from that group. 23
  • 24. For example, when a customer tries to buy a mobile phone, then it suggests a customer for the mobile back cover, screen guard. Also, Hadoop helps retailers to customize their stocks based on the predictions that came from different sources such as Google search, social media websites, etc. Based on these predictions retailers can make the best decision which helps them to improve their business and maximize their profits. 5.5. Real-time analysis of customers data Hadoop can analyze customer data in real-time. It can track clickstream data as it’s for storing and processing high volumes of clickstream data. When a visitor visits a website, then Hadoop can capture information like from where the visitor originated before reaching a particular website, the search used for landing on the website. Hadoop can also grab data about the other webpages in which the visitor shows interest, time spent by the visitor on each page, etc. This is the analysis of website performance and user engagement. Enterprises of all types, by implementing Hadoop perform clickstream analysis for optimizing the user-path, predicting the next product to buy, carrying out market basket analysis, etc. 5.6. Uses of Hadoop in Government sectors The government uses Hadoop for the country, states, and cities development by analyzing vast amounts of data. For example,they use Hadoop for managing traffic in the streets, for the development of smart cities, or for improving transportation in the city. 24
  • 25. 6 CONCLUSION: •Hadoop is a technology of the future. Sure, it might not be an integral part of the curriculum, but it is and will be an integral part of the workings of an E-commerce, finance, insurance, IT, healthcare are some of the starting points. • By the above description we can understand the need of Big Data in future, so Hadoop can be the best of maintenance and efficient implementation of large data. • This technology has a bright future scope because day by day the need of data would increase and security issues also the major point.In nowadays many Multinational organisations prefer Hadoop over RDBMS. • So major companies like facebook amazon, yahoo, linkedIn etc. are adapting Hadoop and in future there can be many names in the list. • Hence Hadoop Technology is the most appropriate approach for handling the large data in a smart way and its future is bright… 25