Big Data UNIT 2 AKTU syllabus all topics covered

History of Hadoop
Early Foundations (2003–2004)
• In 2003, Google published a paper on the Google File System (GFS), which
described a scalable distributed file system to store and manage large datasets
across multiple machines.
• In 2004, Google introduced MapReduce, a programming model designed for
processing large-scale datasets efficiently.
• Hadoop was originally created by Doug Cutting and Mike Cafarella in 2005, when
they were working on the Nutch Search Engine project. The Nutch Search Engine
project is highly extendible and scalable web crawler, used the search and index
web pages. In order to search and index the billions of pages that exist on the
internet, vast computing resources are required. Instead of relying on (centralized)
hardware to deliver high-availability, Cutting and Cafarella developed software
(now known as Hadoop) that was able to distribute the workload across clusters of
computers.

Birth of Hadoop (2005–2006)
• Inspired by Google’s work, Doug Cutting and Mike Cafarella
were developing an open-source web search engine called Nutch.
• They realized the need for a robust storage and processing system
like GFS and MapReduce.
• In 2005, they started implementing a distributed computing
system as part of the Apache Nutch project.
• In 2006, Hadoop became an independent project under the
Apache Software Foundation (ASF), with Doug Cutting naming it
after his son’s toy elephant.

Growth and Adoption (2007–2011)
•Yahoo! became the primary contributor to Hadoop and built the world’s largest Hadoop cluster at the time.
•In 2008, Apache Hadoop 0.1 was released as an open-source framework.
•Hadoop's ecosystem expanded with HBase (2008), Hive (2009), Pig (2008), and Zookeeper (2010), adding more
functionalities.
•By 2011, Hadoop became widely adopted across industries for handling big data.

Modern Hadoop and Industry
Adoption (2012–Present)
• Hadoop 2.0 (2013) introduced YARN (Yet Another Resource
Negotiator), which improved cluster resource management
• Companies like Facebook, Twitter, Amazon, and Microsoft
integrated Hadoop into their data architectures.
• Hadoop 3.0 (2017) introduced support for erasure coding,
improved scalability, and containerization.
• Although cloud-based solutions like Apache Spark and
Google BigQuery are gaining popularity, Hadoop remains a
key player in big data processing.

Apache Hadoop
• Apache Hadoop is an open-source software framework developed by
Douglas Cutting, then at Yahoo, that provides the highly reliable
distributed processing of large data sets using simple programming
models.
• Hadoop is an open-source software framework that is used for storing
and processing large amounts of data in a distributed computing
environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel
processing of large datasets.
• Its framework is based on Java programming with some native code in
C and shell scripts.

• The goal is to evaluate word distribution in one million
German books, which is too large for a single computer
to handle.
• The Map-Reduce algorithm breaks the large task into
smaller, manageable subprocesses.
• Each book is processed individually to calculate word
occurrence and distribution, with results stored in
separate tables for each book.
• A master node manages and coordinates the entire
process but doesn’t perform any calculations.Slave
nodes receive tasks from the master node, read books,
calculate word frequencies, and store the results.

Hadoop Distributed File System
• It has distributed file system known as
HDFS and this HDFS splits files into
blocks and sends them across various
nodes in form of large clusters.
• Also in case of a node failure, the
system operates and data transfer takes
place between the nodes which are
facilitated by HDFS.

HDFS Architecture
1.Nodes: Master-slave nodes typically forms the HDFS cluster.
2.NameNode(MasterNode):
1. Manages all the slave nodes and assign work to them.
2. It executes filesystem namespace operations like opening, closing, renaming files and directories.
3. It should be deployed on reliable hardware which has the high config. not on commodity hardware.
3.DataNode(SlaveNode):
1. Actual worker nodes, who do the actual work like reading, writing, processing etc.
2. They also perform creation, deletion, and replication upon instruction from the master.
3. They can be deployed on commodity hardware.

HDFS Architecture
• HDFS daemons: Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent
copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
• Require high memory as data is actually stored here.

Basic HDFS Commands
• List files in a directory:
hdfs dfs -ls /path/to/directory
• Create a directory:
hdfs dfs -mkdir /new_directory
• Upload a file to HDFS:
hdfs dfs -put local_file.txt /hdfs_directory/
• Download a file from HDFS:
hdfs dfs -get /hdfs_directory/file.txt /local_directory/
• Remove a file or directory:
hdfs dfs -rm /path/to/file
hdfs dfs -rm -r /path/to/directory

Components of Hadoop
Hadoop can be broken down into four main components. These four
components (which are in fact independent software modules) together compose
a Hadoop cluster. In other words, when people are talking about ‘Hadoop,’
actually mean that they are using (at least) these 4 components:
1.Hadoop Common – Contains libraries and utilities need by all the other Hadoop
Modules;
2.Hadoop Distributed File System – A distributed file-system that stores data on
commodity machines, providing high aggregate bandwidth across a cluster;
3.Hadoop MapReduce – A programming model for large scale data processing;
4.Hadoop YARN – A resource-management platform responsible for managing
compute resources in clusters and using them for scheduling of users’
applications.

Hadoop Common
• Hadoop is a Java-based solution. Hadoop Common provides the tools (in Java) for a
user’s computer so that it can read the data that is stored in a Hadoop file system. No
matter what type of operating system a user has (Windows, Linux, etc.), Hadoop
Common ensures that data can be correctly interpreted by the machine.
Hadoop Distributed File System (HDFS)
• The Hadoop Distributed File System (HDFS) is a filesystem designed for storing very
large files with streaming data access patterns, running on clusters of commodity
hardware.[1] This means that Hadoop stores files that are typically many terabytes up
to petabytes of data. The streaming nature of HDFS means that HDFS stores data
under the assumption that it will need to be read multiple times and that the speed
with which the data can be read is most important. Lastly, HDFS is designed to run on
commodity hardware, which is inexpensive hardware that than be sourced from
different vendors.

• Hadoop MapReduce
MapReduce is a processing technique and program model that enables
distributed processing of large quantities of data, in parallel, on large clusters of
commodity hardware. Similar in the way that HDFS stores blocks in a distributed
manner, MapReduce processes data in a distributed manner. In other words,
MapReduce uses processing power in local nodes within the cluster, instead of
centralized processing.
• Hadoop YARN
Hadoop YARN (Yet Another Resource Negotiator) is responsible for allocating
system resources to the various applications running in a Hadoop cluster and
scheduling tasks to be executed on different cluster nodes. It was developed
because in very large clusters (with more than 4000 nodes), the MapReduce
system begins to hit scalability bottlenecks.

Data Format in Hadoop
• In Hadoop, data can be stored and processed in various formats depending on the
requirements for efficiency, compatibility, and performance. Here are the common
data formats used in Hadoop:
• Hadoop data formats are broadly classified into:
1.Text-Based Formats
2.Binary Row-Based Formats
3.Columnar Formats
4.Key-Value Formats

1. Text-Based Formats
Text-based formats store data in a human-readable form but are less efficient in terms of
storage and performance.
a) CSV (Comma-Separated Values)
• Stores tabular data with values separated by commas.
• Simple and widely used but lacks data types and schema enforcement.
• Consumes more space compared to binary formats.
b) JSON (JavaScript Object Notation)
• Stores structured or semi-structured data in key-value pairs.
• Supports nested data, making it useful for NoSQL and unstructured data.
• More readable but larger in size compared to binary formats.
c) XML (Extensible Markup Language)
• Uses tags to define data, making it highly structured.
• Used in older applications but is verbose and slow to process.

2. Binary Row-Based Formats
Binary formats store data in compressed binary form, making them faster and
more space-efficient than text-based formats.
a) Avro
• A row-based format developed by Apache Avro.
• Stores schema separately in JSON format while data is stored in binary.
• Supports schema evolution, making it useful for streaming applications like
Kafka.
b) SequenceFile
• A Hadoop-native format that stores key-value pairs in binary form.
• Optimized for Hadoop MapReduce and supports compression.
• Efficient for storing intermediate data in Hadoop workflows.

3. Columnar Formats
Columnar formats store data column-wise instead of row-wise, improving query
performance for analytical workloads.
a) Parquet
• A highly compressed columnar storage format.
• Optimized for big data analytics in tools like Apache Spark, Hive, and Presto.
• Supports schema evolution and efficient queries on large datasets.
b) ORC (Optimized Row Columnar)
• Designed specifically for Apache Hive.
• Provides higher compression and better performance than Parquet in some
cases.
• Used in data warehousing and OLAP (Online Analytical Processing).

4. Key-Value Formats
These formats store data in key-value pairs, making them
ideal for NoSQL databases and real-time applications.
a) HBase Table Format
• Used in HBase, a NoSQL database built on Hadoop.
• Supports random reads and writes, making it suitable for
real-time processing.
b) SequenceFile
• Used for storing key-value data in Hadoop’s MapReduce
framework.
• Supports compression and splitting, making it scalable for
large datasets.

Analyzing Data With Hadoop
Analyzing data with Hadoop involves using various components and tools within the Hadoop ecosystem
to process, transform, and gain insights from large datasets. Here are the steps and considerations for
analyzing data with Hadoop:
1. Data Ingestion:
• Start by ingesting data into the Hadoop cluster. You can use tools like Apache Flume, Apache Kafka, or
Hadoop’s HDFS for batch data ingestion.
• Ensure that your data is stored in a structured format in HDFS or another suitable storage system.
2. Data Preparation:
• Preprocess and clean the data as needed. This may involve tasks such as data deduplication, data
normalization, and handling missing values.
• Transform the data into a format suitable for analysis, which could include data enrichment and feature
engineering.

3. Choose a Processing Framework:
Select the appropriate data processing framework based on your requirements. Common
choices include:
• MapReduce: Ideal for batch processing and simple transformations.
• Apache Spark: Suitable for batch, real-time, and iterative data processing. It offers a
wide range of libraries for machine learning, graph processing, and more.
• Apache Hive: If you prefer SQL-like querying, you can use Hive for data analysis.
• Apache Pig: A high-level data flow language for ETL and data analysis tasks.
• Custom Code: You can write custom Java, Scala, or Python code using Hadoop APIs if
necessary.
4. Data Analysis:
• Write the code or queries needed to perform the desired analysis. Depending on your choice
of framework, this may involve writing MapReduce jobs, Spark applications, HiveQL
queries, or Pig scripts.
• Implement data aggregation, filtering, grouping, and any other required transformations.

5. Scaling:
• Hadoop is designed for horizontal scalability. As your data and processing needs grow, you
can add more nodes to your cluster to handle larger workloads.
6. Optimization:
• Optimize your code and queries for performance. Tune the configuration parameters of your
Hadoop cluster, such as memory settings and resource allocation.
• Consider using data partitioning and bucketing techniques to improve query performance.
7. Data Visualization:
• Once you have obtained results from your analysis, you can use data visualization tools like
Apache Zeppelin, Apache Superset, or external tools like Tableau and Power BI to create
meaningful visualizations and reports.
8. Iteration:
• Data analysis is often an iterative process. You may need to refine your analysis based on
initial findings or additional questions that arise.

9. Data Security and Governance:
• Ensure that data access and processing adhere to security and governance policies. Use tools
like Apache Ranger or Apache Sentry for access control and auditing.
10. Results Interpretation:
• Interpret the results of your analysis and draw meaningful insights from the data.
• Document and share your findings with relevant stakeholders.
11. Automation:
• Consider automating your data analysis pipeline to ensure that new data is continuously
ingested, processed, and analyzed as it arrives.
12. Performance Monitoring:
• Implement monitoring and logging to keep track of the health and performance of your
Hadoop cluster and data analysis jobs.

Scaliing
• It can be defined as a process to expand the existing configuration (servers/computers) to
handle a large number of user requests or to manage the amount of load on the server.
This process is called scalability. This can be done either by increasing the current system
configuration (increasing RAM, number of servers) or adding more power to the
configuration. Scalability plays a vital role in the designing of a system as it helps in
responding to a large number of user requests more effectively and quickly.
• There are two ways to do this :
1.Vertical Scaling
2.Horizontal Scaling

Vertical Scaling
• It is defined as the process of increasing the capacity of a single machine by adding more resources
such as memory, storage, etc. to increase the throughput of the system. No new resource is added,
rather the capability of the existing resources is made more efficient. This is called Vertical
scaling. Vertical Scaling is also called the Scale-up approach.
• Example: MySQL
• Advantages of Vertical Scaling
It is easy to implement
Reduced software costs as no new resources are added
Fewer efforts required to maintain this single system
Disadvantages of Vertical Scaling
Single-point failure
Since when the system (server) fails, the downtime is high because we only have a single server
High risk of hardware failures

Example
When traffic increases, the server
degrades in performance. The first
possible solution that everyone has is to
increase the power of their system. For
instance, if earlier they used 8 GB RAM
and 128 GB hard drive now with
increasing traffic, the power of the
system is affected. So a possible solution
is to increase the existing RAM or hard
drive storage, for e.g. the resources
could be increased to 16 GB of RAM
and 500 GB of a hard drive but this is
not an ultimate solution as after a point
of time, these capacities will reach a
saturation point.

Horizontal Scaling
• It is defined as the process of adding more instances of the same type to the existing pool
of resources and not increasing the capacity of existing resources like in vertical scaling.
This kind of scaling also helps in decreasing the load on the server. This is called
Horizontal Scaling Horizontal Scaling is also called the Scale-out approach.
• In this process, the number of servers is increased and not the individual capacity of the
server. This is done with the help of a Load Balancer which basically routes the user
requests to different servers according to the availability of the server. Thereby, increasing
the overall performance of the system. In this way, the entire process is distributed among
all servers rather than just depending on a single server.
Example: NoSQL, Cassandra, and MongoDB

Advantages of Horizontal Scaling
1.Fault Tolerance means that there is no single point of failure in this kind of scale because there are 5
servers here instead of 1 powerful server. So if anyone of the servers fails then there will be other
servers for backup. Whereas, in Vertical Scaling there is single point failure i.e: if a server fails then
the whole service is stopped.
2.Low Latency: Latency refers to how late or delayed our request is being processed.
3.Built-in backup
Disadvantages of Horizontal Scaling
4.Not easy to implement as there are a number of components in this kind of scale
5.Cost is high
6.Networking components like, router, load balancer are required

EXAMPLE
if there exists a system of the capacity of
8 GB of RAM and in future, there is a
requirement of 16 GB of RAM then,
rather than the increasing capacity of 8
GB RAM to 16 GB of RAM, similar
instances of 8 GB RAM could be used
to meet the requirements.

Hadoop Streaming
• Hadoop Streaming uses UNIX standard streams as the interface
between Hadoop and your program so you can write MapReduce
program in any language which can write to standard output and read
standard input. Hadoop offers a lot of methods to help non-Java
development.
• The primary mechanisms are Hadoop Pipes which gives a native C++
interface to Hadoop and Hadoop Streaming which permits any
program that uses standard input and output to be used for map tasks
and reduce tasks.
• With this utility, one can create and run MapReduce jobs with any
executable or script as the mapper and/or the reducer.

Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as
follows :
1. Hadoop Streaming is a part of the Hadoop Distribution System.
2. It facilitates ease of writing Map Reduce programs and codes.
3. Hadoop Streaming supports almost all types of programming
languages such as Python, C++, Ruby, Perl etc.
4. The entire Hadoop Streaming framework runs on Java. However,
the codes might be written in different languages as mentioned
in the above point.
5. The Hadoop Streaming process uses Unix Streams that act as an
interface between Hadoop and Map Reduce programs.
6. Hadoop Streaming uses various Streaming Command Options
and the two mandatory ones are – -input directoryname or
filename and -output directoryname

As it can be clearly seen in the diagram above that there are almost 8 key parts in a Hadoop Streaming
Architecture. They are :
• Input Reader/Format
• Key Value
• Mapper Stream
• Key-Value Pairs
• Reduce Stream
• Output Format
• Map External
• Reduce External

The six steps involved in the working of Hadoop Streaming are:
• Step 1: The input data is divided into chunks or blocks, typically 64MB to 128MB in size
automatically. Each chunk of data is processed by a separate mapper.
• Step 2: The mapper reads the input data from standard input (stdin) and generates an intermediate
key-value pair based on the logic of the mapper function which is written to standard output
(stdout).
• Step 3: The intermediate key-value pairs are sorted and partitioned based on their keys ensuring
that all values with the same key are directed to the same reducer.
• Step 4: The key-value pairs are passed to the reducers for further processing where each reducer
receives a set of key-value pairs with the same key.
• Step 5: The reducer function, implemented by the developer, performs the required computations or
aggregations on the data and generates the final output which is written to the standard output
(stdout).
• Step 6: The final output generated by the reducers is stored in the specified output location in the
HDFS.

Hadoop Pipes
• Hadoop Pipes is a C++ interface that allows users to write MapReduce applications using Hadoop.
It uses sockets to enable communication between tasktrackers and processes running the C++ map
or reduce functions.
Working of Hadoop Pipes:
• C++ code is split into a separate process that performs the application-specific code
• Writable serialization is used to convert types into bytes that are sent to the process via a socket
• The application programs link against a thin C++ wrapper library that handles communication
with the rest of the Hadoop system

Hadoop Ecosystem
• Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems.
• The Hadoop Ecosystem is a suite of tools and technologies built around Hadoop to process and manage large-scale data.
Key components include:
1. HDFS (Hadoop Distributed File System) – Distributed storage system.
2. MapReduce – Programming model for parallel data processing.
3. YARN (Yet Another Resource Negotiator) – Resource management and job scheduling.
Supporting Tools:
• Hive – SQL-like querying.
• Pig – Data flow scripting.
• HBase – NoSQL database.
• Spark – Fast in-memory data processing.
• Sqoop – Data transfer between Hadoop and RDBMS.
• Flume – Data ingestion from logs.
• Oozie – Workflow scheduling.
• Zookeeper – Coordination service.

Classical Data Processing
Memory
Disk
CPU
Single Node Machine
1. Data fits into memory
• Load data from disk into memory and then
process from memory
2. Data does not fit into memory
• Load part of the data from disk into memory
• Process the data

Motivation: Simple Example
• 10 Billion web pages
• Average size of webpage: 20KB
• Total 200 TB
• Disk read bandwidth = 50MB/sec
• Time to read = 4 million second = 46+ days
• Longer, if you want to do useful analytics with the data

What is the Solution?
• Use multiple interconnected machine as follows
BIG DATA
Known as Distributed Data processing in Cluster of Computers
1. Split data into small chunks
2. Send different chunks to
different machines and process
3. Collect the results from
different machines

How to Organize Cluster of Computers?

Cluster Architecture: Rack Servers
switch switch
machi
ne
machi
ne machi
ne
machi
ne
switch
Rack 1 Rack 2
Backbone switch
Typically 2-10 gbps
1 gbps
between
any pair
of nodes
• Each rack contains 16-64 commodity (low cost) computers (also called nodes)
• In 2011, Google has roughly 1 million nodes

Challenges in Cluster Computing

Challenge # 1
• Node failures
• Single server lifetime: 1000 days
• 1000 servers in cluster => 1 failure/day
• 1M servers in clusters => 1000 failures/day
• Consequences of node failure
• Data loss
• Node failure in the middle of long and expensive computation
• Need to restart the computation from scratch

Challenge # 2
• Network bottleneck
• Computers in a cluster exchanges data through network
• Example
• Network bandwidth = 1 gbps
• Moving 10TB of data takes 1 day

Challenge # 3
• Distributed Programming is hard!
• Why?
1. Data distributions across machines is non-trivial
• (It is desirable that machines have roughly the same load)
2. Avoiding race conditions
• Given two tasks T1 and T2,
• Correctness of result depends on the sequence of execution of task
• For example, T1 before T2 is must, but NOT T2 before T1

What is the Solution?
• Map-Reduce
• It is a simple programming model for processing really big data using cluster of
computers

Map Reduce
• MapReduce is a programming model in Hadoop used for processing large datasets in a distributed
and parallel manner.
• Generally MapReduce paradigm is based on sending the computer to where the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
• Map stage − The map or mapper’s job is to process the input data. Generally the input data is
in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file
is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces
a new set of output, which will be stored in the HDFS.

Map Reduce
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in
the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

How Map-Reduce addresses the
challenges?
1. Data loss prevention
• By keeping multiple copies of data in different machines
2. Data movement minimization
• By moving computation to the data
• (send your computer program to machines containing data)
3. Simple programing model
• Mainly using two functions
1.Map
2.Reduce
Programmer’s responsibility:
Write only two functions, Map and Reduce suitable for your problem
You DO NOT need to worry about other things

Redundant Storage Infrastructure
• Distributed File System
• Global file namespaces, redundancy
• Multiple copies of data and in different nodes
• Example: Google file system (GFS), HDFS (Hadoop, a open-source map-reduce system)
• Typical usage pattern
• Data is rarely updated in place
• Reads and appends are common

Distributed File System: Inside Look
• Data is kept in chunks, spread across machines
• Each chunk is replicated on different machines
• Ensures persistence
• Example:
• We have two files, A and B
• 3 computers
• 2 times replication of data
a1 a2
a3 b1
b1 a1
b2 a2
a3
b2
Here are the Chunk Servers
Chunk servers also serve as
compute nodes
Bring computation to
the data
a1 a2
a3
b1
b2
A B

Distributes File System: Summary
• Chunk servers
• File is split into contiguous chunks (16-64 mb)
• Each chunk is replicated (usually 2 times or 3 times)
• Try to keep replicas in different racks
• Master node
• Stores metadata about where the files are stored
• It might also be replicated

Example Problem: Counting Words
• We have a huge text document and count the number of
times each distinct word appears in the file
• Sample application
• Analyze web server logs to find popular URLs
• How you solve this using a single machine?

Word Count
• Case 1: File too large for memory, but all <word, count> pairs fit in
memory
• You can create a big string array OR you can create
a hash table
• Case 2: All <word, count> pairs do not fit in memory, but fit into disk
• A possible approach (write computer
programs/functions for each step)
1.Break the text document into sequence of words
2.Sort the wordsThis will bring same words together
3.Count the frequencies in a single pass
getWords(text
File)
sort
count

Map-Reduce: In a Nutshell
• getWords(dataFile) sort count
Map
extract something you
care about
(here word and count)
Group by key
sort and shuffle
Reduce
Aggregate, summarize,
etc
Save the results
Summary
1. Outline stays the same
2. Map and Reduce to be defined to fit the problem

MapReduce: The Map Step
c2
f2
k1 v1
k2 v2
map
c1
f1
c3
f3
…
k3 v3
map
Input key-value pairs
(file name and its content)
Intermediate key-value
pairs
(word and count)
…
k4 v4
map

MapReduce: The Reduce Step
Group
by key
reduce
reduce
k1 𝑣 ′
k2 𝑣′′
k3 𝑣′′′
…
k3 v4
…
k1 v1
k2 v2 v5
v3 v6
Key-value groups
Output
key-value pairs
reduce
k1 v6
Intermediate
key-value pairs
k3 v4
…
k1 v1
v2
k2
v5
k2
k1 v3

Map-reduce: Word Count
The crew of the space
shuttle Endeavor recently
returned to Earth as
ambassadors, harbingers
of a new era of space
exploration. Crew members
at ……………………..
Big document
(the, 1)
(crew, 1)
(of, 1)
(the, 1)
(space, 1)
(shuttle, 1)
(endeavor, 1)
(recently, 1)
(returned, 1)
(to, 1)
(earth, 1)
(as, 1)
(ambassadors, 1)
….
(crew, 1)
……..
(crew, 1)
(crew, 1)
(space, 1)
(the, 1)
(the, 1)
(the, 1)
(shuttle, 1)
(recently, 1)
…
(crew, 2)
(space, 1)
(the, 3)
(shuttle, 1)
(recently, 1)
…
MAP:
Read input
and produces
a set of key-
value pairs
Group by
key:
Collect all
pairs with
same key
Reduce:
Collect all
values
belonging to
the key and
output
(key, value)
Provided by the
programmer
Provided by the
programmer
(key, value)
(key, value)

Word Count Using MapReduce: Pseudocode
map(key, value):
// key: document name; value: text of the document
for each word w in value
emit(w, 1)
reduce(key, values):
// key: a word; value: set of counts values for the word
result = 0
for each count v in values:
result += v
emit(key, result)

Map-reduce System: Under the Hood
All phases are distributed with many tasks doing the work in
parallel
Moving data across
machines

Map-Reduce Algorithm Design
• Programmer’s responsibility is to design two functions:
1. Map
2. Reduce
• A very important issue
• Often network is the bottleneck
• Your design should minimize data communications across machines

Problems Suitable for Map-reduce
• Map-reduce is suitable for batch processing
• Updates are made after whole batch of data is processed
• The mappers do not need data from one another while they are running
• Example
1. Word count
• In general, map-reduce is suitable:
if the problem can be broken into independent sub-problems

Problems NOT Suitable for Map-reduce
• In general, when the machines need to exchange
data too often during computation
• Examples
1. Applications that require very quick response time
• In IR, indexing is okay, but query processing is not suitable for map-reduce
2. Machine learning algorithms that require frequent
parameter update
• Stochastic gradient descent

Warm-up Exercise
• Matrix Addition
• Can it be done in map-reduce?
• YES
• What is the map function (key and value)?
• Key = row number; value = elements of row (as an array)
• What is the reduce function?
• For each key, reducer will have two arrays
• Reduce function simply adds numbers, position-wise

Advanced Exercise: Join By Map-Reduce
• Compute the natural join T1(A,B) T2(B,C)
⋈
(combine rows from T1 and T2 such that rows have common value in column B)
A B
a1 b1
a2 b1
a3 b2
a4 b3
B C
b2 c1
b2 c2
b3 c3
⋈
A B C
a3 b2 c1
a3 b2 c2
a4 b3 c3
=
T1
T2

Map-Reduce Join
• Map process
• Each row (a,b) from T1 into key-value pair (b,(a,T1))
• Each row (b,c) from T2 into (b,(c,T2))
• Reduce process
• Each reduce process matches all the pairs (b,(a,T1))
with all (b,(c,T2)) and outputs (a,b,c)

Advanced Exercise
• You have a dataset with thousands of features. Find the most co-related features
in that data.
features

Take Home Exercises
• Design Map and Reduce functions for the following
1. Pagerank
2. HITS

Developing Map Reduce Application
• Map Program for word Count in Python
sentences = [
"Artificial Intelligence is fascinating",
"Deep learning improves medical imaging",
"Generative AI is transforming healthcare"
]
# Using map to count words in each sentence
word_counts = list(map(lambda sentence:
len(sentence.split()), sentences))
# Print results
for i, count in enumerate(word_counts):
print(f"Sentence {i+1}: {count} words")

Reduce Program for word Count in Python
from functools import reduce
sentences = [
"Artificial Intelligence is fascinating",
"Deep learning improves medical imaging",
"Generative AI is transforming healthcare"
]
# Using reduce to count total words
total_word_count = reduce(lambda total,
sentence: total + len(sentence.split()),
sentences, 0)
# Print result
print(f"Total Word Count: {total_word_count}")

Unit Test MapReduce using MRUnit
• In order to make sure that your code is correct, you need to Unit test
your code first.
• And like you unit test your Java code using JUnit testing framework,
the same can be done using MRUnit to test MapReduce Jobs.
• MRUnit is built on top of JUnit framework. So we will use the JUnit
classes to implement unit test code for MapReduce.
• If you are familiar with JUnits then you will find unit testing for
MapReduce jobs also follows the same pattern.

Unit Test MapReduce using MRUnit
• To Unit test MapReduce jobs:
• Create a new test class to the existing project
• Add the mrunit jar file to build path
• Declare the drivers
• Write a method for initializations & environment setup
• Write a method to test mapper
• Write a method to test reducer
• Write a method to test the whole MapReduce job
• Run the test

Unit Testing : Streaming by Python
• We can test it on a hadoop cluster, or test it on our local machine by
Unix stream.
•cat test_input | python3 mapper.py | sort | python3 reducer.py

Handling failures in hadoop,
mapreduce
• In the real world, user code is buggy, processes crash, and machines fail.
• One of the major benefits of using Hadoop is its ability to handle such failures
and allow your job to complete successfully.
• In mapReduce there are three failures modes to consider :
• Failure of the running task
• Failure of the tasktracker
• Failure of the jobtracker

Handling failures in hadoop,
mapreduce
•Task Failure
• The most common occurrence of this failure is when user code in the
map or reduce task throws a runtime exception.
• If this happens, the child JVM reports the error back to its parent
application master before it exits.
• The error ultimately makes it into the user logs.
• The application master marks the task attempt as failed, and frees up
the container so its resources are available for another task.

Handling failures in hadoop, mapreduce
•Tasktracker Failure
• If tasktracker fails by crashing or running very slowly, it will stop sending heartbeats to the jobtracker.
• The jobtracker will notice a tasketracker that has stoped sending heartbeats if it has not received one
for 10 minutes (Interval can be changed) and remove it from its pool of tasktracker
• A tasktracker can also be blacklisted by the jobtracker.

Handling failures in hadoop, mapreduce
•Jobtracker Failure
• It is most serious failure mode.
• Hadoop has no mechanism for dealing with jobtracker failure
• This situation is improved with YARN.

How Hadoop runs a MapReduce job using YARN

Job Scheduling in Hadoop
• In Hadoop 1, Hadoop MapReduce framework is responsible for
scheduling tasks, monitoring them, and re-executes the failed task.
• But in Hadoop 2, a YARN called Yet Another Resource Negotiator was
introduced.
• The basic idea behind the YARN introduction is to split the
functionalities of resource management and job scheduling or
monitoring into separate daemons that are ResorceManager,
ApplicationMaster, and NodeManager.

Job Scheduling in Hadoop
• The ResourceManager has two main components that are Schedulers
and ApplicationsManager.
• Schedulers in YARN ResourceManager is a pure scheduler which is
responsible for allocating resources to the various running
applications.
• The FIFO Scheduler, CapacityScheduler, and FairScheduler are
pluggable policies that are responsible for allocating resources to the
applications.

FIFO Scheduler
• First In First Out is the default scheduling policy used in Hadoop.
• FIFO Scheduler gives more preferences to the application coming first
than those coming later.
• It places the applications in a queue and executes them in the
order of their submission (first in, first out).

Capacity Schedular
• It is designed to run Hadoop applications in a shared, multi-
tenant cluster while maximizing the throughput and the
utilization of the cluster.
• It supports hierarchical queues to reflect the structure of
organizations or groups that utilizes the cluster resources

Capacity Schedular
Advantages
• It maximizes the utilization of resources and
throughput in the Hadoop cluster.
• Provides elasticity for groups or
organizations in a cost- effective
manner.
• It also gives capacity guarantees
and safeguards to the
organization utilizing cluster.
Disadvantage:
• It is complex amongst the other
scheduler.

Fair Scheduler
• It like capacity Scheduler
• It assigns resources to applications in such a way that all applications get,
on average, an equal amount of resources over time.
• When the single application is running, then tha app uses the entire
cluster resources.
• When other applications are submitted, the free up resources are assigned
to the new apps so that every app eventually gets roughly the same amount
of resources.

Shuffle Sort
• Shuffling is the process of transferring intermediate key-value pairs
from the Mapper to the Reducer.
• It ensures that all values belonging to the same key are grouped
together before reaching the Reducer.
• This process is automatic in Hadoop and managed by the framework.
• Sorting happens after shuffling, where intermediate key-value pairs
are sorted by key in ascending order.
• Sorting ensures that the reducer processes data in a structured
manner.
• Hadoop uses merge sort for this operation to handle large-scale data
efficiently.

Steps in Shuffle
and Sort Phase
Mapper
Output
Collection
Partitioning
Shuffling
Sorting

Task Execution in Hadoop MapReduce
• Hadoop MapReduce follows a distributed processing model where tasks
are executed in multiple phases. The key phases include Map, Shuffle &
Sort, and Reduce, with task execution managed by the YARN (Yet
Another Resource Negotiator) framework.
1. Job Execution Overview
A MapReduce job consists of multiple tasks, which are divided into:
• Map Tasks (Mappers) – Process input data and generate intermediate
key-value pairs.
• Reduce Tasks (Reducers) – Process the sorted key-value pairs and
generate the final output.
• Each task runs independently and is managed by Hadoop's JobTracker
(Hadoop 1.x) or ResourceManager (Hadoop 2.x with YARN).

2. Phases of Task Execution
(a) Input Splitting
• The input dataset is divided into chunks called input splits.
• Each split is processed by an individual Mapper.
(b) Mapping Phase
• Each Mapper takes an input split and processes it in parallel.
• The output of this phase is a set of intermediate key-value pairs.
(c) Shuffle and Sort Phase
• Intermediate key-value pairs are shuffled across nodes to group similar keys together.
• The data is sorted by key before being sent to the Reducer.
(d) Reduce Phase
• The Reducer processes grouped key-value pairs and performs aggregation, filtering, or
transformation.
• The final output is written to HDFS (Hadoop Distributed File System).

3. Task Execution Components
• JobTracker (Hadoop 1.x) or ResourceManager (Hadoop 2.x -
YARN) manages job scheduling.
• TaskTracker (Hadoop 1.x) or NodeManager (Hadoop 2.x -
YARN) executes individual tasks.
• Speculative Execution helps handle slow tasks by running
duplicates on different nodes.
4. Fault Tolerance in Task Execution
• If a task fails, Hadoop automatically retries the execution on
another node.
• Checkpointing and replication in HDFS ensure no data loss.

Map Reduce types and format
• Hadoop MapReduce processes large datasets in a
distributed manner. It supports different types and data
formats to handle diverse data processing needs.

1. MapReduce Data Types
Hadoop uses the Writable interface for data serialization and efficient processing. The key
MapReduce data
types include:
(a) Key-Value Data Types
•Text → Stores string values (Text is preferred over String for efficiency).
•IntWritable → Stores integer values.
•LongWritable → Stores long integer values.
•FloatWritable → Stores floating-point values.
•DoubleWritable → Stores double-precision values.
•BooleanWritable → Stores boolean values (true/false).
•ArrayWritable → Stores an array of writable objects.
•MapWritable → Stores key-value pairs where keys and values are writable.

2. Input Formats in Hadoop
• How the input files are split up and read in Hadoop is defined by the
InputFormat.
• An Hadoop InputFormat is the first component in Map-Reduce, it is
responsible for creating the input splits and dividing them into records.
• Initially, the data for a MapReduce task is stored in input files, and input files
typically reside in HDFS.
• Using InputFormat we define how these input files are split and read.

Input Formats
• The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:
• The files or other objects that should be used for input is selected by the
InputFormat.
• InputFormat defines the Data splits, which defines both the size of individual Map tasks and its
potential execution server.
• InputFormat defines the RecordReader, which is responsible for reading actual records from
the input files.

Types of InputFormat in MapReduce
• FileInputFormat in Hadoop
• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileAsTextInputFormat
• SequenceFileAsBinaryInputFormat
• NLineInputFormat
• dbInputFormat

FileInputFormat in Hadoop
• It is the base class for all file-based InputFormats.
• Hadoop FileInputFormat specifies input directory where data files are
located.
• When we start a Hadoop job, FileInputFormat is provided with a path
containing files to read.
• FileInputFormat will read all files and divides these files into one or
more InputSplits.

TextInputFormat
• It is the default InputFormat of MapReduce.
• TextInputFormat treats each line of each input file as a separate
record and performs no parsing.
• This is useful for unformatted data or line-based records like log
files.
• Key – It is the byte offset of the beginning of the line within the file
(not whole file just one split), so it will be unique if combined with the
file name.
• Value – It is the contents of the line, excluding line terminators.

KeyValueTextInputFormat
• It is similar to TextInputFormat as it also treats each line of input as a
separate record.
• While TextInputFormat treats entire line as the value, but the
KeyValueTextInputFormat breaks the line itself into key and value by a
tab character (‘/t’).
• Here Key is everything up to the tab character while the value is the
remaining part of the line after tab character.

SequenceFileInputFormat
• Hadoop SequenceFileInputFormat is an InputFormat which reads
sequence files.
• Sequence files are binary files that stores sequences of binary key-
value pairs.
• Sequence files block-compress and provide direct serialization and
deserialization of several arbitrary data types (not just text).
• Here Key & Value both are user-defined.

SequenceFileAsTextInputFormat
• Hadoop SequenceFileAsTextInputFormat is another form of
SequenceFileInputFormat which converts the sequence file key values
to Text objects.
• By calling ‘tostring()’ conversion is performed on the keys and values.
• This InputFormat makes sequence files suitable input for streaming.

NLineInputFormat
• Hadoop NLineInputFormat is another form of TextInputFormat where
the keys are byte offset of the line and values are contents of the line.
• if we want our mapper to receive a fixed number of lines of input,
then we use NLineInputFormat.
• N is the number of lines of input that each mapper receives. By
default (N=1), each mapper receives exactly one line of input. If N=2,
then each split contains two lines. One mapper will receive the first
two Key-Value pairs and another mapper will receive the second
two key-value pairs.

DBInputFormat
• Hadoop DBInputFormat is an InputFormat that reads data from a
relational database, using JDBC.
• As it doesn’t have portioning capabilities, so we need to careful not to
swamp the database from which we are reading too many mappers.
• So it is best for loading relatively small datasets, perhaps for joining
with large datasets from HDFS using MultipleInputs.

Hadoop Output Format
• The Hadoop Output Format checks the Output-Specification of the job.
• It determines how RecordWriter implementation is used to write output to output
files.
• Hadoop RecordWriter
• As we know, Reducer takes as input a set of an intermediate key-value pair produced
by the mapper and runs a reducer function on them to generate output that is again
zero or more key-value pairs.
• RecordWriter writes these output key-value pairs from the Reducer phase to output
files.
• As we saw above, Hadoop RecordWriter takes output data from Reducer and
writes this data to output files.
• The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format.

Types of Hadoop Output Formats
• TextOutputFormat
• SequenceFileOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• DBOutputFormat

TextOutputFormat
• MapReduce default Hadoop reducer Output Format is
TextOutputFormat, which writes (key, value) pairs on individual lines
of text files and its keys and values can be of any type
• Each key-value pair is separated by a tab character, which can be
changed using MapReduce.output.textoutputformat.separator
property.

SequenceFileOutputFormat
• It is an Output Format which writes sequences files for its output and
it is intermediate format use between MapReduce jobs, which rapidly
serialize arbitrary data types to the file;
• SequenceFileInputFormat will deserialize the file into the same types
and presents the data to the next mapper in the same manner as it
was emitted by the previous reducer

MapFileOutputFormat
• It is another form of FileOutputFormat in Hadoop Output Format,
which is used to write output as map files.
• The key in a MapFile must be added in order, so we need to ensure
that reducer emits keys in sorted order.

MultipleOutputs
• It allows writing data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string.
• DBOutputFormat
• DBOutputFormat in Hadoop is an Output Format for writing to
relational databases and HBase.
• It sends the reduce output to a SQL table.
• It accepts key-value pairs.

Features of MapReduce
• Scalability
• Flexibility
• Security and Authentication
• Cost-effective solution
• Fast
• Simple model of programming
• Parallel Programming
• Availability and resilient nature

Real World Map Reduce
MapReduce is widely used in industry for processing large datasets in a distributed manner. Below
are some real-world applications where MapReduce plays a crucial role:
1. Healthcare and Medical Imaging
• Disease detection (X-rays, MRIs, CT scans)
• Genomic analysis for genetic mutations
• Processing electronic health records (EHRs)
2. E-Commerce and Retail
• Product recommendation systems
• Customer sentiment analysis from reviews
• Fraud detection in online transactions
3. Social Media and Online Platforms
• Trending topic detection (Twitter, Facebook)
• Spam and fake news filtering
• User engagement analysis (likes, shares, comments)

4. Finance and Banking
• Stock market predictions
• Risk management and fraud detection
• Customer segmentation for targeted marketing
5. Search Engines (Google, Bing, Yahoo)
• Web indexing for search results
• Ad targeting based on user behavior
6. Transportation and Logistics
• Traffic analysis for route optimization
• Supply chain and warehouse management
7. Scientific Research and Big Data Analytics
• Climate change analysis using satellite data
• Astronomy research for celestial object detection

Big Data UNIT 2 AKTU syllabus all topics covered

More Related Content

Similar to Big Data UNIT 2 AKTU syllabus all topics covered (20)

Recently uploaded (20)

Big Data UNIT 2 AKTU syllabus all topics covered