International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1994
A SURVEY ON DIFFERENT FILE HANDLING MECHANISMS IN HDFS
Revathi S1, Saniya Kauser2, Sushmitha S3, Vinodini G4
1Assistant Professor, Dept of CSE, Dr.TTIT, Karnataka, India
2UG Student, Dept of CSE, Dr.TTIT, Karnataka, India
3 UG Student, Dept of CSE, Dr.TTIT, Karnataka, India
4UG Student, Dept of CSE, Dr.TTIT, Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Hadoop is a software framework for distributed
processing of large datasetsacrosslargeclustersofcomputers.
Hadoop framework consists of two main layers. They are
Hadoop Distributed file system (HDFS) and Execution engine
(Map Reduce). HDFS has the property of handling large size
files (in MB’s, GB’s or TB’s), but the performance of HDFS
degrades when handling small size files. The huge numbers of
small files impose heavy burden on NameNode of HDFS,
correlations between small files were not considered for data
placement. There are three common mechanisms to handle
the small files in HDFS like Hadoop Archive (HAR), Sequence
File and TLB MapFile. In order to improve the access
efficiency and to quickly locate a small files, a common
strategy is to merge small files into large ones. This paper
discusses the different small file handling mechanism like
Hadoop Archive (HAR), Sequence Files, TLB-MapFile and
compares them.
Keywords: Hadoop, HDFS, HAR, Sequence File , TLB-
MapFile
1. INTRODUCTION
Hadoop is an open-source software framework[1],
which offers cost-efficient solution to store, manage and
analyze a large amount of data, it provides distributed
processing and storage of huge data across thousands of
computers[2]. Google initiated the idea of hadoop to store
and process a large information through web and now it is
adapted by other web giants like, Facebook, Twitter,
Linkedin, Yahoo, etc, The Hadoop comes with two layers
called MapReduce framework and Hadoop Distributed File
System (HDFS).
1.1 MapReduce framework: MapReduce is a core
component of the Apache Hadoop softwareframework[3].It
is parallel programming model for processing and
generating large data sets. generally it is the execution unit
of hadoop framework[4]. It uses map function to process a
key/value pair in order to generate a set of key/value pairs,
and a reduce function that combines all intermediate values
associated with the same intermediate key. It is based on
two functions called map and reduce. MapReduce provides
good fault-tolerant, with each node periodically reports its
status to a master node.
1.2 HDFS: is a distributed file system designedtostoreand
process large datasets [3]. HDFS is scalable and fault-
tolerant, which is organized on low-cost hardware. HDFS
provides efficient access to application data and is suitable
for applications that have large data sets. HDFS provides a
stable storage layer for the distributedapplication.HDFS has
a master/slave architecture and consisting of three main
components [4]: NameNode, DataNodes and Clients, as
shown in Figure 1.
1. Name node: a master server that manages the file
system namespace and regulates access to files by
clients.
2. Date Nodes: a number of DataNodes, usually one
per node in the cluster, which manage storage
attached to the nodes that they run on.
Figure 1: Components of HDFS
However storing and accessing a large numberofsmall
files impose a big challenge to HDFS because of two main
issues:
1. The Namenode Memory is highly consumedbylarge
numbers of files
2. Doesn’t considers file correlations for data
placement.
Based on the analysis of small file problem, an efficient
approach is designed for HDFS to reduce the memory
consumption of NameNode, and to improve the storage and
access efficiency of small files.
II. Small File Handling mechanisms:
The three common mechanisms are HAR, sequence filesand
TLB-MapReduce [5].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1995
a. HADOOP ARCHIVE FILES:
Hadoop archive or HAR is an archiving facility that
pack files in to HDFS blocks efficiently. Thus HAR areused to
tackle the small files handling problem in hadoop. Usually
the HAR files are created through collection of files and the
archiving tool runs a mapreduce in order to process the
input files in parallel and create an archive files.
Advantages:
1. Hadoop Archives (HAR) can be used to address the
namespacelimitationsassociatedwithstoringmany
small files.
2. HAR packs a number of small files into large files so
that the original files can be accessedtransparently.
3. HAR increases the scalability of the system by
reducing the namespace usage and decreasing the
operation load in the NameNode.
4. This improvement is orthogonal to memory
optimization in the NameNode and distributing
namespace management across multiple
NameNodes.
5. Hadoop Archive is also compatible with
MapReduce,it allows parallel access to the original
files by MapReduce jobs.
Limitations of HAR files:
1. Once an archive file is created, you cannot update
the file to add or remove files. Hence the files are
immutable.
2. Archive file will have a copy of all the original files
so once a .har is created it will take as much space
as the original files. .har files are not the
compressed files.
3. When a .har file is given as an input to Map Reduce
job, the small files inside the .har file will be
processed individually by separate mappers which
is inefficient.
b. SEQUENCE FILES:
Sequence files is a flat file consisting of binary
key/value pairs. It is extensively used in Map Reduce as
input/output formatsandalsothetemporaryresultsofmaps
are stored using Sequence File. The SequenceFile providesa
Writer, Reader and Sorter classes for writing, reading and
sorting respectively.
Advantages
1. As binary files, these are more compact than text
files
2. Provides optional support for compression at
different levels- record, block.
3. Files can split and processed in parallel.
4. As HDFS and MapRecude are optimized for large
files, Sequence Files can be used as containers for
large number of small files thus solving Hadoop’s
drawback of processing used number of small files.
5. Extensively used in MapReduce jobs as input and
output formats. Internally,thetemporaryoutputsof
maps are also stored using sequence file format.
Limitations
1. Similar to other Hadoop files, Sequence Files are
append only.
2. As these are specific to Hadoop, as of now, there is
only Java API available to interact with sequence
files. It does not supports multi language.
c. TLB- MapFile
TLB-MapFile consists of three parts
 Small files merge module
 The audit log mining module and
 Small files prefetching module
1. Small files merge module
By accessing HDFS audit logs, TLB-MapFile obtains
the access frequency of small files. Then, small files are
merged into large files according to the order of the level of
access frequency. Finally, the file block is kept into HDFS.
2. Audit log mining module
It analysis HDFS audit log, obtains the strengthofthe
association of any two small files, and gets the access
frequency of small files within the specified time. Also, the
module creates a TLB table and a highfrequencyaccesstable
based on the correlation strength and the access frequency.
3. Small file prefetching module
The module gets the mapping information between
block and small files of being read in TLB table. Ifthemodule
retrieves the mapping relationship, thelocationofsmall files
is directly located and the content of small file is read.
Meanwhile, the module will pre read small files to
the cache according to the association strength of small files,
and the number of small files of pre-reading can be
controlled by the ratio between the user's waiting time
threshold and the reading time of small files.
Advantages of TLB-MapFile
1. The TLB-MapFileimprovestheNameNode memory
efficiency.
2. It also improves the file download schemeincaseof
mass number of files.
3. File reading speed is increased.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1996
III. Comparison between three small file handling
mechanisms
Functions HAR Sequence File TLB-MapFile
Working Provides
archiving facility
to pack the file
into HDFS blocks
It is a flat
binary file
consisting of
key/value pair
Merges a massive
small files into
large files based
on high frequency
access log files.
Purpose To tackle the
small file handing
problems in
HDFS
To tackle the
small file
handing
problems in
HDFS
To tackle the small
file handing
problems in HDFS
MapReduce Used to
efficiently create
a hadoop
archives
Uses
mapreduces as
input/output
format
Uses mapfile to
store, merge the
mapping
information
NameNode
memory
consumption
Reduces the
namenode
memory usage
Also, reduces
namenode
memory usage
Efficiently reduces
the namenode
consumption than
HAR
Reading
efficiency of
small files
Very low as it has
to read two index
file and data file
Reading
efficiency is
still slow than
HAR
Reading speed is
high as it enhances
analyzing speed
File updating Files are
immutable. Once
created cannot be
updated
Binary
key/value pair
will be
updated
The files are
updated regularly
Support for
compression
Does not support
for compression
Optional
support
Supports
compression
Retrieval
efficiency
higher Based on the
binary key
processed
Higher than HAR
Download
Speed
Less compare to
TLB
Still less Improves
download speed
Capability to
split
Splittable hence
used in
MapReduce
Files can split
and processed
in parallel
It’s a fast table
structure storing a
log files, which
cannot be splitted
Table 1: Comparison File Handling Mechanism
IV. Conclusion
Hadoop is a framework to handle the big data,
hadoop software always process thebigfilesefficientlybutit
results in inefficient handling of small file, to overcome the
problem of small file accessing in HDFS layer of hadoop,
there are three mechanism called HAR, sequence files and
TLB-Mapfile. The paper outlines the advantages and
disadvantages of all three files, and also paper compares the
three small file handling mechanism with respect to its
working and efficiency in HDFS.
References:
[1] Apache org.Hadoop Distributed File System.
http://hadoop.apache.org
[2] Apache Hadoop in Heterogeneous Distributed File
System. http://www.cloudera.com/hadoop/
[3] K. Kiruthika1, E. Gothai, An Efficient Approach for
Storing and Accessing Small Files in HDFS, Karpagam
Journal Of Engineering Research (KJER) Volume No.: II,
Special IssueonIEEESponsoredInternational Conference on
Intelligent Systems and Control (ISCO’15).
[4] Deeksha S, R Kanagavalli, Dr. Kavitha K S, Dr. Kavitha C,
Efficient Resolution for the NameNode MemoryIssueforthe
Access of Small Files in HDFS,International ResearchJournal
of Engineering and Technology(IRJET),Volume:04Issue: 04
Apr -2017.
[5] Bing Meng, Wei-bin Guo, Gui-sheng Fan, A Novel
Approach For Efficient Accessing Of Small Files In HDFS:
TLB-Mapfile, International Conference on International
Conference on Software Engineering, Artificial Intelligence,
Networking and Parallel/Distributed Computing (SNPD),
IEEE/ACIS, 21 July 2016

More Related Content

PPT
hadoop
PDF
Hadoop paper
PDF
PDF
An experimental evaluation of performance
PDF
Survey on Performance of Hadoop Map reduce Optimization Methods
PPTX
Hadoop distributed file system
PPTX
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
PPT
hadoop
Hadoop paper
An experimental evaluation of performance
Survey on Performance of Hadoop Map reduce Optimization Methods
Hadoop distributed file system
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...

What's hot (16)

PPTX
Data Life Cycle
PPT
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
PDF
Hadoop add-on API for Advanced Content Based Search & Retrieval
PDF
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
PPTX
Hadoop file system
PPTX
Hadoop
PPTX
DOC
PDF
Hive partitioning best practices
PPTX
Hadoop storage
PPTX
Sector Vs Hadoop
PPT
Chapter13
PPTX
Tim Pugh-SPEDDEXES 2014
PPTX
Hadoop Distributed File System
PDF
Introduction to Hadoop
PPTX
Snapshot in Hadoop Distributed File System
Data Life Cycle
Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...
Hadoop add-on API for Advanced Content Based Search & Retrieval
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
Hadoop file system
Hadoop
Hive partitioning best practices
Hadoop storage
Sector Vs Hadoop
Chapter13
Tim Pugh-SPEDDEXES 2014
Hadoop Distributed File System
Introduction to Hadoop
Snapshot in Hadoop Distributed File System
Ad

Similar to A Survey on Different File Handling Mechanisms in HDFS (20)

PDF
Available techniques in hadoop small file issue
PDF
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
PDF
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
PDF
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
PPTX
OPERATING SYSTEM .pptx
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
PDF
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
PPTX
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
DOCX
Paper ijert
PPTX
Hadoop/MapReduce/HDFS
PDF
Intro to big data choco devday - 23-01-2014
PPT
Large scale computing
ODP
Apache hadoop
PDF
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
PPT
Hadoop online-training
PPTX
Finalprojectpresentation
PDF
MapReduce in Cloud Computing
PPTX
Big Data Unit 4 - Hadoop
PDF
A Brief on MapReduce Performance
Available techniques in hadoop small file issue
IRJET- A Novel Approach to Process Small HDFS Files with Apache Spark
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
OPERATING SYSTEM .pptx
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Unit 3 Big Data àaaaaaaaaaaaTutorial.pdf
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
Paper ijert
Hadoop/MapReduce/HDFS
Intro to big data choco devday - 23-01-2014
Large scale computing
Apache hadoop
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop online-training
Finalprojectpresentation
MapReduce in Cloud Computing
Big Data Unit 4 - Hadoop
A Brief on MapReduce Performance
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
Application of smart robotics in the supply chain
PPTX
240409 Data Center Training Programs by Uptime Institute (Drafting).pptx
PPTX
MODULE 3 SUSTAINABLE DEVELOPMENT GOALSPPT.pptx
PDF
ITEC 1010 - Information and Organizations Database System and Big data
PDF
Human CELLS and structure in Anatomy and human physiology
PDF
The Journal of Finance - July 1993 - JENSEN - The Modern Industrial Revolutio...
PDF
August 2025 Top Read Articles in - Bioscience & Engineering Recent Research T...
PPTX
Retail.pptx internet of things mtech 2 nd sem
PPTX
1. Effective HSEW Induction Training - EMCO 2024, O&M.pptx
PDF
LAST 3 MONTH VOCABULARY MAGAZINE 2025 . (1).pdf
PDF
Recent Trends in Network Security - 2025
PPTX
CC PPTS unit-I PPT Notes of Cloud Computing
PDF
Design and Implementation of Low-Cost Electric Vehicles (EVs) Supercharger: A...
PDF
Thesis of the Fruit Harvesting Robot .pdf
PPTX
Ingredients of concrete technology .pptx
PPTX
quantum theory on the next future in.pptx
PDF
IMDb_Product_Teardown_product_management
PDF
Water Supply and Sanitary Engineering Textbook
PPTX
22ME926Introduction to Business Intelligence and Analytics, Advanced Integrat...
PPTX
Embedded Systems Microcontrollers and Microprocessors.pptx
Application of smart robotics in the supply chain
240409 Data Center Training Programs by Uptime Institute (Drafting).pptx
MODULE 3 SUSTAINABLE DEVELOPMENT GOALSPPT.pptx
ITEC 1010 - Information and Organizations Database System and Big data
Human CELLS and structure in Anatomy and human physiology
The Journal of Finance - July 1993 - JENSEN - The Modern Industrial Revolutio...
August 2025 Top Read Articles in - Bioscience & Engineering Recent Research T...
Retail.pptx internet of things mtech 2 nd sem
1. Effective HSEW Induction Training - EMCO 2024, O&M.pptx
LAST 3 MONTH VOCABULARY MAGAZINE 2025 . (1).pdf
Recent Trends in Network Security - 2025
CC PPTS unit-I PPT Notes of Cloud Computing
Design and Implementation of Low-Cost Electric Vehicles (EVs) Supercharger: A...
Thesis of the Fruit Harvesting Robot .pdf
Ingredients of concrete technology .pptx
quantum theory on the next future in.pptx
IMDb_Product_Teardown_product_management
Water Supply and Sanitary Engineering Textbook
22ME926Introduction to Business Intelligence and Analytics, Advanced Integrat...
Embedded Systems Microcontrollers and Microprocessors.pptx

A Survey on Different File Handling Mechanisms in HDFS

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1994 A SURVEY ON DIFFERENT FILE HANDLING MECHANISMS IN HDFS Revathi S1, Saniya Kauser2, Sushmitha S3, Vinodini G4 1Assistant Professor, Dept of CSE, Dr.TTIT, Karnataka, India 2UG Student, Dept of CSE, Dr.TTIT, Karnataka, India 3 UG Student, Dept of CSE, Dr.TTIT, Karnataka, India 4UG Student, Dept of CSE, Dr.TTIT, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - Hadoop is a software framework for distributed processing of large datasetsacrosslargeclustersofcomputers. Hadoop framework consists of two main layers. They are Hadoop Distributed file system (HDFS) and Execution engine (Map Reduce). HDFS has the property of handling large size files (in MB’s, GB’s or TB’s), but the performance of HDFS degrades when handling small size files. The huge numbers of small files impose heavy burden on NameNode of HDFS, correlations between small files were not considered for data placement. There are three common mechanisms to handle the small files in HDFS like Hadoop Archive (HAR), Sequence File and TLB MapFile. In order to improve the access efficiency and to quickly locate a small files, a common strategy is to merge small files into large ones. This paper discusses the different small file handling mechanism like Hadoop Archive (HAR), Sequence Files, TLB-MapFile and compares them. Keywords: Hadoop, HDFS, HAR, Sequence File , TLB- MapFile 1. INTRODUCTION Hadoop is an open-source software framework[1], which offers cost-efficient solution to store, manage and analyze a large amount of data, it provides distributed processing and storage of huge data across thousands of computers[2]. Google initiated the idea of hadoop to store and process a large information through web and now it is adapted by other web giants like, Facebook, Twitter, Linkedin, Yahoo, etc, The Hadoop comes with two layers called MapReduce framework and Hadoop Distributed File System (HDFS). 1.1 MapReduce framework: MapReduce is a core component of the Apache Hadoop softwareframework[3].It is parallel programming model for processing and generating large data sets. generally it is the execution unit of hadoop framework[4]. It uses map function to process a key/value pair in order to generate a set of key/value pairs, and a reduce function that combines all intermediate values associated with the same intermediate key. It is based on two functions called map and reduce. MapReduce provides good fault-tolerant, with each node periodically reports its status to a master node. 1.2 HDFS: is a distributed file system designedtostoreand process large datasets [3]. HDFS is scalable and fault- tolerant, which is organized on low-cost hardware. HDFS provides efficient access to application data and is suitable for applications that have large data sets. HDFS provides a stable storage layer for the distributedapplication.HDFS has a master/slave architecture and consisting of three main components [4]: NameNode, DataNodes and Clients, as shown in Figure 1. 1. Name node: a master server that manages the file system namespace and regulates access to files by clients. 2. Date Nodes: a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. Figure 1: Components of HDFS However storing and accessing a large numberofsmall files impose a big challenge to HDFS because of two main issues: 1. The Namenode Memory is highly consumedbylarge numbers of files 2. Doesn’t considers file correlations for data placement. Based on the analysis of small file problem, an efficient approach is designed for HDFS to reduce the memory consumption of NameNode, and to improve the storage and access efficiency of small files. II. Small File Handling mechanisms: The three common mechanisms are HAR, sequence filesand TLB-MapReduce [5].
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1995 a. HADOOP ARCHIVE FILES: Hadoop archive or HAR is an archiving facility that pack files in to HDFS blocks efficiently. Thus HAR areused to tackle the small files handling problem in hadoop. Usually the HAR files are created through collection of files and the archiving tool runs a mapreduce in order to process the input files in parallel and create an archive files. Advantages: 1. Hadoop Archives (HAR) can be used to address the namespacelimitationsassociatedwithstoringmany small files. 2. HAR packs a number of small files into large files so that the original files can be accessedtransparently. 3. HAR increases the scalability of the system by reducing the namespace usage and decreasing the operation load in the NameNode. 4. This improvement is orthogonal to memory optimization in the NameNode and distributing namespace management across multiple NameNodes. 5. Hadoop Archive is also compatible with MapReduce,it allows parallel access to the original files by MapReduce jobs. Limitations of HAR files: 1. Once an archive file is created, you cannot update the file to add or remove files. Hence the files are immutable. 2. Archive file will have a copy of all the original files so once a .har is created it will take as much space as the original files. .har files are not the compressed files. 3. When a .har file is given as an input to Map Reduce job, the small files inside the .har file will be processed individually by separate mappers which is inefficient. b. SEQUENCE FILES: Sequence files is a flat file consisting of binary key/value pairs. It is extensively used in Map Reduce as input/output formatsandalsothetemporaryresultsofmaps are stored using Sequence File. The SequenceFile providesa Writer, Reader and Sorter classes for writing, reading and sorting respectively. Advantages 1. As binary files, these are more compact than text files 2. Provides optional support for compression at different levels- record, block. 3. Files can split and processed in parallel. 4. As HDFS and MapRecude are optimized for large files, Sequence Files can be used as containers for large number of small files thus solving Hadoop’s drawback of processing used number of small files. 5. Extensively used in MapReduce jobs as input and output formats. Internally,thetemporaryoutputsof maps are also stored using sequence file format. Limitations 1. Similar to other Hadoop files, Sequence Files are append only. 2. As these are specific to Hadoop, as of now, there is only Java API available to interact with sequence files. It does not supports multi language. c. TLB- MapFile TLB-MapFile consists of three parts  Small files merge module  The audit log mining module and  Small files prefetching module 1. Small files merge module By accessing HDFS audit logs, TLB-MapFile obtains the access frequency of small files. Then, small files are merged into large files according to the order of the level of access frequency. Finally, the file block is kept into HDFS. 2. Audit log mining module It analysis HDFS audit log, obtains the strengthofthe association of any two small files, and gets the access frequency of small files within the specified time. Also, the module creates a TLB table and a highfrequencyaccesstable based on the correlation strength and the access frequency. 3. Small file prefetching module The module gets the mapping information between block and small files of being read in TLB table. Ifthemodule retrieves the mapping relationship, thelocationofsmall files is directly located and the content of small file is read. Meanwhile, the module will pre read small files to the cache according to the association strength of small files, and the number of small files of pre-reading can be controlled by the ratio between the user's waiting time threshold and the reading time of small files. Advantages of TLB-MapFile 1. The TLB-MapFileimprovestheNameNode memory efficiency. 2. It also improves the file download schemeincaseof mass number of files. 3. File reading speed is increased.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 06 | June -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1996 III. Comparison between three small file handling mechanisms Functions HAR Sequence File TLB-MapFile Working Provides archiving facility to pack the file into HDFS blocks It is a flat binary file consisting of key/value pair Merges a massive small files into large files based on high frequency access log files. Purpose To tackle the small file handing problems in HDFS To tackle the small file handing problems in HDFS To tackle the small file handing problems in HDFS MapReduce Used to efficiently create a hadoop archives Uses mapreduces as input/output format Uses mapfile to store, merge the mapping information NameNode memory consumption Reduces the namenode memory usage Also, reduces namenode memory usage Efficiently reduces the namenode consumption than HAR Reading efficiency of small files Very low as it has to read two index file and data file Reading efficiency is still slow than HAR Reading speed is high as it enhances analyzing speed File updating Files are immutable. Once created cannot be updated Binary key/value pair will be updated The files are updated regularly Support for compression Does not support for compression Optional support Supports compression Retrieval efficiency higher Based on the binary key processed Higher than HAR Download Speed Less compare to TLB Still less Improves download speed Capability to split Splittable hence used in MapReduce Files can split and processed in parallel It’s a fast table structure storing a log files, which cannot be splitted Table 1: Comparison File Handling Mechanism IV. Conclusion Hadoop is a framework to handle the big data, hadoop software always process thebigfilesefficientlybutit results in inefficient handling of small file, to overcome the problem of small file accessing in HDFS layer of hadoop, there are three mechanism called HAR, sequence files and TLB-Mapfile. The paper outlines the advantages and disadvantages of all three files, and also paper compares the three small file handling mechanism with respect to its working and efficiency in HDFS. References: [1] Apache org.Hadoop Distributed File System. http://hadoop.apache.org [2] Apache Hadoop in Heterogeneous Distributed File System. http://www.cloudera.com/hadoop/ [3] K. Kiruthika1, E. Gothai, An Efficient Approach for Storing and Accessing Small Files in HDFS, Karpagam Journal Of Engineering Research (KJER) Volume No.: II, Special IssueonIEEESponsoredInternational Conference on Intelligent Systems and Control (ISCO’15). [4] Deeksha S, R Kanagavalli, Dr. Kavitha K S, Dr. Kavitha C, Efficient Resolution for the NameNode MemoryIssueforthe Access of Small Files in HDFS,International ResearchJournal of Engineering and Technology(IRJET),Volume:04Issue: 04 Apr -2017. [5] Bing Meng, Wei-bin Guo, Gui-sheng Fan, A Novel Approach For Efficient Accessing Of Small Files In HDFS: TLB-Mapfile, International Conference on International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), IEEE/ACIS, 21 July 2016