SlideShare a Scribd company logo
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1520
Big Data Testing Using Hadoop Platform
Tushar Kumar Sharma1, Chirag Jindal2, Akhil Saini3, Satyam Gupta4
Computer Science and Engineering, Chandigarh University, Mohali India
-------------------------------------------------------------------------***------------------------------------------------------------------------
Abstract — Big data analysis has emerged as a crucial
technology in recent years due to the exponential growth
of data generated from various sources. This data can come
in structured, unstructured, or semi-structured formats
and is generated from diverse channels such as social
media platforms, smart city sensors, ecommerce websites,
and numerous applications. These vast amounts of data
encompass a wide range of formats, including text, images,
audios, and videos. Hadoop provides a comprehensive
ecosystem of tools and frameworks that enable efficient
storage and processing of big data. One of the key
components of Hadoop is the Hadoop Distributed File
System (HDFS), which is designed to store and manage
data across a cluster of commodity hardware. HDFS breaks
down large files into smaller blocks and replicates them
across multiple nodes, ensuring data reliability and
availability. MapReduce allows for parallel computation of
data across a cluster of machines, making it suitable for
processing large-scale datasets. By breaking down complex
tasks into smaller subtasks and distributing them across
multiple nodes, MapReduce enables faster and more
efficient data processing.As big data analysis continues to
evolve, Hadoop has expanded its ecosystem with various
technologies to enhance its capabilities. These include
YARN (Yet Another Resource Negotiator), which serves as
the cluster resource management framework, enabling
efficient allocation of computing resources for different
applications. Pig, an abstraction layer on top of Hadoop,
provides a high-level language called Pig Latin for
expressing data analysis tasks. MRjob is a Python
framework that simplifies the development of MapReduce
jobs. Zookeeper is a centralized service for maintaining
configuration information, synchronization, and naming
services. Hive offers a data warehouse infrastructure and a
query language called HiveQL for querying and analyzing
data stored in Hadoop. Apache Spark, a fast and general-
purpose data processing engine, is integrated with Hadoop
to provide faster in-memory computation capabilitiesIn
this paper, our focus is on exploring big data analysis and
demonstrating how Hadoop, along with its associated
technologies, can be used for analyzing, storing, and
processing large volumes of data. By leveraging the power
of Hadoop's distributed architecture and the
complementary tools within its ecosystem, organizations
can effectively harness the potential of big data and derive
valuable insights for various applications and industries.
Keywords— Big Data, Hadoop, HDFS (Hadoop
Distributed File System), MapReduce, Software Testing,
Yarn, Pig, MRjob, Zookeeper, Hive, Apache Spark, Hive,
HBase
Recently, information technology systems have been
playing a major role in handling and giving insights to
organization’s business. This information can come from
education, traffic, healthcare, or commerce sectors. This
data can be structured or semi-structured and can be in the
form of text, images, audio, video, and log files. The amount
of data to be stored and processed is often in terabytes.[15]
It becomes extremely hard for a single system to manage
such big data.
3 V’s, on the basis of which Big Data is defined includes–
volume, velocity and variety.
Volume – The size of data from organizations in social
media, healthcare, education, and business section comes
is ever increasing. Big companies like Google and
Facebook process information in petabytes. The IOT
(sensors data) is also increasing day by day. So, it
becomes difficult to manage such big data using traditions
systems.
Variety – data in today’s world is divided into structured
(schema, columns), semi-structured (json, emails, xml)
and unstructured (images, videos, audio) categories. This
data can also be raw and requires heavy system work to
convert it into useful information using traditional
analytical systems.
Velocity – this concept defines the speed at which the data
arrives from source destination and the speed with
which it is processed. The size of the incoming data is
huge andit needs to be processed at a similar speed.
Companies like Google, Facebook process petabytes of
data on a daily basis.[23] There is certain software that
come in use to operate and manage this big data. This
paper focuses one such software called Hadoop and how
it is used in big data testing. The paper first describes
what is big data and then moves to technologies like
Hadoop and its components –HDFS and MapReduce.
There are several budding technologies which come into
play like YARN, Pig, Spark, Zookeeper, Hive, Apache
Spark and HBase. Yarn helps in enabling the processing
and running of batch, stream, interactive, and graph
data stored in HDFS.[27] Hive and Pig are two integral
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
I. INTRODUCTION
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1521
parts of the Hadoop ecosystem, but there are some key
differences between them.
Fig 1: 3V’s of Big Data
Spark is an free to use framework which is dedicated for
providingto-way communicative questioning, machine
learning, and live workloads. Apache Zookeeper is a
service that provides operational support to Hadoop
clusters, offering a distributed configuration service, a
synchronization tool, and a naming registry.[7] Finally,
Apache HBase is a free-to-use, NoSQL, distributed,
warehouse for big data that provides consistent live
access to zetabytes of data. It is effective for managing
large sparse datasets. [5] The paper will focus on practical
usage of these technologies and compare them
performance wise.
However, with growing number of business storage this
technology has also become quite vulnerable to attacks.
The paper will focus on describing challenges, security
measures and vulnerability of this architectureand how to
tackle them.
With the humongous growth in the big data technology,
there is massive rush of unstructured data in various
forms and size. It becomes quite difficult to handle such
data. Hence, many of the existing problems related to big
data such as analysis methods and applications, storage
models, privacy and security are discussed below.
A. Analysis methods and Applications
It is important to analyze big data to gain further
meaningful and useful insights. To handle a complex,
inconsistent and unstructured data, an accurate method is
required. To work with such an inconsistent and
complicated data, poses another problem for the data
scientists. There are methods such as inquisitive,
predictive, prescriptive, and pre-emptive analytics but the
skills, tools and knowledge required to work on them are
limited.[21]
some of the challenges in dealing with big data are live or
on-the-spot data transfer, situation discovery, research,
insight gaining and responses. Their application require a
more than sufficient knowledge to apply an efficient
approach. Conventional applications are limited by
processing limitations, high computational processing, and
memory constraints.
B. Storage models
Millions of new IOT devices are being linked to the internet
every day. Social media apps like Instagram, Facebook are
an explosive contributor to the big data. Following this
burst in data, many companies switch from traditional data
storage systems towards cloud storage.[9] It is still a
challenge to upload to cloud storage in real time because
the data is huge. During the real time uploading of IOT
devices associated with floods and debris flow
management to the cloud, the sensors may be sending big
chunks of data which can cause timely processing and
processing computationally challenging.
C. Privacy and data security
Despite the rise in popularity of big data analytics, the
problem of how to deal with a lot of data while maintaining
privacy is still unresolved. Data are decentralized and
originate from a variety of sources, including sensors,
mobile devices, and Internet of Things (IoT) devices, which
explains why. A privacy and security issue has also arisen
from the analysis of heterogeneous data sources as a result
of contact with other external systems.[9] It is also
essential to ensure that the source is not subject to any
attacks. Illegal transfer of backup data, for instance,
becomes a major worry in big data research for the
healthcare industry. To stop a new breach of patient
security and privacy, this situation must be fully rectified.
This event forces big data to reconsider analytics and
developer privacy.
III. RELATED WORK
A. Hadoop and budding tools
Hadoop, developed by the Apache Software Foundation, is
a distributed computing framework that enables
unrestricted access to large-scale data processing. It is
widely used by global corporations such as eBay,
Facebook, and Yahoo. Hadoop comprises several
components, including HDFS for user interface and
storage, MapReduce for data processing, and YARN for
system management. Initially created in 2005 to support
Yahoo's Nutch search engine project, Hadoop was
published in 2011 and can scale up to 4000 nodes per
cluster, enabling fast processing of vast amounts of
data.[13] To enhance security and administration, vendor-
specific Hadoop distributions have been developed. The
key components of Hadoop include the Name Node, Data
Node, Job Tracker, and Task Tracker. However, Hadoop's
lack of internal cluster security makes it susceptible to
threats.
II. CHALLENGES ASSOCIATED WITH BIG DATA
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1522
B. Hadoop Distributed File System (HDFS)
HDFS is designed to efficiently handle large-scale data
distributed across clusters of commodity computers. This
data can originate from various sources such as IoT
sensors, social media platforms, and e-commerce servers.
HDFS breaks down big data into smaller blocks, typically
weighing 128MB, and distributes multiple copies of each
block across the cluster for fault tolerance. If a machine
fails, the data can be retrieved from other machines.
Fig 2: HFDS Nodes
The architecture of HDFS consists of three main
components: the Name Node, Client Node, and Data Nodes.
The Name Node keeps track of the files present on the Data
Nodes and their locations within the cluster. However, the
Name Node itself does not store data. The Client Node is
responsible for reading and writing files in HDFS.[34]
When reading a file, the Client Node requests the file's
location from the Name Node, which provides the
information needed to access the file from the Data Nodes.
When writing a file, the Client Node informs the Name
Node to create a new file, and the data is then sent to the
Data Nodes, which communicate with each other to create
backups of the file. Various methods can be used to
interact with HDFS, including command-line interface, user
interface (such as Ambari), Java interface, and HTTP/HDFS
proxies. The command-line interface offers several
commands for working with HDFS.
Fig 3: Functions of different nodes in HDFS
C. MapReduce
MapReduce is a parallel programming paradigm and
processing approach for building distributed applications
that efficiently handle large volumes of data on clusters of
shared hardware.[35] It was originally developed at
Google and is powered by Hadoop, an open-source
platform developed by Apache, primarily based on Java.
The MapReduce algorithm consists of two key elements:
Map and Reduce. Using tuples to represent each element,
the Map operation transforms one collection of data into
another, while the Reduce operation combines the output
of the Map into a smaller set of tuples. The MapReduce
paradigm allows easy scaling of data processing across
multiple CPU nodes. Mappers and reducers are the basic
components used in the MapReduce model for data
processing. A MapReduce application can be easily scaled
by updating the configuration, allowing it to run on
hundreds or even thousands of servers in a cluster.
Fig 4: MapReduce flow chart
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1523
The MapReduce program's execution consists of two
stages: the Map stage and the Reduce stage.[28] The Map
stage processes the input data, typically stored in Hadoop
Distributed File System (HDFS), and passes it character by
character to the mapper function, producing intermediate
data. The Reduce stage combines the intermediate outputs
and produces a final output, which is then stored back in
HDFS. MapReduce follows a master-slave paradigm, where
the master (JobTracker) divides the problem into jobs and
sends them to various Data Nodes (TaskTracker) for
parallel processing. The map jobs often run on the same
cluster nodes as the processed data, and if a node is
overloaded, another node close to the data is selected. The
intermediate results of the Map and Shuffle phases are
combined to reduce the amount of data transferred
between map tasks and reduce tasks. While the input data
and the final output are stored in HDFS, the intermediate
results are stored in the local file systems of the Data
Nodes.
The JobTracker is responsible for monitoring the Data
Nodes, managing the TaskTrackers, and ensuring the
completion of jobs. If a task fails or a Data Node becomes
unresponsive, the JobTracker restarts the job on another
server, preferably one with a copy of the data. Additionally,
if a task is running slowly, the JobTracker can initiate
speculative execution by starting the same task on a
different server to complete the job within the allocated
time.
D. MrJob
MrJob is a Python framework that simplifies the
development and execution of MapReduce tasks on
Hadoop servers. It provides a high-level API for creating
MapReduce jobs and abstracts the complexities of Hadoop.
Some key features and benefits of MrJob include:
i. Compatibility with various Hadoop versions, including
Amazon Elastic MapReduce (EMR), local computers,
and Docker containers.
ii. High-level API that eliminates the need for dealing with
low-level Hadoop details.
iii. Built-in test framework for testing MapReduce tasks
before deploying them to a Hadoop cluster.
To create and run a job using MrJob, one can use
commands such as creating a new job, running a job locally
or on a Hadoop cluster, and testing a job locally.[19]
Creating a new job: python mrjob create-job my_job
Running a job locally: python my_job.py -r localinput_file
Running a job on a Hadoop cluster: python my_job.py-r
hadoop input_file
Running a job on Amazon Elastic MapReduce: python
my_job.py -r emr s3://input_bucket/input_file
Testing a job locally: python my_job_test.py
E. YARN
YARN, which stands for Yet Another Resource Negotiator,
is a cluster management technology used in Hadoop for
resource allocation and application scheduling. It offers the
following characteristics and advantages:
i. Resource management: YARN enables efficient
sharing of cluster resources among multiple
applications.
ii. Centralized scheduler: YARN includes a centralized
scheduler that controls resource allocation and
schedules applications on the cluster.
iii. Fault tolerance: YARN incorporates fault tolerance
mechanisms to ensure uninterrupted application
execution even in the presence of node failures.
iv. Scalability: YARN is designed to handle large
clusters with thousands of nodes and millions of
tasks.
To interact with YARN, various commands can be used,
such as starting the ResourceManager, starting the
NodeManager, submitting MapReduce or Spark jobs, and
monitoring application progress.
petabytes of data in awidely distributed way.
Start ResourceManager: yarn resourcemanager
Start NodeManager: yarn nodemanager
Submit MapReduce job: yarn jar mapreduce.jar MyJob
input output
Submit Spark job: spark-submit --master yarn -- deploy-
mode cluster myapp.py input output
Monitor application progress: yarn application - status
<application_id>
F. Pig
Pig is a framework for analyzing large datasets using Pig
Latin, a high-level scripting language. Pig Latin provides a
simplified syntax for expressing data analysis tasks,
allowing users to focus on the analysis logic rather than
low-level implementation details.[9] Pig handles data
storage and processing complexities, making data handling
easier. It offers a scripting interface for conducting data
analysis jobs and is designed to operate on large clusters
and handle petabytes of data in a distributed manner.
Some common Pig commands include loading data from a
file, filtering data, grouping and aggregating data, and
writing data to a file.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1524
Start Pig shell: pig
Load from a file: mydata = LOAD 'mydata.txt' USING
PigStorage(',') AS (col1:int, col2:chararray,col3:double);
Filter data: mydata_filtered = FILTER mydata BY col1 >
100;
Group & aggregate data: mydata_grouped = GROUP
mydata BY col2; mydata_aggregated = FOREACH
mydata_grouped GENERATE group,AVG(mydata.col3);
Write to a file: STORE mydata_aggregated INTO 'output'
USING PigStorage();
G. Hive
Hive is a data storage utility that provides a SQL-like
interface for querying and analyzing large Hadoop
datasets. Hive translates SQL queries into MapReduce
tasks, enabling developers and analysts to interact with
Hadoop using familiar SQL syntax. Key characteristics and
benefits of Hive include:
i. SQL-like interface: Hive offers a user-friendly SQL-like
interface that simplifies the creation and execution of
queries on large databases.
ii. Scalability: Hive is designed to handle big clusters and
process petabytes of data in a distributed and parallel
manner.
iii. Data warehousing features: Hive supports features like
table creation and partitioning for effective data
warehousing.
To use Hive, one can start the Hive shell, create tables, load
data into tables, and run queries using SQL syntax.
Starting the Hive shell: hive
Creating a table: CREATE TABLE mytable (col1 INT, col2
STRING, col3 DOUBLE);
Loading data into a table: LOAD DATA LOCAL INPATH
'data.txt' INTO TABLE mytable;
Running a query: SELECT col1, AVG(col3) FROM
mytable GROUP BY col1;
Overall, these tools and technologies, such as Hadoop,
HDFS, MapReduce, MrJob, YARN, Pig, and Hive, provide
powerful capabilities for distributed data processing and
analysis, making it easier to handle large datasets.
This paper runs a MapReduce job using sandbox
environment to better understand the usage of Hadoop
HDFS and MapReduce commands. A sample movie dataset
is used to apply various functions and commands on. With
the help of Ambari (Hadoop user interface) bash
command line we type the following HDFS commands:
i. Hadoop fs -ls (lists files in the cluster)
ii. Hadoop fs -mkdir ml-100k(makes a new folder
named ml-100k)
Fig 5: MrJob commands in Ambari Bash Shell
With the help of nano text editor we input the functions
for sorting and adding the movie ratings according to the
user id. This task is completed with the help of a python
library called MRjob. Below are the functions of MRjob:
Fig 6: MrJob functions in nano-text editor
Once the functions are defined in the editor we make
sure to save the python file.[12] Then we run the
MRjob python file on the movie dataset only to view
the required results:
Fig 7: Required Result
IV. IMPLEMENTATION OF BIG DATA PROCESSING
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1525
Fig 8: Required Result
Big Data is a type of data whose size, variety, and
complexity necessitate the development of new design,
methods, algorithms, and analytics in order to handle it
and derive value and secret knowledge from it. Data is now
produced from a variety of sources and can enter the
system at varying speeds.[4] Today, processing huge
amounts of data is a major problem. We covered the
Hadoop tool for Big Data in depth in this article. Hadoop is
the foundational framework for structuring Big Data and
solving the issue of making it usable for analytics. We also
talked about some Hadoop components that help with the
handling of big data sets in dispersed computing settings.
In the future, we can apply some clustering methods and
test their performance.
Hadoop offers several advantages for big data testing,
including scalability, failure tolerance, and support for
different data types. However, due to the platform's
complexity and the diversity of tools accessible, testing big
data on Hadoop can be difficult. To guarantee the accuracy
and dependability of big data processed with Hadoop, it is
critical to have a comprehensive grasp of its architecture,
testing methods, and tools.
As the demand for big data analytics grows rapidly across
various sectors, the potential scope of big data testing
using Hadoop is important. As more businesses implement
big data technologies, the need for efficient testing
strategies and tools grows.[17] Because of its scalability,
fault tolerance, and support for different data types,
Hadoop has become a common tool for big data handling
and analysis. Hadoop is anticipated to provide more
sophisticated testing methods and instruments to
guarantee the accuracy and dependability of big data as it
evolves.
[1] Bhosale, H. S., Gadekar, P. D. (2014). "A Review Paper
on Big Data and Hadoop." International Journal of Scientific
and Research Publications, 4(10).
[2] Smitha, T., Kumar, V. S. (2013). "Application of Big Data
in Data Mining." International Journal of Emerging
Technology and Advanced Engineering, 3(7).
[3] IBM Big Data Analytics HUB. (n.d.). "Four V's of Big
Data." Retrieved from
www.ibmbigdatahub.com/infographic/four-vs-big-data.
[4] Mridul, M., Khajuria, A., Dutta, S., Kumar, N. (2014).
"Analysis of Big Data using Apache Hadoop and
MapReduce." International Journal of Advance Research in
Computer Science and Software Engineering, 4(5).
[5] Apache Hadoop Project. (n.d.). "Apache Hadoop."
Retrieved from http://hadoop.apache.org/.
[6] Smitha, T., Sundaram, V. (2012). "Classification Rules
by Decision Tree for Disease Prediction." International
Journal for Computer Applications, 43(8).
[7] Baker, N. (2017). "3 Universities That Are Using Big
Data - QS." Retrieved from https://www.qs.com/3-
universities-thatare-using-big-
data/?fbclid=IwAR2eH_8StopgXHIlnGn5IQvnwIKmDQVM_
xTWZc5oq6FDp4_vwNkUzG4xZOQ.
[8] Information Technology Research and Development
Centre, University of Kufa. (2018). "E-learning server
statistics." Retrieved from
http://elearning.uokufa.edu.iq/?page_id=100.
[9] Parsola, J., et al. (2018). "Post Event Investigation of
Multi-stream Video Data Utilizing Hadoop Cluster."
International Journal of Electrical and Computer
Engineering, 8, 5089.
[10] Apache Hadoop. (2018). "Hadoop – Apache Hadoop
2.9.2." Retrieved from
https://hadoop.apache.org/docs/r2.9.2/.
[11] Tom, W. (2015). "Hadoop: The Definitive Guide."
Fourth Edition The Definitive Guide Storage and Analysis
at Internet Scale.
[12] Insights SAS. (n.d.). "What is Hadoop?" Retrieved from
https://www.sas.com/en_us/insights/big-
data/hadoop.html.
[13] Yang, H. C., Dasdan, A., Hsiao, R. L., Parker, D. S.
(2007). "Map–reduce–merge: Simplified relational data
processing on large clusters." Proceedings of the 2007 ACM
SIGMOD International Conference on Management of Data.
V. CONLUSION AND FUTURE WORK
After running the Python commands we get the
MapReduced result from the dataset which contains the
sorted (ascending order) ratings of the movies (right)
according to the user ids(left).
REFERENCES
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072
© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1526
[14] Hunt, P., Konar, M., Junqueira, F. P., Reed, B. (2010).
"ZooKeeper: Wait-free coordination for internet-scale
systems." Proceedings of the USENIX Annual Technical
Conference.
[15] Junqueira, F., Reed, B. (2013). "ZooKeeper: Distributed
Process Coordination." O'Reilly Media, Inc.
[16] Ranjan, R. (2014). "Streaming big data processing in
datacenter clouds." IEEE Cloud Computing, 1(1), 78-83.
[17] Bahga, A., Madisetti, V. (2014). "Internet of Things: A
Hands-On Approach." Vpt.
[18] Meng, X., et al. (2016). "MLlib: Machine learning in
Apache Spark." Journal of Machine Learning Research,
17(1), 1235-1241.
[19] Gonzalez, J. E., et al. (2014). "Graphx: Graph
processing in a distributed dataflow framework."
Proceedings of the 14th USENIX Symposium on Operating
Systems Design and Implementation.
[20] Kulkarni, S., et al. (2015). "Twitter Heron: Stream
processing at scale." Proceedings of the 2015 ACM
SIGMOD International Conference on Management of
Data.
[21] Aji, A., et al. (2013). "Hadoop GIS: A high-performance
spatial data warehousing system over MapReduce."
Proceedings of the VLDB Endowment, 6(11), 1009-1020.
[22] Apache Hadoop. (2016). Retrieved from
http://hadoop.apache.org/.
[23] Apache Hive. (2016). Retrieved from
https://hive.apache.org/.
[24] Apache Hadoop HDFS. (2016). Retrieved from
http://hadoop.apache.org/hdfs.
[25] Dean, J., Ghemawat, S. (2008). "MapReduce:
Simplified data processing on large clusters."
Communications of the ACM, 51(1), 107-113.
[26] White, T. (2012). "Hadoop: The Definitive Guide."
O'Reilly Media, Inc.
[27] Zaharia, M., et al. (2010). "Spark: Cluster computing
with working sets." HotCloud, 10(10-10), 95.
[28] Akbarinia, R., Pournaras, E., Aberer, K. (2013).
"Dynamic load balancing in distributed stream processing
systems." IEEE Transactions on Parallel and Distributed
Systems, 24(7), 1362-1371.
[29] Zaharia, M., et al. (2016). "Apache Spark: A unified
engine for big data processing." Communications of the
ACM, 59(11), 56-65.
[30] Kshemkalyani, A. D., Singhal, M. (2010). "Distributed
Computing: Principles, Algorithms, and Systems."
Cambridge University Press.
[31] Abadi, D. J., Chu, A., Eksombatchai, P. (2013). "The
power of comparative reasoning." Communications of the
ACM, 56(3), 70-77.
[32] Chen, Q., et al. (2014). "A survey of big data storage
and computational frameworks." Journal of Computer
Science and Technology, 29(2), 165-182.
[33] Qiu, M., et al. (2014). "Performance modeling and
analysis of big data processing in cloud systems." IEEE
Transactions on Parallel and Distributed Systems, 25(9),
2193-2203.
[34] Bhatia, R., Kumar, S., Goyal, P. (2013). "Hadoop: A
framework for big data analytics." International Journal of
Emerging Technology and Advanced Engineering, 3(3),
238-241.
[35] Vavilapalli, V. K., et al. (2013). "Apache Hadoop YARN:
Yet Another Resource Negotiator." Proceedings of the 4th
Annual Symposium on Cloud Computing, 5(7), 1-16.

More Related Content

Similar to Big Data Testing Using Hadoop Platform (20)

PDF
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
PDF
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
Rida Qayyum
 
PDF
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
April Knyff
 
PDF
Big Data-Survey
ijeei-iaes
 
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
IJET - International Journal of Engineering and Techniques
 
PDF
Moving Toward Big Data: Challenges, Trends and Perspectives
IJRESJOURNAL
 
DOCX
Encroachment in Data Processing using Big Data Technology
MangaiK4
 
PDF
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
EMC
 
PDF
Research paper on big data and hadoop
Shree M.L.Kakadiya MCA mahila college, Amreli
 
PDF
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 
PDF
Big Data
Kirubaburi R
 
PDF
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
ClaraZara1
 
PDF
DOCUMENT SELECTION USING MAPREDUCE
ijsptm
 
PDF
IJSRED-V2I3P43
IJSRED
 
PDF
Elementary Concepts of Big Data and Hadoop
rahulmonikasharma
 
PDF
Influence of Hadoop in Big Data Analysis and Its Aspects
IJMER
 
PDF
Hadoop
Veera Sundari
 
PDF
Big Data Processing with Hadoop : A Review
IRJET Journal
 
PDF
IRJET- A Scenario on Big Data
IRJET Journal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...
Rida Qayyum
 
An Analytical Study on Research Challenges and Issues in Big Data Analysis.pdf
April Knyff
 
Big Data-Survey
ijeei-iaes
 
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
IJET - International Journal of Engineering and Techniques
 
Moving Toward Big Data: Challenges, Trends and Perspectives
IJRESJOURNAL
 
Encroachment in Data Processing using Big Data Technology
MangaiK4
 
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
EMC
 
Research paper on big data and hadoop
Shree M.L.Kakadiya MCA mahila college, Amreli
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 
Big Data
Kirubaburi R
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
ClaraZara1
 
DOCUMENT SELECTION USING MAPREDUCE
ijsptm
 
IJSRED-V2I3P43
IJSRED
 
Elementary Concepts of Big Data and Hadoop
rahulmonikasharma
 
Influence of Hadoop in Big Data Analysis and Its Aspects
IJMER
 
Big Data Processing with Hadoop : A Review
IRJET Journal
 
IRJET- A Scenario on Big Data
IRJET Journal
 

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 

Recently uploaded (20)

PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Product Development & DevelopmentLecture02.pptx
zeeshanwazir2
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PPTX
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
Big Data and Data Science hype .pptx
SUNEEL37
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
smart lot access control system with eye
rasabzahra
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPT
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Product Development & DevelopmentLecture02.pptx
zeeshanwazir2
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Big Data and Data Science hype .pptx
SUNEEL37
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
smart lot access control system with eye
rasabzahra
 
Design Thinking basics for Engineers.pdf
CMR University
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 

Big Data Testing Using Hadoop Platform

  • 1. © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1520 Big Data Testing Using Hadoop Platform Tushar Kumar Sharma1, Chirag Jindal2, Akhil Saini3, Satyam Gupta4 Computer Science and Engineering, Chandigarh University, Mohali India -------------------------------------------------------------------------***------------------------------------------------------------------------ Abstract — Big data analysis has emerged as a crucial technology in recent years due to the exponential growth of data generated from various sources. This data can come in structured, unstructured, or semi-structured formats and is generated from diverse channels such as social media platforms, smart city sensors, ecommerce websites, and numerous applications. These vast amounts of data encompass a wide range of formats, including text, images, audios, and videos. Hadoop provides a comprehensive ecosystem of tools and frameworks that enable efficient storage and processing of big data. One of the key components of Hadoop is the Hadoop Distributed File System (HDFS), which is designed to store and manage data across a cluster of commodity hardware. HDFS breaks down large files into smaller blocks and replicates them across multiple nodes, ensuring data reliability and availability. MapReduce allows for parallel computation of data across a cluster of machines, making it suitable for processing large-scale datasets. By breaking down complex tasks into smaller subtasks and distributing them across multiple nodes, MapReduce enables faster and more efficient data processing.As big data analysis continues to evolve, Hadoop has expanded its ecosystem with various technologies to enhance its capabilities. These include YARN (Yet Another Resource Negotiator), which serves as the cluster resource management framework, enabling efficient allocation of computing resources for different applications. Pig, an abstraction layer on top of Hadoop, provides a high-level language called Pig Latin for expressing data analysis tasks. MRjob is a Python framework that simplifies the development of MapReduce jobs. Zookeeper is a centralized service for maintaining configuration information, synchronization, and naming services. Hive offers a data warehouse infrastructure and a query language called HiveQL for querying and analyzing data stored in Hadoop. Apache Spark, a fast and general- purpose data processing engine, is integrated with Hadoop to provide faster in-memory computation capabilitiesIn this paper, our focus is on exploring big data analysis and demonstrating how Hadoop, along with its associated technologies, can be used for analyzing, storing, and processing large volumes of data. By leveraging the power of Hadoop's distributed architecture and the complementary tools within its ecosystem, organizations can effectively harness the potential of big data and derive valuable insights for various applications and industries. Keywords— Big Data, Hadoop, HDFS (Hadoop Distributed File System), MapReduce, Software Testing, Yarn, Pig, MRjob, Zookeeper, Hive, Apache Spark, Hive, HBase Recently, information technology systems have been playing a major role in handling and giving insights to organization’s business. This information can come from education, traffic, healthcare, or commerce sectors. This data can be structured or semi-structured and can be in the form of text, images, audio, video, and log files. The amount of data to be stored and processed is often in terabytes.[15] It becomes extremely hard for a single system to manage such big data. 3 V’s, on the basis of which Big Data is defined includes– volume, velocity and variety. Volume – The size of data from organizations in social media, healthcare, education, and business section comes is ever increasing. Big companies like Google and Facebook process information in petabytes. The IOT (sensors data) is also increasing day by day. So, it becomes difficult to manage such big data using traditions systems. Variety – data in today’s world is divided into structured (schema, columns), semi-structured (json, emails, xml) and unstructured (images, videos, audio) categories. This data can also be raw and requires heavy system work to convert it into useful information using traditional analytical systems. Velocity – this concept defines the speed at which the data arrives from source destination and the speed with which it is processed. The size of the incoming data is huge andit needs to be processed at a similar speed. Companies like Google, Facebook process petabytes of data on a daily basis.[23] There is certain software that come in use to operate and manage this big data. This paper focuses one such software called Hadoop and how it is used in big data testing. The paper first describes what is big data and then moves to technologies like Hadoop and its components –HDFS and MapReduce. There are several budding technologies which come into play like YARN, Pig, Spark, Zookeeper, Hive, Apache Spark and HBase. Yarn helps in enabling the processing and running of batch, stream, interactive, and graph data stored in HDFS.[27] Hive and Pig are two integral International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 I. INTRODUCTION
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1521 parts of the Hadoop ecosystem, but there are some key differences between them. Fig 1: 3V’s of Big Data Spark is an free to use framework which is dedicated for providingto-way communicative questioning, machine learning, and live workloads. Apache Zookeeper is a service that provides operational support to Hadoop clusters, offering a distributed configuration service, a synchronization tool, and a naming registry.[7] Finally, Apache HBase is a free-to-use, NoSQL, distributed, warehouse for big data that provides consistent live access to zetabytes of data. It is effective for managing large sparse datasets. [5] The paper will focus on practical usage of these technologies and compare them performance wise. However, with growing number of business storage this technology has also become quite vulnerable to attacks. The paper will focus on describing challenges, security measures and vulnerability of this architectureand how to tackle them. With the humongous growth in the big data technology, there is massive rush of unstructured data in various forms and size. It becomes quite difficult to handle such data. Hence, many of the existing problems related to big data such as analysis methods and applications, storage models, privacy and security are discussed below. A. Analysis methods and Applications It is important to analyze big data to gain further meaningful and useful insights. To handle a complex, inconsistent and unstructured data, an accurate method is required. To work with such an inconsistent and complicated data, poses another problem for the data scientists. There are methods such as inquisitive, predictive, prescriptive, and pre-emptive analytics but the skills, tools and knowledge required to work on them are limited.[21] some of the challenges in dealing with big data are live or on-the-spot data transfer, situation discovery, research, insight gaining and responses. Their application require a more than sufficient knowledge to apply an efficient approach. Conventional applications are limited by processing limitations, high computational processing, and memory constraints. B. Storage models Millions of new IOT devices are being linked to the internet every day. Social media apps like Instagram, Facebook are an explosive contributor to the big data. Following this burst in data, many companies switch from traditional data storage systems towards cloud storage.[9] It is still a challenge to upload to cloud storage in real time because the data is huge. During the real time uploading of IOT devices associated with floods and debris flow management to the cloud, the sensors may be sending big chunks of data which can cause timely processing and processing computationally challenging. C. Privacy and data security Despite the rise in popularity of big data analytics, the problem of how to deal with a lot of data while maintaining privacy is still unresolved. Data are decentralized and originate from a variety of sources, including sensors, mobile devices, and Internet of Things (IoT) devices, which explains why. A privacy and security issue has also arisen from the analysis of heterogeneous data sources as a result of contact with other external systems.[9] It is also essential to ensure that the source is not subject to any attacks. Illegal transfer of backup data, for instance, becomes a major worry in big data research for the healthcare industry. To stop a new breach of patient security and privacy, this situation must be fully rectified. This event forces big data to reconsider analytics and developer privacy. III. RELATED WORK A. Hadoop and budding tools Hadoop, developed by the Apache Software Foundation, is a distributed computing framework that enables unrestricted access to large-scale data processing. It is widely used by global corporations such as eBay, Facebook, and Yahoo. Hadoop comprises several components, including HDFS for user interface and storage, MapReduce for data processing, and YARN for system management. Initially created in 2005 to support Yahoo's Nutch search engine project, Hadoop was published in 2011 and can scale up to 4000 nodes per cluster, enabling fast processing of vast amounts of data.[13] To enhance security and administration, vendor- specific Hadoop distributions have been developed. The key components of Hadoop include the Name Node, Data Node, Job Tracker, and Task Tracker. However, Hadoop's lack of internal cluster security makes it susceptible to threats. II. CHALLENGES ASSOCIATED WITH BIG DATA
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1522 B. Hadoop Distributed File System (HDFS) HDFS is designed to efficiently handle large-scale data distributed across clusters of commodity computers. This data can originate from various sources such as IoT sensors, social media platforms, and e-commerce servers. HDFS breaks down big data into smaller blocks, typically weighing 128MB, and distributes multiple copies of each block across the cluster for fault tolerance. If a machine fails, the data can be retrieved from other machines. Fig 2: HFDS Nodes The architecture of HDFS consists of three main components: the Name Node, Client Node, and Data Nodes. The Name Node keeps track of the files present on the Data Nodes and their locations within the cluster. However, the Name Node itself does not store data. The Client Node is responsible for reading and writing files in HDFS.[34] When reading a file, the Client Node requests the file's location from the Name Node, which provides the information needed to access the file from the Data Nodes. When writing a file, the Client Node informs the Name Node to create a new file, and the data is then sent to the Data Nodes, which communicate with each other to create backups of the file. Various methods can be used to interact with HDFS, including command-line interface, user interface (such as Ambari), Java interface, and HTTP/HDFS proxies. The command-line interface offers several commands for working with HDFS. Fig 3: Functions of different nodes in HDFS C. MapReduce MapReduce is a parallel programming paradigm and processing approach for building distributed applications that efficiently handle large volumes of data on clusters of shared hardware.[35] It was originally developed at Google and is powered by Hadoop, an open-source platform developed by Apache, primarily based on Java. The MapReduce algorithm consists of two key elements: Map and Reduce. Using tuples to represent each element, the Map operation transforms one collection of data into another, while the Reduce operation combines the output of the Map into a smaller set of tuples. The MapReduce paradigm allows easy scaling of data processing across multiple CPU nodes. Mappers and reducers are the basic components used in the MapReduce model for data processing. A MapReduce application can be easily scaled by updating the configuration, allowing it to run on hundreds or even thousands of servers in a cluster. Fig 4: MapReduce flow chart
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1523 The MapReduce program's execution consists of two stages: the Map stage and the Reduce stage.[28] The Map stage processes the input data, typically stored in Hadoop Distributed File System (HDFS), and passes it character by character to the mapper function, producing intermediate data. The Reduce stage combines the intermediate outputs and produces a final output, which is then stored back in HDFS. MapReduce follows a master-slave paradigm, where the master (JobTracker) divides the problem into jobs and sends them to various Data Nodes (TaskTracker) for parallel processing. The map jobs often run on the same cluster nodes as the processed data, and if a node is overloaded, another node close to the data is selected. The intermediate results of the Map and Shuffle phases are combined to reduce the amount of data transferred between map tasks and reduce tasks. While the input data and the final output are stored in HDFS, the intermediate results are stored in the local file systems of the Data Nodes. The JobTracker is responsible for monitoring the Data Nodes, managing the TaskTrackers, and ensuring the completion of jobs. If a task fails or a Data Node becomes unresponsive, the JobTracker restarts the job on another server, preferably one with a copy of the data. Additionally, if a task is running slowly, the JobTracker can initiate speculative execution by starting the same task on a different server to complete the job within the allocated time. D. MrJob MrJob is a Python framework that simplifies the development and execution of MapReduce tasks on Hadoop servers. It provides a high-level API for creating MapReduce jobs and abstracts the complexities of Hadoop. Some key features and benefits of MrJob include: i. Compatibility with various Hadoop versions, including Amazon Elastic MapReduce (EMR), local computers, and Docker containers. ii. High-level API that eliminates the need for dealing with low-level Hadoop details. iii. Built-in test framework for testing MapReduce tasks before deploying them to a Hadoop cluster. To create and run a job using MrJob, one can use commands such as creating a new job, running a job locally or on a Hadoop cluster, and testing a job locally.[19] Creating a new job: python mrjob create-job my_job Running a job locally: python my_job.py -r localinput_file Running a job on a Hadoop cluster: python my_job.py-r hadoop input_file Running a job on Amazon Elastic MapReduce: python my_job.py -r emr s3://input_bucket/input_file Testing a job locally: python my_job_test.py E. YARN YARN, which stands for Yet Another Resource Negotiator, is a cluster management technology used in Hadoop for resource allocation and application scheduling. It offers the following characteristics and advantages: i. Resource management: YARN enables efficient sharing of cluster resources among multiple applications. ii. Centralized scheduler: YARN includes a centralized scheduler that controls resource allocation and schedules applications on the cluster. iii. Fault tolerance: YARN incorporates fault tolerance mechanisms to ensure uninterrupted application execution even in the presence of node failures. iv. Scalability: YARN is designed to handle large clusters with thousands of nodes and millions of tasks. To interact with YARN, various commands can be used, such as starting the ResourceManager, starting the NodeManager, submitting MapReduce or Spark jobs, and monitoring application progress. petabytes of data in awidely distributed way. Start ResourceManager: yarn resourcemanager Start NodeManager: yarn nodemanager Submit MapReduce job: yarn jar mapreduce.jar MyJob input output Submit Spark job: spark-submit --master yarn -- deploy- mode cluster myapp.py input output Monitor application progress: yarn application - status <application_id> F. Pig Pig is a framework for analyzing large datasets using Pig Latin, a high-level scripting language. Pig Latin provides a simplified syntax for expressing data analysis tasks, allowing users to focus on the analysis logic rather than low-level implementation details.[9] Pig handles data storage and processing complexities, making data handling easier. It offers a scripting interface for conducting data analysis jobs and is designed to operate on large clusters and handle petabytes of data in a distributed manner. Some common Pig commands include loading data from a file, filtering data, grouping and aggregating data, and writing data to a file.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1524 Start Pig shell: pig Load from a file: mydata = LOAD 'mydata.txt' USING PigStorage(',') AS (col1:int, col2:chararray,col3:double); Filter data: mydata_filtered = FILTER mydata BY col1 > 100; Group & aggregate data: mydata_grouped = GROUP mydata BY col2; mydata_aggregated = FOREACH mydata_grouped GENERATE group,AVG(mydata.col3); Write to a file: STORE mydata_aggregated INTO 'output' USING PigStorage(); G. Hive Hive is a data storage utility that provides a SQL-like interface for querying and analyzing large Hadoop datasets. Hive translates SQL queries into MapReduce tasks, enabling developers and analysts to interact with Hadoop using familiar SQL syntax. Key characteristics and benefits of Hive include: i. SQL-like interface: Hive offers a user-friendly SQL-like interface that simplifies the creation and execution of queries on large databases. ii. Scalability: Hive is designed to handle big clusters and process petabytes of data in a distributed and parallel manner. iii. Data warehousing features: Hive supports features like table creation and partitioning for effective data warehousing. To use Hive, one can start the Hive shell, create tables, load data into tables, and run queries using SQL syntax. Starting the Hive shell: hive Creating a table: CREATE TABLE mytable (col1 INT, col2 STRING, col3 DOUBLE); Loading data into a table: LOAD DATA LOCAL INPATH 'data.txt' INTO TABLE mytable; Running a query: SELECT col1, AVG(col3) FROM mytable GROUP BY col1; Overall, these tools and technologies, such as Hadoop, HDFS, MapReduce, MrJob, YARN, Pig, and Hive, provide powerful capabilities for distributed data processing and analysis, making it easier to handle large datasets. This paper runs a MapReduce job using sandbox environment to better understand the usage of Hadoop HDFS and MapReduce commands. A sample movie dataset is used to apply various functions and commands on. With the help of Ambari (Hadoop user interface) bash command line we type the following HDFS commands: i. Hadoop fs -ls (lists files in the cluster) ii. Hadoop fs -mkdir ml-100k(makes a new folder named ml-100k) Fig 5: MrJob commands in Ambari Bash Shell With the help of nano text editor we input the functions for sorting and adding the movie ratings according to the user id. This task is completed with the help of a python library called MRjob. Below are the functions of MRjob: Fig 6: MrJob functions in nano-text editor Once the functions are defined in the editor we make sure to save the python file.[12] Then we run the MRjob python file on the movie dataset only to view the required results: Fig 7: Required Result IV. IMPLEMENTATION OF BIG DATA PROCESSING
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1525 Fig 8: Required Result Big Data is a type of data whose size, variety, and complexity necessitate the development of new design, methods, algorithms, and analytics in order to handle it and derive value and secret knowledge from it. Data is now produced from a variety of sources and can enter the system at varying speeds.[4] Today, processing huge amounts of data is a major problem. We covered the Hadoop tool for Big Data in depth in this article. Hadoop is the foundational framework for structuring Big Data and solving the issue of making it usable for analytics. We also talked about some Hadoop components that help with the handling of big data sets in dispersed computing settings. In the future, we can apply some clustering methods and test their performance. Hadoop offers several advantages for big data testing, including scalability, failure tolerance, and support for different data types. However, due to the platform's complexity and the diversity of tools accessible, testing big data on Hadoop can be difficult. To guarantee the accuracy and dependability of big data processed with Hadoop, it is critical to have a comprehensive grasp of its architecture, testing methods, and tools. As the demand for big data analytics grows rapidly across various sectors, the potential scope of big data testing using Hadoop is important. As more businesses implement big data technologies, the need for efficient testing strategies and tools grows.[17] Because of its scalability, fault tolerance, and support for different data types, Hadoop has become a common tool for big data handling and analysis. Hadoop is anticipated to provide more sophisticated testing methods and instruments to guarantee the accuracy and dependability of big data as it evolves. [1] Bhosale, H. S., Gadekar, P. D. (2014). "A Review Paper on Big Data and Hadoop." International Journal of Scientific and Research Publications, 4(10). [2] Smitha, T., Kumar, V. S. (2013). "Application of Big Data in Data Mining." International Journal of Emerging Technology and Advanced Engineering, 3(7). [3] IBM Big Data Analytics HUB. (n.d.). "Four V's of Big Data." Retrieved from www.ibmbigdatahub.com/infographic/four-vs-big-data. [4] Mridul, M., Khajuria, A., Dutta, S., Kumar, N. (2014). "Analysis of Big Data using Apache Hadoop and MapReduce." International Journal of Advance Research in Computer Science and Software Engineering, 4(5). [5] Apache Hadoop Project. (n.d.). "Apache Hadoop." Retrieved from http://hadoop.apache.org/. [6] Smitha, T., Sundaram, V. (2012). "Classification Rules by Decision Tree for Disease Prediction." International Journal for Computer Applications, 43(8). [7] Baker, N. (2017). "3 Universities That Are Using Big Data - QS." Retrieved from https://www.qs.com/3- universities-thatare-using-big- data/?fbclid=IwAR2eH_8StopgXHIlnGn5IQvnwIKmDQVM_ xTWZc5oq6FDp4_vwNkUzG4xZOQ. [8] Information Technology Research and Development Centre, University of Kufa. (2018). "E-learning server statistics." Retrieved from http://elearning.uokufa.edu.iq/?page_id=100. [9] Parsola, J., et al. (2018). "Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster." International Journal of Electrical and Computer Engineering, 8, 5089. [10] Apache Hadoop. (2018). "Hadoop – Apache Hadoop 2.9.2." Retrieved from https://hadoop.apache.org/docs/r2.9.2/. [11] Tom, W. (2015). "Hadoop: The Definitive Guide." Fourth Edition The Definitive Guide Storage and Analysis at Internet Scale. [12] Insights SAS. (n.d.). "What is Hadoop?" Retrieved from https://www.sas.com/en_us/insights/big- data/hadoop.html. [13] Yang, H. C., Dasdan, A., Hsiao, R. L., Parker, D. S. (2007). "Map–reduce–merge: Simplified relational data processing on large clusters." Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. V. CONLUSION AND FUTURE WORK After running the Python commands we get the MapReduced result from the dataset which contains the sorted (ascending order) ratings of the movies (right) according to the user ids(left). REFERENCES
  • 7. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 1526 [14] Hunt, P., Konar, M., Junqueira, F. P., Reed, B. (2010). "ZooKeeper: Wait-free coordination for internet-scale systems." Proceedings of the USENIX Annual Technical Conference. [15] Junqueira, F., Reed, B. (2013). "ZooKeeper: Distributed Process Coordination." O'Reilly Media, Inc. [16] Ranjan, R. (2014). "Streaming big data processing in datacenter clouds." IEEE Cloud Computing, 1(1), 78-83. [17] Bahga, A., Madisetti, V. (2014). "Internet of Things: A Hands-On Approach." Vpt. [18] Meng, X., et al. (2016). "MLlib: Machine learning in Apache Spark." Journal of Machine Learning Research, 17(1), 1235-1241. [19] Gonzalez, J. E., et al. (2014). "Graphx: Graph processing in a distributed dataflow framework." Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation. [20] Kulkarni, S., et al. (2015). "Twitter Heron: Stream processing at scale." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. [21] Aji, A., et al. (2013). "Hadoop GIS: A high-performance spatial data warehousing system over MapReduce." Proceedings of the VLDB Endowment, 6(11), 1009-1020. [22] Apache Hadoop. (2016). Retrieved from http://hadoop.apache.org/. [23] Apache Hive. (2016). Retrieved from https://hive.apache.org/. [24] Apache Hadoop HDFS. (2016). Retrieved from http://hadoop.apache.org/hdfs. [25] Dean, J., Ghemawat, S. (2008). "MapReduce: Simplified data processing on large clusters." Communications of the ACM, 51(1), 107-113. [26] White, T. (2012). "Hadoop: The Definitive Guide." O'Reilly Media, Inc. [27] Zaharia, M., et al. (2010). "Spark: Cluster computing with working sets." HotCloud, 10(10-10), 95. [28] Akbarinia, R., Pournaras, E., Aberer, K. (2013). "Dynamic load balancing in distributed stream processing systems." IEEE Transactions on Parallel and Distributed Systems, 24(7), 1362-1371. [29] Zaharia, M., et al. (2016). "Apache Spark: A unified engine for big data processing." Communications of the ACM, 59(11), 56-65. [30] Kshemkalyani, A. D., Singhal, M. (2010). "Distributed Computing: Principles, Algorithms, and Systems." Cambridge University Press. [31] Abadi, D. J., Chu, A., Eksombatchai, P. (2013). "The power of comparative reasoning." Communications of the ACM, 56(3), 70-77. [32] Chen, Q., et al. (2014). "A survey of big data storage and computational frameworks." Journal of Computer Science and Technology, 29(2), 165-182. [33] Qiu, M., et al. (2014). "Performance modeling and analysis of big data processing in cloud systems." IEEE Transactions on Parallel and Distributed Systems, 25(9), 2193-2203. [34] Bhatia, R., Kumar, S., Goyal, P. (2013). "Hadoop: A framework for big data analytics." International Journal of Emerging Technology and Advanced Engineering, 3(3), 238-241. [35] Vavilapalli, V. K., et al. (2013). "Apache Hadoop YARN: Yet Another Resource Negotiator." Proceedings of the 4th Annual Symposium on Cloud Computing, 5(7), 1-16.