SlideShare a Scribd company logo
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data
Matrix
K.L.S.Soujanyaa
, B.Shirishab
, Challa Madhavi Lathac
a
Professor, Department of CSE,CMR College of Engineering & Technology, Hyderabad, India.
klssoujanya@cmrcet.org
b
M .Tech Student, Department of CSE,CMR College of Engineering & Technology, Hyderabad, India.
shirishabadithela@gmail.com
c
Assistant Professor, Department of Information Technology, Faculty of Informatics, University of
Gondar, Gondar, Ethiopia. saidatta2009@gmail.com (Corresponding Author)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
112 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data
Matrix
Abstract - Data is the most valuable entity in today’s world which has to be managed. The huge data
available is to be processed for knowledge and predictions.This huge data in other words big data is
available from various sources like Facebook, twitter and many more resources. The processing time taken
by the frameworks such as Spark ,MapReduce Hierachial Distributed Matrix(HHDM) is more. Hence
Hybrid Hierarchically Distributed Data Matrix(HHHDM) is proposed. This framework is used to develop
Bigdata applications. In existing system developed programs are by default or automatically roughly
defined, jobs are without any functionality being described to be reusable.It also reduces the ability to
optimize data flow of job sequences and pipelines. To overcome the problems of existing framework we
introduce a HHHDM method for developing the big data processing jobs. The proposed method is a Hybrid
method which has the advantages of Hierarchial Distributed Matrix (HHDM) which is functional,
stronglytyped for writing big data applications which are composable. To improve the performance of
executing HHHDM jobs multiple optimizationsis applied to the HHHDM method. The experimental results
show that the improvement of the processing time is 65-70 percent when compared to the existing
technology that is spark.
Keywords: Big data processing, Optimization, Hybrid Hierarchically Distributed Data
Matrixframework, strongly-typed.
I. Introduction
The exponential growth and availability of data is described by the Big data. It has become a buzz word
in software environment. But the growing large-scale data is exponential as year by year recent research
report says in the year of 2020 exponential growth of large-scale data is zetta bytes and yotta byte of
storage in systems. For this kind of problems introduces the new development of novel solutions to
overcome this problems or challenges. In general the mapreduce framework fundamental principle are
move to analysis of the data, rather than moving the data to a system that (mapreduce) can analyze it.
Programmers are inviting to think in a data centric fashion by using it. Here programmers can focus
on applying transformations to sets of records of data. The details of this data records are transparently
managed or maintained by the framework. Framework is transparently manage the details of data records
are distributed execution and fault tolerance. However, in running years inthe data analytics domain
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
113 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
applications requirements are increasing with day to day life of software environment Hadoop framework
various limitations have been recognized and we constituted a new wave of mostly domain specific , in
optimized big data processing platforms with witnessed an unprecedented interest to takle these challenges
with new solutions.
In recent years with using the distributed clusters of commodity machines, more methods were
presented to take care of the ever increasing data sets. Several frameworks (eg: spark )developing of big
data programs and application complexity can significantly reduced by these frameworks.
The main challenges of present in big data analytical applications are listed below :
• Real time software applications and programs requiresa chain of operations for processing.
• Manual optimizations are time-consuming and prone to errors. Merging, developing and interaction
in big data programs is not natively supported.
• MapReduce and Spark are roughly defined and without giving any information about the
functionalities. Because of this aspect the application is not reusable.
To overcome the above challenges, we a new framework HHHDM is proposed.
II. Literature Survey
Various approaches have been used for securing and maintaining the efficiency and
performance of millions of data set with variety, velocity, and volume(Anju & Shyma, 2018).In the recent
past, the flow of data produced by various computations has been increased and it is shifting to large scale
data mechanisms. MapReduce data processing is one of the best widespread method to manage big data
and it is useful for reduce processing time and memory space and also efficient parallel processing to
produce large data sets (Triguero et al., 2015). Qian et al., (2015) implemented MapReduce by using an
algorithm i.e. hierarchical attribute reduction. Manogaran et al., (2017) followed the same process to
monitor the smart health care in a secured way. Majority of researchers has been found that the
performance of data has been improved for their proposed frameworks and methods implemented by using
MapReduce data processing model. However, the MapReduce framework implementations had some
limitations, which are handled by some researchers. Mattew et al., (2018) overcomes the big data
complications and limitations such as data storage, partitioning, transformation, retrievals, extractions,
indexing etc., by generating training data set.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
114 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
In intelligent transportation systems (ITS), the big data is playing major role around the globe. ITS big data
is having major impacts on applications and designs of ITS, which makes the system secure and efficient
(Li et al., 2018).
However, in running years in the data analytics domain applications requirements are increasing with
day to day life of software environment Hadoop framework various limitations have been recognized and
established a newtrend of domain specific, which is a witnessed to optimize big data processing and also
handle unparalleled problems with new elucidations (Sharif et al., 2013).The big data has convertedto a
widespread term in software environment, but the growing large-scale data is exponential as year by year
recent research report says in the year of 2020 exponential growth of large-scale data is zetta bytes and
yotta byte of storage in systems. Therefore, the storage related and dataset problems could be solved and
overcome by using novel approaches (Sharif & Mohammad, 2014).Some researchers suggested hybrid
models to solve various data set issues. The hybrid models are integrated with more than one model, which
is very efficient to improve the performance, handle problems and weakness of data sets (Mohamad et al.,
2016; Paradarami et al., 2017).
III. Proposed System
In order to overcome the challenges and also improve the efficiency of big data processing we
propose HHHDM by integrating Map reduce with HHDM.
A. Introduction to HHDM :-
• HHDM is functional cores attribute to develop the optimization and parallel execution of big data
program and applications.
• HHDM is defined as HHDM[T,R].It is function in that T is input type of type T and R is output
• A part form this core objects HHDM includes data dependence, location, functionality and state.
• HHDM is strongly data type and light weight and functional defined.
B. Representation of HHDM :-
Attributes of HHDM :
1. ID : This id is identifier to the HHDM. This is must be a unique within each HHDM context.
2. INTYPE and OUTTYPE : Those are used to type correctness in programming planning and
execution.
3. CATEGORY : Which category of HHDM either DDM or DFM for program execution.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
115 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. CHILDERN and DEPENDENCY : This attributes are used to reordering the jobs or reusing the
jobs of big data.
5. LOCATION : HHDM holdes the address of program (URL) it is used in any where with in the
HHDM location.
6. FUNCTION: It is core attribute in HHDM it is used to how to compute the output to the programs.
7. STATE: It provides the status of the program and application
This attributes are using based on programmer design methods. By using this attributes HHDM
defined of
• Functional
• Portable
• Location
Categories of HHDM :
HHDM is independent core object and tree based structure which consists of the following two types of
nodes :
DDM Distributed Data Matrix:- leaf nodes of the HHDM hierarchically hold the data of all node
and it is atomic operation if includes ID, SIZE, LOCATION of data jobs. It is used to path
specification in HHDM defined as HHDM[path, T]
DFM Distributed Functional Matrix:-It is high level programming in HHDM and non leaf node
hold the chilled data. It is used to composable output to the program this output is input to other
subsquents. If it is in execution state it wappered the children node data andother nodes data.
Data Dependencies of HHDM:
This is further divided into four types
1. One -To- One(1:1)
2. One-To-N(1:N)
3. N-To-One(N:1)
4. N-To-N(N:N)
Advantages of proposed system:
• HHDM is a functional defined.
• Strongly-typed data type.
• It provides the reusability of jobs in programming and applications.
• Location path is available in HHDM.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
116 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
IV. Implementation
HHDM Function:- is the function used for transformation of input data to out put by using various
semantics. Functions have different semantics targeting different execution context for different datasets in
HHDM. One HHDM function has three semantics, those are Fp, Fa, Fc :
Fp is the basic semantics of a function
Fp : List[T] ->List[R]
Fa is the aggregation semantics of a function
Fa :(List[T] , List[R] ) -> List[R]
Fc is the combination semantics of a function
Fc : ( List[R] , List[R] ) -> List[R]
HHDM Composition:- HHDM inherits the idea of functional composition to support two basic type of
composition:
H D M[T , R] compose H D M[I ,T] => H D M[I ,R]
H D M[T , R] andThen H D M[R ,U] => H D M[T , U]
These two patterns are commonly used in functional programming and can be recursively used in
HHDM sequences to achieve complicated composition requirements.
Interaction with HHDM : To interaction with HHDM we use five types of Actions for integration
Compute, Sample, Count, Traverse, Trace. HHDM applications are designed to be interactive during
runtime in an a synchronous manner
Creating a frame:-
Method:
In the first method we will be creating frame by extending Frame class which is defined in java.awt
package.
In the program we are using three methods:
setTitle: For setting the title of the frame we will use this method. It takes String as an argument which will
be the title name.
SetVisible: For making our frame visible we will use this method. This method takes Boolean value as an
argument. If we are passing true then window will be visible otherwise window will not be visible.
SetSize: For setting the size of the window we will use this method. The first argument is width of the
frame and second argument is height of the frame.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
117 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
V. Case Study
As case study different data sets of text documents are taken for performing word count and sorting by
using two methods that is HHDM and Spark. The running of the software is shown in the following figures.
Figure 1:Home screen
In the Figure 1 Home screen of the project is shown, where in the frame shows the provision for uploading
the intended data set, execution type etc., using both HHDM and spark separately.
Figure 2: Upload dataset
In the Figure2 showschoosing the input file or data set from required or stored place in the system. In this
step of selecting input data set we select the input data set as per our requirements and then access that
input data set by using frame of open button.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
118 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure 3: Dataset output Type display
The Figure3shows the information about the output display type after the execution for we have to select
here the execution type frame as word count or sort. This chosen functionality will be run by HHDM on
our required input.
Figure 4: Run HHDM
The Figure 4 shows the taken program execution in HHDM method. HHDM job completion time of above
taken program is 1375 M.sec. In HHDM method include the DDM and DFM techniques are works as
similar as MapReduce framework in Hadoop.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
119 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure 5: Run using Spark
The Figure 5 shows the execution of taken data set or job with spark. Then we observe the job completion
time of the taken job with spark i.e., 8709M.sec
Figure 6: Job completion chart
The Figure 6 shows the result display of the job completion time for both HHDM and spark and this step of
result show graph representation of the same above results. Here we observe HHDM graph length is less
compare with spark i.e., HHDM job completion time is less compared with the spark job completion time.
VI. Results
The results of the execution time of the job are shown in the following Table.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
120 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Table 1: Execution time for Word Count
Input HHDM Spark
Test1.txt 1066 7870
Test2.txt 1078 8063
Test3.txt 1328 7907
The Table 1 shows the different job completion time of different data sets in the word count application .
For these values we have graphical representation as shown in Figure 7
Figure :7 Graphical representation of word count jobs
In the Figure 7 the Graphical representation of word count shows that the execution time for HHDM is far
less when compared to the Spark, in all the Three data sets which is represented with different colors like
blue,red and green.
Table 2: Execution time for sorting.
Input Run
HHDM
Run Spark
Text1.txt 1066 7870
Text2.txt 1078 8063
Text3.txt 1328 7907
The Table 2 shows the shows the different job completion time of different data sets in the sorting
technique . For these values we have graphical representation as shown in Figure 8.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
121 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Figure : 8 Graphical representation of Sorting jobs
In the Figure 8 the Graphical representation of sorting jobs chat shows that the execution time for HHDM
is far less when compared to the Spark, in all the Three data sets which is represented with different colors
like blue, red and green.
VII. Conclusion
In this paper, HHDM which is a functional and strongly-typed meta-data abstraction, is
implemented. Also a runtime implementation of the system to support the execution, management and
optimization of HHDM applications is implemented.The applications written in HHDM are natively
composable and can be merged with already existing software application. The movement of data in
HHDM jobs is naturally optimized even before execution at runtime. HHDM facilitates the programmer to
concentrate on the logic by automating the integration process and optimization process.The results show
that the execution time is optimized when HHDM is used when compared with spark. The improvement in
the performance is 65-70%.
References:
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
122 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
[1] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian
Hueske, Arvid Heise, Odej Kao, MarcusLeich, Ulf Leser, Volker Markl, Felix Naumann, Mathias
Peters,Astrid Rheinl¨ander, Matthias J. Sax, Sebastian Schelter, MareikeH¨oger, Kostas Tzoumas,
and Daniel Warneke.” The Stratosphere Platform for Big Data analytics”. VLDB J., 23(6), 2014.
[2] Anju Abraham, and Shyma Kareem., 2018,”Security and Clustering Of Big Data in Map Reduce
Framework “International Journal of Advance Research, Ideas And Innovations In Technology
Volume 4, Issue 1,pp-199, ISSN:2454-132X.
[3] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C Murthy, and Carlo
Curino. Apache Tez:” A UnifyingFramework for Modeling and Building Data Processing
Applications” .In SIGMOD, 2015.
[4] ChunWei Tsai, Chin Feng Lai, Han Chieh Chao, and Athanasios V.Vasilakos.” Big Data
Analytics”: a survey. Journal of Big Data, 2(21),2015.
[5] Corrigan, P. Zikopoulos, K. Parasuraman, T. Deutsch, D. Deroos, and J. Giles,” Harness the
Power of Big Data the IBM Big Data Platform”. 1st
ed. New York, NY, USA:McGraw-Hill, Nov.
2012.
[6] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, ToddPhillips, Dietmar Ebner, Vinay
Chaudhary, and Michael Young.Machine learning:” The high interest credit card of Technical
Debt”.In SE4ML: Software Engineering for Machine Learning, 2014.
[7] Dongyao Wu, Sherif Sakr, Liming Zhu, and Qinghua Lu.” Composable and Efficient Functional
Big Data Processing Framework” .In IEEE Big Data, 2015.
[8] Jiawei Yuan, and Yifan Tian. “Practical Privacy-Preserving MapReduce Based K-means
Clustering over Large-scale Dataset”. IEEE, 2017.
[9] Li Zhu, Fei Richard Yu, Yige Wang, Bin Ning and Tao Tang.”Big Data Analytics in Intelligent
Transportation Systems” A Survey 1524-9050 IEEE, 2018.
[10] Manogaran, G., Varatharajan, R., Lopez, D., Kumar, P.M., Sundarasekar, R., Thota, C.,2017. A
new architecture of Internet of Things and big data ecosystem forsecured smart healthcare
monitoring and alerting system. Future Gener.Comput. Syst. 82, 375–387.
[11] Mattew Malensek, Walid Budgaga, Ryan Stern, Sangmi Lee Pallickara and Shrideep Pallickara.”
Trident: Distributed Storage, Analysis, and Exploration of Multidimensional Phenomena”IEEE,
2018.
[12] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, DaviesLiu, Joseph K. Bradley,
Xiangrui Meng, Tomer Kaftan, Michael J.Franklin, Ali Ghodsi, and Matei Zaharia. “Spark SQL:
RelationalData Processing in Spark”. In SIGMOD, pages 1383–1394, 2015.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
123 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
[13] Mohamad, M., Selamat, A., 2016. A new hybrid rough set and soft set parameterreduction method
for spam e-mail classification task. Lecture Notes in ArtificialIntelligent, LNAI 9806 (9806), 18–
30.
[14] Paradarami, N.D., Tulasi, K., Bastian, Wightman, J.L., 2017. A hybrid recommendersystem using
artificial neural networks. Expert Syst. Appl. 83, 300–313.
[15] Qian, J., Lv, P., Yue, X., Liu, C., Jing, Z., 2015. Hierarchical attribute reductionalgorithms for big
data using MapReduce. Knowl.-Based Syst. 73, 18–31.
[16] Sherif Sakr and Mohamed Medhat Gaber, editors. “Large Scale andBig Data - Processing and
Management”. Auerbach Publications, 2014.
[17] Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. “The Family ofMapReduce and Large-scale Data
Processing Systems”. ACM CSUR,46(1):11, 2013.
[18] Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F., 2015. MRPR: a MapReducesolution
for prototype reduction in big data classification. Neurocomputing 150,331–345.
[19] Y. Zhang, S. Chen, Q. Wang, and G. Yu, “i2mapreduce:Incremental MapReduce for mining
evolving Big Data,”IEEE Transactions on Knowledge and Data Engineering, vol.27,2015
[20] Zhipeng Gao, Kun Niu, Yidan Fan, and Zhenyiying.”MR-Mafia: Parallel Subspace Clustering
Algorithm Based on MapReduce For large Multi-dimensional Datasets” International Conference
on Big Data IEEE, 2018.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
124 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

What's hot (17)

PDF
A Survey on Graph Database Management Techniques for Huge Unstructured Data
IJECEIAES
 
PDF
Granularity analysis of classification and estimation for complex datasets wi...
IJECEIAES
 
PDF
IRJET- Recommendation System based on Graph Database Techniques
IRJET Journal
 
PDF
Comparison Between WEKA and Salford System in Data Mining Software
Universitas Pembangunan Panca Budi
 
PDF
A Survey of Agent Based Pre-Processing and Knowledge Retrieval
IOSR Journals
 
PDF
The Big Data Importance – Tools and their Usage
IRJET Journal
 
PDF
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
IRJET Journal
 
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
ieijjournal
 
PDF
An Introduction to CCDH
Nicole Vasilevsky
 
PDF
Query Optimization Techniques in Graph Databases
IJDMS
 
PDF
Characterizing and Processing of Big Data Using Data Mining Techniques
IJTET Journal
 
PDF
Efficient Cost Minimization for Big Data Processing
IRJET Journal
 
PDF
Z36149154
IJERA Editor
 
PDF
A NOVEL APPROACH FOR PROCESSING BIG DATA
IJDMS
 
PDF
Data mining seminar report
mayurik19
 
PDF
Lecture1-IS322(Data&InfoMang-introduction)
Taibah University, College of Computer Science & Engineering
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
IJECEIAES
 
Granularity analysis of classification and estimation for complex datasets wi...
IJECEIAES
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET Journal
 
Comparison Between WEKA and Salford System in Data Mining Software
Universitas Pembangunan Panca Budi
 
A Survey of Agent Based Pre-Processing and Knowledge Retrieval
IOSR Journals
 
The Big Data Importance – Tools and their Usage
IRJET Journal
 
IRJET- Analysis for EnhancedForecastof Expense Movement in Stock Exchange
IRJET Journal
 
RESEARCH IN BIG DATA – AN OVERVIEW
ieijjournal
 
An Introduction to CCDH
Nicole Vasilevsky
 
Query Optimization Techniques in Graph Databases
IJDMS
 
Characterizing and Processing of Big Data Using Data Mining Techniques
IJTET Journal
 
Efficient Cost Minimization for Big Data Processing
IRJET Journal
 
Z36149154
IJERA Editor
 
A NOVEL APPROACH FOR PROCESSING BIG DATA
IJDMS
 
Data mining seminar report
mayurik19
 
Lecture1-IS322(Data&InfoMang-introduction)
Taibah University, College of Computer Science & Engineering
 

Similar to Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix (20)

PDF
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
PDF
A REVIEW PAPER ON BIG DATA ANALYTICS
Sarah Adams
 
PDF
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
IJECEIAES
 
PDF
Framework for efficient transformation for complex medical data for improving...
IJECEIAES
 
PDF
Framework for efficient transformation for complex medical data for improving...
IJECEIAES
 
PPTX
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
PDF
Big Data Testing Using Hadoop Platform
IRJET Journal
 
PDF
Elementary Concepts of Big Data and Hadoop
rahulmonikasharma
 
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
PPTX
Data mining with big data
Sandip Tipayle Patil
 
PDF
A Comprehensive Study on Big Data Applications and Challenges
ijcisjournal
 
PDF
Revolutionizing Big Data with AI-Driven Hybrid Soft Computing Techniques
mlaij
 
PDF
Revolutionizing Big Data with AI-Driven Hybrid Soft Computing Techniques
mlaij
 
PDF
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
PDF
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
aciijournal
 
PDF
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
PDF
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
PDF
Survey Paper on Big Data and Hadoop
IRJET Journal
 
PDF
A Review Paper on Big Data and Hadoop for Data Science
ijtsrd
 
PDF
Data minig with Big data analysis
Poonam Kshirsagar
 
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
A REVIEW PAPER ON BIG DATA ANALYTICS
Sarah Adams
 
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
IJECEIAES
 
Framework for efficient transformation for complex medical data for improving...
IJECEIAES
 
Framework for efficient transformation for complex medical data for improving...
IJECEIAES
 
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Big Data Testing Using Hadoop Platform
IRJET Journal
 
Elementary Concepts of Big Data and Hadoop
rahulmonikasharma
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
Data mining with big data
Sandip Tipayle Patil
 
A Comprehensive Study on Big Data Applications and Challenges
ijcisjournal
 
Revolutionizing Big Data with AI-Driven Hybrid Soft Computing Techniques
mlaij
 
Revolutionizing Big Data with AI-Driven Hybrid Soft Computing Techniques
mlaij
 
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
aciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
Big Data Summarization : Framework, Challenges and Possible Solutions
aciijournal
 
Survey Paper on Big Data and Hadoop
IRJET Journal
 
A Review Paper on Big Data and Hadoop for Data Science
ijtsrd
 
Data minig with Big data analysis
Poonam Kshirsagar
 
Ad

Recently uploaded (20)

PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PDF
smart lot access control system with eye
rasabzahra
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPT
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Thermal runway and thermal stability.pptx
godow93766
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
smart lot access control system with eye
rasabzahra
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Ad

Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix

  • 1. Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix K.L.S.Soujanyaa , B.Shirishab , Challa Madhavi Lathac a Professor, Department of CSE,CMR College of Engineering & Technology, Hyderabad, India. [email protected] b M .Tech Student, Department of CSE,CMR College of Engineering & Technology, Hyderabad, India. [email protected] c Assistant Professor, Department of Information Technology, Faculty of Informatics, University of Gondar, Gondar, Ethiopia. [email protected] (Corresponding Author) International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 112 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix Abstract - Data is the most valuable entity in today’s world which has to be managed. The huge data available is to be processed for knowledge and predictions.This huge data in other words big data is available from various sources like Facebook, twitter and many more resources. The processing time taken by the frameworks such as Spark ,MapReduce Hierachial Distributed Matrix(HHDM) is more. Hence Hybrid Hierarchically Distributed Data Matrix(HHHDM) is proposed. This framework is used to develop Bigdata applications. In existing system developed programs are by default or automatically roughly defined, jobs are without any functionality being described to be reusable.It also reduces the ability to optimize data flow of job sequences and pipelines. To overcome the problems of existing framework we introduce a HHHDM method for developing the big data processing jobs. The proposed method is a Hybrid method which has the advantages of Hierarchial Distributed Matrix (HHDM) which is functional, stronglytyped for writing big data applications which are composable. To improve the performance of executing HHHDM jobs multiple optimizationsis applied to the HHHDM method. The experimental results show that the improvement of the processing time is 65-70 percent when compared to the existing technology that is spark. Keywords: Big data processing, Optimization, Hybrid Hierarchically Distributed Data Matrixframework, strongly-typed. I. Introduction The exponential growth and availability of data is described by the Big data. It has become a buzz word in software environment. But the growing large-scale data is exponential as year by year recent research report says in the year of 2020 exponential growth of large-scale data is zetta bytes and yotta byte of storage in systems. For this kind of problems introduces the new development of novel solutions to overcome this problems or challenges. In general the mapreduce framework fundamental principle are move to analysis of the data, rather than moving the data to a system that (mapreduce) can analyze it. Programmers are inviting to think in a data centric fashion by using it. Here programmers can focus on applying transformations to sets of records of data. The details of this data records are transparently managed or maintained by the framework. Framework is transparently manage the details of data records are distributed execution and fault tolerance. However, in running years inthe data analytics domain International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 113 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. applications requirements are increasing with day to day life of software environment Hadoop framework various limitations have been recognized and we constituted a new wave of mostly domain specific , in optimized big data processing platforms with witnessed an unprecedented interest to takle these challenges with new solutions. In recent years with using the distributed clusters of commodity machines, more methods were presented to take care of the ever increasing data sets. Several frameworks (eg: spark )developing of big data programs and application complexity can significantly reduced by these frameworks. The main challenges of present in big data analytical applications are listed below : • Real time software applications and programs requiresa chain of operations for processing. • Manual optimizations are time-consuming and prone to errors. Merging, developing and interaction in big data programs is not natively supported. • MapReduce and Spark are roughly defined and without giving any information about the functionalities. Because of this aspect the application is not reusable. To overcome the above challenges, we a new framework HHHDM is proposed. II. Literature Survey Various approaches have been used for securing and maintaining the efficiency and performance of millions of data set with variety, velocity, and volume(Anju & Shyma, 2018).In the recent past, the flow of data produced by various computations has been increased and it is shifting to large scale data mechanisms. MapReduce data processing is one of the best widespread method to manage big data and it is useful for reduce processing time and memory space and also efficient parallel processing to produce large data sets (Triguero et al., 2015). Qian et al., (2015) implemented MapReduce by using an algorithm i.e. hierarchical attribute reduction. Manogaran et al., (2017) followed the same process to monitor the smart health care in a secured way. Majority of researchers has been found that the performance of data has been improved for their proposed frameworks and methods implemented by using MapReduce data processing model. However, the MapReduce framework implementations had some limitations, which are handled by some researchers. Mattew et al., (2018) overcomes the big data complications and limitations such as data storage, partitioning, transformation, retrievals, extractions, indexing etc., by generating training data set. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 114 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. In intelligent transportation systems (ITS), the big data is playing major role around the globe. ITS big data is having major impacts on applications and designs of ITS, which makes the system secure and efficient (Li et al., 2018). However, in running years in the data analytics domain applications requirements are increasing with day to day life of software environment Hadoop framework various limitations have been recognized and established a newtrend of domain specific, which is a witnessed to optimize big data processing and also handle unparalleled problems with new elucidations (Sharif et al., 2013).The big data has convertedto a widespread term in software environment, but the growing large-scale data is exponential as year by year recent research report says in the year of 2020 exponential growth of large-scale data is zetta bytes and yotta byte of storage in systems. Therefore, the storage related and dataset problems could be solved and overcome by using novel approaches (Sharif & Mohammad, 2014).Some researchers suggested hybrid models to solve various data set issues. The hybrid models are integrated with more than one model, which is very efficient to improve the performance, handle problems and weakness of data sets (Mohamad et al., 2016; Paradarami et al., 2017). III. Proposed System In order to overcome the challenges and also improve the efficiency of big data processing we propose HHHDM by integrating Map reduce with HHDM. A. Introduction to HHDM :- • HHDM is functional cores attribute to develop the optimization and parallel execution of big data program and applications. • HHDM is defined as HHDM[T,R].It is function in that T is input type of type T and R is output • A part form this core objects HHDM includes data dependence, location, functionality and state. • HHDM is strongly data type and light weight and functional defined. B. Representation of HHDM :- Attributes of HHDM : 1. ID : This id is identifier to the HHDM. This is must be a unique within each HHDM context. 2. INTYPE and OUTTYPE : Those are used to type correctness in programming planning and execution. 3. CATEGORY : Which category of HHDM either DDM or DFM for program execution. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 115 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. 4. CHILDERN and DEPENDENCY : This attributes are used to reordering the jobs or reusing the jobs of big data. 5. LOCATION : HHDM holdes the address of program (URL) it is used in any where with in the HHDM location. 6. FUNCTION: It is core attribute in HHDM it is used to how to compute the output to the programs. 7. STATE: It provides the status of the program and application This attributes are using based on programmer design methods. By using this attributes HHDM defined of • Functional • Portable • Location Categories of HHDM : HHDM is independent core object and tree based structure which consists of the following two types of nodes : DDM Distributed Data Matrix:- leaf nodes of the HHDM hierarchically hold the data of all node and it is atomic operation if includes ID, SIZE, LOCATION of data jobs. It is used to path specification in HHDM defined as HHDM[path, T] DFM Distributed Functional Matrix:-It is high level programming in HHDM and non leaf node hold the chilled data. It is used to composable output to the program this output is input to other subsquents. If it is in execution state it wappered the children node data andother nodes data. Data Dependencies of HHDM: This is further divided into four types 1. One -To- One(1:1) 2. One-To-N(1:N) 3. N-To-One(N:1) 4. N-To-N(N:N) Advantages of proposed system: • HHDM is a functional defined. • Strongly-typed data type. • It provides the reusability of jobs in programming and applications. • Location path is available in HHDM. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 116 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. IV. Implementation HHDM Function:- is the function used for transformation of input data to out put by using various semantics. Functions have different semantics targeting different execution context for different datasets in HHDM. One HHDM function has three semantics, those are Fp, Fa, Fc : Fp is the basic semantics of a function Fp : List[T] ->List[R] Fa is the aggregation semantics of a function Fa :(List[T] , List[R] ) -> List[R] Fc is the combination semantics of a function Fc : ( List[R] , List[R] ) -> List[R] HHDM Composition:- HHDM inherits the idea of functional composition to support two basic type of composition: H D M[T , R] compose H D M[I ,T] => H D M[I ,R] H D M[T , R] andThen H D M[R ,U] => H D M[T , U] These two patterns are commonly used in functional programming and can be recursively used in HHDM sequences to achieve complicated composition requirements. Interaction with HHDM : To interaction with HHDM we use five types of Actions for integration Compute, Sample, Count, Traverse, Trace. HHDM applications are designed to be interactive during runtime in an a synchronous manner Creating a frame:- Method: In the first method we will be creating frame by extending Frame class which is defined in java.awt package. In the program we are using three methods: setTitle: For setting the title of the frame we will use this method. It takes String as an argument which will be the title name. SetVisible: For making our frame visible we will use this method. This method takes Boolean value as an argument. If we are passing true then window will be visible otherwise window will not be visible. SetSize: For setting the size of the window we will use this method. The first argument is width of the frame and second argument is height of the frame. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 117 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 7. V. Case Study As case study different data sets of text documents are taken for performing word count and sorting by using two methods that is HHDM and Spark. The running of the software is shown in the following figures. Figure 1:Home screen In the Figure 1 Home screen of the project is shown, where in the frame shows the provision for uploading the intended data set, execution type etc., using both HHDM and spark separately. Figure 2: Upload dataset In the Figure2 showschoosing the input file or data set from required or stored place in the system. In this step of selecting input data set we select the input data set as per our requirements and then access that input data set by using frame of open button. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 118 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 8. Figure 3: Dataset output Type display The Figure3shows the information about the output display type after the execution for we have to select here the execution type frame as word count or sort. This chosen functionality will be run by HHDM on our required input. Figure 4: Run HHDM The Figure 4 shows the taken program execution in HHDM method. HHDM job completion time of above taken program is 1375 M.sec. In HHDM method include the DDM and DFM techniques are works as similar as MapReduce framework in Hadoop. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 119 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 9. Figure 5: Run using Spark The Figure 5 shows the execution of taken data set or job with spark. Then we observe the job completion time of the taken job with spark i.e., 8709M.sec Figure 6: Job completion chart The Figure 6 shows the result display of the job completion time for both HHDM and spark and this step of result show graph representation of the same above results. Here we observe HHDM graph length is less compare with spark i.e., HHDM job completion time is less compared with the spark job completion time. VI. Results The results of the execution time of the job are shown in the following Table. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 120 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 10. Table 1: Execution time for Word Count Input HHDM Spark Test1.txt 1066 7870 Test2.txt 1078 8063 Test3.txt 1328 7907 The Table 1 shows the different job completion time of different data sets in the word count application . For these values we have graphical representation as shown in Figure 7 Figure :7 Graphical representation of word count jobs In the Figure 7 the Graphical representation of word count shows that the execution time for HHDM is far less when compared to the Spark, in all the Three data sets which is represented with different colors like blue,red and green. Table 2: Execution time for sorting. Input Run HHDM Run Spark Text1.txt 1066 7870 Text2.txt 1078 8063 Text3.txt 1328 7907 The Table 2 shows the shows the different job completion time of different data sets in the sorting technique . For these values we have graphical representation as shown in Figure 8. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 121 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 11. Figure : 8 Graphical representation of Sorting jobs In the Figure 8 the Graphical representation of sorting jobs chat shows that the execution time for HHDM is far less when compared to the Spark, in all the Three data sets which is represented with different colors like blue, red and green. VII. Conclusion In this paper, HHDM which is a functional and strongly-typed meta-data abstraction, is implemented. Also a runtime implementation of the system to support the execution, management and optimization of HHDM applications is implemented.The applications written in HHDM are natively composable and can be merged with already existing software application. The movement of data in HHDM jobs is naturally optimized even before execution at runtime. HHDM facilitates the programmer to concentrate on the logic by automating the integration process and optimization process.The results show that the execution time is optimized when HHDM is used when compared with spark. The improvement in the performance is 65-70%. References: International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 122 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 12. [1] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, MarcusLeich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters,Astrid Rheinl¨ander, Matthias J. Sax, Sebastian Schelter, MareikeH¨oger, Kostas Tzoumas, and Daniel Warneke.” The Stratosphere Platform for Big Data analytics”. VLDB J., 23(6), 2014. [2] Anju Abraham, and Shyma Kareem., 2018,”Security and Clustering Of Big Data in Map Reduce Framework “International Journal of Advance Research, Ideas And Innovations In Technology Volume 4, Issue 1,pp-199, ISSN:2454-132X. [3] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C Murthy, and Carlo Curino. Apache Tez:” A UnifyingFramework for Modeling and Building Data Processing Applications” .In SIGMOD, 2015. [4] ChunWei Tsai, Chin Feng Lai, Han Chieh Chao, and Athanasios V.Vasilakos.” Big Data Analytics”: a survey. Journal of Big Data, 2(21),2015. [5] Corrigan, P. Zikopoulos, K. Parasuraman, T. Deutsch, D. Deroos, and J. Giles,” Harness the Power of Big Data the IBM Big Data Platform”. 1st ed. New York, NY, USA:McGraw-Hill, Nov. 2012. [6] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, ToddPhillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young.Machine learning:” The high interest credit card of Technical Debt”.In SE4ML: Software Engineering for Machine Learning, 2014. [7] Dongyao Wu, Sherif Sakr, Liming Zhu, and Qinghua Lu.” Composable and Efficient Functional Big Data Processing Framework” .In IEEE Big Data, 2015. [8] Jiawei Yuan, and Yifan Tian. “Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset”. IEEE, 2017. [9] Li Zhu, Fei Richard Yu, Yige Wang, Bin Ning and Tao Tang.”Big Data Analytics in Intelligent Transportation Systems” A Survey 1524-9050 IEEE, 2018. [10] Manogaran, G., Varatharajan, R., Lopez, D., Kumar, P.M., Sundarasekar, R., Thota, C.,2017. A new architecture of Internet of Things and big data ecosystem forsecured smart healthcare monitoring and alerting system. Future Gener.Comput. Syst. 82, 375–387. [11] Mattew Malensek, Walid Budgaga, Ryan Stern, Sangmi Lee Pallickara and Shrideep Pallickara.” Trident: Distributed Storage, Analysis, and Exploration of Multidimensional Phenomena”IEEE, 2018. [12] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, DaviesLiu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J.Franklin, Ali Ghodsi, and Matei Zaharia. “Spark SQL: RelationalData Processing in Spark”. In SIGMOD, pages 1383–1394, 2015. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 123 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 13. [13] Mohamad, M., Selamat, A., 2016. A new hybrid rough set and soft set parameterreduction method for spam e-mail classification task. Lecture Notes in ArtificialIntelligent, LNAI 9806 (9806), 18– 30. [14] Paradarami, N.D., Tulasi, K., Bastian, Wightman, J.L., 2017. A hybrid recommendersystem using artificial neural networks. Expert Syst. Appl. 83, 300–313. [15] Qian, J., Lv, P., Yue, X., Liu, C., Jing, Z., 2015. Hierarchical attribute reductionalgorithms for big data using MapReduce. Knowl.-Based Syst. 73, 18–31. [16] Sherif Sakr and Mohamed Medhat Gaber, editors. “Large Scale andBig Data - Processing and Management”. Auerbach Publications, 2014. [17] Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. “The Family ofMapReduce and Large-scale Data Processing Systems”. ACM CSUR,46(1):11, 2013. [18] Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F., 2015. MRPR: a MapReducesolution for prototype reduction in big data classification. Neurocomputing 150,331–345. [19] Y. Zhang, S. Chen, Q. Wang, and G. Yu, “i2mapreduce:Incremental MapReduce for mining evolving Big Data,”IEEE Transactions on Knowledge and Data Engineering, vol.27,2015 [20] Zhipeng Gao, Kun Niu, Yidan Fan, and Zhenyiying.”MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce For large Multi-dimensional Datasets” International Conference on Big Data IEEE, 2018. International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 5, May 2019 124 https://sites.google.com/site/ijcsis/ ISSN 1947-5500