Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix

Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data
Matrix
K.L.S.Soujanyaa
, B.Shirishab
, Challa Madhavi Lathac
a
Professor, Department of CSE,CMR College of Engineering & Technology, Hyderabad, India.
klssoujanya@cmrcet.org
b
M .Tech Student, Department of CSE,CMR College of Engineering & Technology, Hyderabad, India.
shirishabadithela@gmail.com
c
Assistant Professor, Department of Information Technology, Faculty of Informatics, University of
Gondar, Gondar, Ethiopia. saidatta2009@gmail.com (Corresponding Author)
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 17, No. 5, May 2019
112 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data
Matrix
Abstract - Data is the most valuable entity in today’s world which has to be managed. The huge data
available is to be processed for knowledge and predictions.This huge data in other words big data is
available from various sources like Facebook, twitter and many more resources. The processing time taken
by the frameworks such as Spark ,MapReduce Hierachial Distributed Matrix(HHDM) is more. Hence
Hybrid Hierarchically Distributed Data Matrix(HHHDM) is proposed. This framework is used to develop
Bigdata applications. In existing system developed programs are by default or automatically roughly
defined, jobs are without any functionality being described to be reusable.It also reduces the ability to
optimize data flow of job sequences and pipelines. To overcome the problems of existing framework we
introduce a HHHDM method for developing the big data processing jobs. The proposed method is a Hybrid
method which has the advantages of Hierarchial Distributed Matrix (HHDM) which is functional,
stronglytyped for writing big data applications which are composable. To improve the performance of
executing HHHDM jobs multiple optimizationsis applied to the HHHDM method. The experimental results
show that the improvement of the processing time is 65-70 percent when compared to the existing
technology that is spark.
Keywords: Big data processing, Optimization, Hybrid Hierarchically Distributed Data
Matrixframework, strongly-typed.
I. Introduction
The exponential growth and availability of data is described by the Big data. It has become a buzz word
in software environment. But the growing large-scale data is exponential as year by year recent research
report says in the year of 2020 exponential growth of large-scale data is zetta bytes and yotta byte of
storage in systems. For this kind of problems introduces the new development of novel solutions to
overcome this problems or challenges. In general the mapreduce framework fundamental principle are
move to analysis of the data, rather than moving the data to a system that (mapreduce) can analyze it.
Programmers are inviting to think in a data centric fashion by using it. Here programmers can focus
on applying transformations to sets of records of data. The details of this data records are transparently
managed or maintained by the framework. Framework is transparently manage the details of data records
are distributed execution and fault tolerance. However, in running years inthe data analytics domain
Vol. 17, No. 5, May 2019
ISSN 1947-5500

applications requirements are increasing with day to day life of software environment Hadoop framework
various limitations have been recognized and we constituted a new wave of mostly domain specific , in
optimized big data processing platforms with witnessed an unprecedented interest to takle these challenges
with new solutions.
In recent years with using the distributed clusters of commodity machines, more methods were
presented to take care of the ever increasing data sets. Several frameworks (eg: spark )developing of big
data programs and application complexity can significantly reduced by these frameworks.
The main challenges of present in big data analytical applications are listed below :
• Real time software applications and programs requiresa chain of operations for processing.
• Manual optimizations are time-consuming and prone to errors. Merging, developing and interaction
in big data programs is not natively supported.
• MapReduce and Spark are roughly defined and without giving any information about the
functionalities. Because of this aspect the application is not reusable.
To overcome the above challenges, we a new framework HHHDM is proposed.
II. Literature Survey
Various approaches have been used for securing and maintaining the efficiency and
performance of millions of data set with variety, velocity, and volume(Anju & Shyma, 2018).In the recent
past, the flow of data produced by various computations has been increased and it is shifting to large scale
data mechanisms. MapReduce data processing is one of the best widespread method to manage big data
and it is useful for reduce processing time and memory space and also efficient parallel processing to
produce large data sets (Triguero et al., 2015). Qian et al., (2015) implemented MapReduce by using an
algorithm i.e. hierarchical attribute reduction. Manogaran et al., (2017) followed the same process to
monitor the smart health care in a secured way. Majority of researchers has been found that the
performance of data has been improved for their proposed frameworks and methods implemented by using
MapReduce data processing model. However, the MapReduce framework implementations had some
limitations, which are handled by some researchers. Mattew et al., (2018) overcomes the big data
complications and limitations such as data storage, partitioning, transformation, retrievals, extractions,
indexing etc., by generating training data set.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

In intelligent transportation systems (ITS), the big data is playing major role around the globe. ITS big data
is having major impacts on applications and designs of ITS, which makes the system secure and efficient
(Li et al., 2018).
However, in running years in the data analytics domain applications requirements are increasing with
day to day life of software environment Hadoop framework various limitations have been recognized and
established a newtrend of domain specific, which is a witnessed to optimize big data processing and also
handle unparalleled problems with new elucidations (Sharif et al., 2013).The big data has convertedto a
widespread term in software environment, but the growing large-scale data is exponential as year by year
recent research report says in the year of 2020 exponential growth of large-scale data is zetta bytes and
yotta byte of storage in systems. Therefore, the storage related and dataset problems could be solved and
overcome by using novel approaches (Sharif & Mohammad, 2014).Some researchers suggested hybrid
models to solve various data set issues. The hybrid models are integrated with more than one model, which
is very efficient to improve the performance, handle problems and weakness of data sets (Mohamad et al.,
2016; Paradarami et al., 2017).
III. Proposed System
In order to overcome the challenges and also improve the efficiency of big data processing we
propose HHHDM by integrating Map reduce with HHDM.
A. Introduction to HHDM :-
• HHDM is functional cores attribute to develop the optimization and parallel execution of big data
program and applications.
• HHDM is defined as HHDM[T,R].It is function in that T is input type of type T and R is output
• A part form this core objects HHDM includes data dependence, location, functionality and state.
• HHDM is strongly data type and light weight and functional defined.
B. Representation of HHDM :-
Attributes of HHDM :
1. ID : This id is identifier to the HHDM. This is must be a unique within each HHDM context.
2. INTYPE and OUTTYPE : Those are used to type correctness in programming planning and
execution.
3. CATEGORY : Which category of HHDM either DDM or DFM for program execution.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

4. CHILDERN and DEPENDENCY : This attributes are used to reordering the jobs or reusing the
jobs of big data.
5. LOCATION : HHDM holdes the address of program (URL) it is used in any where with in the
HHDM location.
6. FUNCTION: It is core attribute in HHDM it is used to how to compute the output to the programs.
7. STATE: It provides the status of the program and application
This attributes are using based on programmer design methods. By using this attributes HHDM
defined of
• Functional
• Portable
• Location
Categories of HHDM :
HHDM is independent core object and tree based structure which consists of the following two types of
nodes :
DDM Distributed Data Matrix:- leaf nodes of the HHDM hierarchically hold the data of all node
and it is atomic operation if includes ID, SIZE, LOCATION of data jobs. It is used to path
specification in HHDM defined as HHDM[path, T]
DFM Distributed Functional Matrix:-It is high level programming in HHDM and non leaf node
hold the chilled data. It is used to composable output to the program this output is input to other
subsquents. If it is in execution state it wappered the children node data andother nodes data.
Data Dependencies of HHDM:
This is further divided into four types
1. One -To- One(1:1)
2. One-To-N(1:N)
3. N-To-One(N:1)
4. N-To-N(N:N)
Advantages of proposed system:
• HHDM is a functional defined.
• Strongly-typed data type.
• It provides the reusability of jobs in programming and applications.
• Location path is available in HHDM.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

IV. Implementation
HHDM Function:- is the function used for transformation of input data to out put by using various
semantics. Functions have different semantics targeting different execution context for different datasets in
HHDM. One HHDM function has three semantics, those are Fp, Fa, Fc :
Fp is the basic semantics of a function
Fp : List[T] ->List[R]
Fa is the aggregation semantics of a function
Fa :(List[T] , List[R] ) -> List[R]
Fc is the combination semantics of a function
Fc : ( List[R] , List[R] ) -> List[R]
HHDM Composition:- HHDM inherits the idea of functional composition to support two basic type of
composition:
H D M[T , R] compose H D M[I ,T] => H D M[I ,R]
H D M[T , R] andThen H D M[R ,U] => H D M[T , U]
These two patterns are commonly used in functional programming and can be recursively used in
HHDM sequences to achieve complicated composition requirements.
Interaction with HHDM : To interaction with HHDM we use five types of Actions for integration
Compute, Sample, Count, Traverse, Trace. HHDM applications are designed to be interactive during
runtime in an a synchronous manner
Creating a frame:-
Method:
In the first method we will be creating frame by extending Frame class which is defined in java.awt
package.
In the program we are using three methods:
setTitle: For setting the title of the frame we will use this method. It takes String as an argument which will
be the title name.
SetVisible: For making our frame visible we will use this method. This method takes Boolean value as an
argument. If we are passing true then window will be visible otherwise window will not be visible.
SetSize: For setting the size of the window we will use this method. The first argument is width of the
frame and second argument is height of the frame.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

V. Case Study
As case study different data sets of text documents are taken for performing word count and sorting by
using two methods that is HHDM and Spark. The running of the software is shown in the following figures.
Figure 1:Home screen
In the Figure 1 Home screen of the project is shown, where in the frame shows the provision for uploading
the intended data set, execution type etc., using both HHDM and spark separately.
Figure 2: Upload dataset
In the Figure2 showschoosing the input file or data set from required or stored place in the system. In this
step of selecting input data set we select the input data set as per our requirements and then access that
input data set by using frame of open button.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

Figure 3: Dataset output Type display
The Figure3shows the information about the output display type after the execution for we have to select
here the execution type frame as word count or sort. This chosen functionality will be run by HHDM on
our required input.
Figure 4: Run HHDM
The Figure 4 shows the taken program execution in HHDM method. HHDM job completion time of above
taken program is 1375 M.sec. In HHDM method include the DDM and DFM techniques are works as
similar as MapReduce framework in Hadoop.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

Figure 5: Run using Spark
The Figure 5 shows the execution of taken data set or job with spark. Then we observe the job completion
time of the taken job with spark i.e., 8709M.sec
Figure 6: Job completion chart
The Figure 6 shows the result display of the job completion time for both HHDM and spark and this step of
result show graph representation of the same above results. Here we observe HHDM graph length is less
compare with spark i.e., HHDM job completion time is less compared with the spark job completion time.
VI. Results
The results of the execution time of the job are shown in the following Table.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

Table 1: Execution time for Word Count
Input HHDM Spark
Test1.txt 1066 7870
Test2.txt 1078 8063
Test3.txt 1328 7907
The Table 1 shows the different job completion time of different data sets in the word count application .
For these values we have graphical representation as shown in Figure 7
Figure :7 Graphical representation of word count jobs
In the Figure 7 the Graphical representation of word count shows that the execution time for HHDM is far
less when compared to the Spark, in all the Three data sets which is represented with different colors like
blue,red and green.
Table 2: Execution time for sorting.
Input Run
HHDM
Run Spark
Text1.txt 1066 7870
Text2.txt 1078 8063
Text3.txt 1328 7907
The Table 2 shows the shows the different job completion time of different data sets in the sorting
technique . For these values we have graphical representation as shown in Figure 8.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

Figure : 8 Graphical representation of Sorting jobs
In the Figure 8 the Graphical representation of sorting jobs chat shows that the execution time for HHDM
is far less when compared to the Spark, in all the Three data sets which is represented with different colors
like blue, red and green.
VII. Conclusion
In this paper, HHDM which is a functional and strongly-typed meta-data abstraction, is
implemented. Also a runtime implementation of the system to support the execution, management and
optimization of HHDM applications is implemented.The applications written in HHDM are natively
composable and can be merged with already existing software application. The movement of data in
HHDM jobs is naturally optimized even before execution at runtime. HHDM facilitates the programmer to
concentrate on the logic by automating the integration process and optimization process.The results show
that the execution time is optimized when HHDM is used when compared with spark. The improvement in
the performance is 65-70%.
References:
Vol. 17, No. 5, May 2019
ISSN 1947-5500

[1] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian
Hueske, Arvid Heise, Odej Kao, MarcusLeich, Ulf Leser, Volker Markl, Felix Naumann, Mathias
Peters,Astrid Rheinl¨ander, Matthias J. Sax, Sebastian Schelter, MareikeH¨oger, Kostas Tzoumas,
and Daniel Warneke.” The Stratosphere Platform for Big Data analytics”. VLDB J., 23(6), 2014.
[2] Anju Abraham, and Shyma Kareem., 2018,”Security and Clustering Of Big Data in Map Reduce
Framework “International Journal of Advance Research, Ideas And Innovations In Technology
Volume 4, Issue 1,pp-199, ISSN:2454-132X.
[3] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun C Murthy, and Carlo
Curino. Apache Tez:” A UnifyingFramework for Modeling and Building Data Processing
Applications” .In SIGMOD, 2015.
[4] ChunWei Tsai, Chin Feng Lai, Han Chieh Chao, and Athanasios V.Vasilakos.” Big Data
Analytics”: a survey. Journal of Big Data, 2(21),2015.
[5] Corrigan, P. Zikopoulos, K. Parasuraman, T. Deutsch, D. Deroos, and J. Giles,” Harness the
Power of Big Data the IBM Big Data Platform”. 1st
ed. New York, NY, USA:McGraw-Hill, Nov.
2012.
[6] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, ToddPhillips, Dietmar Ebner, Vinay
Chaudhary, and Michael Young.Machine learning:” The high interest credit card of Technical
Debt”.In SE4ML: Software Engineering for Machine Learning, 2014.
[7] Dongyao Wu, Sherif Sakr, Liming Zhu, and Qinghua Lu.” Composable and Efficient Functional
Big Data Processing Framework” .In IEEE Big Data, 2015.
[8] Jiawei Yuan, and Yifan Tian. “Practical Privacy-Preserving MapReduce Based K-means
Clustering over Large-scale Dataset”. IEEE, 2017.
[9] Li Zhu, Fei Richard Yu, Yige Wang, Bin Ning and Tao Tang.”Big Data Analytics in Intelligent
Transportation Systems” A Survey 1524-9050 IEEE, 2018.
[10] Manogaran, G., Varatharajan, R., Lopez, D., Kumar, P.M., Sundarasekar, R., Thota, C.,2017. A
new architecture of Internet of Things and big data ecosystem forsecured smart healthcare
monitoring and alerting system. Future Gener.Comput. Syst. 82, 375–387.
[11] Mattew Malensek, Walid Budgaga, Ryan Stern, Sangmi Lee Pallickara and Shrideep Pallickara.”
Trident: Distributed Storage, Analysis, and Exploration of Multidimensional Phenomena”IEEE,
2018.
[12] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, DaviesLiu, Joseph K. Bradley,
Xiangrui Meng, Tomer Kaftan, Michael J.Franklin, Ali Ghodsi, and Matei Zaharia. “Spark SQL:
RelationalData Processing in Spark”. In SIGMOD, pages 1383–1394, 2015.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

[13] Mohamad, M., Selamat, A., 2016. A new hybrid rough set and soft set parameterreduction method
for spam e-mail classification task. Lecture Notes in ArtificialIntelligent, LNAI 9806 (9806), 18–
30.
[14] Paradarami, N.D., Tulasi, K., Bastian, Wightman, J.L., 2017. A hybrid recommendersystem using
artificial neural networks. Expert Syst. Appl. 83, 300–313.
[15] Qian, J., Lv, P., Yue, X., Liu, C., Jing, Z., 2015. Hierarchical attribute reductionalgorithms for big
data using MapReduce. Knowl.-Based Syst. 73, 18–31.
[16] Sherif Sakr and Mohamed Medhat Gaber, editors. “Large Scale andBig Data - Processing and
Management”. Auerbach Publications, 2014.
[17] Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. “The Family ofMapReduce and Large-scale Data
Processing Systems”. ACM CSUR,46(1):11, 2013.
[18] Triguero, I., Peralta, D., Bacardit, J., García, S., Herrera, F., 2015. MRPR: a MapReducesolution
for prototype reduction in big data classification. Neurocomputing 150,331–345.
[19] Y. Zhang, S. Chen, Q. Wang, and G. Yu, “i2mapreduce:Incremental MapReduce for mining
evolving Big Data,”IEEE Transactions on Knowledge and Data Engineering, vol.27,2015
[20] Zhipeng Gao, Kun Niu, Yidan Fan, and Zhenyiying.”MR-Mafia: Parallel Subspace Clustering
Algorithm Based on MapReduce For large Multi-dimensional Datasets” International Conference
on Big Data IEEE, 2018.
Vol. 17, No. 5, May 2019
ISSN 1947-5500

Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix

More Related Content

What's hot (17)

Similar to Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix (20)

Recently uploaded (20)

Optimizing Bigdata Processing by using Hybrid Hierarchically Distributed Data Matrix