SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 960
Association Rule Mining using RHadoop
Anurag Agrahari1, Prof D.T.V. Dharmaji Rao2
1M.tech Student, Deptt. Of Computer Sci & Engg, AITAM College, Tekkali, Srikakulam, Andhra Pradesh, India.
2Professor, Deptt of Computer Sci & Engg, AITAM College, Tekkali, Srikakulam, Andhra Pradesh, India.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - With the advancement of computer related
technology, the world is generating several Exabyte of
data per year. Data generated through many sources like
business processes, transactions, social networking sites,
web servers, and many other source are remains in
structured as well as unstructured form increased by
exponential rates. Processing of huge amount of data or
extracting hidden information is always a challenging and
tuff task. Many Data Mining and statistical techniques are
being used to find out the hidden pattern hidden in it.
Association Rule Mining is one of core Data mining
techniques being widely used in Big Data Analytics. Many
tools are used in the Big Data Analysis. R is statistical
language, widely used in the analysis. Hadoop can be
integrated with R known as RHadoop. In this paper, we
will briefly discuss what is big data, Hadoop, integration
with R and analysis of result on training data set using
association Rule.
Key Words: Big Data, Data mining, Association Rule
mining, RHadoop, Ambari.
1. INTRODUCTION
Today large datasets are a part of every field that
somehow involves digital information. Companies,
programs and network sensors produce huge amount of
fresh data, around 2.5 EBs (Exabyte’s) every single day.
The total amount of data in the world was 4.4 zettabytes in
2013. According to prediction, this is set to rise steeply to
44 zettabytes by 2020[1].A lot of statistical techniques and
data mining techniques are used to analysis the Big data
set. Big Data analysis refers to data that exceeds the
typical storage, processing, and computing capacity of
conventional databases using data analysis techniques
[17]. Most popular distributed computing paradigms
nowadays uses Hadoop framework for handling the large
scale data. Data mining, a technique to understand and
find out the hidden knowledge from raw data and convert
into useful information, is increasingly being used in a
variety of fields like marketing, business intelligence,
scientific discoveries, biotechnology, Internet searches,
and multimedia. Data mining is an interdisciplinary field
combining ideas from statistics, machine learning, and
natural language processing. Data mining in such
environments requires a utilization of the available
resources. The application of Big data in expert system in
agriculture for making of decision support system, will
open the new research area [21][22].
2. RELETED WORKS
Association rule mining works was carried out by [2]
where she applied association rule with Map Reduce
concept in cloud. Cloud itself provides the platform to
analysis the Big Data. Most of Organization is using cloud
for their application. Furthermore a lot of work for
Association Rule Mining (ARM) is one of important and
popular technique of data mining which find interesting
correlation or association between set of items or
attributes and also frequent patterns. in large database
[3].Association Rule mining is widely used in Many other
area like business application, bi-informatics, medical
diagnosis, text analysis and business Intelligence and
consumer behavior. A lot of works carried out using
Hadoop and Association Rule mining.
Furthermore, very few and limited works carried
out using Rhadoop. The concept of Rhadoop is discussed
by[4],where they conclude the concept of Rhadoop, how it
work with the Hadoop Component, integration with the
hadoop, ODBC and other component. Memory-Sensitive
Heterogeneous Earliest Finish Time (MSHEFT) algorithms
were implemented by [5] where he used the Hadoop and
spark platform with R to analysis and performance
evaluation of the algorithms. Being well known statistical
language, open source software, and new package can be
developed for computation of large and complex data,
large genotype–phenotype datasets [6]. A new tool was
developed by [6] named BlueSNP using R and Hadoop for
large scale cluster. Linear regression analysis was
developed by Bogdan Oancea using R and Hadoop[7]. A
state of art research on sentiment analysis was done by
[8][9] using Rhadoop of twitter data. R is also providing
the API interface for extracting the twitter data.
Furthermore, R can be integrated with other Big data
other Hadoop technology such as Spark known
SparkR[10].Comparative study between the Rapriori and
spark was done by[34].
Furthermore, in Map Reduce model, some parallel Apriori
algorithms are implemented using Hadoop-MapReduce
model [23][24][25][26]. The name of the algorithm
Apriori is based on the Apriori property which states that
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 961
all nonempty subsets of a frequent itemset must also be
frequent [2]. Many sequential and parallel algorithms have
been proposed to improve the performance of Apriori
algorithm. Some popular sequential approaches are Hash-
based technique, Transaction reduction, Partitioning,
Sampling and Dynamic itemset counting (DIC)
[26][27][28][29][30]. A algorithms proposed for three
parallel version of Apriori algorithm,Count Distribution
(CD), Data Distribution (DD) and Candidate
Distribution[32].
3. ASSOCIATION RULE MINING
Association rule mining (ARM) is a popular method of data
mining method for discovering interesting relations
between terms in the dataset. The concept of strong rules
was used by Agarwal et al [11] to find association rules in
items. An association rule defines relation between two set
of items for e.g.
{A, B} => {C}.
In a purchase relation this would indicate if a person buys
A and B together, he/she is more likely to also buy C.
Mining association rule consists of following two steps:
Finding the item set which are frequent in the
data set: The frequent item sets are set of those items
whose support (sup (item)) in the data set is greater than
the minimum required support (min_sup). Considering the
above example all three A, B and C belongs to frequent
item set and sup {A, B} and sup {C} would be greater than
the min_sup. The support of an item set is defined as
proportion of transactions which contains the item set.
Generating association rule from frequent item
set: Generating the interesting rules from the frequent
item sets on the basis of confidence (conf). The confidence
of the above rule will be sup {A, B} divided by sup{C}. If
the confidence of the rule is greater than the required
confidence, the rule can be considered as an interesting
one.
The frequent item set required for generation of
association rule can be generated using Apriori algorithm.
If the confidence of the rule is greater than the required
confidence, then the rule can be considered as an
interesting one.
Fig 1. Map/reduce at n node showing n iterations [33].
The following Step in association mining will take place in
Map reduce. As figure 1 show that data submitted to the
process will have the multiple phase/step. Those
step/phase are as below[23][24][25]. Hadoop paradigm
have two function i.e. mapper and reducer.The input and
output of these functions must be in form of (key, value)
pairs. The Mapper takes the input (k1, v1) pairs from
HDFS and produces a list of intermediate (k2, v2) pairs. An
optional Combiner function is applied to reduce
communication cost of transferring intermediate outputs
of mappers to reducers.
Phase 1
Mapper (key, value)
// key: TID
// value: itemsets in
transaction Ti
for each transaction Ti
assigned to Mapper do
for each itemset in Ck
do
if itemset ∈ Ti
output (itemset, 1);
end if
end for
end for
Phase 2
Combiner (key, value)
// key: itemset
// value: list (1)
for each itemset do
for each 1 in list (1) of
corresponding itemset do
itemset.local_sup + = 1;
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 962
end for
output (itemset,itemset.local_sup);
end for
Phase 3
Reducer (key, value)
// key: itemset
// value: list (local_sup)
for each itemset do
for each local_sup in
list (local_sup) of
corresponding itemset do
itemset.sup + = local_sup;
end for
if itemset.sup ≥ minimum
support;
output (itemset, itemset.sup);
end for
These algorithms can be classified in two categories: 1-
phase of map-reduce and k-phase of mapreduce [2]. Some
algorithms used all the three functions mapper, reducer
and combiner while some used only mapper and reducer
function.
4. RHADOOP
4.1 Integrating with R with Hadoop
R is a language and environment for statistical computing
was developed at Bell Laboratories (formerly AT&T, now
Lucent Technologies) by John Chambers and colleagues
[12]. It is upgraded version of S. R was initially written by
Robert Gentleman and Ross Ihaka—also known as “R & R”
of the Statistics Department of the University of
Auckland[13].
There are several approaches to integrate R and
Hadoop: R and Streaming, Rhipe and RHadoop but in this
paper we will present only RHadoop. RHadoop is an open
source project developed by Revolution Analytics
(http://www.revolutionanalytics.com/) that provides
client-side integration between R and Hadoop. This allows
running Map Reduce jobs within R and consists of a
collection of several packages:
plyrmr - provides plyr like processing functions for
structured data type, having capabilities of handling large
data sets stored with Hadoop;
rmr - contains a collection of functions that provide Map
Reduce model implementation in R;
rdfs - is an interface between R and HDFS, providing file
management operations in R for data stored in HDFS;
rhbase - is an interface between R and Hbase, and
provides management functions in R for Hbase databases;
ravro -A package that adds the ability to read and write
avro files from local and HDFS file system and adds an
avro input format for rmr2[16].
Fig 2. RHadoop Framework[14]
As figure 1 clearly show the working with different
component of Mapreduce and HDFS. The package rhdfs is
used to read and write function used for data in cluster.
Same wise,rmr2 and rmr is used to do task of instruction
with Mapreduce programming paradigm. The rhdfs
package provides the mechanism to store and fetch the
data from HDFS. It only installed at the each data node of
the cluster. The rmr2 and rmr(for hadoop version 1)
provide the facilities to do map reduce in cluster. R script
or program only runs over the rmr and rhdfs package.
Running the R script need some other package of R, will be
discussed on next subsection.
4.2 R Programming With Rhadoop
Fig 3. R interface RHadoop over Hadoop [15].
As Figure 3 show that the detailed Architecture of
Hadoop and Rhadoop. The brief explanation was covered
by author on [15]. The first layer known as hardware layer
which include the commodity computer over the network
or in cloud cluster. Middle layer manage the Map reduce
and HDFS system component of Hadoop. The script
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 963
provided over the RHadoop via R is act through Rhadoop
to Hadoop. The next layer is a layer that provides an
interface for data analysis. Interface layer which having
streaming component to other language like c,c++,Python
and other programming language. R script communicates
to hadoop component through the Rhadoop libraries
which defined above section. We can use Rhipe or
Rhadoop libraries that build an interface between Hadoop
and R, allowing users to access data from the Hadoop file
system and write their own scripts for implementing Map
and Reduce jobs [15].R con not communicate to hadoop
component, al through it need some more and necessary
component of R core language detailed discussed on next
section of Project setup.
5. PROJECT SETUP
For implementing the concept, we had created the
environment of 8 node cluster AITAM Central Library.
Ambari, open source tools for managing the cluster,
monitor the cluster and installing the component of
Hadoop component. For the implementing the concept the
Horton work’s HDP 2.2 ware used along with Ambari
Version 1.7.1 over the Cent Os version 6.6 using the Virtual
Platform.
Commodity Specification detail-
For implementing the algorithms, we had used different
node with different processor based system, memory,
RAM, cache Size, Different clock speed, and different core
system. This is heterogeneous kind of system cluster.
Along with this we have used the VMware for some system
(n0.exmple.com, n1.example.com, n2.example.com).
No.example.com machine have two installation i.e one for
ambari server of RAM size 2 GB and rest for the Name
node server based on the window 7 base operating
system. The guest operating system was cent os 6.6.
Name Core Ip address RAM
N0.example.com 8(i7) 172.16.6.181 7.67 GB
N1.example.com
N2.example.com
8(i7)
8(i7)
172.16.6.182
172.16.6.183
3.73 GB
3.73 GB
N3.example.com
N4.example.com
N5.example.com
N6.example.com
N7.example.com
N8.example.com
4(i3)
172.16.6.184
172.16.6.185
172.16.6.186
172.16.6.187
172.16.6.188
172.16.6.190
3.73 GB
Table 1. Cluster commodity Hardware Specification
Ambari Installation A completely open source
management platform for provisioning, managing,
monitoring and securing Apache Hadoop clusters [18][17].
Ambari enables System Administrators to:
Provision a Hadoop Cluster Ambari provides a
step-by-step wizard for installing Hadoop services across
any number of hosts. It handles configuration of Hadoop
services for the cluster.
Manage a Hadoop Cluster Ambari provides central
management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster.
Monitor a Hadoop Cluster Ambari provides a
dashboard for monitoring health and status of the Hadoop
cluster. Ambari leverages Ambari Metrics System for
metrics collection. Ambari leverages Ambari Alert
Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down,
remaining disk space is low, etc).
Apart from this, Ambari is very useful in large
scale cluster setup because it provide consistent, secure
platform for operational control.
Fig 4. Conceptual Layout Of Ambari server installation
It is Simplified Installation and Configuration and
Management, Centralized Security Setup, Full Visibility
into Cluster Health and Highly Extensible and
Customizable [17].
Fig 5. Screenshot of ambari server of 4 node cluster.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 964
More node can be add and remove from ambari server
dashboard. Service can start and stop from same dash
board. For installing the Ambari we need to create repo for
installation of hadoop service over node/client.
name Address/base Url
HDP-2.2 http://n0.example.com/hdp/HDP/cent
os6/2.x/GA/2.2.0.0
HDP-UTILS-
1.1.0.20
http://n0.example.com/hdp/HDP-TILS-
1.1.0.20/repos/centos6
Table 2- Ambari repo base Url(Cent os6.6)
We have to provide the repositories’ for hadoop. In our
case we have used Horton work product HDP 2.2 and
other hadoop utilities. The components of HDP 2.2 are
summarized in Table 2.
SERVICE VERSION DESCRIPTION
Zookeeper 3.4.6.2.2.0.0 Centralized
service which
provides highly
reliable
distributed
coordination
HDFS 2.6.0.2.2.0.0 Apache Hadoop
Distributed File
System
YARN +
MapReduce2
2.6.0.2.2.0.0 Apache Hadoop
NextGen
MapReduce
(YARN)
Ganglia 3.5.0 Ganglia Metrics
Collection system
(RRDTool will be
installed too)
Nagios 3.5.0 Nagios
Monitoring and
Alerting system
Table 3- HDP 2.2 Cluster Stack Descriptions
Name ip address services
n1.example.com 172.16.6.181 App Timeline Server
Ganglia Monitor
HDFS Client
History Server
NameNode
NodeManager
ResourceManager
n2.example.com 172.16.6.182 DataNode
Ganglia Monitor
Ganglia Server
HDFS Client
MapReduce2 Client
Nagios Server
YARN Client
ZooKeeper Client
n3.example.com 172.16.6.183 DataNode
Ganglia Monitor
HDFS Client
MapReduce2 Client
SNameNode
YARN Client
ZooKeeper Client
ZooKeeper Server
n4.example.com
n5.example.com
n6.example.com
n7.example.com
n8.example.com
172.16.6.184
172.16.6.185
172.16.6.186
172.16.6.187
172.16.6.188
172.16.6.189
DataNode
Ganglia Monitor
HDFS Client
MapReduce2 Client
YARN Client
ZooKeeper Client
Table 4- Description of service running at nodes
Rhadoop And R installation- R and Rhadoop installation
is easy task. Before installing Rhadoop, The dependent
package of Rhadoop should be installed in each node of
cluster. Revolution R Open (RRO) is the enhanced
distribution of R from Revolution Analytics. It is a
complete open source platform for statistical analysis and
data science. RRO 8.0.1 is based on (and 100% compatible
with) version 3.1.2 of the R language, and includes
additional capabilities for performance, reproducibility
and platform support. Along with this Rstudio ware
integrated with R. The concerning package of the Rhadoop
and R software shown in table 3.
SERVICE/Package Version Description
rmr2 3.4.6.2.2.0.0 Master, client
rhdfs 1.0.8 Only at Master
plyrmr 0.6 Master, client
rHadoopClient 0.2 Master, client
Nagios 3.5.0 Master, client
Rcurl 1.95-4.5 Master, client
rJava 0.9-6 Master, client
RJSONIO 1.3-0 Master, client
iterators 1.0.7 Master, client
itertools 0.1-3 Master, client
digest 0.6.8 Master, client
Rcpp 0.11.4 Master, client
jsonlite 0.9.14 Master, client
mime 0.2 Master, client
R6 2.0.1 Master, client
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 965
functional 0.6 Master, client
httr 0.6.1 Master, client
stringr 0.6.2 Master, client
memoise 0.2.1 Master, client
whisker 0.3-2 Master, client
evaluate 0.5.5 Master, client
rstudioapi 0.2 Master, client
brew 1.0-6 Master, client
roxygen2 4.1.0 Master, client
reshape2 1.4.1 Master, client
plyr 1.8.1 Master, client
devtools 1.7.0 Master, client
caTools 1.17.1 Master, client
R.methodsS3 1.7.0 Master, client
RColorBrewer 1.1-2 Master, client
pryr 0.1 Master, client
proto 0.3-10 Master, client
dichromat 2.0-0 Master, client
gtable 0.1.2 Master, client
colorspace 1.2-4 Master, client
scales 0.2.4 Master, client
munsell 0.4.2 Master, client
labeling 0.3 Master, client
DBI 0.3.1 Master, client
ggplot2 1.0.0 Master, client
acepack 1.3-3.3 Master, client
latticeExtra 0.6-26 Master, client
lhs 0.10 Master, client
rmr2 3.3.1 Master, client
sp 1.0-17 Master, client
Hmisc 3.15-0 Master, client
hydroPSO 0.3-4 Master, client
plyrmr 0.6.0 Master, client
Table 5. R and Rhadoop Package Description in
Node/master
6. RESULT AND ANALYSIS
Data Set was taken from well known source UCI data set
repository and other source. The dataset named chess,
connect, and mushroom ware downloaded from UCI data
set repository [19][20]. The short description about the
data set summarized in table 4.
Data set Instance attributes
Chess 3196 36
pumsb 49046 74
Connect 67557 42
Table 6. Data set And there Description [19][20].
The dataset provided by [19] are in integer form and space
delimiter. Each row of the dataset represents the new data
in the file.
For the conceptualizing the concept, we had fix the
support value as (.5) and count value as (.9). Table 5 show
the result of training data set with same support and count
value in without hadoop and 2 node, 4 node, 6 node and 8
node cluster. Running time count in second and process
run 4 times each to find out the average running time of
the cluster.
Dataset Without
hadoop
2
node
4
node
6
node
8
node
chess 31.157 24 12 10 7
pumsb 271.117 37 36 34 25
connect 82.150 42 40 25 18
Table 7. Result showing in different node
Result may have difference because of the network and
one single system is running logically two system. The
reduction of execution time is more evident for those
datasets which have longer execution time. Moreover, we
can see that when the size of the dataset increases, the
performance on multi-node Hadoop also gets better. The
result could get better if the ambari server is running on
separate system and master node running separately in
the cluster and having more RAM size. The value of
running time is average mean value and round of value in
second.
7. CONCLUSIONS
We implemented the improved Apriori Algorithm on
MapReduce programming model on the Hadoop platform.
The R language is widely used in the hadoop due to easy
and comfort for the analyst. Moreover, this distributed
algorithm can also cater to the distributed nature of the
input data. Furthermore, details of the management of the
distributed systems, such as data transferring among
nodes and node failures are taken care by Hadoop, which
adds a great deal of robustness and scalability to the
system. The performance could be enhanced if we have
the facilities to increase the RAM and high performing
Machine.
ACKNOWLEDGEMENT
I really thankful to the Director, Shri Prof. V.V Nageswara
Rao and Head of computer science department Shri Dr. G.
S. N. Murty and my guide shri prof D.T.V. Dharmaji Rao,
for there kind support and help.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 966
REFERENCES
[1] D. Kornack and P. Rakic, “Cell Proliferation
without Neurogenesis in Adult Primate
Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127-
2130, doi:10.1126/science.1065467.
[2] Pallavi roys(2012), MINING ASSOCIATION RULES
IN CLOUD, MS thesis, North Dakota State
University of Agriculture and Applied Science,
Fargo, North Dakota August 2012Mr.
[3] Neeraj Raheja, Ravish Kumar(2012) ,
Optimization of Association Rule Learning in
Distributed Database using Clustering
Technique,International Journal of Scientific and
Research Publications, Volume 2, Issue 12,
December 2012 ,ISSN 2250-3153.
[4] HARISH D,ANUSHA M.S ,Dr. DAYA SAGAR
K.V(2015).,”BIG DATA ANALYSIS USING
RHADOOP”, ISSN: 2349-2163,Issue 4, Volume 2
(April 2015) pp 180-85.
[5] Bao Rong Chang, Yun-Da Lee, and Po-Hao Liao,
“Development of Multiple Big Data, Analytics
Platforms with Rapid Response”
https://doi.org/10.1155/2017/6972461.
[6] H.Huang et al., BlueSNP: R package for highly
scalable genome-wide associationstudies using
Hadoop clusters, BIOINFORMATICS
APPLICATIONS NOTE , Vol. 29 no. 1 2013, pages
135–136 doi:10.1093/bioinformatics/bts647.
[7] Bogdan OANCEA, LINEAR REGRESSION WITH R
AND HADOOP, Challenges of the Knowledge
Society. IT in Social Sciences.
[8] Thasnim K M1, M Sudheep Elayidom,
IMPLEMENTATION OF A NOVEL SENTIMENT
ANALYSIS TECHNIQUE FOR TWITTER DATA
USING RHADOOP, IJRET: International Journal of
Research in Engineering and Technology, eISSN:
2319-1163 | pISSN: 2321-7308.
[9] Shubham S. Deshmukh et al(2017)., Twitter Data
Analysis using R, International Journal of Science,
Engineering and Technology Research (IJSETR)
Volume 6, Issue 4, April 2017, ISSN: 2278 -7798.
[10] Shivaram Venkataraman et al(2016),
SparkR: Scaling R Programs with Spark, SIGMOD
’16, June 26–July 1, 2016, San Francisco, CA, USA.,
ISBN 978-1-4503-3531-7/16/06.
[11] R. Agrawal and R. Srikant(1994). Fast
algorithms for mining association rules. In Proc.
1994 Int. Conf.Very Large Data Bases, pages 487-
499, Santiago, Chile, September 1994.
[12] https://www.r-project.org/about.html
access on 03/09/17.
[13] R project,Cran project https://www.r-
project.org/contributors.html access on
05/09/17.
[14] Bao Rong Chang, Yun-Da Lee, and Po-Hao
Liao(2017), Development of Multiple Big Data
Analytics Platforms with Rapid Response,
Hindawi Scientific Programming Volume 2017,
Article ID 6972461,
https://doi.org/10.1155/2017/6972461.
[15] Bogdan OANCEA, Raluca Mariana
DRAGOESCU(2014), Integrating R and Hadoop for
Big Data Analysis, Romanian Statistical Review nr.
2 / 2014.
[16] Rhadoop wiki
https://github.com/RevolutionAnalytics/RHadoo
p/wiki access on 03/09/2017.
[17] Anurag Agrahari, Prof D.T.V. Dharmaji
Rao(2017), “A Review paper on Big Data:
Technologies, Tools and Trends,”
www.irjet.net,Volume: 04 Issue: 10,Oct -2017,e-
ISSN: 2395-0056 pp 640-649.
[18] Ambari wiki,https://ambari.apache.org/
access on 2/08/2017.
[19] http://fimi.ua.ac.be/data/ access on 12-
04-2015.
[20] Lichman, M. (2013). UCI Machine
Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and
Computer Science.
[21] Violeta nicoleta opriş & ciprian
racuciu(2015), the expert systems analysis using
the concept of big data and cloud computing
services, Military Technical Academy, Faculty of
Military Electronic and Information Systems,
Bucharest, Naval Academy Scientific Bulletin,
Volume XVIII – 2015 – Issue 2.
[22] Anurag Agrahari ,Dr. ShashiKant
Tripathi(2012),A Theoretical Framework for
Development of Decision Support System for
Agriculture, RESEARCH INVENTY: International
Journal of Engineering and Science ISSN: 2278-
4721, Vol. 1, Issue 6 (October 2012), PP 50-55.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 967
[23] Li L. & Zhang M. (2011). The Strategy of
Mining Association Rule Based on Cloud
Computing. Proceeding of the 2011 International
Conference on Business Computing and Global
Informatization (BCGIN ’11). Washington, DC,
USA, IEEE: 475- 478.
[24] Li N., Zeng L., He Q. & Shi Z. (2012).
Parallel Implementation of Apriori Algorithm
Based on MapReduce. Proc. of the 13
th
ACIS
International Conference on Software
Engineering, Artificial Intelligence, Networking
and Parallel & Distributed Computing (SNPD ’12).
Kyoto, IEEE: 236 – 241.
[25] Lin M., Lee P. & Hsueh S. (2012). Apriori-
based Frequent Itemset Mining Algorithms on
MapReduce. Proc. of the 16
th
International
Conference on Ubiquitous Information
Management and Communication (ICUIMC ’12).
New York, NY, USA, ACM: Article No. 76.
[26] Yang X.Y., Liu Z. & Fu Y. (2010).
MapReduce as a Programming Model for
Association Rules Algorithm on Hadoop. Proc. of
the 3
rd
International Conference on Information
Sciences and Interaction Sciences (ICIS ’10).
Chengdu, China, IEEE: 99 – 102
[27] S. Park, Ming-Syan and P. S. Yu, “Using a
Hash-Based Method with Transaction Trimming
for Mining Association Rules,” in IEEE
Transactions on Knowledge and Data
Engineering, vol. 9, no. 5, pp. 813-825, 1997.
[28] J. S. Park, Ming-Syan and P. S. Yu, “An
Effective Hash-Based Algorithm for Mining
Association Rules,” in SIGMOD ACM, vol. 24, no. 2,
pp. 175-186,1995.
[29] M. J. Zaki, S. Parthasarathy, W. Li and M.
Ogihara, “Evaluation of Sampling for Data Mining
of Association Rules,” in Proceedings IEEE 7th
International Workshop on Research Issues in
Data Engineering, 1997, pp. 42-50.
[30] A. Savasere, E. Omiecinski and S. Navathe,
“An Efficient Algorithm for Mining Association
Rules in Large Databases,” in Proceedings 21st
VLDB Conference,Switzerland, 1995, pp. 432-444.
[31] S. Brin, R. Motwani, J. D. Ullman and S.
Tsur, “Dynamic Itemset Counting and Implication
Rules for Market Basket Data,” in ACM SIGMOD
Record, vol. 26, no. 2, pp. 255-264, 1997.
[32] R. Agrawal and J.C. Shafer, “Parallel
Mining of Association Rules,” in IEEETransactions
on Knowledge and Data Engineering, vol.8, no. 6,
pp.962-969,1996.
[33] Poonam Modgi et al, / (IJCSIT)
International Journal of Computer Science and
Information Technologies, Vol. 5 (3) , 2014, pp
3093 – 3097.
[34] Sanjay Rathee, Manohar Kaul, Arti
Kashyap, R-Apriori: An Efficient Apriori based
Algorithm on Spark,
http://www.iith.ac.in/~mkaul/papers/pikm09-
rathee.pdf. access on 03/09/17.

More Related Content

What's hot (20)

PDF
A Relative Study on Various Techniques for High Utility Itemset Mining from T...
IRJET Journal
 
PDF
A Firefly based improved clustering algorithm
IRJET Journal
 
PDF
Association Rule Hiding using Hash Tree
ijtsrd
 
PDF
5 parallel implementation 06299286
Ninad Samel
 
PDF
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
cscpconf
 
PDF
The International Journal of Engineering and Science (The IJES)
theijes
 
PDF
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
Editor IJMTER
 
PDF
D0352630
iosrjournals
 
PDF
Modifed Bit-Apriori Algorithm for Frequent Item- Sets in Data Mining
idescitation
 
PDF
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
PDF
Review Over Sequential Rule Mining
ijsrd.com
 
PDF
Preprocessing and secure computations for privacy preservation data mining
IAEME Publication
 
PDF
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
PPTX
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
PDF
B017550814
IOSR Journals
 
PDF
An Efficient and Scalable UP-Growth Algorithm with Optimized Threshold (min_u...
IRJET Journal
 
PDF
3.[18 22]hybrid association rule mining using ac tree
Alexander Decker
 
PDF
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
IOSR Journals
 
PDF
Parallel Key Value Pattern Matching Model
ijsrd.com
 
A Relative Study on Various Techniques for High Utility Itemset Mining from T...
IRJET Journal
 
A Firefly based improved clustering algorithm
IRJET Journal
 
Association Rule Hiding using Hash Tree
ijtsrd
 
5 parallel implementation 06299286
Ninad Samel
 
An Improved Frequent Itemset Generation Algorithm Based On Correspondence
cscpconf
 
The International Journal of Engineering and Science (The IJES)
theijes
 
A Survey on Improve Efficiency And Scability vertical mining using Agriculter...
Editor IJMTER
 
D0352630
iosrjournals
 
Modifed Bit-Apriori Algorithm for Frequent Item- Sets in Data Mining
idescitation
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
Review Over Sequential Rule Mining
ijsrd.com
 
Preprocessing and secure computations for privacy preservation data mining
IAEME Publication
 
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
Mining single dimensional boolean association rules from transactional
ramya marichamy
 
B017550814
IOSR Journals
 
An Efficient and Scalable UP-Growth Algorithm with Optimized Threshold (min_u...
IRJET Journal
 
3.[18 22]hybrid association rule mining using ac tree
Alexander Decker
 
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
IOSR Journals
 
Parallel Key Value Pattern Matching Model
ijsrd.com
 

Similar to Association Rule Mining using RHadoop (20)

PDF
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
PDF
IRJET- Effecient Support Itemset Mining using Parallel Map Reducing
IRJET Journal
 
PDF
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET Journal
 
PDF
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
PDF
International journal of computer science and innovation vol 2015-n1-paper4
sophiabelthome
 
PDF
A Survey on Frequent Patterns To Optimize Association Rules
IRJET Journal
 
PPTX
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
KamleshKumar394
 
PDF
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
PDF
Intelligent Supermarket using Apriori
IRJET Journal
 
PDF
Hu3414421448
IJERA Editor
 
PDF
A Performance Based Transposition algorithm for Frequent Itemsets Generation
Waqas Tariq
 
PDF
Gr2411971203
IJERA Editor
 
PDF
Volume 2-issue-6-2081-2084
Editor IJARCET
 
PDF
Volume 2-issue-6-2081-2084
Editor IJARCET
 
PDF
PROJECT-109,93.pdf data miiining project
sampathkumarkorada
 
PDF
K355662
IJERA Editor
 
PDF
K355662
IJERA Editor
 
PDF
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET Journal
 
PDF
A framework for visualizing association mining results
Gurdal Ertek
 
PPT
Apriori and Eclat algorithm in Association Rule Mining
Wan Aezwani Wab
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
IRJET- Effecient Support Itemset Mining using Parallel Map Reducing
IRJET Journal
 
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET Journal
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
International journal of computer science and innovation vol 2015-n1-paper4
sophiabelthome
 
A Survey on Frequent Patterns To Optimize Association Rules
IRJET Journal
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
KamleshKumar394
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
Intelligent Supermarket using Apriori
IRJET Journal
 
Hu3414421448
IJERA Editor
 
A Performance Based Transposition algorithm for Frequent Itemsets Generation
Waqas Tariq
 
Gr2411971203
IJERA Editor
 
Volume 2-issue-6-2081-2084
Editor IJARCET
 
Volume 2-issue-6-2081-2084
Editor IJARCET
 
PROJECT-109,93.pdf data miiining project
sampathkumarkorada
 
K355662
IJERA Editor
 
K355662
IJERA Editor
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET Journal
 
A framework for visualizing association mining results
Gurdal Ertek
 
Apriori and Eclat algorithm in Association Rule Mining
Wan Aezwani Wab
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
PPTX
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
Day2 B2 Best.pptx
helenjenefa1
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
Evaluation and thermal analysis of shell and tube heat exchanger as per requi...
shahveer210504
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 

Association Rule Mining using RHadoop

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 960 Association Rule Mining using RHadoop Anurag Agrahari1, Prof D.T.V. Dharmaji Rao2 1M.tech Student, Deptt. Of Computer Sci & Engg, AITAM College, Tekkali, Srikakulam, Andhra Pradesh, India. 2Professor, Deptt of Computer Sci & Engg, AITAM College, Tekkali, Srikakulam, Andhra Pradesh, India. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - With the advancement of computer related technology, the world is generating several Exabyte of data per year. Data generated through many sources like business processes, transactions, social networking sites, web servers, and many other source are remains in structured as well as unstructured form increased by exponential rates. Processing of huge amount of data or extracting hidden information is always a challenging and tuff task. Many Data Mining and statistical techniques are being used to find out the hidden pattern hidden in it. Association Rule Mining is one of core Data mining techniques being widely used in Big Data Analytics. Many tools are used in the Big Data Analysis. R is statistical language, widely used in the analysis. Hadoop can be integrated with R known as RHadoop. In this paper, we will briefly discuss what is big data, Hadoop, integration with R and analysis of result on training data set using association Rule. Key Words: Big Data, Data mining, Association Rule mining, RHadoop, Ambari. 1. INTRODUCTION Today large datasets are a part of every field that somehow involves digital information. Companies, programs and network sensors produce huge amount of fresh data, around 2.5 EBs (Exabyte’s) every single day. The total amount of data in the world was 4.4 zettabytes in 2013. According to prediction, this is set to rise steeply to 44 zettabytes by 2020[1].A lot of statistical techniques and data mining techniques are used to analysis the Big data set. Big Data analysis refers to data that exceeds the typical storage, processing, and computing capacity of conventional databases using data analysis techniques [17]. Most popular distributed computing paradigms nowadays uses Hadoop framework for handling the large scale data. Data mining, a technique to understand and find out the hidden knowledge from raw data and convert into useful information, is increasingly being used in a variety of fields like marketing, business intelligence, scientific discoveries, biotechnology, Internet searches, and multimedia. Data mining is an interdisciplinary field combining ideas from statistics, machine learning, and natural language processing. Data mining in such environments requires a utilization of the available resources. The application of Big data in expert system in agriculture for making of decision support system, will open the new research area [21][22]. 2. RELETED WORKS Association rule mining works was carried out by [2] where she applied association rule with Map Reduce concept in cloud. Cloud itself provides the platform to analysis the Big Data. Most of Organization is using cloud for their application. Furthermore a lot of work for Association Rule Mining (ARM) is one of important and popular technique of data mining which find interesting correlation or association between set of items or attributes and also frequent patterns. in large database [3].Association Rule mining is widely used in Many other area like business application, bi-informatics, medical diagnosis, text analysis and business Intelligence and consumer behavior. A lot of works carried out using Hadoop and Association Rule mining. Furthermore, very few and limited works carried out using Rhadoop. The concept of Rhadoop is discussed by[4],where they conclude the concept of Rhadoop, how it work with the Hadoop Component, integration with the hadoop, ODBC and other component. Memory-Sensitive Heterogeneous Earliest Finish Time (MSHEFT) algorithms were implemented by [5] where he used the Hadoop and spark platform with R to analysis and performance evaluation of the algorithms. Being well known statistical language, open source software, and new package can be developed for computation of large and complex data, large genotype–phenotype datasets [6]. A new tool was developed by [6] named BlueSNP using R and Hadoop for large scale cluster. Linear regression analysis was developed by Bogdan Oancea using R and Hadoop[7]. A state of art research on sentiment analysis was done by [8][9] using Rhadoop of twitter data. R is also providing the API interface for extracting the twitter data. Furthermore, R can be integrated with other Big data other Hadoop technology such as Spark known SparkR[10].Comparative study between the Rapriori and spark was done by[34]. Furthermore, in Map Reduce model, some parallel Apriori algorithms are implemented using Hadoop-MapReduce model [23][24][25][26]. The name of the algorithm Apriori is based on the Apriori property which states that
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 961 all nonempty subsets of a frequent itemset must also be frequent [2]. Many sequential and parallel algorithms have been proposed to improve the performance of Apriori algorithm. Some popular sequential approaches are Hash- based technique, Transaction reduction, Partitioning, Sampling and Dynamic itemset counting (DIC) [26][27][28][29][30]. A algorithms proposed for three parallel version of Apriori algorithm,Count Distribution (CD), Data Distribution (DD) and Candidate Distribution[32]. 3. ASSOCIATION RULE MINING Association rule mining (ARM) is a popular method of data mining method for discovering interesting relations between terms in the dataset. The concept of strong rules was used by Agarwal et al [11] to find association rules in items. An association rule defines relation between two set of items for e.g. {A, B} => {C}. In a purchase relation this would indicate if a person buys A and B together, he/she is more likely to also buy C. Mining association rule consists of following two steps: Finding the item set which are frequent in the data set: The frequent item sets are set of those items whose support (sup (item)) in the data set is greater than the minimum required support (min_sup). Considering the above example all three A, B and C belongs to frequent item set and sup {A, B} and sup {C} would be greater than the min_sup. The support of an item set is defined as proportion of transactions which contains the item set. Generating association rule from frequent item set: Generating the interesting rules from the frequent item sets on the basis of confidence (conf). The confidence of the above rule will be sup {A, B} divided by sup{C}. If the confidence of the rule is greater than the required confidence, the rule can be considered as an interesting one. The frequent item set required for generation of association rule can be generated using Apriori algorithm. If the confidence of the rule is greater than the required confidence, then the rule can be considered as an interesting one. Fig 1. Map/reduce at n node showing n iterations [33]. The following Step in association mining will take place in Map reduce. As figure 1 show that data submitted to the process will have the multiple phase/step. Those step/phase are as below[23][24][25]. Hadoop paradigm have two function i.e. mapper and reducer.The input and output of these functions must be in form of (key, value) pairs. The Mapper takes the input (k1, v1) pairs from HDFS and produces a list of intermediate (k2, v2) pairs. An optional Combiner function is applied to reduce communication cost of transferring intermediate outputs of mappers to reducers. Phase 1 Mapper (key, value) // key: TID // value: itemsets in transaction Ti for each transaction Ti assigned to Mapper do for each itemset in Ck do if itemset ∈ Ti output (itemset, 1); end if end for end for Phase 2 Combiner (key, value) // key: itemset // value: list (1) for each itemset do for each 1 in list (1) of corresponding itemset do itemset.local_sup + = 1;
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 962 end for output (itemset,itemset.local_sup); end for Phase 3 Reducer (key, value) // key: itemset // value: list (local_sup) for each itemset do for each local_sup in list (local_sup) of corresponding itemset do itemset.sup + = local_sup; end for if itemset.sup ≥ minimum support; output (itemset, itemset.sup); end for These algorithms can be classified in two categories: 1- phase of map-reduce and k-phase of mapreduce [2]. Some algorithms used all the three functions mapper, reducer and combiner while some used only mapper and reducer function. 4. RHADOOP 4.1 Integrating with R with Hadoop R is a language and environment for statistical computing was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues [12]. It is upgraded version of S. R was initially written by Robert Gentleman and Ross Ihaka—also known as “R & R” of the Statistics Department of the University of Auckland[13]. There are several approaches to integrate R and Hadoop: R and Streaming, Rhipe and RHadoop but in this paper we will present only RHadoop. RHadoop is an open source project developed by Revolution Analytics (http://www.revolutionanalytics.com/) that provides client-side integration between R and Hadoop. This allows running Map Reduce jobs within R and consists of a collection of several packages: plyrmr - provides plyr like processing functions for structured data type, having capabilities of handling large data sets stored with Hadoop; rmr - contains a collection of functions that provide Map Reduce model implementation in R; rdfs - is an interface between R and HDFS, providing file management operations in R for data stored in HDFS; rhbase - is an interface between R and Hbase, and provides management functions in R for Hbase databases; ravro -A package that adds the ability to read and write avro files from local and HDFS file system and adds an avro input format for rmr2[16]. Fig 2. RHadoop Framework[14] As figure 1 clearly show the working with different component of Mapreduce and HDFS. The package rhdfs is used to read and write function used for data in cluster. Same wise,rmr2 and rmr is used to do task of instruction with Mapreduce programming paradigm. The rhdfs package provides the mechanism to store and fetch the data from HDFS. It only installed at the each data node of the cluster. The rmr2 and rmr(for hadoop version 1) provide the facilities to do map reduce in cluster. R script or program only runs over the rmr and rhdfs package. Running the R script need some other package of R, will be discussed on next subsection. 4.2 R Programming With Rhadoop Fig 3. R interface RHadoop over Hadoop [15]. As Figure 3 show that the detailed Architecture of Hadoop and Rhadoop. The brief explanation was covered by author on [15]. The first layer known as hardware layer which include the commodity computer over the network or in cloud cluster. Middle layer manage the Map reduce and HDFS system component of Hadoop. The script
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 963 provided over the RHadoop via R is act through Rhadoop to Hadoop. The next layer is a layer that provides an interface for data analysis. Interface layer which having streaming component to other language like c,c++,Python and other programming language. R script communicates to hadoop component through the Rhadoop libraries which defined above section. We can use Rhipe or Rhadoop libraries that build an interface between Hadoop and R, allowing users to access data from the Hadoop file system and write their own scripts for implementing Map and Reduce jobs [15].R con not communicate to hadoop component, al through it need some more and necessary component of R core language detailed discussed on next section of Project setup. 5. PROJECT SETUP For implementing the concept, we had created the environment of 8 node cluster AITAM Central Library. Ambari, open source tools for managing the cluster, monitor the cluster and installing the component of Hadoop component. For the implementing the concept the Horton work’s HDP 2.2 ware used along with Ambari Version 1.7.1 over the Cent Os version 6.6 using the Virtual Platform. Commodity Specification detail- For implementing the algorithms, we had used different node with different processor based system, memory, RAM, cache Size, Different clock speed, and different core system. This is heterogeneous kind of system cluster. Along with this we have used the VMware for some system (n0.exmple.com, n1.example.com, n2.example.com). No.example.com machine have two installation i.e one for ambari server of RAM size 2 GB and rest for the Name node server based on the window 7 base operating system. The guest operating system was cent os 6.6. Name Core Ip address RAM N0.example.com 8(i7) 172.16.6.181 7.67 GB N1.example.com N2.example.com 8(i7) 8(i7) 172.16.6.182 172.16.6.183 3.73 GB 3.73 GB N3.example.com N4.example.com N5.example.com N6.example.com N7.example.com N8.example.com 4(i3) 172.16.6.184 172.16.6.185 172.16.6.186 172.16.6.187 172.16.6.188 172.16.6.190 3.73 GB Table 1. Cluster commodity Hardware Specification Ambari Installation A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters [18][17]. Ambari enables System Administrators to: Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. It handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. Ambari leverages Ambari Metrics System for metrics collection. Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). Apart from this, Ambari is very useful in large scale cluster setup because it provide consistent, secure platform for operational control. Fig 4. Conceptual Layout Of Ambari server installation It is Simplified Installation and Configuration and Management, Centralized Security Setup, Full Visibility into Cluster Health and Highly Extensible and Customizable [17]. Fig 5. Screenshot of ambari server of 4 node cluster.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 964 More node can be add and remove from ambari server dashboard. Service can start and stop from same dash board. For installing the Ambari we need to create repo for installation of hadoop service over node/client. name Address/base Url HDP-2.2 http://n0.example.com/hdp/HDP/cent os6/2.x/GA/2.2.0.0 HDP-UTILS- 1.1.0.20 http://n0.example.com/hdp/HDP-TILS- 1.1.0.20/repos/centos6 Table 2- Ambari repo base Url(Cent os6.6) We have to provide the repositories’ for hadoop. In our case we have used Horton work product HDP 2.2 and other hadoop utilities. The components of HDP 2.2 are summarized in Table 2. SERVICE VERSION DESCRIPTION Zookeeper 3.4.6.2.2.0.0 Centralized service which provides highly reliable distributed coordination HDFS 2.6.0.2.2.0.0 Apache Hadoop Distributed File System YARN + MapReduce2 2.6.0.2.2.0.0 Apache Hadoop NextGen MapReduce (YARN) Ganglia 3.5.0 Ganglia Metrics Collection system (RRDTool will be installed too) Nagios 3.5.0 Nagios Monitoring and Alerting system Table 3- HDP 2.2 Cluster Stack Descriptions Name ip address services n1.example.com 172.16.6.181 App Timeline Server Ganglia Monitor HDFS Client History Server NameNode NodeManager ResourceManager n2.example.com 172.16.6.182 DataNode Ganglia Monitor Ganglia Server HDFS Client MapReduce2 Client Nagios Server YARN Client ZooKeeper Client n3.example.com 172.16.6.183 DataNode Ganglia Monitor HDFS Client MapReduce2 Client SNameNode YARN Client ZooKeeper Client ZooKeeper Server n4.example.com n5.example.com n6.example.com n7.example.com n8.example.com 172.16.6.184 172.16.6.185 172.16.6.186 172.16.6.187 172.16.6.188 172.16.6.189 DataNode Ganglia Monitor HDFS Client MapReduce2 Client YARN Client ZooKeeper Client Table 4- Description of service running at nodes Rhadoop And R installation- R and Rhadoop installation is easy task. Before installing Rhadoop, The dependent package of Rhadoop should be installed in each node of cluster. Revolution R Open (RRO) is the enhanced distribution of R from Revolution Analytics. It is a complete open source platform for statistical analysis and data science. RRO 8.0.1 is based on (and 100% compatible with) version 3.1.2 of the R language, and includes additional capabilities for performance, reproducibility and platform support. Along with this Rstudio ware integrated with R. The concerning package of the Rhadoop and R software shown in table 3. SERVICE/Package Version Description rmr2 3.4.6.2.2.0.0 Master, client rhdfs 1.0.8 Only at Master plyrmr 0.6 Master, client rHadoopClient 0.2 Master, client Nagios 3.5.0 Master, client Rcurl 1.95-4.5 Master, client rJava 0.9-6 Master, client RJSONIO 1.3-0 Master, client iterators 1.0.7 Master, client itertools 0.1-3 Master, client digest 0.6.8 Master, client Rcpp 0.11.4 Master, client jsonlite 0.9.14 Master, client mime 0.2 Master, client R6 2.0.1 Master, client
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 965 functional 0.6 Master, client httr 0.6.1 Master, client stringr 0.6.2 Master, client memoise 0.2.1 Master, client whisker 0.3-2 Master, client evaluate 0.5.5 Master, client rstudioapi 0.2 Master, client brew 1.0-6 Master, client roxygen2 4.1.0 Master, client reshape2 1.4.1 Master, client plyr 1.8.1 Master, client devtools 1.7.0 Master, client caTools 1.17.1 Master, client R.methodsS3 1.7.0 Master, client RColorBrewer 1.1-2 Master, client pryr 0.1 Master, client proto 0.3-10 Master, client dichromat 2.0-0 Master, client gtable 0.1.2 Master, client colorspace 1.2-4 Master, client scales 0.2.4 Master, client munsell 0.4.2 Master, client labeling 0.3 Master, client DBI 0.3.1 Master, client ggplot2 1.0.0 Master, client acepack 1.3-3.3 Master, client latticeExtra 0.6-26 Master, client lhs 0.10 Master, client rmr2 3.3.1 Master, client sp 1.0-17 Master, client Hmisc 3.15-0 Master, client hydroPSO 0.3-4 Master, client plyrmr 0.6.0 Master, client Table 5. R and Rhadoop Package Description in Node/master 6. RESULT AND ANALYSIS Data Set was taken from well known source UCI data set repository and other source. The dataset named chess, connect, and mushroom ware downloaded from UCI data set repository [19][20]. The short description about the data set summarized in table 4. Data set Instance attributes Chess 3196 36 pumsb 49046 74 Connect 67557 42 Table 6. Data set And there Description [19][20]. The dataset provided by [19] are in integer form and space delimiter. Each row of the dataset represents the new data in the file. For the conceptualizing the concept, we had fix the support value as (.5) and count value as (.9). Table 5 show the result of training data set with same support and count value in without hadoop and 2 node, 4 node, 6 node and 8 node cluster. Running time count in second and process run 4 times each to find out the average running time of the cluster. Dataset Without hadoop 2 node 4 node 6 node 8 node chess 31.157 24 12 10 7 pumsb 271.117 37 36 34 25 connect 82.150 42 40 25 18 Table 7. Result showing in different node Result may have difference because of the network and one single system is running logically two system. The reduction of execution time is more evident for those datasets which have longer execution time. Moreover, we can see that when the size of the dataset increases, the performance on multi-node Hadoop also gets better. The result could get better if the ambari server is running on separate system and master node running separately in the cluster and having more RAM size. The value of running time is average mean value and round of value in second. 7. CONCLUSIONS We implemented the improved Apriori Algorithm on MapReduce programming model on the Hadoop platform. The R language is widely used in the hadoop due to easy and comfort for the analyst. Moreover, this distributed algorithm can also cater to the distributed nature of the input data. Furthermore, details of the management of the distributed systems, such as data transferring among nodes and node failures are taken care by Hadoop, which adds a great deal of robustness and scalability to the system. The performance could be enhanced if we have the facilities to increase the RAM and high performing Machine. ACKNOWLEDGEMENT I really thankful to the Director, Shri Prof. V.V Nageswara Rao and Head of computer science department Shri Dr. G. S. N. Murty and my guide shri prof D.T.V. Dharmaji Rao, for there kind support and help.
  • 7. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 966 REFERENCES [1] D. Kornack and P. Rakic, “Cell Proliferation without Neurogenesis in Adult Primate Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127- 2130, doi:10.1126/science.1065467. [2] Pallavi roys(2012), MINING ASSOCIATION RULES IN CLOUD, MS thesis, North Dakota State University of Agriculture and Applied Science, Fargo, North Dakota August 2012Mr. [3] Neeraj Raheja, Ravish Kumar(2012) , Optimization of Association Rule Learning in Distributed Database using Clustering Technique,International Journal of Scientific and Research Publications, Volume 2, Issue 12, December 2012 ,ISSN 2250-3153. [4] HARISH D,ANUSHA M.S ,Dr. DAYA SAGAR K.V(2015).,”BIG DATA ANALYSIS USING RHADOOP”, ISSN: 2349-2163,Issue 4, Volume 2 (April 2015) pp 180-85. [5] Bao Rong Chang, Yun-Da Lee, and Po-Hao Liao, “Development of Multiple Big Data, Analytics Platforms with Rapid Response” https://doi.org/10.1155/2017/6972461. [6] H.Huang et al., BlueSNP: R package for highly scalable genome-wide associationstudies using Hadoop clusters, BIOINFORMATICS APPLICATIONS NOTE , Vol. 29 no. 1 2013, pages 135–136 doi:10.1093/bioinformatics/bts647. [7] Bogdan OANCEA, LINEAR REGRESSION WITH R AND HADOOP, Challenges of the Knowledge Society. IT in Social Sciences. [8] Thasnim K M1, M Sudheep Elayidom, IMPLEMENTATION OF A NOVEL SENTIMENT ANALYSIS TECHNIQUE FOR TWITTER DATA USING RHADOOP, IJRET: International Journal of Research in Engineering and Technology, eISSN: 2319-1163 | pISSN: 2321-7308. [9] Shubham S. Deshmukh et al(2017)., Twitter Data Analysis using R, International Journal of Science, Engineering and Technology Research (IJSETR) Volume 6, Issue 4, April 2017, ISSN: 2278 -7798. [10] Shivaram Venkataraman et al(2016), SparkR: Scaling R Programs with Spark, SIGMOD ’16, June 26–July 1, 2016, San Francisco, CA, USA., ISBN 978-1-4503-3531-7/16/06. [11] R. Agrawal and R. Srikant(1994). Fast algorithms for mining association rules. In Proc. 1994 Int. Conf.Very Large Data Bases, pages 487- 499, Santiago, Chile, September 1994. [12] https://www.r-project.org/about.html access on 03/09/17. [13] R project,Cran project https://www.r- project.org/contributors.html access on 05/09/17. [14] Bao Rong Chang, Yun-Da Lee, and Po-Hao Liao(2017), Development of Multiple Big Data Analytics Platforms with Rapid Response, Hindawi Scientific Programming Volume 2017, Article ID 6972461, https://doi.org/10.1155/2017/6972461. [15] Bogdan OANCEA, Raluca Mariana DRAGOESCU(2014), Integrating R and Hadoop for Big Data Analysis, Romanian Statistical Review nr. 2 / 2014. [16] Rhadoop wiki https://github.com/RevolutionAnalytics/RHadoo p/wiki access on 03/09/2017. [17] Anurag Agrahari, Prof D.T.V. Dharmaji Rao(2017), “A Review paper on Big Data: Technologies, Tools and Trends,” www.irjet.net,Volume: 04 Issue: 10,Oct -2017,e- ISSN: 2395-0056 pp 640-649. [18] Ambari wiki,https://ambari.apache.org/ access on 2/08/2017. [19] http://fimi.ua.ac.be/data/ access on 12- 04-2015. [20] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [21] Violeta nicoleta opriş & ciprian racuciu(2015), the expert systems analysis using the concept of big data and cloud computing services, Military Technical Academy, Faculty of Military Electronic and Information Systems, Bucharest, Naval Academy Scientific Bulletin, Volume XVIII – 2015 – Issue 2. [22] Anurag Agrahari ,Dr. ShashiKant Tripathi(2012),A Theoretical Framework for Development of Decision Support System for Agriculture, RESEARCH INVENTY: International Journal of Engineering and Science ISSN: 2278- 4721, Vol. 1, Issue 6 (October 2012), PP 50-55.
  • 8. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 6.171 | ISO 9001:2008 Certified Journal | Page 967 [23] Li L. & Zhang M. (2011). The Strategy of Mining Association Rule Based on Cloud Computing. Proceeding of the 2011 International Conference on Business Computing and Global Informatization (BCGIN ’11). Washington, DC, USA, IEEE: 475- 478. [24] Li N., Zeng L., He Q. & Shi Z. (2012). Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc. of the 13 th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD ’12). Kyoto, IEEE: 236 – 241. [25] Lin M., Lee P. & Hsueh S. (2012). Apriori- based Frequent Itemset Mining Algorithms on MapReduce. Proc. of the 16 th International Conference on Ubiquitous Information Management and Communication (ICUIMC ’12). New York, NY, USA, ACM: Article No. 76. [26] Yang X.Y., Liu Z. & Fu Y. (2010). MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. Proc. of the 3 rd International Conference on Information Sciences and Interaction Sciences (ICIS ’10). Chengdu, China, IEEE: 99 – 102 [27] S. Park, Ming-Syan and P. S. Yu, “Using a Hash-Based Method with Transaction Trimming for Mining Association Rules,” in IEEE Transactions on Knowledge and Data Engineering, vol. 9, no. 5, pp. 813-825, 1997. [28] J. S. Park, Ming-Syan and P. S. Yu, “An Effective Hash-Based Algorithm for Mining Association Rules,” in SIGMOD ACM, vol. 24, no. 2, pp. 175-186,1995. [29] M. J. Zaki, S. Parthasarathy, W. Li and M. Ogihara, “Evaluation of Sampling for Data Mining of Association Rules,” in Proceedings IEEE 7th International Workshop on Research Issues in Data Engineering, 1997, pp. 42-50. [30] A. Savasere, E. Omiecinski and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” in Proceedings 21st VLDB Conference,Switzerland, 1995, pp. 432-444. [31] S. Brin, R. Motwani, J. D. Ullman and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” in ACM SIGMOD Record, vol. 26, no. 2, pp. 255-264, 1997. [32] R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules,” in IEEETransactions on Knowledge and Data Engineering, vol.8, no. 6, pp.962-969,1996. [33] Poonam Modgi et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (3) , 2014, pp 3093 – 3097. [34] Sanjay Rathee, Manohar Kaul, Arti Kashyap, R-Apriori: An Efficient Apriori based Algorithm on Spark, http://www.iith.ac.in/~mkaul/papers/pikm09- rathee.pdf. access on 03/09/17.