SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 277
IMPROVING PERFORMANCE OF APRIORI ALGORITHM USING
HADOOP
Ravindra Bachate1
, Hyder Ali Hingoliwala2
1
Department of Computer Engineering, JSPM’s JSCOE, Pune, Maharashtra, India
2
Department of Computer Engineering, JSPM’s JSCOE, Pune, Maharashtra, India
Abstract
Spatial data is a data having a geological information. This paper explores the use of Hadoop framework to improve the
performance of Apriori algorithm for spatial data mining. FP growth algorithm is better than Apriori but it fails in certain
situations. By applying the Apriori algorithm parallely using Hadoop framework to spatial data, we can perform well as compare
to FP growth. This paper includes clustering based on geological location, classification based on mineral resource type and
spatial coherence between mineral resources. Spatial data mining find out the different association rules by observing the spatial
data by using Apriori algorithm. The result of the paper will indicate the accurate prediction of occurrence of commodity with
respect to other commodity of mineral resources.
Keywords: Hadoop, data mining, association rules, clustering, spatial coherence
--------------------------------------------------------------------***----------------------------------------------------------------------
1. INTRODUCTION
FP growth algorithm is a very popular algorithm used for
the association mining due its performance and less storage
requirement. With these pros, it has some limitations also
which are considered rarely. This paper focuses on the
limitations of FP growth association mining algorithm and
trying to suggest improvements in Apriori algorithm which
can overcome the limitations of FP Growth association
mining algorithm. As we aware about the hardware cost, it’s
decreasing day by day, so there is no need to concentrate
more on storage requirement. FP growth has two limitations
– 1. It is difficult to use for interactive mining where the
user may change support value as per requirement. 2. It is
not suitable where the data has been increasing with time.
So to deal with limitations of FP growth, this paper suggest
to implement Apriori algorithm using Hadoop for
association rule mining.
1.1 Mineral Resources Data System
Mineral Resources Data System is a collection of data
describing metallic and nonmetallic mineral resources in the
world [7]. It includes resource name, location, commodity,
geologic characteristics, resource description, production,
reserves, and references. As MRDS contains mineral
resources data around the world, it is large and complex. If
data size goes beyond the Tera Byte, it is difficult to process
and mine using FP growth algorithm. The performance of
FP growth algorithm hampers when adding the new records
and also by changing the support value. As the mineral
resources data set is collected from various regions and
people, we need to perform ETL (extract, transform and
load) operations on the data for processing and mining.
1.2 Hadoop Map Reduce
To deal with unstructured and big data like mineral
resources data system (MRDS), we need a best technology
which can cope with it. There are two options available,
parallel DBMS and Hadoop Map Reduce technology.
Hadoop has another projects also which can be used for the
mining purpose like Hive. But internally again all the
projects using Hadoop Map Reduce technique [5]. It gives a
better data processing performance with minimum cost and
time as compare to parallel DBMS because it works with
commodity hardware. Hadoop stores data in the form of
blocks on Hadoop Distributed File System i.e. HDFS. The
Hadoop framework provides a solution for problems of
massive data processing; because it runs applications on
large cluster built of commodity hardware with failure
tolerance [4].Unstructured data can be processed with
Hadoop Map Reduce technique which is not possible with
RDBMS. Map Reduce provides flexibility and fault
tolerance which is not with parallel DBMS. Map Reduce
provides automatic parallelization, data partitioning, task
scheduling, handling machine failures and manages inter-
machine communication. Hadoop is totally transparent from
the end user. The rate of growing an unstructured data is
much more as compare to the structured data. The
unstructured data includes media files, heavy text files, csv
files, log files etc.
2. RELATED WORK
Association rule mining algorithm includes to find coverage,
support, confidence, lift and interesting [4]. Coverage
defines the proportion of case data specified on the Left Haft
Hand Side of the rule. Support of an association rule means
percentage of task relevant data for which the pattern is true.
Confidence gives the trustworthiness associated with the
patterns discovered. Lift is a measure of the importance of
the association. The term interesting gives the strength of
associations between sets of items in the association rule.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 278
Hongyong Yu, Deshuai Wang [1] proposed a system for
data processing and mining log data of SaaS cloud using
Hadoop. This paper focuses on Hadoop’s Map Reduce
technique and the algorithm used for data mining by
Hongyong Yu, Deshuai Wang. They suggest by applying
Apriori algorithm concurrently in the distributed system,
performance of Apriori algorithm can be increased in
proportional to the number of nodes in the distributed
system [1].
Every day, 2.5 quintillion bytes of data are created and 90
percent of the data in the world today were produced within
the past two years [3].The commonly used software
technology cannot cope with massive data and big challenge
is to extract an important information from it. Big data has
large volume, heterogeneous format and decentralized data
control. The example of big data applications are Facebook,
Twitter and Google. It is a big challenge to manage and
mining a massive data because of its volume, different file
formats and growing rate of the data in the world. There are
many challenges with big data such as storage, processing,
variety and cost.
3. PROPOSED SYSTEM
This paper proposes a system which overcomes the problem
faced in FP growth association mining algorithm. To
improve the performance of Apriori algorithm, it is
implemented parallely using Hadoop map reduce technique.
The proposed system has three modules
1. Spatial Clustering
2. Spatial Classification
3. Spatial Coherence
Before implementing these modules, we need to perform
Extract, Transfer and Load operations on the raw data.
Because the data available may be in the various form and
having an unnecessary information into to it. So first we
need to extract the required data from raw data and then
transfer it in to the csv format. To process this data, it should
be loaded on to the HDFS.
Fig-1 Proposed System
3.1 Spatial Clustering
There are various algorithms available for clustering like k
means but here in this paper, Hadoop partition technique is
used to perform spatial clustering based on the geological
location like Alaska, California etc[2]. It runs parallel on
different nodes concurrently. Because of this, time required
for spatial clustering is less as compared to k-means
clustering algorithm. In this paper, U.S. mineral resources
dataset is taken. In effect, there are 50 clusters formed with
respect to the states in United State.
These clusters have all the records which belongs to
respective state. Intention behind to form the clusters is to
make it specialize and reduce the processing time and cost.
3.2 Spatial Classification
Here, the same technique is used for implementing a spatial
classification which is used for spatial clustering. In this
module, again the data set is clustered according to the type
of commodity which is required for the next module. In
spatial classification, the records having same commodity
put in to the separate cluster to make it more specialize. It
means, e.g. Alaska state may have another sub clusters with
commodity type like gold, silver, copper etc.
3.3 Spatial Coherence
Objective of this paper is to find out the spatial coherence.
Spatial coherence means if we are mining for gold then what
are the possibilities of getting other mineral resources at the
same location. For this, association mining is required. As
we have discussed above, FP Growth algorithms fails in two
situations. So to overcome this, this paper introduces
parallel Apriori algorithm for spatial association mining. In
Apriori algorithm, first of all, we have to find out the
candidate keys. After this, we have to perform two task, one
is to find out the support values and second is to find out the
confidence. Formula’s to calculate a support and confidence
for gold with copper.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 279
1. Support Formula
Where N is all the records in a cluster
2. Confidence Formula
Here, a formulas for calculating support and confidence of
gold is given. Support of (gold => copper) gives the
probability of having gold with copper in all the records
whereas confidence gives the probability of having of gold
with copper with respect to all the gold records in a cluster.
The value of support is always less than the confidence.
4. RESULTS
To implement this idea, we have taken a dataset of MRDS
which is a spatial data set of mineral resources in U.S.[7].
This sample data has around 3.5 lack records.
Table -1: Support and Confidence
Commodity Support (%) Confidence (%)
(Gold -> Copper) 2.89 9.30
(Gold -> Copper, Lead,
Silver)
0.72 2.3
(Gold -> Silver) 2.17 6.97
(Gold -> null) 24.63 79.06
Here, the Apriori algorithm is performed on single node and
multi node Hadoop to compare the performance. Also we
find the association rules for this spatial data for a single
cluster.
Table -2: Execution on Single and Multinode System
No. of Node Execution Time in Sec
1 5.2
2 3.1
3 2.3
The result in Table-1 shows the different association rules
with respect to gold.Table-2 shows the performance of
Apriori algorithm using single node and multi node system.
It is observed that if we implement the Apriori Algorithm
parallel on Hadoop, performance is improved.
5. CONCLUSION
FP growth algorithm has two limitations, it cannot be used
for dynamic data size and where the support needs to change
according to situation. By applying the Apriori Algorithm
concurrently using Hadoop, we can overcome the problems
faced by FP growth association mining algorithm. Also we
can improve the speed of association rule mining for spatial
data as compare to the FP growth algorithm as Hadoop is a
distributed system.
ACKNOWLEDGEMENTS
I express true sense of gratitude towards my project guide
Prof. H.A. Hingoliwala, Associate Professor Computer
Department for his invaluable co-operation and guidance
that he gave me throughout my project. I specially thank our
P.G coordinator Prof. M. D. Ingle for inspiring me and
providing me all the lab facilities. I would also like to
express my appreciation and thanks to HOD Prof. S.M.
Shinde & JSCOE Principal Dr. M.G. Jadhav and all my
friends who knowingly or unknowingly have assisted me
throughout my hard work
REFERENCES
[1]. Hongyong Yu, Deshuai Wang, “Mass Log Data
Processing and Mining Based on Hadoop and Cloud
Computing” .The 7th International Conference on Computer
Science & Education (ICCSE 2012)July 14-17, 2012.
Melbourne, Australia.
[2]. Duck-Ho Bae Coll. of Inf. & Commun., Hanyang Univ.,
Seoul, South Korea Ji-Haeng Baek ; Hyun-Kyo Oh ; Ju-
Won Song ; Sang-Wook Kim, “SD-Miner: A SPATIAL
DATA MINING SYSTEM” Network Infrastructure and
Digital Content, 2009.
[3]. Xindong Wu,Fellow, IEEE,Xingquan Zhu, Senior
Member, IEEE, Gong-Qing Wu, and Wei Ding,Senior
Member, IEEE, “Data mining with big data,” IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 26, NO. 1, JANUARY 2014.
[4]. Binbin He, Ying Cui, Jianhua Chen, Pingjing Xie, “A
Spatial Data Mining Method for Mineral Resources
Potential Assessment,” IEEE 978-1-4244-8351-
8/11,2011IEEE.
[5]. Hadoop: The definitive Guide, 3rd ed., O’Reilly, Tom
White, 2012
[6]. Hadoop, http://hadoop.apache.org/
[7]. MRDS, http://tin.er.usgs.gov/mrds/

More Related Content

What's hot (20)

PDF
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
PPTX
Berlin Hadoop Get Together Apache Drill
MapR Technologies
 
PDF
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
PDF
IRJET- Performing Load Balancing between Namenodes in HDFS
IRJET Journal
 
PDF
Parallel Key Value Pattern Matching Model
ijsrd.com
 
PDF
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
PDF
Effect of countries in performance of hadoop.
Computer Science Journals
 
PDF
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
PDF
Survey Paper on Big Data and Hadoop
IRJET Journal
 
PDF
An effective classification approach for big data with parallel generalized H...
riyaniaes
 
PDF
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
PDF
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
 
PDF
Improved Map reduce Framework using High Utility Transactional Databases
International Journal of Engineering Inventions www.ijeijournal.com
 
PDF
A data aware caching 2415
SANTOSH WAYAL
 
PPTX
Hadoop interview questions
barbie0909
 
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
PPTX
Graphlab Ted Dunning Clustering
MapR Technologies
 
PPTX
Introduction to HADOOP
Shital Kat
 
PDF
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
PDF
InternReport
Swetha Tanamala
 
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
Berlin Hadoop Get Together Apache Drill
MapR Technologies
 
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
IRJET- Performing Load Balancing between Namenodes in HDFS
IRJET Journal
 
Parallel Key Value Pattern Matching Model
ijsrd.com
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
Effect of countries in performance of hadoop.
Computer Science Journals
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
Survey Paper on Big Data and Hadoop
IRJET Journal
 
An effective classification approach for big data with parallel generalized H...
riyaniaes
 
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
 
Improved Map reduce Framework using High Utility Transactional Databases
International Journal of Engineering Inventions www.ijeijournal.com
 
A data aware caching 2415
SANTOSH WAYAL
 
Hadoop interview questions
barbie0909
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
iosrjce
 
Graphlab Ted Dunning Clustering
MapR Technologies
 
Introduction to HADOOP
Shital Kat
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
InternReport
Swetha Tanamala
 

Viewers also liked (16)

PDF
CKu-band
Luis Garcia
 
DOCX
Paratransit Services Research
P.C.D.I Healthcare and Consultants of Texas L.L.C.
 
PDF
Chapters 1 and 2 Summary
livvy milner
 
PDF
Chapter 7 Summary
livvy milner
 
PPTX
Major 2 p pt
Rahul Agarwal
 
PDF
Webinar: Inside Social (April 2016)
Universum Webinars
 
PDF
RF 55f - Fondation Raoul Follereau - Rapport Annuel 2008
Bernard hardy
 
PDF
Frontend performance metrics
Артем Захарченко
 
PPTX
Sunk Costs
tutor2u
 
PPSX
Trabajo ctm
Matias ascanio
 
PPTX
Fixed, variable and Incremental Cost
Rahat Inayat Ali
 
DOCX
Utterback_Douglas - Resume
Douglas Utterback
 
PPTX
Building a Modern Windows App
Brent Edwards
 
PPTX
Apriori algorithm
Ashis Kumar Chanda
 
PDF
Александр Кашеверов — Коротко про WEB: HTML, CSS, JS.
DataArt
 
CKu-band
Luis Garcia
 
Chapters 1 and 2 Summary
livvy milner
 
Chapter 7 Summary
livvy milner
 
Major 2 p pt
Rahul Agarwal
 
Webinar: Inside Social (April 2016)
Universum Webinars
 
RF 55f - Fondation Raoul Follereau - Rapport Annuel 2008
Bernard hardy
 
Frontend performance metrics
Артем Захарченко
 
Sunk Costs
tutor2u
 
Trabajo ctm
Matias ascanio
 
Fixed, variable and Incremental Cost
Rahat Inayat Ali
 
Utterback_Douglas - Resume
Douglas Utterback
 
Building a Modern Windows App
Brent Edwards
 
Apriori algorithm
Ashis Kumar Chanda
 
Александр Кашеверов — Коротко про WEB: HTML, CSS, JS.
DataArt
 
Ad

Similar to Improving performance of apriori algorithm using hadoop (20)

PDF
Association Rule Mining using RHadoop
IRJET Journal
 
PDF
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
PDF
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET Journal
 
PDF
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
PDF
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
PDF
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
PDF
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
IRJET Journal
 
PDF
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
cscpconf
 
PDF
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
 
PDF
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
ShivarkarSandip
 
PDF
A Framework for Visualizing Association Mining Results
ertekg
 
PDF
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
IAEME Publication
 
PDF
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 
PDF
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
PDF
Ijariie1184
IJARIIE JOURNAL
 
PDF
Ijariie1184
IJARIIE JOURNAL
 
PDF
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
PDF
K355662
IJERA Editor
 
PDF
K355662
IJERA Editor
 
PDF
Az36311316
IJERA Editor
 
Association Rule Mining using RHadoop
IRJET Journal
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
IJTET Journal
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET Journal
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
BRNSSPublicationHubI
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
Irjet v4 iA Survey on FP (Growth) Tree using Association Rule Mining7351
IRJET Journal
 
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...
cscpconf
 
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
 
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
ShivarkarSandip
 
A Framework for Visualizing Association Mining Results
ertekg
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
IAEME Publication
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
Ijariie1184
IJARIIE JOURNAL
 
Ijariie1184
IJARIIE JOURNAL
 
REVIEW: Frequent Pattern Mining Techniques
Editor IJMTER
 
K355662
IJERA Editor
 
K355662
IJERA Editor
 
Az36311316
IJERA Editor
 
Ad

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
eSAT Journals
 
PDF
Material management in construction – a case study
eSAT Journals
 
PDF
Managing drought short term strategies in semi arid regions a case study
eSAT Journals
 
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
eSAT Journals
 
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
eSAT Journals
 
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
eSAT Journals
 
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
eSAT Journals
 
PDF
Geographical information system (gis) for water resources management
eSAT Journals
 
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
eSAT Journals
 
PDF
Factors influencing compressive strength of geopolymer concrete
eSAT Journals
 
PDF
Experimental investigation on circular hollow steel columns in filled with li...
eSAT Journals
 
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
eSAT Journals
 
PDF
Evaluation of punching shear in flat slabs
eSAT Journals
 
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
eSAT Journals
 
PDF
Evaluation of operational efficiency of urban road network using travel time ...
eSAT Journals
 
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
eSAT Journals
 
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
eSAT Journals
 
Mechanical properties of hybrid fiber reinforced concrete for pavements
eSAT Journals
 
Material management in construction – a case study
eSAT Journals
 
Managing drought short term strategies in semi arid regions a case study
eSAT Journals
 
Life cycle cost analysis of overlay for an urban road in bangalore
eSAT Journals
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
Laboratory investigation of expansive soil stabilized with natural inorganic ...
eSAT Journals
 
Influence of reinforcement on the behavior of hollow concrete block masonry p...
eSAT Journals
 
Influence of compaction energy on soil stabilized with chemical stabilizer
eSAT Journals
 
Geographical information system (gis) for water resources management
eSAT Journals
 
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
eSAT Journals
 
Factors influencing compressive strength of geopolymer concrete
eSAT Journals
 
Experimental investigation on circular hollow steel columns in filled with li...
eSAT Journals
 
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
eSAT Journals
 
Evaluation of punching shear in flat slabs
eSAT Journals
 
Evaluation of performance of intake tower dam for recent earthquake in india
eSAT Journals
 
Evaluation of operational efficiency of urban road network using travel time ...
eSAT Journals
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
Estimation of morphometric parameters and runoff using rs & gis techniques
eSAT Journals
 
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 
Effect of use of recycled materials on indirect tensile strength of asphalt c...
eSAT Journals
 

Recently uploaded (20)

PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
MRRS Strength and Durability of Concrete
CivilMythili
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 

Improving performance of apriori algorithm using hadoop

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 277 IMPROVING PERFORMANCE OF APRIORI ALGORITHM USING HADOOP Ravindra Bachate1 , Hyder Ali Hingoliwala2 1 Department of Computer Engineering, JSPM’s JSCOE, Pune, Maharashtra, India 2 Department of Computer Engineering, JSPM’s JSCOE, Pune, Maharashtra, India Abstract Spatial data is a data having a geological information. This paper explores the use of Hadoop framework to improve the performance of Apriori algorithm for spatial data mining. FP growth algorithm is better than Apriori but it fails in certain situations. By applying the Apriori algorithm parallely using Hadoop framework to spatial data, we can perform well as compare to FP growth. This paper includes clustering based on geological location, classification based on mineral resource type and spatial coherence between mineral resources. Spatial data mining find out the different association rules by observing the spatial data by using Apriori algorithm. The result of the paper will indicate the accurate prediction of occurrence of commodity with respect to other commodity of mineral resources. Keywords: Hadoop, data mining, association rules, clustering, spatial coherence --------------------------------------------------------------------***---------------------------------------------------------------------- 1. INTRODUCTION FP growth algorithm is a very popular algorithm used for the association mining due its performance and less storage requirement. With these pros, it has some limitations also which are considered rarely. This paper focuses on the limitations of FP growth association mining algorithm and trying to suggest improvements in Apriori algorithm which can overcome the limitations of FP Growth association mining algorithm. As we aware about the hardware cost, it’s decreasing day by day, so there is no need to concentrate more on storage requirement. FP growth has two limitations – 1. It is difficult to use for interactive mining where the user may change support value as per requirement. 2. It is not suitable where the data has been increasing with time. So to deal with limitations of FP growth, this paper suggest to implement Apriori algorithm using Hadoop for association rule mining. 1.1 Mineral Resources Data System Mineral Resources Data System is a collection of data describing metallic and nonmetallic mineral resources in the world [7]. It includes resource name, location, commodity, geologic characteristics, resource description, production, reserves, and references. As MRDS contains mineral resources data around the world, it is large and complex. If data size goes beyond the Tera Byte, it is difficult to process and mine using FP growth algorithm. The performance of FP growth algorithm hampers when adding the new records and also by changing the support value. As the mineral resources data set is collected from various regions and people, we need to perform ETL (extract, transform and load) operations on the data for processing and mining. 1.2 Hadoop Map Reduce To deal with unstructured and big data like mineral resources data system (MRDS), we need a best technology which can cope with it. There are two options available, parallel DBMS and Hadoop Map Reduce technology. Hadoop has another projects also which can be used for the mining purpose like Hive. But internally again all the projects using Hadoop Map Reduce technique [5]. It gives a better data processing performance with minimum cost and time as compare to parallel DBMS because it works with commodity hardware. Hadoop stores data in the form of blocks on Hadoop Distributed File System i.e. HDFS. The Hadoop framework provides a solution for problems of massive data processing; because it runs applications on large cluster built of commodity hardware with failure tolerance [4].Unstructured data can be processed with Hadoop Map Reduce technique which is not possible with RDBMS. Map Reduce provides flexibility and fault tolerance which is not with parallel DBMS. Map Reduce provides automatic parallelization, data partitioning, task scheduling, handling machine failures and manages inter- machine communication. Hadoop is totally transparent from the end user. The rate of growing an unstructured data is much more as compare to the structured data. The unstructured data includes media files, heavy text files, csv files, log files etc. 2. RELATED WORK Association rule mining algorithm includes to find coverage, support, confidence, lift and interesting [4]. Coverage defines the proportion of case data specified on the Left Haft Hand Side of the rule. Support of an association rule means percentage of task relevant data for which the pattern is true. Confidence gives the trustworthiness associated with the patterns discovered. Lift is a measure of the importance of the association. The term interesting gives the strength of associations between sets of items in the association rule.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 278 Hongyong Yu, Deshuai Wang [1] proposed a system for data processing and mining log data of SaaS cloud using Hadoop. This paper focuses on Hadoop’s Map Reduce technique and the algorithm used for data mining by Hongyong Yu, Deshuai Wang. They suggest by applying Apriori algorithm concurrently in the distributed system, performance of Apriori algorithm can be increased in proportional to the number of nodes in the distributed system [1]. Every day, 2.5 quintillion bytes of data are created and 90 percent of the data in the world today were produced within the past two years [3].The commonly used software technology cannot cope with massive data and big challenge is to extract an important information from it. Big data has large volume, heterogeneous format and decentralized data control. The example of big data applications are Facebook, Twitter and Google. It is a big challenge to manage and mining a massive data because of its volume, different file formats and growing rate of the data in the world. There are many challenges with big data such as storage, processing, variety and cost. 3. PROPOSED SYSTEM This paper proposes a system which overcomes the problem faced in FP growth association mining algorithm. To improve the performance of Apriori algorithm, it is implemented parallely using Hadoop map reduce technique. The proposed system has three modules 1. Spatial Clustering 2. Spatial Classification 3. Spatial Coherence Before implementing these modules, we need to perform Extract, Transfer and Load operations on the raw data. Because the data available may be in the various form and having an unnecessary information into to it. So first we need to extract the required data from raw data and then transfer it in to the csv format. To process this data, it should be loaded on to the HDFS. Fig-1 Proposed System 3.1 Spatial Clustering There are various algorithms available for clustering like k means but here in this paper, Hadoop partition technique is used to perform spatial clustering based on the geological location like Alaska, California etc[2]. It runs parallel on different nodes concurrently. Because of this, time required for spatial clustering is less as compared to k-means clustering algorithm. In this paper, U.S. mineral resources dataset is taken. In effect, there are 50 clusters formed with respect to the states in United State. These clusters have all the records which belongs to respective state. Intention behind to form the clusters is to make it specialize and reduce the processing time and cost. 3.2 Spatial Classification Here, the same technique is used for implementing a spatial classification which is used for spatial clustering. In this module, again the data set is clustered according to the type of commodity which is required for the next module. In spatial classification, the records having same commodity put in to the separate cluster to make it more specialize. It means, e.g. Alaska state may have another sub clusters with commodity type like gold, silver, copper etc. 3.3 Spatial Coherence Objective of this paper is to find out the spatial coherence. Spatial coherence means if we are mining for gold then what are the possibilities of getting other mineral resources at the same location. For this, association mining is required. As we have discussed above, FP Growth algorithms fails in two situations. So to overcome this, this paper introduces parallel Apriori algorithm for spatial association mining. In Apriori algorithm, first of all, we have to find out the candidate keys. After this, we have to perform two task, one is to find out the support values and second is to find out the confidence. Formula’s to calculate a support and confidence for gold with copper.
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 279 1. Support Formula Where N is all the records in a cluster 2. Confidence Formula Here, a formulas for calculating support and confidence of gold is given. Support of (gold => copper) gives the probability of having gold with copper in all the records whereas confidence gives the probability of having of gold with copper with respect to all the gold records in a cluster. The value of support is always less than the confidence. 4. RESULTS To implement this idea, we have taken a dataset of MRDS which is a spatial data set of mineral resources in U.S.[7]. This sample data has around 3.5 lack records. Table -1: Support and Confidence Commodity Support (%) Confidence (%) (Gold -> Copper) 2.89 9.30 (Gold -> Copper, Lead, Silver) 0.72 2.3 (Gold -> Silver) 2.17 6.97 (Gold -> null) 24.63 79.06 Here, the Apriori algorithm is performed on single node and multi node Hadoop to compare the performance. Also we find the association rules for this spatial data for a single cluster. Table -2: Execution on Single and Multinode System No. of Node Execution Time in Sec 1 5.2 2 3.1 3 2.3 The result in Table-1 shows the different association rules with respect to gold.Table-2 shows the performance of Apriori algorithm using single node and multi node system. It is observed that if we implement the Apriori Algorithm parallel on Hadoop, performance is improved. 5. CONCLUSION FP growth algorithm has two limitations, it cannot be used for dynamic data size and where the support needs to change according to situation. By applying the Apriori Algorithm concurrently using Hadoop, we can overcome the problems faced by FP growth association mining algorithm. Also we can improve the speed of association rule mining for spatial data as compare to the FP growth algorithm as Hadoop is a distributed system. ACKNOWLEDGEMENTS I express true sense of gratitude towards my project guide Prof. H.A. Hingoliwala, Associate Professor Computer Department for his invaluable co-operation and guidance that he gave me throughout my project. I specially thank our P.G coordinator Prof. M. D. Ingle for inspiring me and providing me all the lab facilities. I would also like to express my appreciation and thanks to HOD Prof. S.M. Shinde & JSCOE Principal Dr. M.G. Jadhav and all my friends who knowingly or unknowingly have assisted me throughout my hard work REFERENCES [1]. Hongyong Yu, Deshuai Wang, “Mass Log Data Processing and Mining Based on Hadoop and Cloud Computing” .The 7th International Conference on Computer Science & Education (ICCSE 2012)July 14-17, 2012. Melbourne, Australia. [2]. Duck-Ho Bae Coll. of Inf. & Commun., Hanyang Univ., Seoul, South Korea Ji-Haeng Baek ; Hyun-Kyo Oh ; Ju- Won Song ; Sang-Wook Kim, “SD-Miner: A SPATIAL DATA MINING SYSTEM” Network Infrastructure and Digital Content, 2009. [3]. Xindong Wu,Fellow, IEEE,Xingquan Zhu, Senior Member, IEEE, Gong-Qing Wu, and Wei Ding,Senior Member, IEEE, “Data mining with big data,” IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014. [4]. Binbin He, Ying Cui, Jianhua Chen, Pingjing Xie, “A Spatial Data Mining Method for Mineral Resources Potential Assessment,” IEEE 978-1-4244-8351- 8/11,2011IEEE. [5]. Hadoop: The definitive Guide, 3rd ed., O’Reilly, Tom White, 2012 [6]. Hadoop, http://hadoop.apache.org/ [7]. MRDS, http://tin.er.usgs.gov/mrds/