SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 743
Mining Big Data using Genetic Algorithm
Surbhi Jain
Assistant Professor, Department of Computer Science, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract – In today’s era, the amount of data available in
the world is growing at a very rapid pace day by day because
of the use of internet, smart phones, social networks, etc. This
collection of large and complex data sets is referred to as Big
Data. Primitive database systems are unable to capture, store
and analyse this large amount of data. It is necessary to
improve the text processing so that the information or the
relevant knowledge which was previously unknown can be
mined from the text. This paper proposes need for an
algorithm for the clustering problem of big data using a
combination of the genetic algorithm with some of the known
clustering algorithms. The main idea behind this istocombine
the advantages of Geneticalgorithmsandclusteringtoprocess
large amount of data. Genetic Algorithm is an algorithm
which is used to optimize the results. This paper gives an
overview of concepts like data mining, genetic algorithmsand
big data.
Key Words: Genetic Algorithms, Big Data, Clustering,
Chromosomes, Mining
1. INTRODUCTION
In current Big Data age the data is becoming more and more
available owing to advances in information and
communication knowhow, enterprises are gaining
meaningful information,relevantknowledgeandvisionfrom
this huge data based on decision making. Big data mining is
the ability of taking out valuable information from huge and
complex set of data or data streams i.e. Big Data. One of the
important data mining techniques for big data analysis is
clustering. There are difficulties for applying clustering
techniques to big data due to enormous amount of data
rising on daily basis. There are a lot of clustering techniques
available the most common of which is the K-means
algorithm. It is used to analyze information from a dataset.
But as we are saying that because of big data we have
plethora of data available, thus available clustering
algorithms are not very efficient. As Big Data refers to
terabytes and petabytes of data, we need to have clustering
algorithms with high computational costs. We can think of
designing an algorithm which can combine the features of
some of the clustering algorithms and genetic algorithm to
process big data.
To extract some meaningful information from the source
data is the process called Mining. It is a set of computerized
techniques that are used to extract formerly unknown or
buried information from largesetsofdatabases.ASuccessful
Data Mining makes possible to uncover patterns and
relationships, and then to use this “new” information for
making proactive knowledge-driven business decisions.
There are a lot of algorithms whicharebeingusedformining
the information from plain text. Thealgorithmsusedtosolve
the optimization problems aretheGeneticAlgorithms.These
algorithms work on search based inputs. The algorithms
eventually leads to generate useful solutions forsuchkindof
problems.
2. GENETIC ALGORITHMS
Genetic Algorithms are a clan of computational prototypes
inspired by evolution theory of Darwin.AccordingtoDarwin
the species which is fittest and can adapt to changing
surroundings can survive; the remaining tends to die away.
Darwin also stated that “the survival of an organism can be
maintained through the process of reproduction, crossover
and mutation”. GA’s basic working mechanism is as follows:
the algorithm is started with a set of solutions (represented
by chromosomes) called population. Solutions from one
population are taken and used to form a new population
(reproduction). This is driven by optimism, that the new
population will be superior to the old one. This is the reason
they are often termed as optimistic search algorithms. The
reproductive prospects are distributed in such a way that
those chromosomes which represent a better solutionto the
target problem are given more chances to reproduce than
those which represent inferior solutions.
They search through a huge combination of parameters to
find the best match. For example, they can search through
different combinations of materials and designs to find the
perfect combination of both which could resultina stronger,
lighter and overall, better final product.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 744
As an example we can consider “Face Recognition Systems”
which are used for drawingsketchesbasedonvisualizations.
This system is majorly used for investigation purposes
where in sketch of some criminal is to be made on the basis
of description given by some eye witness to the crime. The
initial population is nothing but a lot of facial features which
are already there in the system. The features may include a
lots of varieties of noses, ears, lips, eyes etc. They may differ
in color, size or anything else. As the witness starts giving
descriptions the features which are most likely to match can
be selected (Selection). The selectedfeaturescanthenfollow
the steps of cross-over and mutation to produce more likely
features. As in eyes of one face and lips of another can be
chosen to go for cross over to produce a new individual
which has both the features matching with the criminal. The
process continues till the witness recognizes thefinal face as
the one desired.
3. BIG DATA
Big data is a term for data sets that are so large or complex
that primitive data processing application software is
inadequate to deal with them. Big data represents a new
period in data study and utilization. It is a leveraging open
source technology- a robust, secure, highly available,
enterprise-class Big Data platform. Challenges include
capture, storage, analysis, querying,andupdatingdata safely
and securely. While the term “big data” is relatively new, the
doing of collecting and storing plethora of information for
eventual analysis is ages old.
The significance of big data is not based on how much data
we have, but how we use that data. We can take data from
any source and analyze it to find responses that enable us to
produce results in reducedcostandtimewithsmartdecision
making. Here in this paper we are trying to combine bigdata
with genetic algorithms for generating efficient analysis of
data. The reason for the interest in genetic algorithmsisthat
these are very powerful and broadly applicable search
techniques. As said earlier also, Big Data refers to large-
volume, complex, growing data sets with numerous, self-
directed sources. Big Data are now rapidly expanding in all
fields like science and engineering, including physical,
biological andbiomedical scienceswiththefastdevelopment
of networking, data storage, and the data collectioncapacity.
With the new technology of Big Data, the computations can
be speeded up. In very usual cases, if our system starts
getting heavy because of loads of data whichisbecoming too
big for our system to be managed, we add RAM or vacate
some space by deleting certain processes. Big data on the
contrary, adds more systems to the pool and there by
promote parallelism.Thishoweverleadstofaulttoleranceas
a consequence. More the number of systems, more is the
probability of system failures. Fortunately, big data handles
this automatically by duplicating data on the systemssothat
if one system fails, its data can be redirected to some other
system.
4. DATA MINING
The knowledge from the data sets is extracted using Data
Mining technology. It is used to search and analyze data.The
data to be mined varies from a small data settoanenormous
sized data set i.e. big data. In Data Mining, the source data is
kept in the format of databases i.e. in the form of tables if we
are considering relational databases. We only have to apply
the algorithms to extract data from databases. The Data
Mining environment produces voluminous data. The
information retrieved in the data Miningstepistransformed
into the structure that is easily understood by users. Once
data has been extracted and then transformed, it is loaded
into systems from where we can read it. The various
methods like genetic algorithms, support vector machines,
decision tree, neural network andclusteranalysistodisclose
the hidden patterns inside the huge amounts of data set are
all included in data mining.
For handling such large amount of data sets, various
algorithms which define various structures and approaches
implemented to handle Big Data are needed. They also
defines the various tools that were developed for analyzing
them. Data mining and Text Mining are often used
synonymously which howeverisnotright.Although both are
mining techniques, but there is a very thin line of difference
between the two. Data mining refers to the process of
extraction of useful text from the databases which is not
known prior, while text mining refers to extraction of useful
and knowledgeable data from the plain text i.e. the naturally
occurring text. Unlike data mining, this text need not be
transformed into any other format.
5. CLUSTERING
Clustering refers to categorizing similar kind of objects. It is
a method of exploring the data, a technique of finding out
patterns in the dataset. It falls in the category of
unsupervised learning i.e. we don’t know in advance how
data should group the data objects (of similar types)
together. It is one of the most vital research field in the data
mining. In clustering we aim at making collections of objects
in such a manner that the objects having same attributes
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 745
belong to same group and objects withdifferent behaviorsin
dissimilar groups. With the formation of groups, we can
easily identify areas where the object space is dense and
where it is sparsely filled and hence can determine the
distribution patterns. We can find the stimulating patterns
directly from the data sets without needing to have much of
background knowledge. One of the popular approaches of
clustering is Partitioning. Partitioning worksbytransferring
objects by moving them from one cluster to another cluster
starting from a certain point. The number of clusters for this
technique should be pre-defined for thistechnique(likeink-
means algorithm).
6. GENETIC ALGORITHM FOR CLUSTERING
The voluminous data that is available to us can be divided
into small groups where each group can be considered as
population. By applying genetic operators iteratively on the
population we can find out the optimum solution for the
current scenario. Search process, as we all know, is a
problem-solving method wherein we cannot determine the
sequence of steps leading to the solution in advance. It is
based on how nicely and wisely we have applied the search
operators. An ideal search should be capable of carrying out
search process locally as well as in a random manner.
Random search explores the entire solution andisproficient
in avoiding reaching to a local optimum while local search
helps in exploring all the local possibilities and reaching the
best solution.
As discussed earlier a genetic algorithm is capable of
effectively searching the problem domain and solving
complex problems by simulating natural evolution. It
perform search and provide near optimal solutions for
objective function of an optimization problem. A set of
chromosomes is referred to as a population wherein a
chromosome (represented as strings) refers to the
parameters in the search space, encodedbya combinationof
cluster centroids.
First step is to create a randompopulation,whichrepresents
different solutions in the search space. Next, a few of
chromosomes are selected as per the principle of survival of
the fittest, and each is assigned into the next generation.
Chromosomes are nothingbutbinary encodedstrings,which
represents probable solutions to the optimization problem.
Each string is then evaluated on the fitness function
(objective function), giving a measure of the solutionquality
called the fitness value. A new candidate solutionpopulation
can be createdafter recombination(crossoverandmutation)
is being performed upon candidate solution selection.
Individual representation and population initialization,
fitness computation, selection, crossover and mutation are
thus the basic steps of genetic algorithm for data clustering.
Given is the algorithm for the same:
Input:
k: the no of clusters
d: the data set containing n objects
p: population size Tmax: Maximum no. of iterations
Output:
A set of K clusters
1) Initialize every chromosome to have k random centroids
selected from the set of data.
2) For T=1 to Tmax
(i) For every chromosome i
a. Allocate the object data to the cluster
with the closest centroid.
b. Recomputed k cluster centroids of
chromosome i as the mean of their data objects.
c. Compute the chromosome i fitness.
(ii) Generate the new group of chromosomes using
GA selection, crossover and mutation.
The spine for a Genetic Algorithm to work is the Fitness
function F (x). The prime focus of this function is to give the
successive results after applying GA.
Firstly, it is derived from the objective function and then
used in successive genetic operations like crossover,
mutation. Fitness means quality value which is the degree of
the reproductive efficiency of individual string
(chromosomes). A score is given to each individual
chromosome with the help of fitness functions.Theproposal
is to generate a Genetic Algorithm based clustering
algorithm which is expected to provide an optimal
clustering, better than that of K-Means approach. This may
however induces a little more time complexity.
The major benefit of using genetic algorithmsisthatthey are
easily parallelized. Parallel implementation of GA is
apprehended using two commonly used models namely:
 Coarse-grained parallel GA
 Fine-grained parallel GA
In the first model every node is given a population split to
process while in the second model each individual is
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 746
provided with a separate node for fitness evaluation.
Adjoining nodes communicate with each other for selection
and remaining operations.
6.1 PARALLEL ImplementationforClusteringusing
GAs
At first, the input data set is fragmented according to the
block size by the input format. Each fragmentisthengivento
a mapper to perform the First phase clustering,the resultsof
which are passed on to a single reducer to perform the
Second phase mapper.
Step 1: Population initialization
Each mapper forms the initial populationofindividualsafter
receiving the input fragments. Each individual is a
chromosome of size 𝑁. Every segment of the chromosome is
a centroid. Centroids are randomlyselecteddata pointsfrom
the received data split. For every data point in each
chromosome clustering is performed and the data set is
assigned to the cluster of the closest centroid. Then the
fitness is evaluated.
Step 2: Mating & Selection
Cross-over and mutation techniques are used for mating.
For cross-over, wegenerallyusearithmetic cross-overwhich
generates one offspring from two parents. The centroid of
the offspring is the arithmetic average of the corresponding
centroid of parents. Swap mutation technique is used for
mutation. In this, 9’s compliment of the data points is taken.
The offspring from older population are selected to produce
a new population. For selection, an approach known as
Tournament selection is used wherein the individual is
selected by performing a tournament based on fitness
evaluationamongseveral individualschosenatrandomfrom
the population.
Step 3: Termination
A new population thus generated replaces the older
population which would again form a newer population
using mating and selection procedure.Thiswhole procedure
would be reiterated again and again until the termination
condition is met. The termination condition can be anything
like achieving a specified number of iterations or reaching a
particular solution. The fittest individual of the final
population of each mapper is handed on as the result to the
reducer. The Second phaseclusteringonthemapping results
of all mapper is then performed by the reducer.
6.2 GENETIC K-MEANS ALGORITHM
Apart from parallel implementation using Genetic
Algorithms, we can also have an algorithmthatcombinesthe
advantage of Genetic algorithm and K-means algorithm for
clustering. It is expected to provide an optimal clustering,
better to that of K-Means approach, but probablywitha little
more time complexity.
The major steps of the algorithm of GK-means are:
1) Set the population.
2) Compute fitness of everyindividual byfollowing equation.
Fitness (i) =2. (pi - 1)/Q-1
i=individual, p=position, Q=total individuals
3) If satisfied with the fitness condition,thenassignsolution,
Else
4) Calculate sub population and migrate
5) Counting the ith individual depends on the rate si,whichis
relative to its level of fitness that is
Si = fitness (i) / summation (fitness (i));
6) Translate population and assets individual wellness.
7) Perform crossover and mutation on each sub population
8) If termination condition satisfies, stop; else go to step 5.
The major drawback of k-means algorithm is that it can’t
process large amounts of data. If we have minimum amount
of data then k mean is easy to process but for large amount
of data it will not give desired results. Since we are talking
about Big Data, so surely k-means is not the solution to our
problem. GK-means on the contrary will take less memory
and time to process big data and will give desired results as
well. The Genetic k-means gradually converges to the global
optimum as desired.
7. DISADVANTAGES OF GA
A major difficulty in applying Genetic Algorithms is how to
handle constraints. Genetic operators often produce
infeasible offspring while manipulating chromosomes. A
Penalty technique is used to keep a check on the number of
infeasible solutions produced in each generation. This helps
in enforcing the genetic search towards an optimal solution.
Apart from this, a few other disadvantages are:
1) These are challenging to understand and to describe to
end users.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 747
2) The problem abstraction and the means to represent
individuals is quite difficult.
3) How to determine the best fitness function is a difficult
work.
4) How to do crossover and mutation is another difficulty.
5) The large over-production of individuals and the random
character of the search process is another drawback.
8. FUTURE SCOPE
The paper compares and reviews the methods available for
clustering data based on genetic algorithms. A more robust
and time saving algorithm can be designedsuchthatbigdata
can be effectively mined overcoming all the challengesbeing
faced by Genetic Algorithms.
9. CONCLUSION
This paper provides the reader a review of all the jargons
related to analysing big data. The concepts like Text Mining,
Big Data and Genetic Algorithm concept, samples, scope,
methods, advantages, challenges etc. are all discussed here.
The paper reviews various methods that are available for
text mining. The paper concludes that since the prime focus
is on to mining big data, so algorithm followed has to be
space effective and time effective. The paper presents need
for an algorithm that characterizes the features of the Big
Data revolution, and proposes a Big Data processing model,
from the data mining perspective
10. REFERENCES
[1] Senthilnath, J., S. N. Omkar, and V. Mani. "Clusteringusing
firefly algorithm: performance study." Swarm and
Evolutionary Computation 1, no. 3 (2011).
[2] Ahmed and Saeed. A Survey of Big Data CloudComputing
Security. International Journal of Computer Science and
Software Engineering (IJCSSE), Volume 3, Issue 1, December
2014.
[3] Arora, Deepali, Varshney, Analysis of K-Means and K-
Medoids Algorithm For Big Data, InternationalConference on
Information Security & Privacy (ICISP2015), 2015.
[4] Dash and Dash, Comparative Analysis of K-means and
Genetic Algorithm Based Data Clustering. International
Journal of Advanced Computer and Mathematical
SciencesISSN 2230-9624. Vol 3, Issue 2, 2012.
[5] Gaddam, Securing your Big Data Environment, Black Hat
USA 2015.
[6] http://www.sas.com/en_us/insights/big-data/internet-
of-things.html
[7] Inukollu, Arsi and Ravuri, Security Issues Associated
With Big Data in Cloud Computing. International Journal of
Network Security & Its Applications (IJNSA), Vol.6, No.3, May
2014.
[8] Jiawei Han and MichelineKamber,“Data MiningConcepts
& Techniques”, Second Edition, Morgan Kaufmann
Publishers
[9] “Text Mining Technique using Genetic Algorithm”,
International Journal of Computer Applications (0975 –
8887) Volume #. 63, February 2013
[10] McAfee, Andrew, and Erik Brynjolfsson. "Big data: the
management revolution." Harvard business review 2012
[11] Deepankar Bharadwaj, Dr. Arvind Shukla, Text Mining
Technique on Big data using Genetic Algorithm,
International Journal of Computer Engineering and
Applications, Volume X, Issue IX, Sep. 16
[12] Mitsuo Gen, Runwei Cheng, Genetic Algorithms and
Engineering Optimization, John Wiley and Sons, 2000

More Related Content

What's hot (20)

PDF
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
IJET - International Journal of Engineering and Techniques
 
PDF
Big Data
Putchong Uthayopas
 
PDF
Cri big data
Putchong Uthayopas
 
PDF
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Robert Grossman
 
PDF
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET Journal
 
PDF
Data Mining – A Perspective Approach
IRJET Journal
 
PPTX
Big data road map
karthika karthi
 
PDF
Big Data Analytics: Challenge or Opportunity?
NUS-ISS
 
PDF
Keynote on 2015 Yale Day of Data
Robert Grossman
 
PPT
Data mining in agriculture
Sibananda Khatai
 
PDF
Introduction to Data Mining
AbcdDcba12
 
PPTX
No Free Lunch: Metadata in the life sciences
Chris Dwan
 
PDF
Data Mining in the World of BIG Data-A Survey
Editor IJCATR
 
PDF
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
IJECEIAES
 
PDF
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
IJMIT JOURNAL
 
PDF
6.a survey on big data challenges in the context of predictive
EditorJST
 
PDF
Performance analysis of data mining algorithms with neural network
IAEME Publication
 
PDF
A Survey on Data Mining
IOSR Journals
 
PDF
Real World Application of Big Data In Data Mining Tools
ijsrd.com
 
PPTX
Big Data for Ag (2019)
Benjamin Wielgosz
 
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
IJET - International Journal of Engineering and Techniques
 
Cri big data
Putchong Uthayopas
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Robert Grossman
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET Journal
 
Data Mining – A Perspective Approach
IRJET Journal
 
Big data road map
karthika karthi
 
Big Data Analytics: Challenge or Opportunity?
NUS-ISS
 
Keynote on 2015 Yale Day of Data
Robert Grossman
 
Data mining in agriculture
Sibananda Khatai
 
Introduction to Data Mining
AbcdDcba12
 
No Free Lunch: Metadata in the life sciences
Chris Dwan
 
Data Mining in the World of BIG Data-A Survey
Editor IJCATR
 
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analyt...
IJECEIAES
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
IJMIT JOURNAL
 
6.a survey on big data challenges in the context of predictive
EditorJST
 
Performance analysis of data mining algorithms with neural network
IAEME Publication
 
A Survey on Data Mining
IOSR Journals
 
Real World Application of Big Data In Data Mining Tools
ijsrd.com
 
Big Data for Ag (2019)
Benjamin Wielgosz
 

Similar to Mining Big Data using Genetic Algorithm (20)

PPT
Large Scale Data Mining using Genetics-Based Machine Learning
jaumebp
 
PDF
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
ijcsa
 
PDF
Clustering of Big Data Using Different Data-Mining Techniques
IRJET Journal
 
PPTX
Genetic algorithms in Data Mining
Atul Khanna
 
PDF
Characterizing and Processing of Big Data Using Data Mining Techniques
IJTET Journal
 
PDF
Dunham - Data Mining.pdf
ssuserf71896
 
PDF
Dunham - Data Mining.pdf
PRAJITBHADURI
 
PDF
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh
 
PDF
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
PDF
Ijariie1184
IJARIIE JOURNAL
 
PDF
Ijariie1184
IJARIIE JOURNAL
 
PPTX
Data mining approaches and methods
sonangrai
 
PDF
Data minig with Big data analysis
Poonam Kshirsagar
 
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
gylainloippu
 
PDF
reference paper.pdf
MayuRana1
 
PDF
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
ijeei-iaes
 
PPTX
DM
sowfi
 
PPTX
Big data and data mining
Polash Halder
 
Large Scale Data Mining using Genetics-Based Machine Learning
jaumebp
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
ijcsa
 
Clustering of Big Data Using Different Data-Mining Techniques
IRJET Journal
 
Genetic algorithms in Data Mining
Atul Khanna
 
Characterizing and Processing of Big Data Using Data Mining Techniques
IJTET Journal
 
Dunham - Data Mining.pdf
ssuserf71896
 
Dunham - Data Mining.pdf
PRAJITBHADURI
 
Big Data Mining - Classification, Techniques and Issues
Karan Deep Singh
 
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
Ijariie1184
IJARIIE JOURNAL
 
Ijariie1184
IJARIIE JOURNAL
 
Data mining approaches and methods
sonangrai
 
Data minig with Big data analysis
Poonam Kshirsagar
 
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
gylainloippu
 
reference paper.pdf
MayuRana1
 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
ijeei-iaes
 
DM
sowfi
 
Big data and data mining
Polash Halder
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
EC3551-Transmission lines Demo class .pptx
Mahalakshmiprasannag
 
PPTX
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
PDF
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PDF
monopile foundation seminar topic for civil engineering students
Ahina5
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PDF
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PDF
PRIZ Academy - Change Flow Thinking Master Change with Confidence.pdf
PRIZ Guru
 
PDF
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PDF
UNIT-4-FEEDBACK AMPLIFIERS AND OSCILLATORS (1).pdf
Sridhar191373
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
EC3551-Transmission lines Demo class .pptx
Mahalakshmiprasannag
 
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
monopile foundation seminar topic for civil engineering students
Ahina5
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PRIZ Academy - Change Flow Thinking Master Change with Confidence.pdf
PRIZ Guru
 
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
UNIT-4-FEEDBACK AMPLIFIERS AND OSCILLATORS (1).pdf
Sridhar191373
 
Thermal runway and thermal stability.pptx
godow93766
 
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 

Mining Big Data using Genetic Algorithm

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 743 Mining Big Data using Genetic Algorithm Surbhi Jain Assistant Professor, Department of Computer Science, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract – In today’s era, the amount of data available in the world is growing at a very rapid pace day by day because of the use of internet, smart phones, social networks, etc. This collection of large and complex data sets is referred to as Big Data. Primitive database systems are unable to capture, store and analyse this large amount of data. It is necessary to improve the text processing so that the information or the relevant knowledge which was previously unknown can be mined from the text. This paper proposes need for an algorithm for the clustering problem of big data using a combination of the genetic algorithm with some of the known clustering algorithms. The main idea behind this istocombine the advantages of Geneticalgorithmsandclusteringtoprocess large amount of data. Genetic Algorithm is an algorithm which is used to optimize the results. This paper gives an overview of concepts like data mining, genetic algorithmsand big data. Key Words: Genetic Algorithms, Big Data, Clustering, Chromosomes, Mining 1. INTRODUCTION In current Big Data age the data is becoming more and more available owing to advances in information and communication knowhow, enterprises are gaining meaningful information,relevantknowledgeandvisionfrom this huge data based on decision making. Big data mining is the ability of taking out valuable information from huge and complex set of data or data streams i.e. Big Data. One of the important data mining techniques for big data analysis is clustering. There are difficulties for applying clustering techniques to big data due to enormous amount of data rising on daily basis. There are a lot of clustering techniques available the most common of which is the K-means algorithm. It is used to analyze information from a dataset. But as we are saying that because of big data we have plethora of data available, thus available clustering algorithms are not very efficient. As Big Data refers to terabytes and petabytes of data, we need to have clustering algorithms with high computational costs. We can think of designing an algorithm which can combine the features of some of the clustering algorithms and genetic algorithm to process big data. To extract some meaningful information from the source data is the process called Mining. It is a set of computerized techniques that are used to extract formerly unknown or buried information from largesetsofdatabases.ASuccessful Data Mining makes possible to uncover patterns and relationships, and then to use this “new” information for making proactive knowledge-driven business decisions. There are a lot of algorithms whicharebeingusedformining the information from plain text. Thealgorithmsusedtosolve the optimization problems aretheGeneticAlgorithms.These algorithms work on search based inputs. The algorithms eventually leads to generate useful solutions forsuchkindof problems. 2. GENETIC ALGORITHMS Genetic Algorithms are a clan of computational prototypes inspired by evolution theory of Darwin.AccordingtoDarwin the species which is fittest and can adapt to changing surroundings can survive; the remaining tends to die away. Darwin also stated that “the survival of an organism can be maintained through the process of reproduction, crossover and mutation”. GA’s basic working mechanism is as follows: the algorithm is started with a set of solutions (represented by chromosomes) called population. Solutions from one population are taken and used to form a new population (reproduction). This is driven by optimism, that the new population will be superior to the old one. This is the reason they are often termed as optimistic search algorithms. The reproductive prospects are distributed in such a way that those chromosomes which represent a better solutionto the target problem are given more chances to reproduce than those which represent inferior solutions. They search through a huge combination of parameters to find the best match. For example, they can search through different combinations of materials and designs to find the perfect combination of both which could resultina stronger, lighter and overall, better final product.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 744 As an example we can consider “Face Recognition Systems” which are used for drawingsketchesbasedonvisualizations. This system is majorly used for investigation purposes where in sketch of some criminal is to be made on the basis of description given by some eye witness to the crime. The initial population is nothing but a lot of facial features which are already there in the system. The features may include a lots of varieties of noses, ears, lips, eyes etc. They may differ in color, size or anything else. As the witness starts giving descriptions the features which are most likely to match can be selected (Selection). The selectedfeaturescanthenfollow the steps of cross-over and mutation to produce more likely features. As in eyes of one face and lips of another can be chosen to go for cross over to produce a new individual which has both the features matching with the criminal. The process continues till the witness recognizes thefinal face as the one desired. 3. BIG DATA Big data is a term for data sets that are so large or complex that primitive data processing application software is inadequate to deal with them. Big data represents a new period in data study and utilization. It is a leveraging open source technology- a robust, secure, highly available, enterprise-class Big Data platform. Challenges include capture, storage, analysis, querying,andupdatingdata safely and securely. While the term “big data” is relatively new, the doing of collecting and storing plethora of information for eventual analysis is ages old. The significance of big data is not based on how much data we have, but how we use that data. We can take data from any source and analyze it to find responses that enable us to produce results in reducedcostandtimewithsmartdecision making. Here in this paper we are trying to combine bigdata with genetic algorithms for generating efficient analysis of data. The reason for the interest in genetic algorithmsisthat these are very powerful and broadly applicable search techniques. As said earlier also, Big Data refers to large- volume, complex, growing data sets with numerous, self- directed sources. Big Data are now rapidly expanding in all fields like science and engineering, including physical, biological andbiomedical scienceswiththefastdevelopment of networking, data storage, and the data collectioncapacity. With the new technology of Big Data, the computations can be speeded up. In very usual cases, if our system starts getting heavy because of loads of data whichisbecoming too big for our system to be managed, we add RAM or vacate some space by deleting certain processes. Big data on the contrary, adds more systems to the pool and there by promote parallelism.Thishoweverleadstofaulttoleranceas a consequence. More the number of systems, more is the probability of system failures. Fortunately, big data handles this automatically by duplicating data on the systemssothat if one system fails, its data can be redirected to some other system. 4. DATA MINING The knowledge from the data sets is extracted using Data Mining technology. It is used to search and analyze data.The data to be mined varies from a small data settoanenormous sized data set i.e. big data. In Data Mining, the source data is kept in the format of databases i.e. in the form of tables if we are considering relational databases. We only have to apply the algorithms to extract data from databases. The Data Mining environment produces voluminous data. The information retrieved in the data Miningstepistransformed into the structure that is easily understood by users. Once data has been extracted and then transformed, it is loaded into systems from where we can read it. The various methods like genetic algorithms, support vector machines, decision tree, neural network andclusteranalysistodisclose the hidden patterns inside the huge amounts of data set are all included in data mining. For handling such large amount of data sets, various algorithms which define various structures and approaches implemented to handle Big Data are needed. They also defines the various tools that were developed for analyzing them. Data mining and Text Mining are often used synonymously which howeverisnotright.Although both are mining techniques, but there is a very thin line of difference between the two. Data mining refers to the process of extraction of useful text from the databases which is not known prior, while text mining refers to extraction of useful and knowledgeable data from the plain text i.e. the naturally occurring text. Unlike data mining, this text need not be transformed into any other format. 5. CLUSTERING Clustering refers to categorizing similar kind of objects. It is a method of exploring the data, a technique of finding out patterns in the dataset. It falls in the category of unsupervised learning i.e. we don’t know in advance how data should group the data objects (of similar types) together. It is one of the most vital research field in the data mining. In clustering we aim at making collections of objects in such a manner that the objects having same attributes
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 745 belong to same group and objects withdifferent behaviorsin dissimilar groups. With the formation of groups, we can easily identify areas where the object space is dense and where it is sparsely filled and hence can determine the distribution patterns. We can find the stimulating patterns directly from the data sets without needing to have much of background knowledge. One of the popular approaches of clustering is Partitioning. Partitioning worksbytransferring objects by moving them from one cluster to another cluster starting from a certain point. The number of clusters for this technique should be pre-defined for thistechnique(likeink- means algorithm). 6. GENETIC ALGORITHM FOR CLUSTERING The voluminous data that is available to us can be divided into small groups where each group can be considered as population. By applying genetic operators iteratively on the population we can find out the optimum solution for the current scenario. Search process, as we all know, is a problem-solving method wherein we cannot determine the sequence of steps leading to the solution in advance. It is based on how nicely and wisely we have applied the search operators. An ideal search should be capable of carrying out search process locally as well as in a random manner. Random search explores the entire solution andisproficient in avoiding reaching to a local optimum while local search helps in exploring all the local possibilities and reaching the best solution. As discussed earlier a genetic algorithm is capable of effectively searching the problem domain and solving complex problems by simulating natural evolution. It perform search and provide near optimal solutions for objective function of an optimization problem. A set of chromosomes is referred to as a population wherein a chromosome (represented as strings) refers to the parameters in the search space, encodedbya combinationof cluster centroids. First step is to create a randompopulation,whichrepresents different solutions in the search space. Next, a few of chromosomes are selected as per the principle of survival of the fittest, and each is assigned into the next generation. Chromosomes are nothingbutbinary encodedstrings,which represents probable solutions to the optimization problem. Each string is then evaluated on the fitness function (objective function), giving a measure of the solutionquality called the fitness value. A new candidate solutionpopulation can be createdafter recombination(crossoverandmutation) is being performed upon candidate solution selection. Individual representation and population initialization, fitness computation, selection, crossover and mutation are thus the basic steps of genetic algorithm for data clustering. Given is the algorithm for the same: Input: k: the no of clusters d: the data set containing n objects p: population size Tmax: Maximum no. of iterations Output: A set of K clusters 1) Initialize every chromosome to have k random centroids selected from the set of data. 2) For T=1 to Tmax (i) For every chromosome i a. Allocate the object data to the cluster with the closest centroid. b. Recomputed k cluster centroids of chromosome i as the mean of their data objects. c. Compute the chromosome i fitness. (ii) Generate the new group of chromosomes using GA selection, crossover and mutation. The spine for a Genetic Algorithm to work is the Fitness function F (x). The prime focus of this function is to give the successive results after applying GA. Firstly, it is derived from the objective function and then used in successive genetic operations like crossover, mutation. Fitness means quality value which is the degree of the reproductive efficiency of individual string (chromosomes). A score is given to each individual chromosome with the help of fitness functions.Theproposal is to generate a Genetic Algorithm based clustering algorithm which is expected to provide an optimal clustering, better than that of K-Means approach. This may however induces a little more time complexity. The major benefit of using genetic algorithmsisthatthey are easily parallelized. Parallel implementation of GA is apprehended using two commonly used models namely:  Coarse-grained parallel GA  Fine-grained parallel GA In the first model every node is given a population split to process while in the second model each individual is
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 746 provided with a separate node for fitness evaluation. Adjoining nodes communicate with each other for selection and remaining operations. 6.1 PARALLEL ImplementationforClusteringusing GAs At first, the input data set is fragmented according to the block size by the input format. Each fragmentisthengivento a mapper to perform the First phase clustering,the resultsof which are passed on to a single reducer to perform the Second phase mapper. Step 1: Population initialization Each mapper forms the initial populationofindividualsafter receiving the input fragments. Each individual is a chromosome of size 𝑁. Every segment of the chromosome is a centroid. Centroids are randomlyselecteddata pointsfrom the received data split. For every data point in each chromosome clustering is performed and the data set is assigned to the cluster of the closest centroid. Then the fitness is evaluated. Step 2: Mating & Selection Cross-over and mutation techniques are used for mating. For cross-over, wegenerallyusearithmetic cross-overwhich generates one offspring from two parents. The centroid of the offspring is the arithmetic average of the corresponding centroid of parents. Swap mutation technique is used for mutation. In this, 9’s compliment of the data points is taken. The offspring from older population are selected to produce a new population. For selection, an approach known as Tournament selection is used wherein the individual is selected by performing a tournament based on fitness evaluationamongseveral individualschosenatrandomfrom the population. Step 3: Termination A new population thus generated replaces the older population which would again form a newer population using mating and selection procedure.Thiswhole procedure would be reiterated again and again until the termination condition is met. The termination condition can be anything like achieving a specified number of iterations or reaching a particular solution. The fittest individual of the final population of each mapper is handed on as the result to the reducer. The Second phaseclusteringonthemapping results of all mapper is then performed by the reducer. 6.2 GENETIC K-MEANS ALGORITHM Apart from parallel implementation using Genetic Algorithms, we can also have an algorithmthatcombinesthe advantage of Genetic algorithm and K-means algorithm for clustering. It is expected to provide an optimal clustering, better to that of K-Means approach, but probablywitha little more time complexity. The major steps of the algorithm of GK-means are: 1) Set the population. 2) Compute fitness of everyindividual byfollowing equation. Fitness (i) =2. (pi - 1)/Q-1 i=individual, p=position, Q=total individuals 3) If satisfied with the fitness condition,thenassignsolution, Else 4) Calculate sub population and migrate 5) Counting the ith individual depends on the rate si,whichis relative to its level of fitness that is Si = fitness (i) / summation (fitness (i)); 6) Translate population and assets individual wellness. 7) Perform crossover and mutation on each sub population 8) If termination condition satisfies, stop; else go to step 5. The major drawback of k-means algorithm is that it can’t process large amounts of data. If we have minimum amount of data then k mean is easy to process but for large amount of data it will not give desired results. Since we are talking about Big Data, so surely k-means is not the solution to our problem. GK-means on the contrary will take less memory and time to process big data and will give desired results as well. The Genetic k-means gradually converges to the global optimum as desired. 7. DISADVANTAGES OF GA A major difficulty in applying Genetic Algorithms is how to handle constraints. Genetic operators often produce infeasible offspring while manipulating chromosomes. A Penalty technique is used to keep a check on the number of infeasible solutions produced in each generation. This helps in enforcing the genetic search towards an optimal solution. Apart from this, a few other disadvantages are: 1) These are challenging to understand and to describe to end users.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 747 2) The problem abstraction and the means to represent individuals is quite difficult. 3) How to determine the best fitness function is a difficult work. 4) How to do crossover and mutation is another difficulty. 5) The large over-production of individuals and the random character of the search process is another drawback. 8. FUTURE SCOPE The paper compares and reviews the methods available for clustering data based on genetic algorithms. A more robust and time saving algorithm can be designedsuchthatbigdata can be effectively mined overcoming all the challengesbeing faced by Genetic Algorithms. 9. CONCLUSION This paper provides the reader a review of all the jargons related to analysing big data. The concepts like Text Mining, Big Data and Genetic Algorithm concept, samples, scope, methods, advantages, challenges etc. are all discussed here. The paper reviews various methods that are available for text mining. The paper concludes that since the prime focus is on to mining big data, so algorithm followed has to be space effective and time effective. The paper presents need for an algorithm that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective 10. REFERENCES [1] Senthilnath, J., S. N. Omkar, and V. Mani. "Clusteringusing firefly algorithm: performance study." Swarm and Evolutionary Computation 1, no. 3 (2011). [2] Ahmed and Saeed. A Survey of Big Data CloudComputing Security. International Journal of Computer Science and Software Engineering (IJCSSE), Volume 3, Issue 1, December 2014. [3] Arora, Deepali, Varshney, Analysis of K-Means and K- Medoids Algorithm For Big Data, InternationalConference on Information Security & Privacy (ICISP2015), 2015. [4] Dash and Dash, Comparative Analysis of K-means and Genetic Algorithm Based Data Clustering. International Journal of Advanced Computer and Mathematical SciencesISSN 2230-9624. Vol 3, Issue 2, 2012. [5] Gaddam, Securing your Big Data Environment, Black Hat USA 2015. [6] http://www.sas.com/en_us/insights/big-data/internet- of-things.html [7] Inukollu, Arsi and Ravuri, Security Issues Associated With Big Data in Cloud Computing. International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.3, May 2014. [8] Jiawei Han and MichelineKamber,“Data MiningConcepts & Techniques”, Second Edition, Morgan Kaufmann Publishers [9] “Text Mining Technique using Genetic Algorithm”, International Journal of Computer Applications (0975 – 8887) Volume #. 63, February 2013 [10] McAfee, Andrew, and Erik Brynjolfsson. "Big data: the management revolution." Harvard business review 2012 [11] Deepankar Bharadwaj, Dr. Arvind Shukla, Text Mining Technique on Big data using Genetic Algorithm, International Journal of Computer Engineering and Applications, Volume X, Issue IX, Sep. 16 [12] Mitsuo Gen, Runwei Cheng, Genetic Algorithms and Engineering Optimization, John Wiley and Sons, 2000