SlideShare a Scribd company logo
SAVITRIBAI PHULE PUNE UNIVERSITY
A MINI PROJECT REPORT ON
OPTIMIZATION OF GENETIC ALGORITHM USING IRIS
FLOWER DATASET
SUBMITTED TOWARDS THE
PARTIAL FULFILLMENT OF THE REQUIREMENTS OF
BACHELOR OF ENGINEERING (Computer Engineering)
BY
Sunil Rajput Exam No: 71720728F
Ashish kumar Singh Exam No: 71324943K
Ashish Yadav Exam No: 71741665J
Mayank Patil Exam No: 71550097L
Under The Guidance of
Prof. Mangesh Ghonge
DEPARTMENT OF COMPUTER ENGINEERING
SANDIP INSTITUTE OF TECHNOLOGY AND RESEARCH
CENTRE
MAHIRAVANI, TRIMBAK ROAD, NASHIK 422213
SANDIP INSTITUTE OF TECHNOLOGY AND RESEARCH CENTRE
DEPARTMENT OF COMPUTER ENGINEERING
CERTIFICATE
This is to certify that the Project Entitled
OPTIMIZATION OF GENETIC ALGORITHM USING IRIS
FLOWER DATASET
Submitted by
Sunil Rajput Exam No: 71720728F
Ashish Kumar Singh Exam No: 71324943K
Ashish Yadav Exam No: 71741665J
Mayank Patil Exam No: 71550097L
is a bonafide work carried out by Students under the supervision of Prof. Mangesh
Ghonge and it is submitted towards the partial fulfillment of the requirement of
Bache- lor of Engineering (Computer Engineering) Project.
Prof. Mangesh Ghonge Prof. A. D.Potgantwar
Internal Guide H.O.D
Dept. of Computer Engg. Dept. of Computer Engg.
SITRC, Department of Computer Engineering 2019-20 I
Abstract
Machine learning is the core of Artificial Intelligence (AI) and pattern recognition is
also an important branch of AI. In this thesis, the conception of machine learning
and machine learning algorithms are introduced. Moreover, a typical and simple
machine learning algorithm called K-means is introduced. A case study about Iris
classification is introduced to show how the K-means works in pattern recognition.
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced
by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of
multiple measurements in taxonomic problems as an example of linear discriminant
ana
collected the data to quantify the morphologic variation of Iris Flower of three
related species. Two of the three species were collected in Gaspe Peninsula all from
the same pasture, and picked on the same day and measured at the same time by the
same person with same apparatus.The data set consists of 50 samples from each of
three species of Iris that is 1) Iris Setosa 2) Iris Virginica 3) Iris Versicolor. Four
features were measured from each sample. They are 1) Sepal Length 2) Sepal Width
3) Petal Length4) Petal Width. All these four parameters are measured in
Centimeters. Based on the combination of these four features, the species among
three can be predicted.The aim of the case study is to design and implement a system
of pattern recognition for the Iris flower based on Machine Learning. This project
shows the workflow of pattern recognition and how to use machine learning
approach to achieve this goal. The data set was collected from an open source
website of machine learning. The programming language used in this project was
Python.
Keywords : Genetic Algorithm Optimization, Iris Dataset, Machine Learning,
Python.
SITRC, Department of Computer Engineering 2019-20 II
Acknowledgments
It gives us great pleasure in presenting the Mini project report on OPTIMIZATION
OF GENETIC ALGORITHM USING IRIS FLOWER DATASET
I would like to take this opportunity to thank my internal guide Prof. Mangesh
Ghonge for giving me all the help and guidance I needed. I am really grateful to
them for their kind support. Their valuable suggestions were very helpful.
I am also grateful to Prof. A. D.Potgantwar, Head of Computer Engineering De-
partment, CollegeName for his indispensable support, suggestions.
In the end our special thanks to Prof. Gokul Patil for providing various resources
such as laboratory with all needed software platforms, continuous Internet connec-
tion, for Our Project.
Sunil Rajput
Ashish Kumar Singh
Ashish Yadav
Mayank Patil
(B.E. Computer Engg.)
INDEX
1 Synopsis 1
1.1 Project Title.........................................................................................2
1.2 Project Option ....................................................................................2
1.3 Internal Guide .....................................................................................2
1.4 Sponsorship and External Guide........................................................2
1.5 Technical Keywords (As per ACM Keywords)..................................2
1.6 Problem Statement..............................................................................2
1.7 Abstract...............................................................................................2
1.8 Goals and Objective............................................................................3
1.9 Relevant mathematics associated with the Project..............................3
1.10 Names of Conferences / Journals where papers can be published......5
1.11 Review of Conference/Journal Papers supporting Project idea ..........5
1.12 Plan of Project Execution....................................................................6
2 Technical Keywords 7
2.1 Area of Project ....................................................................................8
2.2 Technical Keywords ...........................................................................8
3 Introduction 9
3.1 Project Idea .........................................................................................10
3.2 Motivation of the Project ....................................................................10
3.3 Literature Survey.................................................................................10
4 Problem Definition and scope 14
4.1 Problem Statement..............................................................................15
4.1.1 Goals and objectives...............................................................15
4.1.2 Major Constraints...................................................................15
4.2 Methodologiess...................................................................................15
4.3 Outcome..............................................................................................15
4.4 Applications ........................................................................................15
4.5 Hardware Resources Required............................................................16
4.6 Software Resources Required.............................................................16
5 METHODOLY 17
5.1Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . .18
5.1.1 Iris Dataset & Algorithms . . . . . . . . . . . . . . . . . . . . .19
5.1.2 Implementation/Results . . . . . . . . . . . . . . . . . . . . . . .20
5.2Team Organization . . . . . . . . . . . . . . . . . . . . . . . . . .25
5.2.1 Team structure . . . . . . . . . . . . . . . . . . . . . . . .26
6 Summary & Conclusion 29
7 References 31
CHAPTER 1
SYNOPSIS
SITRC, Department of Computer Engineering 2019-20 2
1.1 PROJECT TITLE
OPTIMIZATION OF GENETIC ALGORITHM USING IRIS FLOWER DATASET
1.2 PROJECT OPTION
Mini project
1.3 INTERNAL GUIDE
Prof. Mangesh Ghonge
1.4 SPONSORSHIP AND EXTERNAL GUIDE
SITRC Computer Department
1.5 TECHNICAL KEYWORDS (AS PER ACM KEYWORDS)
Genetic Algorithm Optimization, Iris Dataset, Machine Learning, Python.
1.6 PROBLEM STATEMENT
To Apply the Genetic Algorithm for optimization on a dataset obtained from UCI
ML repository.
For Example: IRIS Dataset using Python.
1.7 ABSTRACT
Machine learning is the core of Artificial Intelligence (AI) and pattern recognition is
also an important branch of AI. In this thesis, the conception of machine learning and
machine learning algorithms are introduced. Moreover, a typical and simple machine
learning algorithm called K-means is introduced. A case study about Iris
classification is introduced to show how the K-means works in pattern recognition.
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced
by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of
multiple measurements in taxonomic problems as an example of linear discriminant
data set because Edgar Anderson of
SITRC, Department of Computer Engineering 2019-20 3
related species. Two of the three species were collected in Gaspe Peninsula all from
the same pasture, and picked on the same day and measured at the same time by the
same person with same apparatus.The data set consists of 50 samples from each of
three species of Iris that is 1) Iris Setosa 2) Iris Virginica 3) Iris Versicolor. Four
features were measured from each sample. They are 1) Sepal Length 2) Sepal Width
3) Petal Length4) Petal Width. All these four parameters are measured in
Centimeters. Based on the combination of these four features, the species among
three can be predicted.
1.8 GOALS AND OBJECTIVE
The aim of the case study is to design and implement a system of pattern recognition
for the Iris flower based on Machine Learning. This project shows the workflow of
pattern recognition and how to use machine learning approach to achieve this goal.
The data set was collected from an open source website of machine learning. The
programming language used in this project was Python.
1.9 REVIEW OFCONFERENCE/JOURNAL PAPERS SUPPORTING
PROJECT IDEA
[1] Author Li Liu, Murat Kantarcioglu and Bhavani Thurasingham discussed about
the securing of data using decision tree algorithm. . It is classified with the perturbed
data set, and this process improves the accuracy of data. It also reduce the costs off
communicatio and computation compared to any other cryptographici services They
also provide the direction for mapping the data mining functions instead of
reconstructing the original data which provide more privacy with less cost [3].
Author Ahmad Ashari, Paryudi, Min tjoa describes about the performance of various
classification algorithm for an alternative design in an energy simulation tool. This
shows there is possible way of comparing multiple algorithms. As per the
comparision of decision tree, naive bayes, K-Nearest Neighbour algorithm the
accuracy of decision tree is better than the other algorithms [4].
SITRC, Department of Computer Engineering 2019-20 4
Author Sagar S.Nikam has defined the comparitive study on classification
techniques which mainly focus in the performance analysis of classification
algorithms and its Limitations. Also focus on classifying data into different classes
according to some constraint. The first approach is the Statistical approach which is
classical approach works on linear discrimination. The second is Machine Learning
which helps to solve more complex problems and third approach is Neural Network
shows the diverse source ranging from the understanding and emulating the human
brain to border issues of human abilities [6].
Author Rachna Raghuwanshi has describe about performance of the Naïve bayes
classifier and Decision Tree with the Fire Data Set to compare the accuracy. Where
as the problem with Cross Validation is avoided [7].
Author XHEMALI, J.HINDE, G.STONE precises on the automatic analysis and
classification of attribute data from training course web pages. They choose Naive
bayes, Decision Tree, Neural Network algorithm to classify the best data with same
data set. As per the result gained the accuracy of naive bayes is more accurate than
any other classification algorithm [8].
Author Bhaskar N.Patel, Satish G. Prajapati, Dr.Kamaljit I. Lakhtaria describes the
classification is the categorization of data into different category based on some
rules. The classification of data with decision tree is the pictorial view, and
categorizing is easier, accuracy is better than othe classification algorithm [11].
Learning is a very important feature of Artificial Intelligence. Many scientists
tried to explain and give a proper definition for learning. However, learning is
not that easy to cover with few simple sentences. Many computer scientists,
sociologists, logicians and other scientists discussed about this for a long time.
Some scientists think learning is an adaptive skill so that the system can perform
the similar task better in the next time(Simon 1987). Others claim that learning
is a process of collecting knowledge(Feigenbaum 1977). Even though there is
no proper definition for learning skill, we still need to give a definition for
machine learning. In general, machine learning aims to find out how the
computer algorithms can be improved automatically through
experience(Mitchell 1997).
SITRC, Department of Computer Engineering 2019-20 5
Machine learning has an important position in the field of Artificial Intelligence.
At the beginning of development of Artificial Intelligence(AI), the AI system
does not have a thorough learning ability so the whole system is not perfect. For
instance, a computer cannot do self-adjustment when it faces problems.
Moreover, the computer cannot automatically collect and discover new
knowledge. The inference of the program needs more induction than deduction.
Therefore, computer only can figure out already existing truths. It does not have
the ability to discover a new logical theory, rules and so on.
1.10PLAN OF PROJECT EXECUTION
Using planner or alike project management tool.
SITRC, Department of Computer Engineering 2019-20 6
CHAPTER 2
INTRODUCTION
SITRC, Department of Computer Engineering 2019-20 7
3.1 PROJECT IDEA
Applying Genetic Algorithm Optimization using Iris Dataset in Python.
3.2 MOTIVATION OF THE PROJECT
The aim of the case study is to design and implement a system of pattern recognition
for the Iris flower based on Machine Learning. This project shows the workflow of
pattern recognition and how to use machine learning approach to achieve this goal.
The data set was collected from an open source website of machine learning. The
programming language used in this project was Python.
3.3 LITERATURE SURVEY
[1]. Author Li Liu, Murat Kantarcioglu and Bhavani Thurasingham discussed about
the securing of data using decision tree algorithm. . It is classified with the perturbed
data set, and this process improves the accuracy of data. It also reduce the costs off
communicatio and computation compared to any other cryptographici services They
also provide the direction for mapping the data mining functions instead of
reconstructing the original data which provide more privacy with less cost [3].
Author Ahmad Ashari, Paryudi, Min tjoa describes about the performance of various
classification algorithm for an alternative design in an energy simulation tool. This
shows there is possible way of comparing multiple algorithms. As per the
comparision of decision tree, naive bayes, K-Nearest Neighbour algorithm the
accuracy of decision tree is better than the other algorithms [4].
Author Sagar S.Nikam has defined the comparitive study on classification
techniques which mainly focus in the performance analysis of classification
algorithms and its Limitations. Also focus on classifying data into different classes
according to some constraint. The first approach is the Statistical approach which is
classical approach works on linear discrimination. The second is Machine Learning
which helps to solve more complex problems and third approach is Neural Network
SITRC, Department of Computer Engineering 2019-20 8
shows the diverse source ranging from the understanding and emulating the human
brain to border issues of human abilities [6].
Author Rachna Raghuwanshi has describe about performance of the Naïve bayes
classifier and Decision Tree with the Fire Data Set to compare the accuracy. Where
as the problem with Cross Validation is avoided [7].
Author XHEMALI, J.HINDE, G.STONE precises on the automatic analysis and
classification of attribute data from training course web pages. They choose Naive
bayes, Decision Tree, Neural Network algorithm to classify the best data with same
data set. As per the result gained the accuracy of naive bayes is more accurate than
any other classification algorithm [8].
Author Bhaskar N.Patel, Satish G. Prajapati, Dr.Kamaljit I. Lakhtaria describes the
classification is the categorization of data into different category based on some
rules. The classification of data with decision tree is the pictorial view, and
categorizing is easier, accuracy is better than othe classification algorithm [11].
Learning is a very important feature of Artificial Intelligence. Many scientists
tried to explain and give a proper definition for learning. However, learning is
not that easy to cover with few simple sentences. Many computer scientists,
sociologists, logicians and other scientists discussed about this for a long time.
Some scientists think learning is an adaptive skill so that the system can perform
the similar task better in the next time(Simon 1987). Others claim that learning
is a process of collecting knowledge(Feigenbaum 1977). Even though there is
no proper definition for learning skill, we still need to give a definition for
machine learning. In general, machine learning aims to find out how the
computer algorithms can be improved automatically through
experience(Mitchell 1997).
CHAPTER 4
PROBLEM DEFINITION AND SCOPE
SITRC, Department of Computer Engineering 2019-20 15
4.1 PROBLEM STATEMENT
To Apply the Genetic Algorithm for optimization on a dataset obtained from UCI
ML repository.
For Example: IRIS Dataset using Python.
4.1.1 Goals and objectives
The aim of the case study is to design and implement a system of pattern recognition
for the Iris flower based on Machine Learning. This project shows the workflow of
pattern recognition and how to use machine learning approach to achieve this goal.
The data set was collected from an open source website of machine learning. The
programming language used in this project was Python.
4.1.2 Major Constraints
It is not sustainable for incomplete Datasets.
4.4 OUTCOME
These results show the effect that the number of k and the random initialization
number have on the clustering result. It is also possible to see the advantages and
disadvantages of the K-means clustering algorithm.
4.5APPLICATIONS
Software engineering.
Traveling Salesman Problem.
Mobile communications infrastructure optimization.
Electronic circuit design, known as Evolvable hardware.
SITRC, Department of Computer Engineering 2019-20 16
4.6 HARDWARE RESOURCES REQUIRED
1. Desktop
2. Inbuilt Compiler
3. Anaconda /Simulink tool
4. Numpy.
5. Iris Dataset.
Sr. No. Parameter Minimum Requirement Justification
1 CPU Speed 2 GHz Remark Required
2 RAM 3 GB Remark Required
Table 4.1: Hardware Requirements
4.7SOFTWARE RESOURCES REQUIRED
Platform :
1. Operating System: Windows,Ubunu/Linux
2. Programming Language: Python, Anaconda ,Numpy, etc.
SITRC, Department of Computer Engineering 2019-20 17
CHAPTER 5
METHODLOGY
5.1 THE DESCRIPTION OF MACHINE LEARNING FORMS
A learning method is a complicated topic which has many different kinds of forms.
Everyone has different methods to study, so does the machine. We can categorize various
machine learning systems by different conditions. In general, we can separate learning
problems in two main categories: supervised learning and unsupervised learning.
Supervised learning is a commonly used machine learning algorithm which appears
in many different fields of computer science. In the supervised learning method, the
computer can establish a learning model based on the training data set. According to this
learning model, a computer can use the algorithm to predict or analyze new information.
By using special algorithms, a computer can find the best result and reduce the error rate
all by itself. Supervised learning is mainly used for two different patterns: classification
and regression.
In supervised learning, when a developer gives the computer some samples, each
sample is always attached with some classification information. The computer will
analyze these samples to get learning experiences so that the error rate would be reduced
when a classifier does recognitions for each patterns.
1.1 Iris Flower Species
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by
the British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple
measurements in taxonomic problems as an example of linear discriminant analysis. It
to quantify the morphologic variation of Iris Flower of three related species. Two of the
three species were collected in Gaspe Peninsula all from the same pasture, and picked on
the same day and measured at the same time by the same person with same apparatus.
The data set consists of 50 samples from each of three species of Iris that is 1) Iris Setosa
2) Iris Virginica 3) Iris Versicolor. Four features were measured from each sample. They
are 1) Sepal Length 2) Sepal Width 3) Petal Length4) Petal Width. All these four
parameters are measured in Centimeters. Based on the combination of these four
features, the species among three can be predicted.
2.IMPLEMENTATION OF ALGORITHMS
2.1 K-Nearest Neighbors Algorithm
The k-Nearest Neighbors algorithm (or kNN for short) is an easy algorithm to understand
and to implement, and a powerful tool to have at your disposal. The implementation will
be specific for classification problems and will be demonstrated using the Iris flowers
classification problem.
5.1.1 What is k-Nearest Neighbors
The model for kNN is the entire training dataset. When a prediction is required for a
unseen data instance, the kNN algorithm will search through the training dataset for the
k-most similar instances. The prediction attribute of the most similar instances is
summarized and returned as the prediction for the unseen instance. The similarity
measure is dependent on the type of data. For real-valued data, the Euclidean distance
can be used. Other types of data such as categorical or binary data, Hamming distance
can be used. In the case of regression problems, the average of the predicted attribute
may be returned. In the case of classification, the most prevalent class may be returned.
5.1.2 How does k-Nearest Neighbors Work
The kNN algorithm is belongs to the family of instance-based, competitive learning and
lazy learning algorithms. Instance-based algorithms are those algorithms that model the
problem using data instances (or rows) in order to make predictive decisions. The kNN
algorithm is an extreme form of instance-based methods because all training
observations are retained as part of the model.
It is a competitive learning algorithm, because it internally uses competition between
model elements (data instances) in order to make a predictive decision. The objective
similarity measure between data instances causes each data instance to compete to the
Lazy learning refers to the fact that the algorithm does not build a model until the time
that a prediction is required. It is lazy because it only does work at the last second. This
has the benefit of only including data relevant to the unseen data, called a localized
model. A disadvantage is that it can be computationally expensive to repeat the same or
similar searches over larger training datasets.
Finally, kNN is powerful because it does not assume anything about the data, other than
a distance measure can be calculated consistently between any two instances. As such, it
is called non-parametric or non-linear as it does not assume a functional form.
3.LOGISTIC REGRESSION ALGORITHM
Logistic Regression is a type of regression that predicts the probability of occurrence of
an event by fitting data to a logit function (logistic function). Like many forms of
regression analysis, it makes use of several predictor variables that may be either
numerical or categorical. For instance, the probability that a person has a heart attack
within a specified time period might be predicted from knowledge of the person's age,
sex and body mass index. This regression is quite used in several scenarios such as
prediction of customer's propensity to purchase a product or cease a subscription in
marketing applications and many others.
3.1 What is Logistic Regression?
Logistic Regression, also known as Logit Regression or Logit Model, is a mathematical
model used in statistics to estimate (guess) the probability of an event occurring having
been given some previous data. Logistic Regression works with binary data, where either
the event happens (1) or the event does not happen (0). So given some feature x it tries
to find out whether some event y happens or not. So y can either be 0 or 1. In the case
where the event happens, y is given the value 1. If the event does not happen, then y is
given the value of 0. For example, if y represents whether a sports teams wins a match,
then y will be 1 if they win the match or y will be 0 if they do not. This is known as
Binomial Logistic Regression.
4. Implementation
4.1 Python
Python is a programming language created by Guido van Rossum in 1989. Python is
an interpreted, object-oriented, dynamic data type of high-level programming
languages.(Python Software Foundation 2013). The programming language style is
simple, clear and it also contains powerful different kinds of classes. Moreover, Python
can easily combine other programming languages, such as C or C++.
As a successful programming language, it has its own advantages:
1. Simple & easy to learn: The concept of this programming language is as simple
as it can be. That makes it easy for everyone to learn and use. It is easy to
understand the syntax.
2. Open source: Python is completely free as it is an open source software. Several
of open source scientific computing storage has the API for Python. Users can
easy to install Python on their own computer and use the standard and extend
library.
3. Scalability: Programmers can write their code in C or C++ and run them in
Python.
4.2 SciKit-learn
Scikit learn is an open source machine learning library for the Python programming
language. It features various classification, regression, and clustering algorithms and
is designed to interoperate with the Python numerical libraries NumPy and SciPy
(Pedregosa et al. 2011). SciKit-learn contains the K- means algorithm based on Python
and it helps to figure out how to implement this algorithm in programming.
4.3 Numpy, Scipy and Matplotlib
In Python, there is no data type called array. In order to implement the data type of
array with python, numpy and scipy are the essential libraries for analyzing and
calculating data. They are all open source libraries. Numpy is mainly used for the
matrix calculation. scipy is developed based on numpy and it is mainly used for
scientific research.
By using them in Python programming, they can be used with two simple commands:
>>> import numpy
>>> import scipy
Then Python will call the methods from numpy and scipy.
Mathplotlib is a famous library for plotting in Python. It provides a series of API and
it is suitable for making interactive mapping. In this case, we need to use it to find the
best result visually.
4.4 Preparing the Iris flower data set
The data set of Iris flower can be found in UCI Machine Learning Repositor (Bache
The data set of Iris flower can be also found in the Scikit-learn library. In site-
packages, there is a folder named sklearn. In this folder, there is a datasets subfolder
to contain many kinds of data sets for machine learning study.
The data set can be found in Appendix 1.
In the species of this table, 0 represents setosa, 1 represents versicolor, 2 represents
virginica
In the process of preparing a training data set and a testing data set, the greatest
problem is how to find the most appropriate way to divide the data set into training
data set and testing data set. In some cases, by using sampling theory and estimation
theory, we can separate the whole data set into training data set and testing data set.
However, sometimes, the method would be changed. The attributes and the property
The K-means algorithm and unsupervised learning does not use a training data set to
compute the training sample. Therefore, there is no need to separate the dataset into a
training data set and a testing data set. It can simply use this dataset to get the result
of clustering.
4.5 Machine learning system design
In general, the principles of machine learning system design should follow two basic
requirements :
the model selection and creation and
the learning algorithm selection and design.
In addition, different models can have different learning systems. On the other hand,
the objective function is also different in different learning models. The objective
function can help the machine to establish a learning system. Moreover, the accuracy
and complexity of different algorithms would be the most important factor of the
learning system. If the chosen algorithm is not very adaptive to the learning system,
then the efficiency and result of the learning system would be reduced. The selection
of training data set can have an influence on learning performance and feature
selection.
ILLUSTRATION OF SAMPLE IRIS DATASET
Sample datasets of Iris Setosa
Sample datasets of Iris Versicolor
Sample datasets of Iris Virginica
5. Evaluating results
The result is shown in four images for the clustering results. Figure 9 will be the result
with eight clusters. Figure 10 shows the result with three clusters.
Figure 9. Clustering of Iris dataset with eight clusters
Figure 10. Clustering of Iris dataset with three clusters
As seen in Figure 9 and 10, the whole dataset is separated into eight clusters in Figure
9 and three clusters are shown in Figure 10 with different colors. In Figure 9, most of
the samples stick together, it is really hard to distinguish them very clearly. The
differences between each sample is small. In this case, the cluster result is not
acceptable. On the other hand, in Figure 10, it can be easily seen that the cluster result
is much better than in
Figure 9. Even though there are still some overlapping parts between green and purple,
but it quite clear to see the difference between these three clusters. This case shows
the importance of choosing the number of clusters for K-means algorithm.
Sometimes for the real datasets, it is difficult to know how many data sets should be
used. Therefore, it is quite hard to choose the number of clusters. One method is to use
the ISODATA algorithm, through the merging and division of clusters to obtain a
reasonable number of k.
Figure 11. Clustering of Iris dataset with bad initialization
Figure 11 , shows the cluster result with three clusters but bad initialization. We
can see that some of the samples change their class compare to the Figure 10. With
a random initialization number, the system will obtain different cluster results.
Therefore, a random initialization number is very important for a good cluster
result. However, we do not know what could be a good initialization number. In
this case, in some machine learning systems, the scientists will choose GA(Genetic
Algorithm) to have the initialization point.
Figure 12 below illustrates a standard result of K-means clustering of Iris
ning datasets
in supervised learning. The number of clusters are three and with a good
initialization point. This is the best classification of all shown here. The whole
dataset has been separated properly and each dataset has good differences. In Figure
10, it shows the stardard result of classification in unsupervised learning. Compare
to this figure, Figure 10 still has some small differences but it still works very well.
Almost every data belongs to the right place.
Figure 12. Clustering of Iris dataset in ground truth
These results show the effect that the number of k and the random initialization
number have on the clustering result. It is also possible to see the advantages and
disadvantages of the K-means clustering algorithm.
HISTOGRAM :
BOX AND WHISKER PLOTS(Give idea about distribution of input
attributes)
SCATTER PLOT (ALL ATTRIBUTES)
5.1 TEAM ORGANIZATION
Team Structure :
Fig 5.4.1 : Team Structure
5.1.1 Team structure
Team Leader: Sunil Rajput
Software Developer: Sunil Rajput
Hardware Developer: Ashish Yadav, Mayank Patil.
Documentation : Ashish kumar Singh.
CHAPTER 8
SUMMARY AND CONCLUSION
SITRC, Department of Computer Engineering 2019-20 33
The primary goal of supervised learning is to build a model that generalizes . Here
in this project we make predictions on unseen data which is the data not used to train
the model hence the machine learning model built should accurately predicts the
species of future flowers rather than accurately predicting the label of already trained
data. With the rapid development of technology, AI has been applied in many fields.
Machine learning is the most fundamental approach to achieve AI. This thesis
describes the work principle of machine learning, two different learning forms of
machine learning and an application of machine learning. In addition, a case study
of Iris flower recognition to introduce the workflow of machine learning in pattern
recognition is shown. In this case, the meaning of pattern recognition and how the
machine learning works in pattern recognition has been described. The K-means
algorithm, which is a very simple machine learning algorithm from the unsupervised
learning method is used. Evolutionary algorithms have been around since the early
sixties. They apply the rules of nature: evolution through selection of the fittest
individuals, the individuals representing solutions to the mathematical problem.
Genetic algorithms are so far generally the best and most robust kind of evolutionary
algorithms. The work also shows how to use SciKit-learn or Anaconda 3,0 software
to learn machine learning.
CHAPTER 9
REFERENCES
[1] Abbas MAkbari Z. (2010). "A multilevel evolutionary algorithm for optimizing numerical
functions" IJIEC 2 (2011): 419 430
[2] Ananya (2017), What is Diabetes, retrieved online from https://www.news-
medical.net/health/What- is-Diabetes.aspx
[3] Coffin, D.; S., Robert E. (2008). "Linkage Learning in Estimation of Distribution Algorithms".
Linkage in Evolutionary Computation. Springer Berlin Heidelberg: 141 156.
doi:10.1007/978-3-540- 85068-7_7.
[4] Eiben, A. E. et al (1994). Genetic algorithms with multi-parent recombination, PPSN III:
Proceedings of the International Conference on Evolutionary Computation. The Third
Conference on Parallel Problem Solving from Nature: 78 87. ISBN 3-540-58484-6.
[5] Clustering - K- -means-Ineractive demo, Available at:
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html. Consulted 22
AUG 2013
[6] Bache, K.& Lichman, M. 2013. UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and
Computer Science.
[7] Bishop, C. 2006. Pattern Recognition and Machine Learning. New York: Springer, pp.424-
428.
[8] Fisher, R.A. 1936. UCI Machine Learning Repository: Iris Data Set. Available at:
http://archive.ics.uci.edu/ml/datasets/Iris. Consulted 10 AUG 2013
[9] Mitchell, T. 1997. Machine learning. McGraw Hill.
[10]
[11] dy of Classification Techniques for Fire Data
7 (1), 2016, 78-82.

More Related Content

What's hot (20)

PDF
Genetic Algorithm for optimization on IRIS Dataset presentation ppt
Sunil Rajput
 
PPTX
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
PDF
Seminar Report | Network Intrusion Detection using Supervised Machine Learnin...
Jowin John Chemban
 
PPTX
Transfer learning-presentation
Bushra Jbawi
 
PPTX
Digit recognition
btandale
 
PPTX
Intro to deep learning
David Voyles
 
PDF
Transfer Learning
Hichem Felouat
 
PPTX
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PPTX
Deep Learning Explained
Melanie Swan
 
PPTX
Deep learning
Ratnakar Pandey
 
PDF
Machine Learning for Dummies
Venkata Reddy Konasani
 
PPTX
Handwritten Digit Recognition(Convolutional Neural Network) PPT
RishabhTyagi48
 
PDF
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
PPTX
Grid based method & model based clustering method
rajshreemuthiah
 
PPTX
Image recognition
Harika Nalla
 
PPTX
Lstm
Mehrnaz Faraz
 
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
PPT
Machine Learning
Rahul Kumar
 
PPTX
CART – Classification & Regression Trees
Hemant Chetwani
 
Genetic Algorithm for optimization on IRIS Dataset presentation ppt
Sunil Rajput
 
1.Introduction to deep learning
KONGU ENGINEERING COLLEGE
 
Seminar Report | Network Intrusion Detection using Supervised Machine Learnin...
Jowin John Chemban
 
Transfer learning-presentation
Bushra Jbawi
 
Digit recognition
btandale
 
Intro to deep learning
David Voyles
 
Transfer Learning
Hichem Felouat
 
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
Introduction to Machine Learning
Rahul Jain
 
Deep Learning Explained
Melanie Swan
 
Deep learning
Ratnakar Pandey
 
Machine Learning for Dummies
Venkata Reddy Konasani
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
RishabhTyagi48
 
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
Grid based method & model based clustering method
rajshreemuthiah
 
Image recognition
Harika Nalla
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 
Machine Learning
Rahul Kumar
 
CART – Classification & Regression Trees
Hemant Chetwani
 

Similar to Genetic Algorithm for optimization on IRIS Dataset REPORT pdf (20)

PDF
IDENTIFICATION OF DIFFERENT SPECIES OF IRIS FLOWER USING MACHINE LEARNING ALG...
IRJET Journal
 
PPTX
Amber_iris_ppt.pptx kk
Abodahab
 
PPTX
Shelly Mehndiratta_IrisFlowerClassification.pptx
2121465csecoe
 
PPTX
ML Projects.pptx
AnuragDhamne1
 
PPTX
ML - Iris ModelML - Iris ModelML - Iris ModelML - Iris Model
SalimAlMaqbali4
 
PDF
An Approach for IRIS Plant Classification Using Neural Network
ijsc
 
PDF
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
ijsc
 
PDF
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
cscpconf
 
PDF
A new model for iris data set classification based on linear support vector m...
IJECEIAES
 
PDF
Case Study: Prediction on Iris Dataset Using KNN Algorithm
IRJET Journal
 
PDF
I0704047054
IJERD Editor
 
PDF
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 
PDF
(44-53) SVM-Based Flower Image Classification for Commercial Applications (RE...
BabitaBhagat4
 
PDF
IJSRED-V2I1P12
IJSRED
 
PDF
Performance Evaluation of Different Data Mining Classification Algorithm and ...
IOSR Journals
 
PDF
Regularized Weighted Ensemble of Deep Classifiers
ijcsa
 
PDF
SCCAI- A Student Career Counselling Artificial Intelligence
vivatechijri
 
PDF
MACHINE LEARNING TOOLBOX
mlaij
 
PPTX
Machine learning
Siddharth Kar
 
PDF
wzhu_paper
Wenzhen Zhu
 
IDENTIFICATION OF DIFFERENT SPECIES OF IRIS FLOWER USING MACHINE LEARNING ALG...
IRJET Journal
 
Amber_iris_ppt.pptx kk
Abodahab
 
Shelly Mehndiratta_IrisFlowerClassification.pptx
2121465csecoe
 
ML Projects.pptx
AnuragDhamne1
 
ML - Iris ModelML - Iris ModelML - Iris ModelML - Iris Model
SalimAlMaqbali4
 
An Approach for IRIS Plant Classification Using Neural Network
ijsc
 
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
ijsc
 
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
cscpconf
 
A new model for iris data set classification based on linear support vector m...
IJECEIAES
 
Case Study: Prediction on Iris Dataset Using KNN Algorithm
IRJET Journal
 
I0704047054
IJERD Editor
 
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Fayan TAO
 
(44-53) SVM-Based Flower Image Classification for Commercial Applications (RE...
BabitaBhagat4
 
IJSRED-V2I1P12
IJSRED
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
IOSR Journals
 
Regularized Weighted Ensemble of Deep Classifiers
ijcsa
 
SCCAI- A Student Career Counselling Artificial Intelligence
vivatechijri
 
MACHINE LEARNING TOOLBOX
mlaij
 
Machine learning
Siddharth Kar
 
wzhu_paper
Wenzhen Zhu
 
Ad

More from Sunil Rajput (8)

PDF
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Sunil Rajput
 
PDF
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Sunil Rajput
 
PDF
DEVELOPING Air Conditioner Controller using MATLAB Fuzzy logic presentation
Sunil Rajput
 
PDF
DEVELOPING AIR CONDITIONING SYSTEM USING FUZZY LOGIC IN MATLAB Report pdf
Sunil Rajput
 
PPTX
Reasons for internationalisation of business final
Sunil Rajput
 
PPTX
Effects and benefits of globalisation
Sunil Rajput
 
PPTX
Business oppurtunity & competitive strategy final
Sunil Rajput
 
PPTX
Merchandise final new
Sunil Rajput
 
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Sunil Rajput
 
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Sunil Rajput
 
DEVELOPING Air Conditioner Controller using MATLAB Fuzzy logic presentation
Sunil Rajput
 
DEVELOPING AIR CONDITIONING SYSTEM USING FUZZY LOGIC IN MATLAB Report pdf
Sunil Rajput
 
Reasons for internationalisation of business final
Sunil Rajput
 
Effects and benefits of globalisation
Sunil Rajput
 
Business oppurtunity & competitive strategy final
Sunil Rajput
 
Merchandise final new
Sunil Rajput
 
Ad

Recently uploaded (20)

PDF
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
PPTX
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
PDF
POWER PLANT ENGINEERING (R17A0326).pdf..
haneefachosa123
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PDF
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
PDF
Additional Information in midterm CPE024 (1).pdf
abolisojoy
 
PDF
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPT
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PPTX
Structural Functiona theory this important for the theorist
cagumaydanny26
 
PDF
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
PPTX
drones for disaster prevention response.pptx
NawrasShatnawi1
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Introduction to Neural Networks and Perceptron Learning Algorithm.pptx
Kayalvizhi A
 
POWER PLANT ENGINEERING (R17A0326).pdf..
haneefachosa123
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
A presentation on the Urban Heat Island Effect
studyfor7hrs
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
Additional Information in midterm CPE024 (1).pdf
abolisojoy
 
BioSensors glucose monitoring, cholestrol
nabeehasahar1
 
Thermal runway and thermal stability.pptx
godow93766
 
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
Structural Functiona theory this important for the theorist
cagumaydanny26
 
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
drones for disaster prevention response.pptx
NawrasShatnawi1
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 

Genetic Algorithm for optimization on IRIS Dataset REPORT pdf

  • 1. SAVITRIBAI PHULE PUNE UNIVERSITY A MINI PROJECT REPORT ON OPTIMIZATION OF GENETIC ALGORITHM USING IRIS FLOWER DATASET SUBMITTED TOWARDS THE PARTIAL FULFILLMENT OF THE REQUIREMENTS OF BACHELOR OF ENGINEERING (Computer Engineering) BY Sunil Rajput Exam No: 71720728F Ashish kumar Singh Exam No: 71324943K Ashish Yadav Exam No: 71741665J Mayank Patil Exam No: 71550097L Under The Guidance of Prof. Mangesh Ghonge DEPARTMENT OF COMPUTER ENGINEERING SANDIP INSTITUTE OF TECHNOLOGY AND RESEARCH CENTRE MAHIRAVANI, TRIMBAK ROAD, NASHIK 422213
  • 2. SANDIP INSTITUTE OF TECHNOLOGY AND RESEARCH CENTRE DEPARTMENT OF COMPUTER ENGINEERING CERTIFICATE This is to certify that the Project Entitled OPTIMIZATION OF GENETIC ALGORITHM USING IRIS FLOWER DATASET Submitted by Sunil Rajput Exam No: 71720728F Ashish Kumar Singh Exam No: 71324943K Ashish Yadav Exam No: 71741665J Mayank Patil Exam No: 71550097L is a bonafide work carried out by Students under the supervision of Prof. Mangesh Ghonge and it is submitted towards the partial fulfillment of the requirement of Bache- lor of Engineering (Computer Engineering) Project. Prof. Mangesh Ghonge Prof. A. D.Potgantwar Internal Guide H.O.D Dept. of Computer Engg. Dept. of Computer Engg.
  • 3. SITRC, Department of Computer Engineering 2019-20 I Abstract Machine learning is the core of Artificial Intelligence (AI) and pattern recognition is also an important branch of AI. In this thesis, the conception of machine learning and machine learning algorithms are introduced. Moreover, a typical and simple machine learning algorithm called K-means is introduced. A case study about Iris classification is introduced to show how the K-means works in pattern recognition. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple measurements in taxonomic problems as an example of linear discriminant ana collected the data to quantify the morphologic variation of Iris Flower of three related species. Two of the three species were collected in Gaspe Peninsula all from the same pasture, and picked on the same day and measured at the same time by the same person with same apparatus.The data set consists of 50 samples from each of three species of Iris that is 1) Iris Setosa 2) Iris Virginica 3) Iris Versicolor. Four features were measured from each sample. They are 1) Sepal Length 2) Sepal Width 3) Petal Length4) Petal Width. All these four parameters are measured in Centimeters. Based on the combination of these four features, the species among three can be predicted.The aim of the case study is to design and implement a system of pattern recognition for the Iris flower based on Machine Learning. This project shows the workflow of pattern recognition and how to use machine learning approach to achieve this goal. The data set was collected from an open source website of machine learning. The programming language used in this project was Python. Keywords : Genetic Algorithm Optimization, Iris Dataset, Machine Learning, Python.
  • 4. SITRC, Department of Computer Engineering 2019-20 II Acknowledgments It gives us great pleasure in presenting the Mini project report on OPTIMIZATION OF GENETIC ALGORITHM USING IRIS FLOWER DATASET I would like to take this opportunity to thank my internal guide Prof. Mangesh Ghonge for giving me all the help and guidance I needed. I am really grateful to them for their kind support. Their valuable suggestions were very helpful. I am also grateful to Prof. A. D.Potgantwar, Head of Computer Engineering De- partment, CollegeName for his indispensable support, suggestions. In the end our special thanks to Prof. Gokul Patil for providing various resources such as laboratory with all needed software platforms, continuous Internet connec- tion, for Our Project. Sunil Rajput Ashish Kumar Singh Ashish Yadav Mayank Patil (B.E. Computer Engg.)
  • 5. INDEX 1 Synopsis 1 1.1 Project Title.........................................................................................2 1.2 Project Option ....................................................................................2 1.3 Internal Guide .....................................................................................2 1.4 Sponsorship and External Guide........................................................2 1.5 Technical Keywords (As per ACM Keywords)..................................2 1.6 Problem Statement..............................................................................2 1.7 Abstract...............................................................................................2 1.8 Goals and Objective............................................................................3 1.9 Relevant mathematics associated with the Project..............................3 1.10 Names of Conferences / Journals where papers can be published......5 1.11 Review of Conference/Journal Papers supporting Project idea ..........5 1.12 Plan of Project Execution....................................................................6 2 Technical Keywords 7 2.1 Area of Project ....................................................................................8 2.2 Technical Keywords ...........................................................................8 3 Introduction 9 3.1 Project Idea .........................................................................................10 3.2 Motivation of the Project ....................................................................10 3.3 Literature Survey.................................................................................10 4 Problem Definition and scope 14 4.1 Problem Statement..............................................................................15 4.1.1 Goals and objectives...............................................................15 4.1.2 Major Constraints...................................................................15 4.2 Methodologiess...................................................................................15 4.3 Outcome..............................................................................................15
  • 6. 4.4 Applications ........................................................................................15 4.5 Hardware Resources Required............................................................16 4.6 Software Resources Required.............................................................16 5 METHODOLY 17 5.1Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . .18 5.1.1 Iris Dataset & Algorithms . . . . . . . . . . . . . . . . . . . . .19 5.1.2 Implementation/Results . . . . . . . . . . . . . . . . . . . . . . .20 5.2Team Organization . . . . . . . . . . . . . . . . . . . . . . . . . .25 5.2.1 Team structure . . . . . . . . . . . . . . . . . . . . . . . .26 6 Summary & Conclusion 29 7 References 31
  • 8. SITRC, Department of Computer Engineering 2019-20 2 1.1 PROJECT TITLE OPTIMIZATION OF GENETIC ALGORITHM USING IRIS FLOWER DATASET 1.2 PROJECT OPTION Mini project 1.3 INTERNAL GUIDE Prof. Mangesh Ghonge 1.4 SPONSORSHIP AND EXTERNAL GUIDE SITRC Computer Department 1.5 TECHNICAL KEYWORDS (AS PER ACM KEYWORDS) Genetic Algorithm Optimization, Iris Dataset, Machine Learning, Python. 1.6 PROBLEM STATEMENT To Apply the Genetic Algorithm for optimization on a dataset obtained from UCI ML repository. For Example: IRIS Dataset using Python. 1.7 ABSTRACT Machine learning is the core of Artificial Intelligence (AI) and pattern recognition is also an important branch of AI. In this thesis, the conception of machine learning and machine learning algorithms are introduced. Moreover, a typical and simple machine learning algorithm called K-means is introduced. A case study about Iris classification is introduced to show how the K-means works in pattern recognition. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple measurements in taxonomic problems as an example of linear discriminant data set because Edgar Anderson of
  • 9. SITRC, Department of Computer Engineering 2019-20 3 related species. Two of the three species were collected in Gaspe Peninsula all from the same pasture, and picked on the same day and measured at the same time by the same person with same apparatus.The data set consists of 50 samples from each of three species of Iris that is 1) Iris Setosa 2) Iris Virginica 3) Iris Versicolor. Four features were measured from each sample. They are 1) Sepal Length 2) Sepal Width 3) Petal Length4) Petal Width. All these four parameters are measured in Centimeters. Based on the combination of these four features, the species among three can be predicted. 1.8 GOALS AND OBJECTIVE The aim of the case study is to design and implement a system of pattern recognition for the Iris flower based on Machine Learning. This project shows the workflow of pattern recognition and how to use machine learning approach to achieve this goal. The data set was collected from an open source website of machine learning. The programming language used in this project was Python. 1.9 REVIEW OFCONFERENCE/JOURNAL PAPERS SUPPORTING PROJECT IDEA [1] Author Li Liu, Murat Kantarcioglu and Bhavani Thurasingham discussed about the securing of data using decision tree algorithm. . It is classified with the perturbed data set, and this process improves the accuracy of data. It also reduce the costs off communicatio and computation compared to any other cryptographici services They also provide the direction for mapping the data mining functions instead of reconstructing the original data which provide more privacy with less cost [3]. Author Ahmad Ashari, Paryudi, Min tjoa describes about the performance of various classification algorithm for an alternative design in an energy simulation tool. This shows there is possible way of comparing multiple algorithms. As per the comparision of decision tree, naive bayes, K-Nearest Neighbour algorithm the accuracy of decision tree is better than the other algorithms [4].
  • 10. SITRC, Department of Computer Engineering 2019-20 4 Author Sagar S.Nikam has defined the comparitive study on classification techniques which mainly focus in the performance analysis of classification algorithms and its Limitations. Also focus on classifying data into different classes according to some constraint. The first approach is the Statistical approach which is classical approach works on linear discrimination. The second is Machine Learning which helps to solve more complex problems and third approach is Neural Network shows the diverse source ranging from the understanding and emulating the human brain to border issues of human abilities [6]. Author Rachna Raghuwanshi has describe about performance of the Naïve bayes classifier and Decision Tree with the Fire Data Set to compare the accuracy. Where as the problem with Cross Validation is avoided [7]. Author XHEMALI, J.HINDE, G.STONE precises on the automatic analysis and classification of attribute data from training course web pages. They choose Naive bayes, Decision Tree, Neural Network algorithm to classify the best data with same data set. As per the result gained the accuracy of naive bayes is more accurate than any other classification algorithm [8]. Author Bhaskar N.Patel, Satish G. Prajapati, Dr.Kamaljit I. Lakhtaria describes the classification is the categorization of data into different category based on some rules. The classification of data with decision tree is the pictorial view, and categorizing is easier, accuracy is better than othe classification algorithm [11]. Learning is a very important feature of Artificial Intelligence. Many scientists tried to explain and give a proper definition for learning. However, learning is not that easy to cover with few simple sentences. Many computer scientists, sociologists, logicians and other scientists discussed about this for a long time. Some scientists think learning is an adaptive skill so that the system can perform the similar task better in the next time(Simon 1987). Others claim that learning is a process of collecting knowledge(Feigenbaum 1977). Even though there is no proper definition for learning skill, we still need to give a definition for machine learning. In general, machine learning aims to find out how the computer algorithms can be improved automatically through experience(Mitchell 1997).
  • 11. SITRC, Department of Computer Engineering 2019-20 5 Machine learning has an important position in the field of Artificial Intelligence. At the beginning of development of Artificial Intelligence(AI), the AI system does not have a thorough learning ability so the whole system is not perfect. For instance, a computer cannot do self-adjustment when it faces problems. Moreover, the computer cannot automatically collect and discover new knowledge. The inference of the program needs more induction than deduction. Therefore, computer only can figure out already existing truths. It does not have the ability to discover a new logical theory, rules and so on. 1.10PLAN OF PROJECT EXECUTION Using planner or alike project management tool.
  • 12. SITRC, Department of Computer Engineering 2019-20 6 CHAPTER 2 INTRODUCTION
  • 13. SITRC, Department of Computer Engineering 2019-20 7 3.1 PROJECT IDEA Applying Genetic Algorithm Optimization using Iris Dataset in Python. 3.2 MOTIVATION OF THE PROJECT The aim of the case study is to design and implement a system of pattern recognition for the Iris flower based on Machine Learning. This project shows the workflow of pattern recognition and how to use machine learning approach to achieve this goal. The data set was collected from an open source website of machine learning. The programming language used in this project was Python. 3.3 LITERATURE SURVEY [1]. Author Li Liu, Murat Kantarcioglu and Bhavani Thurasingham discussed about the securing of data using decision tree algorithm. . It is classified with the perturbed data set, and this process improves the accuracy of data. It also reduce the costs off communicatio and computation compared to any other cryptographici services They also provide the direction for mapping the data mining functions instead of reconstructing the original data which provide more privacy with less cost [3]. Author Ahmad Ashari, Paryudi, Min tjoa describes about the performance of various classification algorithm for an alternative design in an energy simulation tool. This shows there is possible way of comparing multiple algorithms. As per the comparision of decision tree, naive bayes, K-Nearest Neighbour algorithm the accuracy of decision tree is better than the other algorithms [4]. Author Sagar S.Nikam has defined the comparitive study on classification techniques which mainly focus in the performance analysis of classification algorithms and its Limitations. Also focus on classifying data into different classes according to some constraint. The first approach is the Statistical approach which is classical approach works on linear discrimination. The second is Machine Learning which helps to solve more complex problems and third approach is Neural Network
  • 14. SITRC, Department of Computer Engineering 2019-20 8 shows the diverse source ranging from the understanding and emulating the human brain to border issues of human abilities [6]. Author Rachna Raghuwanshi has describe about performance of the Naïve bayes classifier and Decision Tree with the Fire Data Set to compare the accuracy. Where as the problem with Cross Validation is avoided [7]. Author XHEMALI, J.HINDE, G.STONE precises on the automatic analysis and classification of attribute data from training course web pages. They choose Naive bayes, Decision Tree, Neural Network algorithm to classify the best data with same data set. As per the result gained the accuracy of naive bayes is more accurate than any other classification algorithm [8]. Author Bhaskar N.Patel, Satish G. Prajapati, Dr.Kamaljit I. Lakhtaria describes the classification is the categorization of data into different category based on some rules. The classification of data with decision tree is the pictorial view, and categorizing is easier, accuracy is better than othe classification algorithm [11]. Learning is a very important feature of Artificial Intelligence. Many scientists tried to explain and give a proper definition for learning. However, learning is not that easy to cover with few simple sentences. Many computer scientists, sociologists, logicians and other scientists discussed about this for a long time. Some scientists think learning is an adaptive skill so that the system can perform the similar task better in the next time(Simon 1987). Others claim that learning is a process of collecting knowledge(Feigenbaum 1977). Even though there is no proper definition for learning skill, we still need to give a definition for machine learning. In general, machine learning aims to find out how the computer algorithms can be improved automatically through experience(Mitchell 1997).
  • 16. SITRC, Department of Computer Engineering 2019-20 15 4.1 PROBLEM STATEMENT To Apply the Genetic Algorithm for optimization on a dataset obtained from UCI ML repository. For Example: IRIS Dataset using Python. 4.1.1 Goals and objectives The aim of the case study is to design and implement a system of pattern recognition for the Iris flower based on Machine Learning. This project shows the workflow of pattern recognition and how to use machine learning approach to achieve this goal. The data set was collected from an open source website of machine learning. The programming language used in this project was Python. 4.1.2 Major Constraints It is not sustainable for incomplete Datasets. 4.4 OUTCOME These results show the effect that the number of k and the random initialization number have on the clustering result. It is also possible to see the advantages and disadvantages of the K-means clustering algorithm. 4.5APPLICATIONS Software engineering. Traveling Salesman Problem. Mobile communications infrastructure optimization. Electronic circuit design, known as Evolvable hardware.
  • 17. SITRC, Department of Computer Engineering 2019-20 16 4.6 HARDWARE RESOURCES REQUIRED 1. Desktop 2. Inbuilt Compiler 3. Anaconda /Simulink tool 4. Numpy. 5. Iris Dataset. Sr. No. Parameter Minimum Requirement Justification 1 CPU Speed 2 GHz Remark Required 2 RAM 3 GB Remark Required Table 4.1: Hardware Requirements 4.7SOFTWARE RESOURCES REQUIRED Platform : 1. Operating System: Windows,Ubunu/Linux 2. Programming Language: Python, Anaconda ,Numpy, etc.
  • 18. SITRC, Department of Computer Engineering 2019-20 17 CHAPTER 5 METHODLOGY
  • 19. 5.1 THE DESCRIPTION OF MACHINE LEARNING FORMS A learning method is a complicated topic which has many different kinds of forms. Everyone has different methods to study, so does the machine. We can categorize various machine learning systems by different conditions. In general, we can separate learning problems in two main categories: supervised learning and unsupervised learning. Supervised learning is a commonly used machine learning algorithm which appears in many different fields of computer science. In the supervised learning method, the computer can establish a learning model based on the training data set. According to this learning model, a computer can use the algorithm to predict or analyze new information. By using special algorithms, a computer can find the best result and reduce the error rate all by itself. Supervised learning is mainly used for two different patterns: classification and regression. In supervised learning, when a developer gives the computer some samples, each sample is always attached with some classification information. The computer will analyze these samples to get learning experiences so that the error rate would be reduced when a classifier does recognitions for each patterns. 1.1 Iris Flower Species The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It to quantify the morphologic variation of Iris Flower of three related species. Two of the three species were collected in Gaspe Peninsula all from the same pasture, and picked on the same day and measured at the same time by the same person with same apparatus.
  • 20. The data set consists of 50 samples from each of three species of Iris that is 1) Iris Setosa 2) Iris Virginica 3) Iris Versicolor. Four features were measured from each sample. They are 1) Sepal Length 2) Sepal Width 3) Petal Length4) Petal Width. All these four parameters are measured in Centimeters. Based on the combination of these four features, the species among three can be predicted. 2.IMPLEMENTATION OF ALGORITHMS 2.1 K-Nearest Neighbors Algorithm The k-Nearest Neighbors algorithm (or kNN for short) is an easy algorithm to understand and to implement, and a powerful tool to have at your disposal. The implementation will be specific for classification problems and will be demonstrated using the Iris flowers classification problem. 5.1.1 What is k-Nearest Neighbors The model for kNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance. The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other types of data such as categorical or binary data, Hamming distance can be used. In the case of regression problems, the average of the predicted attribute may be returned. In the case of classification, the most prevalent class may be returned. 5.1.2 How does k-Nearest Neighbors Work The kNN algorithm is belongs to the family of instance-based, competitive learning and lazy learning algorithms. Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model. It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to the
  • 21. Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets. Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure can be calculated consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional form. 3.LOGISTIC REGRESSION ALGORITHM Logistic Regression is a type of regression that predicts the probability of occurrence of an event by fitting data to a logit function (logistic function). Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others. 3.1 What is Logistic Regression? Logistic Regression, also known as Logit Regression or Logit Model, is a mathematical model used in statistics to estimate (guess) the probability of an event occurring having been given some previous data. Logistic Regression works with binary data, where either the event happens (1) or the event does not happen (0). So given some feature x it tries to find out whether some event y happens or not. So y can either be 0 or 1. In the case where the event happens, y is given the value 1. If the event does not happen, then y is given the value of 0. For example, if y represents whether a sports teams wins a match, then y will be 1 if they win the match or y will be 0 if they do not. This is known as Binomial Logistic Regression.
  • 22. 4. Implementation 4.1 Python Python is a programming language created by Guido van Rossum in 1989. Python is an interpreted, object-oriented, dynamic data type of high-level programming languages.(Python Software Foundation 2013). The programming language style is simple, clear and it also contains powerful different kinds of classes. Moreover, Python can easily combine other programming languages, such as C or C++. As a successful programming language, it has its own advantages: 1. Simple & easy to learn: The concept of this programming language is as simple as it can be. That makes it easy for everyone to learn and use. It is easy to understand the syntax. 2. Open source: Python is completely free as it is an open source software. Several of open source scientific computing storage has the API for Python. Users can easy to install Python on their own computer and use the standard and extend library. 3. Scalability: Programmers can write their code in C or C++ and run them in Python. 4.2 SciKit-learn Scikit learn is an open source machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms and is designed to interoperate with the Python numerical libraries NumPy and SciPy (Pedregosa et al. 2011). SciKit-learn contains the K- means algorithm based on Python and it helps to figure out how to implement this algorithm in programming.
  • 23. 4.3 Numpy, Scipy and Matplotlib In Python, there is no data type called array. In order to implement the data type of array with python, numpy and scipy are the essential libraries for analyzing and calculating data. They are all open source libraries. Numpy is mainly used for the matrix calculation. scipy is developed based on numpy and it is mainly used for scientific research. By using them in Python programming, they can be used with two simple commands: >>> import numpy >>> import scipy Then Python will call the methods from numpy and scipy. Mathplotlib is a famous library for plotting in Python. It provides a series of API and it is suitable for making interactive mapping. In this case, we need to use it to find the best result visually. 4.4 Preparing the Iris flower data set The data set of Iris flower can be found in UCI Machine Learning Repositor (Bache The data set of Iris flower can be also found in the Scikit-learn library. In site- packages, there is a folder named sklearn. In this folder, there is a datasets subfolder to contain many kinds of data sets for machine learning study. The data set can be found in Appendix 1. In the species of this table, 0 represents setosa, 1 represents versicolor, 2 represents virginica In the process of preparing a training data set and a testing data set, the greatest problem is how to find the most appropriate way to divide the data set into training data set and testing data set. In some cases, by using sampling theory and estimation theory, we can separate the whole data set into training data set and testing data set. However, sometimes, the method would be changed. The attributes and the property
  • 24. The K-means algorithm and unsupervised learning does not use a training data set to compute the training sample. Therefore, there is no need to separate the dataset into a training data set and a testing data set. It can simply use this dataset to get the result of clustering. 4.5 Machine learning system design In general, the principles of machine learning system design should follow two basic requirements : the model selection and creation and the learning algorithm selection and design. In addition, different models can have different learning systems. On the other hand, the objective function is also different in different learning models. The objective function can help the machine to establish a learning system. Moreover, the accuracy and complexity of different algorithms would be the most important factor of the learning system. If the chosen algorithm is not very adaptive to the learning system, then the efficiency and result of the learning system would be reduced. The selection of training data set can have an influence on learning performance and feature selection. ILLUSTRATION OF SAMPLE IRIS DATASET Sample datasets of Iris Setosa Sample datasets of Iris Versicolor Sample datasets of Iris Virginica
  • 25. 5. Evaluating results The result is shown in four images for the clustering results. Figure 9 will be the result with eight clusters. Figure 10 shows the result with three clusters. Figure 9. Clustering of Iris dataset with eight clusters Figure 10. Clustering of Iris dataset with three clusters
  • 26. As seen in Figure 9 and 10, the whole dataset is separated into eight clusters in Figure 9 and three clusters are shown in Figure 10 with different colors. In Figure 9, most of the samples stick together, it is really hard to distinguish them very clearly. The differences between each sample is small. In this case, the cluster result is not acceptable. On the other hand, in Figure 10, it can be easily seen that the cluster result is much better than in Figure 9. Even though there are still some overlapping parts between green and purple, but it quite clear to see the difference between these three clusters. This case shows the importance of choosing the number of clusters for K-means algorithm. Sometimes for the real datasets, it is difficult to know how many data sets should be used. Therefore, it is quite hard to choose the number of clusters. One method is to use the ISODATA algorithm, through the merging and division of clusters to obtain a reasonable number of k. Figure 11. Clustering of Iris dataset with bad initialization
  • 27. Figure 11 , shows the cluster result with three clusters but bad initialization. We can see that some of the samples change their class compare to the Figure 10. With a random initialization number, the system will obtain different cluster results. Therefore, a random initialization number is very important for a good cluster result. However, we do not know what could be a good initialization number. In this case, in some machine learning systems, the scientists will choose GA(Genetic Algorithm) to have the initialization point. Figure 12 below illustrates a standard result of K-means clustering of Iris ning datasets in supervised learning. The number of clusters are three and with a good initialization point. This is the best classification of all shown here. The whole dataset has been separated properly and each dataset has good differences. In Figure 10, it shows the stardard result of classification in unsupervised learning. Compare to this figure, Figure 10 still has some small differences but it still works very well. Almost every data belongs to the right place. Figure 12. Clustering of Iris dataset in ground truth These results show the effect that the number of k and the random initialization number have on the clustering result. It is also possible to see the advantages and disadvantages of the K-means clustering algorithm.
  • 28. HISTOGRAM : BOX AND WHISKER PLOTS(Give idea about distribution of input attributes)
  • 29. SCATTER PLOT (ALL ATTRIBUTES)
  • 30. 5.1 TEAM ORGANIZATION Team Structure : Fig 5.4.1 : Team Structure 5.1.1 Team structure Team Leader: Sunil Rajput Software Developer: Sunil Rajput Hardware Developer: Ashish Yadav, Mayank Patil. Documentation : Ashish kumar Singh.
  • 31. CHAPTER 8 SUMMARY AND CONCLUSION
  • 32. SITRC, Department of Computer Engineering 2019-20 33 The primary goal of supervised learning is to build a model that generalizes . Here in this project we make predictions on unseen data which is the data not used to train the model hence the machine learning model built should accurately predicts the species of future flowers rather than accurately predicting the label of already trained data. With the rapid development of technology, AI has been applied in many fields. Machine learning is the most fundamental approach to achieve AI. This thesis describes the work principle of machine learning, two different learning forms of machine learning and an application of machine learning. In addition, a case study of Iris flower recognition to introduce the workflow of machine learning in pattern recognition is shown. In this case, the meaning of pattern recognition and how the machine learning works in pattern recognition has been described. The K-means algorithm, which is a very simple machine learning algorithm from the unsupervised learning method is used. Evolutionary algorithms have been around since the early sixties. They apply the rules of nature: evolution through selection of the fittest individuals, the individuals representing solutions to the mathematical problem. Genetic algorithms are so far generally the best and most robust kind of evolutionary algorithms. The work also shows how to use SciKit-learn or Anaconda 3,0 software to learn machine learning.
  • 34. [1] Abbas MAkbari Z. (2010). "A multilevel evolutionary algorithm for optimizing numerical functions" IJIEC 2 (2011): 419 430 [2] Ananya (2017), What is Diabetes, retrieved online from https://www.news- medical.net/health/What- is-Diabetes.aspx [3] Coffin, D.; S., Robert E. (2008). "Linkage Learning in Estimation of Distribution Algorithms". Linkage in Evolutionary Computation. Springer Berlin Heidelberg: 141 156. doi:10.1007/978-3-540- 85068-7_7. [4] Eiben, A. E. et al (1994). Genetic algorithms with multi-parent recombination, PPSN III: Proceedings of the International Conference on Evolutionary Computation. The Third Conference on Parallel Problem Solving from Nature: 78 87. ISBN 3-540-58484-6. [5] Clustering - K- -means-Ineractive demo, Available at: http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html. Consulted 22 AUG 2013 [6] Bache, K.& Lichman, M. 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. [7] Bishop, C. 2006. Pattern Recognition and Machine Learning. New York: Springer, pp.424- 428. [8] Fisher, R.A. 1936. UCI Machine Learning Repository: Iris Data Set. Available at: http://archive.ics.uci.edu/ml/datasets/Iris. Consulted 10 AUG 2013 [9] Mitchell, T. 1997. Machine learning. McGraw Hill. [10] [11] dy of Classification Techniques for Fire Data 7 (1), 2016, 78-82.