SlideShare a Scribd company logo
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
1
Machine Learning
Name: Muhammad Gulraj
Muhammad.gulraj@yahoo.com
Performance analysis of KNN & K-Means using internet
advertisements data
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
2
Abstract
Advertisements are undeniably the main source of earning over the internet. In this article we
have analyzed the performance of two different classifiers, namely K – nearest neighbor (KNN),
and K-Means using internet advertisements data [1]. The classifiers are used to identify
advertisement and non-advertisement data and their performance is evaluated based on the
results.
K-nearest neighbor:
In machine learning and pattern recognition K - Nearest neighbor KNN is an algorithm (non-
parametric) that can be used for classification and regression. It is among the simplest
algorithms in machine learning. The output of the KNN depends on whether the algorithm is
used for regression or classification.
KNN classify an object using the majority vote/neighbor. The object is assigned to the class
which has the most members near to the object.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
3
KNN is instance based learning or lazy learning. KNN is simple but powerful algorithm in which
no training is required and new training example can be added easily. However it is slow and
expensive having computation complexity of O (md). KNN has a slow run time performance but
it can be improved significantly by removing redundant data, computing approximate distance or
using pre-sort training examples into fast data structures.
KNN can be used in handwritten character classification, content based image retrieval,
intrusion detection and fault detection.
Let’s consider a simple example of an object which have many sea bass and salmon as
neighbors. Let’s assume that K = 3 which means that we will consider 3 nearest neighbors of
the object. From the below image it is clear that the object have 2 sea bass and 1 salmon, so
the KNN algorithm will classify the object as sea bass.
KNN Algorithm:
KNN Algorithm is very simple. The training period consist of storing all instances and labels
(Class labels). If feature selection has to be performed then n-fold cross validation should be
used on training set. If we want to test a new instance X given a set Y, the following steps are
needed.
i. Compute the distance of X from each instance in set Y
ii. Sort the distance in increasing order and pick the highest K elements.
iii. Find the most repeated class in K nearest neighbor.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
4
K- Means Clustering:
K means clustering is a popular method used for vector quantization. K means clustering is
basically the partition of X observations into K clusters. In K means clustering each observation
belongs to the cluster having the nearest mean. K means clustering is an NP hard problem
(Computationally difficult). Efficient heuristic algorithms can be used which converge rapidly to a
local minimum.
Assume a set of observations x1, x2, x3, x4 …. Xn, every observation is a vector of d-
dimensions. The main aim of K means clustering is to partition the n observations into k sets
(clusters) where (k<=n). S = S1, S2 , … Sk (to minimize the sum of squares within-cluster).
μi is mean of points in Si.
K-Means Algorithm:
The algorithm of K means clustering is as follows.
i. Specify K (the number of clusters)
ii. Select K points randomly as cluster centers.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
5
iii. Assign each object/instance to closest cluster center.
iv. Find the centroid or mean for every cluster and use it as new cluster center.
v. Now reassign all the objects/instances to the closest cluster center.
vi. Iterate until the center of the clusters doesn’t change any more.
Training Data:
The data [1] that is used for training the classifiers are already pre-classified into an ad and non-
ad along with some test data.
 Every image that is inside in <a> tag is a candidate ad. None <a> anchor tag images are
mostly non-ad. Udest is the URL to which the anchor <a> points while Uimg is image’s
URL.
 The geometric information of the image is captured using aspect ratio, height and width.
If any of the geometric features is missing then it is represented by (?).
 Another binary feature is ‘local’ which is 1 when Uimg (data.host.com/image.jpg) and 0 if
Uimg (elsewhere.com/picture.gif).
 Alt text that is used in anchor tag is alternative text which can be used as a feature in
the data set.
 Another featured that is used in the data set is the text inside the <A> anchor tag.
 The whole set of features can be represented by:
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
6
 More Sets of features can be provided by the base URL (Ubase), Destination URL (Udest)
and image URL (Uimg). Phrases that have normally one word are ignored if they are
members of low information terminologies such as html, jpeg, gif, www and http etc.
 This procedure generates a number of encodings (1558 features for encoding), one for
each K = Maximum phrase length and M = Minimum phrase count.
Improving the existing data:
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
7
The existing data can be improved by using the words/phrases that are used at the end of
anchor tag. For example
 <a href=”www.example.com/sales.html”>
 and <img src=”www.example.com/mobile.jpg”>
The above links have sales.html and sale.jpg as the last words, if we separate the low
information word like jpg and html, we can use the words sale and mobile as features.
Learning Rules:
The learning classifier algorithms should have following properties.
 The algorithms should not be over sensitive to the missing information
 The learning algorithms should be insensitive to irrelevant information and must scale
with the number of features.
 Some URL’s or phrases might change over the span of time, so it will be better to
relearn the classifier from scratch, Ideally the learning algorithm would be incremental
(as in the case of K nearest neighbor)
 If aspect ratio is greater than 4.5 and ‘alt’ contain to but does not contain ‘click+here’ and
Udest does not contain ‘http+www’ then the instance is an advertisement (ad)
 If Ubase doesn’t contain ‘messier’ while Udest contains ‘redirect+cgi’ then instance is an
advertisement (ad).
Evaluation:
We conducted several experiments using the cross validation methodology. The data is
randomly divided into 89% training and a test set containing the remaining data. The results are
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
8
cross validated 12 times after invoking the rules mentioned above on the data. The learned
rules accuracy is found out to be 95.6% for K nearest neighbor K-NN and 93.5% for K-Means
classifier. To further understand the limitations we have measured a learning curve for both the
classifiers.
Learning curves for K-NN and K-Means:
. To further understand the limitations we have measured a learning curve for both the
classifiers. To calculate a learning curve we provide the classifier with 10, 20, 30, 40, …., 90,
100 percent training data, after which we calculate 11 fold cross validated accuracy on the
remainder. The figure shows the result which 95.6% for KNN and 93.5% for K means classifier.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
9
Related Work:
There are different systems/softwares available over internet, which are used for removing
unwanted advertisements from web pages, for example muffin (muffin.doit.org), web-filter
(www.uni-paderborn.de) and By-proxy (www.besiex.org/Byproxy). These systems rely on
different filtering rules. As the internet grows, good learning algorithms and classifiers are
required for the scalability of these systems.
Future Work:
A system/software can be developed using this work to remove unwanted and annoying
advertisements from web pages while using internet. The data is very old (1998) which can be
improved as internet and web tools are more advanced.
MuhammadGulraj
BS GIKI, Pakistan
MS UET Peshawar Pakistan
10
References
1. Andrew Ng (2013), an online course for Machine learning, Stanford University,
Stanford, https://class.coursera.org/ml-004/class.
2. Duda and Hart, Pattern Classification (2001-2002), Wiley, New York.
3. http://en.wikipedia.org
4. http://www.mathworks.com/help/stats/hidden-markov-models-hmm.html
5. A Ranking-based KNN Approach for Multi-Label Classification (2012), Tsung-Hsien
Chiang National Taiwan University
6. Selection of K in K-means clustering, D T Pham, S S Dimov, and C D Nguyen
Manufacturing Engineering Centre, Cardiff University, Cardiff, UK

More Related Content

DOCX
Machine Learning 1
Muhammad GulRaj
 
PDF
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
ijaia
 
PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
PDF
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
IJDKP
 
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
IRJET Journal
 
PDF
Premeditated Initial Points for K-Means Clustering
IJCSIS Research Publications
 
PPTX
Genetic algorithms
swapnac12
 
PDF
The Validity of CNN to Time-Series Forecasting Problem
Masaharu Kinoshita
 
Machine Learning 1
Muhammad GulRaj
 
LOG MESSAGE ANOMALY DETECTION WITH OVERSAMPLING
ijaia
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"
IJDKP
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
IRJET Journal
 
Premeditated Initial Points for K-Means Clustering
IJCSIS Research Publications
 
Genetic algorithms
swapnac12
 
The Validity of CNN to Time-Series Forecasting Problem
Masaharu Kinoshita
 

What's hot (19)

PDF
Proactive Secret Sharing using a Trivariate Polynomial
IJERA Editor
 
PDF
Bb25322324
IJERA Editor
 
PDF
Survey paper on Big Data Imputation and Privacy Algorithms
IRJET Journal
 
DOC
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
ijscmc
 
PPTX
Machine learning Algorithms with a Sagemaker demo
Hridyesh Bisht
 
PDF
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
IJECEIAES
 
PDF
Web Based Fuzzy Clustering Analysis
inventy
 
PDF
On The Application of Hyperbolic Activation Function in Computing the Acceler...
iosrjce
 
PDF
Histogram-based multilayer reversible data hiding method for securing secret ...
journalBEEI
 
PDF
Implementation of query optimization for reducing run time
Alexander Decker
 
PDF
student part time job as tutors using k-means algorithm
NUR ZARITH AMBOAKA
 
PDF
47 292-298
idescitation
 
PDF
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Lawrence Takeuchi
 
PDF
Fn3410321036
IJERA Editor
 
PPTX
Algorithms Design Patterns
Ashwin Shiv
 
PDF
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
PDF
Vol 16 No 2 - July-December 2016
ijcsbi
 
PDF
A Stream Authentication Method over Lossy Networks using Optimized Butterfly ...
IJCSIS Research Publications
 
PDF
Bj24390398
IJERA Editor
 
Proactive Secret Sharing using a Trivariate Polynomial
IJERA Editor
 
Bb25322324
IJERA Editor
 
Survey paper on Big Data Imputation and Privacy Algorithms
IRJET Journal
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
ijscmc
 
Machine learning Algorithms with a Sagemaker demo
Hridyesh Bisht
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
IJECEIAES
 
Web Based Fuzzy Clustering Analysis
inventy
 
On The Application of Hyperbolic Activation Function in Computing the Acceler...
iosrjce
 
Histogram-based multilayer reversible data hiding method for securing secret ...
journalBEEI
 
Implementation of query optimization for reducing run time
Alexander Decker
 
student part time job as tutors using k-means algorithm
NUR ZARITH AMBOAKA
 
47 292-298
idescitation
 
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
Lawrence Takeuchi
 
Fn3410321036
IJERA Editor
 
Algorithms Design Patterns
Ashwin Shiv
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
Vol 16 No 2 - July-December 2016
ijcsbi
 
A Stream Authentication Method over Lossy Networks using Optimized Butterfly ...
IJCSIS Research Publications
 
Bj24390398
IJERA Editor
 
Ad

Similar to Performance analysis of KNN & K-Means using internet advertisements data (20)

PDF
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 
PDF
IRJET- Online Course Recommendation System
IRJET Journal
 
DOCX
Types of Machine Learnig Algorithms(CART, ID3)
Fatimakhan325
 
PDF
RESUME SCREENING USING LSTM
IRJET Journal
 
PPT
Machine-Learning-Algorithms- A Overview.ppt
Prabu P
 
PPT
Machine-Learning-Algorithms- A Overview.ppt
Anusha10399
 
PDF
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET Journal
 
PDF
IRJET- Student Placement Prediction using Machine Learning
IRJET Journal
 
PPTX
Dive into Machine Learning Event MUGDSC.pptx
RakshaAgrawal21
 
PPTX
Dive into Machine Learning Event--MUGDSC
RakshaAgrawal21
 
PDF
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Jian Wu
 
DOCX
Feature extraction for classifying students based on theirac ademic performance
Venkat Projects
 
PDF
Network Based Intrusion Detection System using Filter Based Feature Selection...
IRJET Journal
 
PDF
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET Journal
 
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
PDF
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
IRJET Journal
 
PDF
Particle Swarm Optimization based K-Prototype Clustering Algorithm
iosrjce
 
PDF
I017235662
IOSR Journals
 
PDF
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Vikash Kumar
 
PDF
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
Improving K-NN Internet Traffic Classification Using Clustering and Principle...
journalBEEI
 
IRJET- Online Course Recommendation System
IRJET Journal
 
Types of Machine Learnig Algorithms(CART, ID3)
Fatimakhan325
 
RESUME SCREENING USING LSTM
IRJET Journal
 
Machine-Learning-Algorithms- A Overview.ppt
Prabu P
 
Machine-Learning-Algorithms- A Overview.ppt
Anusha10399
 
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET Journal
 
IRJET- Student Placement Prediction using Machine Learning
IRJET Journal
 
Dive into Machine Learning Event MUGDSC.pptx
RakshaAgrawal21
 
Dive into Machine Learning Event--MUGDSC
RakshaAgrawal21
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Jian Wu
 
Feature extraction for classifying students based on theirac ademic performance
Venkat Projects
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
IRJET Journal
 
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET Journal
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
IRJET Journal
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
iosrjce
 
I017235662
IOSR Journals
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Vikash Kumar
 
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
Ad

Performance analysis of KNN & K-Means using internet advertisements data

  • 1. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 1 Machine Learning Name: Muhammad Gulraj [email protected] Performance analysis of KNN & K-Means using internet advertisements data
  • 2. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 2 Abstract Advertisements are undeniably the main source of earning over the internet. In this article we have analyzed the performance of two different classifiers, namely K – nearest neighbor (KNN), and K-Means using internet advertisements data [1]. The classifiers are used to identify advertisement and non-advertisement data and their performance is evaluated based on the results. K-nearest neighbor: In machine learning and pattern recognition K - Nearest neighbor KNN is an algorithm (non- parametric) that can be used for classification and regression. It is among the simplest algorithms in machine learning. The output of the KNN depends on whether the algorithm is used for regression or classification. KNN classify an object using the majority vote/neighbor. The object is assigned to the class which has the most members near to the object.
  • 3. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 3 KNN is instance based learning or lazy learning. KNN is simple but powerful algorithm in which no training is required and new training example can be added easily. However it is slow and expensive having computation complexity of O (md). KNN has a slow run time performance but it can be improved significantly by removing redundant data, computing approximate distance or using pre-sort training examples into fast data structures. KNN can be used in handwritten character classification, content based image retrieval, intrusion detection and fault detection. Let’s consider a simple example of an object which have many sea bass and salmon as neighbors. Let’s assume that K = 3 which means that we will consider 3 nearest neighbors of the object. From the below image it is clear that the object have 2 sea bass and 1 salmon, so the KNN algorithm will classify the object as sea bass. KNN Algorithm: KNN Algorithm is very simple. The training period consist of storing all instances and labels (Class labels). If feature selection has to be performed then n-fold cross validation should be used on training set. If we want to test a new instance X given a set Y, the following steps are needed. i. Compute the distance of X from each instance in set Y ii. Sort the distance in increasing order and pick the highest K elements. iii. Find the most repeated class in K nearest neighbor.
  • 4. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 4 K- Means Clustering: K means clustering is a popular method used for vector quantization. K means clustering is basically the partition of X observations into K clusters. In K means clustering each observation belongs to the cluster having the nearest mean. K means clustering is an NP hard problem (Computationally difficult). Efficient heuristic algorithms can be used which converge rapidly to a local minimum. Assume a set of observations x1, x2, x3, x4 …. Xn, every observation is a vector of d- dimensions. The main aim of K means clustering is to partition the n observations into k sets (clusters) where (k<=n). S = S1, S2 , … Sk (to minimize the sum of squares within-cluster). μi is mean of points in Si. K-Means Algorithm: The algorithm of K means clustering is as follows. i. Specify K (the number of clusters) ii. Select K points randomly as cluster centers.
  • 5. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 5 iii. Assign each object/instance to closest cluster center. iv. Find the centroid or mean for every cluster and use it as new cluster center. v. Now reassign all the objects/instances to the closest cluster center. vi. Iterate until the center of the clusters doesn’t change any more. Training Data: The data [1] that is used for training the classifiers are already pre-classified into an ad and non- ad along with some test data.  Every image that is inside in <a> tag is a candidate ad. None <a> anchor tag images are mostly non-ad. Udest is the URL to which the anchor <a> points while Uimg is image’s URL.  The geometric information of the image is captured using aspect ratio, height and width. If any of the geometric features is missing then it is represented by (?).  Another binary feature is ‘local’ which is 1 when Uimg (data.host.com/image.jpg) and 0 if Uimg (elsewhere.com/picture.gif).  Alt text that is used in anchor tag is alternative text which can be used as a feature in the data set.  Another featured that is used in the data set is the text inside the <A> anchor tag.  The whole set of features can be represented by:
  • 6. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 6  More Sets of features can be provided by the base URL (Ubase), Destination URL (Udest) and image URL (Uimg). Phrases that have normally one word are ignored if they are members of low information terminologies such as html, jpeg, gif, www and http etc.  This procedure generates a number of encodings (1558 features for encoding), one for each K = Maximum phrase length and M = Minimum phrase count. Improving the existing data:
  • 7. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 7 The existing data can be improved by using the words/phrases that are used at the end of anchor tag. For example  <a href=”www.example.com/sales.html”>  and <img src=”www.example.com/mobile.jpg”> The above links have sales.html and sale.jpg as the last words, if we separate the low information word like jpg and html, we can use the words sale and mobile as features. Learning Rules: The learning classifier algorithms should have following properties.  The algorithms should not be over sensitive to the missing information  The learning algorithms should be insensitive to irrelevant information and must scale with the number of features.  Some URL’s or phrases might change over the span of time, so it will be better to relearn the classifier from scratch, Ideally the learning algorithm would be incremental (as in the case of K nearest neighbor)  If aspect ratio is greater than 4.5 and ‘alt’ contain to but does not contain ‘click+here’ and Udest does not contain ‘http+www’ then the instance is an advertisement (ad)  If Ubase doesn’t contain ‘messier’ while Udest contains ‘redirect+cgi’ then instance is an advertisement (ad). Evaluation: We conducted several experiments using the cross validation methodology. The data is randomly divided into 89% training and a test set containing the remaining data. The results are
  • 8. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 8 cross validated 12 times after invoking the rules mentioned above on the data. The learned rules accuracy is found out to be 95.6% for K nearest neighbor K-NN and 93.5% for K-Means classifier. To further understand the limitations we have measured a learning curve for both the classifiers. Learning curves for K-NN and K-Means: . To further understand the limitations we have measured a learning curve for both the classifiers. To calculate a learning curve we provide the classifier with 10, 20, 30, 40, …., 90, 100 percent training data, after which we calculate 11 fold cross validated accuracy on the remainder. The figure shows the result which 95.6% for KNN and 93.5% for K means classifier.
  • 9. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 9 Related Work: There are different systems/softwares available over internet, which are used for removing unwanted advertisements from web pages, for example muffin (muffin.doit.org), web-filter (www.uni-paderborn.de) and By-proxy (www.besiex.org/Byproxy). These systems rely on different filtering rules. As the internet grows, good learning algorithms and classifiers are required for the scalability of these systems. Future Work: A system/software can be developed using this work to remove unwanted and annoying advertisements from web pages while using internet. The data is very old (1998) which can be improved as internet and web tools are more advanced.
  • 10. MuhammadGulraj BS GIKI, Pakistan MS UET Peshawar Pakistan 10 References 1. Andrew Ng (2013), an online course for Machine learning, Stanford University, Stanford, https://class.coursera.org/ml-004/class. 2. Duda and Hart, Pattern Classification (2001-2002), Wiley, New York. 3. http://en.wikipedia.org 4. http://www.mathworks.com/help/stats/hidden-markov-models-hmm.html 5. A Ranking-based KNN Approach for Multi-Label Classification (2012), Tsung-Hsien Chiang National Taiwan University 6. Selection of K in K-means clustering, D T Pham, S S Dimov, and C D Nguyen Manufacturing Engineering Centre, Cardiff University, Cardiff, UK