SlideShare a Scribd company logo
Classification of Breast Cancer dataset using  Decision Tree Induction Sunil Nair  Abel Gebreyesus   Masters of Health Informatics Dalhousie University HINF6210 Project Presentation – November 25, 2008
Agenda Objective Dataset Approach Classification Methods  Decision Tree Problems Future direction
Introduction Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important….
Objective  Significance of project Previous work done using this dataset Most previous work indicated room for improvement in increasing accuracy of classifier
Breast Cancer Dataset # of Instances:  699 # of Attributes:  10  plus  Class  attribute Class distribution : Benign (2):  458  (65.5%) Malignant (4):  241  (34.5%) Missing Values : 16 Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals,  Dr. William H. Wolberg
Attributes Indicate Cellular characteristics Variables are Continuous, Ordinal  with 10 levels Class  Benign (2), Malignant (4) 11 1-10 Mitoses 10 1-10 Normal Nucleoli 9 1-10 Bland Chromatin 8 1-10 Bare Nuclei 7 1-10 Single Epithelial Cell Size 6 1-10 Marginal Adhesion 5 1-10 Uniformity of Cell Shape 4 1-10 Uniformity of Cell Size 3 1-10 Clump Thickness 2 id number Sample code number 1
Attributes / class - distribution Dataset unbalanced
Our Approach Data Pre-processing Comparison between Classification techniques Decision Tree Induction Attribute Selection J48 Evaluation
Data Pre-processing  Filter out the ID column Handle Missing Values WEKA
Data preprocessing  Two options to manage Missing data – WEKA “ Replacemissingvalues ” weka.filters.unsupervised.attribute.ReplaceMissingValues Missing nominal and numeric attributes replaced with mode-means Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16  Outliers
Comparison chart – Handle Missing Value Confusion Matrix Total Correctly Classified Instances Test split = 223 Accuracy Rate: 95.78% How many predictions by chance? Expected Accuracy Rate = Kappa Statistic - is used to measure the agreement between predicted and actual categorization of data while correcting for prediction that occurs by chance. 89% 95% 7% 14 Missing Replaced 90% 96% 5% 11 Missing Removed 87% 94% 8% 14 Complete Exp. Acc. Rate Act. Acc. Rate MAE # RULES DATASET PERFORMANCE EVALUATION 233 70 163 Total 66 63 3 M 167 7 160 B Total M B Class
Data Pre-processing  Missing Value  Replaced  - Mean-Mode Missing Value  Removed  - Mean-Mode
Agenda Objective Dataset Approach Data Pre-Processing Classification Methods  Decision Tree Problems Future direction
Classification Methods Comparison 94% 97% 3% 233 Support Vector M 92% 97% 4% 233 DT-J48 79% 91% 10% 233 Neural Network 90% 96% 4% 233 Naïve Bayes Exp. Acc. Rate Act. Acc. Rate MAE # Total  Inst. CLASSIFIER PERFORMANCE EVALUATION Test Set
Classification using Decision Tree  Decision Tree – WEKA J48 (C4.5) Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes. Attribute Selection - Information gain
Attributes Selected – most IG weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker  89% 95% 7% 14 Missing Replaced 90% 96% 5% 11 Missing Removed 92% 97% 4% 11 Attributes  Selected Exp. Acc. Rate Act. Acc. Rate MAE # RULES DATASET PERFORMANCE EVALUATION 0.198 Mitosis 9 0.443 Marginal Adhesion 8 0.459 Clump Thickness 7 0.466 Normal Nucleoli 6 0.505 Single Epithelial Cell Size 5 0.543 Bland Chromatin 4 0.564 Bare Nucleoli 3 0.66 Uniformity of Cell Shape 2 0.675 Uniformity of Cell Size 1 Information Gain Attribute Rank
The DT – IG/Attribute selection Visualization
Decision Tree - Problems Concerns Missing values Pruning – Preprune or postprune Estimating error rates  Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting
Confusion Matrix – Performance Evaluation The overall  Accuracy  rate  is the number of correct classifications divided by the total number of classifications: TP+TN /  TP+TN+FP+FN   Error Rate = 1- Accuracy  Not a correct measure if Unbalanced Dataset  Classes are unequally represented TN FP M (4) FN TP B (2) Act. Class M (4) B (2) Predicted Class
Unbalanced dataset problem Solution: Stratified Sampling Method Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set. Standard Verification technique Best error estimate
Stratified Sampling Method
Performance Evaluation 92% 96% 3% 13 412 Testing set 97% 99% 2% 13 476 Training set Exp. Acc. Rate Act. Acc. Rate MAE # Rules # Instances Dataset PERFORMANCE EVALUATION Test Set
Tree Visualization
Unbalanced dataset Problem Solution: Cost Matrix  Cost sensitive classification Costs not known Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test Cross Validation once all costs are known
Future direction The overall accuracy of the classifier needs to be increased Cluster based Stratified Sampling Partitioning the original dataset using Kmeans Alg. Multiple Classifier model Bagging and Boosting techniques ROC (Receiver Operating Characteristic)  Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or error costs.
ROC Curve - Visualization For Benign class For Malignant class Area under the curve AUC Larger the area, better is the model
Questions / Comments Thank You !

More Related Content

PPTX
BRAIN TUMOR MRI IMAGE SEGMENTATION AND DETECTION IN IMAGE PROCESSING
Dharshika Shreeganesh
 
PPTX
3 classification
Mahmoud Alfarra
 
PPTX
Decision Tree Learning
Milind Gokhale
 
PDF
Support Vector Machines for Classification
Prakash Pimpale
 
PDF
Anomaly Detection
Carol Hargreaves
 
PPT
5.5 graph mining
Krish_ver2
 
PPTX
Predictive Analysis of Breast Cancer Detection using Classification Algorithm
Sushanti Acharya
 
PDF
Breast Cancer Detection using Convolution Neural Network
IRJET Journal
 
BRAIN TUMOR MRI IMAGE SEGMENTATION AND DETECTION IN IMAGE PROCESSING
Dharshika Shreeganesh
 
3 classification
Mahmoud Alfarra
 
Decision Tree Learning
Milind Gokhale
 
Support Vector Machines for Classification
Prakash Pimpale
 
Anomaly Detection
Carol Hargreaves
 
5.5 graph mining
Krish_ver2
 
Predictive Analysis of Breast Cancer Detection using Classification Algorithm
Sushanti Acharya
 
Breast Cancer Detection using Convolution Neural Network
IRJET Journal
 

What's hot (20)

PPTX
Advanced topics in artificial neural networks
swapnac12
 
PPTX
Machine Learning - Breast Cancer Diagnosis
Pramod Sharma
 
PPTX
Customer segmentation.pptx
Addalashashikumar
 
PDF
Confusion Matrix Explained
Stockholm University
 
PPTX
Dbscan
RohitPaul52
 
PPTX
Association rules
Dr. C.V. Suresh Babu
 
PPTX
Breast cancer classification
Ashwan Abdulmunem
 
PPTX
Multiclass classification of imbalanced data
SaurabhWani6
 
PPT
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
PDF
Feature Extraction
skylian
 
PPTX
Genetic algorithms
swapnac12
 
PPTX
Breast Cancer Detection with Convolutional Neural Networks (CNN)
Mehmet Çağrı Aksoy
 
PPTX
Linear Discriminant Analysis (LDA)
Anmol Dwivedi
 
PPTX
Advantages and disadvantages of hidden markov model
joshiblog
 
PDF
Dimensionality reduction with UMAP
Jakub Bartczuk
 
PPT
Lect12 graph mining
Houw Liong The
 
PPTX
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
PPTX
Unit 1 - ML - Introduction to Machine Learning.pptx
jawad184956
 
PDF
K-Means, its Variants and its Applications
Varad Meru
 
Advanced topics in artificial neural networks
swapnac12
 
Machine Learning - Breast Cancer Diagnosis
Pramod Sharma
 
Customer segmentation.pptx
Addalashashikumar
 
Confusion Matrix Explained
Stockholm University
 
Dbscan
RohitPaul52
 
Association rules
Dr. C.V. Suresh Babu
 
Breast cancer classification
Ashwan Abdulmunem
 
Multiclass classification of imbalanced data
SaurabhWani6
 
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Feature Extraction
skylian
 
Genetic algorithms
swapnac12
 
Breast Cancer Detection with Convolutional Neural Networks (CNN)
Mehmet Çağrı Aksoy
 
Linear Discriminant Analysis (LDA)
Anmol Dwivedi
 
Advantages and disadvantages of hidden markov model
joshiblog
 
Dimensionality reduction with UMAP
Jakub Bartczuk
 
Lect12 graph mining
Houw Liong The
 
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Unit 1 - ML - Introduction to Machine Learning.pptx
jawad184956
 
K-Means, its Variants and its Applications
Varad Meru
 
Ad

Viewers also liked (20)

PPTX
Decision theory
Aditya Mahagaonkar
 
PDF
Decision tree
R A Akerkar
 
PDF
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Interactive Technologies and Games: Education, Health and Disability
 
PPT
2.2 decision tree
Krish_ver2
 
PPTX
Decision tree
Karan Deopura
 
PPTX
a novel approach for breast cancer detection using data mining tool weka
ahmad abdelhafeez
 
PPTX
Decision tree
Mukund Trivedi
 
PPTX
Decision trees
Jagjit Wilku
 
PPT
Data Mining Concepts
Dung Nguyen
 
PPT
Data mining slides
smj
 
PPTX
Data mining
Akannsha Totewar
 
DOCX
Cancer de mama usando Weka e MLP/KNN
Talles Nascimento Rodrigues
 
PDF
Distributed Decision Tree Induction
gregoryg
 
PPTX
Decision Tree and entropy
Saeed Siddik
 
PDF
Thomas Goetz on Decision Trees for Ignite Bay Area
Ignite Bay Area
 
PPTX
Lit Final Presentation
cpost7
 
PDF
DTI brain networks analysis
emapesce
 
PPSX
Data Science 101
odsc
 
PPTX
Lung Cancer Screening
Allina Health
 
Decision theory
Aditya Mahagaonkar
 
Decision tree
R A Akerkar
 
Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selectio...
Interactive Technologies and Games: Education, Health and Disability
 
2.2 decision tree
Krish_ver2
 
Decision tree
Karan Deopura
 
a novel approach for breast cancer detection using data mining tool weka
ahmad abdelhafeez
 
Decision tree
Mukund Trivedi
 
Decision trees
Jagjit Wilku
 
Data Mining Concepts
Dung Nguyen
 
Data mining slides
smj
 
Data mining
Akannsha Totewar
 
Cancer de mama usando Weka e MLP/KNN
Talles Nascimento Rodrigues
 
Distributed Decision Tree Induction
gregoryg
 
Decision Tree and entropy
Saeed Siddik
 
Thomas Goetz on Decision Trees for Ignite Bay Area
Ignite Bay Area
 
Lit Final Presentation
cpost7
 
DTI brain networks analysis
emapesce
 
Data Science 101
odsc
 
Lung Cancer Screening
Allina Health
 
Ad

Similar to Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University (20)

PDF
classification in data mining and data warehousing.pdf
321106410027
 
PDF
research paper
Kalyan Ram
 
PPTX
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Rafiul Sabbir
 
PPT
Vanderbilt b
Claudine Garcia
 
PDF
Diabetespredictionbyusingmachinelearning.pdf
AnnisaSriWardifa1
 
PPTX
Thesis presentation: Applications of machine learning in predicting supply risks
TuanNguyen1697
 
PDF
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
SUJIT SHIBAPRASAD MAITY
 
PPT
Data mining techniques unit iv
malathieswaran29
 
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
Wenfan Xu
 
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
Karunakar Kotha
 
PDF
Predicting Moscow Real Estate Prices with Azure Machine Learning
Leo Salemann
 
PDF
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
ahmad abdelhafeez
 
PDF
Classification of Breast Cancer Diseases using Data Mining Techniques
inventionjournals
 
PDF
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
csandit
 
PDF
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
cscpconf
 
PPTX
Data Science Project: Advancements in Fetal Health Classification
Boston Institute of Analytics
 
PDF
OTTO-Report
Antonio Maria Fiscarelli
 
PPTX
Leveraging Feature Selection Within TreeNet
agdavis
 
PDF
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Ali Al Hamadani
 
PDF
Design of an Intelligent System for Improving Classification of Cancer Diseases
Mohamed Loey
 
classification in data mining and data warehousing.pdf
321106410027
 
research paper
Kalyan Ram
 
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Rafiul Sabbir
 
Vanderbilt b
Claudine Garcia
 
Diabetespredictionbyusingmachinelearning.pdf
AnnisaSriWardifa1
 
Thesis presentation: Applications of machine learning in predicting supply risks
TuanNguyen1697
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
SUJIT SHIBAPRASAD MAITY
 
Data mining techniques unit iv
malathieswaran29
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Wenfan Xu
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Karunakar Kotha
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Leo Salemann
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
ahmad abdelhafeez
 
Classification of Breast Cancer Diseases using Data Mining Techniques
inventionjournals
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
csandit
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
cscpconf
 
Data Science Project: Advancements in Fetal Health Classification
Boston Institute of Analytics
 
Leveraging Feature Selection Within TreeNet
agdavis
 
Predictive Analytics of Cell Types Using Single Cell Gene Expression Profiles
Ali Al Hamadani
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Mohamed Loey
 

More from Sunil Nair (9)

PDF
Change Management-Management Skills Development Project Health Informatics Su...
Sunil Nair
 
PDF
Meditech - Healthcare Information System - Sunil Nair Health Informatics Dalh...
Sunil Nair
 
PDF
Effects of exposure to mercury on health of dentists - Sunil Nair Health Info...
Sunil Nair
 
PDF
Effect Of Type Of Delivery On Birth Trauma And Length Of Stay - Sunil Nair He...
Sunil Nair
 
PDF
The Effect Race and Income on HIV AIDS infection in African-Americans - Sunil...
Sunil Nair
 
PPS
Personalized Disease Management - Thyroid Cancer - Knowledge Management - Sun...
Sunil Nair
 
PPS
Healthcare Technology Assessment Gideon Presentation - Sunil Nair Health Info...
Sunil Nair
 
PPS
Pandemic Flu Health Information and Work Flow Project - Sunil Nair Health Inf...
Sunil Nair
 
PPS
Clinical Decision Support Systems - Sunil Nair Health Informatics Dalhousie U...
Sunil Nair
 
Change Management-Management Skills Development Project Health Informatics Su...
Sunil Nair
 
Meditech - Healthcare Information System - Sunil Nair Health Informatics Dalh...
Sunil Nair
 
Effects of exposure to mercury on health of dentists - Sunil Nair Health Info...
Sunil Nair
 
Effect Of Type Of Delivery On Birth Trauma And Length Of Stay - Sunil Nair He...
Sunil Nair
 
The Effect Race and Income on HIV AIDS infection in African-Americans - Sunil...
Sunil Nair
 
Personalized Disease Management - Thyroid Cancer - Knowledge Management - Sun...
Sunil Nair
 
Healthcare Technology Assessment Gideon Presentation - Sunil Nair Health Info...
Sunil Nair
 
Pandemic Flu Health Information and Work Flow Project - Sunil Nair Health Inf...
Sunil Nair
 
Clinical Decision Support Systems - Sunil Nair Health Informatics Dalhousie U...
Sunil Nair
 

Recently uploaded (20)

PPT
9. Applied Biomechanics (fracture fixation)etc.ppt
Bolan University of Medical and Health Sciences ,Quetta
 
PPTX
Describe Thyroid storm & it’s Pharmacotherapy Drug Interaction: Pyridoxine + ...
Dr. Deepa Singh Rana
 
PPTX
IMPORTANCE of WORLD ORS DAY July 29 & ORS.pptx
MedicalSuperintenden19
 
PPTX
BORDER_MOULDING-_Dr._Sonia.assistant professor
drsoniabithi1987
 
PDF
CA & Simple Goitre , surgery, Faculty of medicine .pdf
MostafaMohammed95
 
PPTX
Pharmacotherapy of Myasthenia Gravis- Dr. Anurag Sharma (1).pptx
Anurag Sharma
 
PPTX
5.Gene therapy for musculoskeletal system disorders.pptx
Bolan University of Medical and Health Sciences ,Quetta
 
PPTX
Sources, types and collection of data.pptx
drmadhulikakgmu
 
PPTX
The Anatomy of the Major Salivary Glands
Srinjoy Chatterjee
 
PPTX
Models of screening of Adrenergic Blocking Drugs.pptx
Dr Fatima Rani
 
PPTX
Birth Preparedness & Complication Readiness
Pratiksha Rai
 
PPTX
COPD chronic obstructive pulmonary disease.pptx
pearlprincess7557
 
PPTX
CEPHALOPELVIC DISPROPORTION (Mufeez).pptx
mufeezwanim2
 
DOCX
RUHS II MBBS Pharmacology Paper-I with Answer Key | 26 July 2025 (New Scheme)
Shivankan Kakkar
 
PPTX
Models for Screening of DIURETICS- Dr. ZOYA KHAN.pptx
Zoya Khan
 
PPTX
CANSA Womens Health UTERINE focus Top Cancers slidedeck Aug 2025
CANSA The Cancer Association of South Africa
 
PPTX
LOW GRADE GLIOMA MANAGEMENT BY DR KANHU CHARAN PATRO
Kanhu Charan
 
PPTX
Anaesthesia Machine - Safety Features and Recent Advances - Dr.Vaidyanathan R
VAIDYANATHAN R
 
PPT
8-Ergonomics of Aging.ppt · version 1.ppt
Bolan University of Medical and Health Sciences ,Quetta
 
PPTX
13.Anesthesia and its all types.....pptx
Bolan University of Medical and Health Sciences ,Quetta
 
9. Applied Biomechanics (fracture fixation)etc.ppt
Bolan University of Medical and Health Sciences ,Quetta
 
Describe Thyroid storm & it’s Pharmacotherapy Drug Interaction: Pyridoxine + ...
Dr. Deepa Singh Rana
 
IMPORTANCE of WORLD ORS DAY July 29 & ORS.pptx
MedicalSuperintenden19
 
BORDER_MOULDING-_Dr._Sonia.assistant professor
drsoniabithi1987
 
CA & Simple Goitre , surgery, Faculty of medicine .pdf
MostafaMohammed95
 
Pharmacotherapy of Myasthenia Gravis- Dr. Anurag Sharma (1).pptx
Anurag Sharma
 
5.Gene therapy for musculoskeletal system disorders.pptx
Bolan University of Medical and Health Sciences ,Quetta
 
Sources, types and collection of data.pptx
drmadhulikakgmu
 
The Anatomy of the Major Salivary Glands
Srinjoy Chatterjee
 
Models of screening of Adrenergic Blocking Drugs.pptx
Dr Fatima Rani
 
Birth Preparedness & Complication Readiness
Pratiksha Rai
 
COPD chronic obstructive pulmonary disease.pptx
pearlprincess7557
 
CEPHALOPELVIC DISPROPORTION (Mufeez).pptx
mufeezwanim2
 
RUHS II MBBS Pharmacology Paper-I with Answer Key | 26 July 2025 (New Scheme)
Shivankan Kakkar
 
Models for Screening of DIURETICS- Dr. ZOYA KHAN.pptx
Zoya Khan
 
CANSA Womens Health UTERINE focus Top Cancers slidedeck Aug 2025
CANSA The Cancer Association of South Africa
 
LOW GRADE GLIOMA MANAGEMENT BY DR KANHU CHARAN PATRO
Kanhu Charan
 
Anaesthesia Machine - Safety Features and Recent Advances - Dr.Vaidyanathan R
VAIDYANATHAN R
 
8-Ergonomics of Aging.ppt · version 1.ppt
Bolan University of Medical and Health Sciences ,Quetta
 
13.Anesthesia and its all types.....pptx
Bolan University of Medical and Health Sciences ,Quetta
 

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Induction - Sunil Nair Health Informatics Dalhousie University

  • 1. Classification of Breast Cancer dataset using Decision Tree Induction Sunil Nair Abel Gebreyesus Masters of Health Informatics Dalhousie University HINF6210 Project Presentation – November 25, 2008
  • 2. Agenda Objective Dataset Approach Classification Methods Decision Tree Problems Future direction
  • 3. Introduction Breast Cancer prognosis Breast cancer incidence is high Improvement in diagnostic methods Early diagnosis and treatment. But, recurrence is high Good prognosis is important….
  • 4. Objective Significance of project Previous work done using this dataset Most previous work indicated room for improvement in increasing accuracy of classifier
  • 5. Breast Cancer Dataset # of Instances: 699 # of Attributes: 10 plus Class attribute Class distribution : Benign (2): 458 (65.5%) Malignant (4): 241 (34.5%) Missing Values : 16 Wisconsin Breast Cancer Database (1991) University of Wisconsin Hospitals, Dr. William H. Wolberg
  • 6. Attributes Indicate Cellular characteristics Variables are Continuous, Ordinal with 10 levels Class Benign (2), Malignant (4) 11 1-10 Mitoses 10 1-10 Normal Nucleoli 9 1-10 Bland Chromatin 8 1-10 Bare Nuclei 7 1-10 Single Epithelial Cell Size 6 1-10 Marginal Adhesion 5 1-10 Uniformity of Cell Shape 4 1-10 Uniformity of Cell Size 3 1-10 Clump Thickness 2 id number Sample code number 1
  • 7. Attributes / class - distribution Dataset unbalanced
  • 8. Our Approach Data Pre-processing Comparison between Classification techniques Decision Tree Induction Attribute Selection J48 Evaluation
  • 9. Data Pre-processing Filter out the ID column Handle Missing Values WEKA
  • 10. Data preprocessing Two options to manage Missing data – WEKA “ Replacemissingvalues ” weka.filters.unsupervised.attribute.ReplaceMissingValues Missing nominal and numeric attributes replaced with mode-means Remove (delete) the tuple with missing values. Missing values are attribute bare nuclei = 16 Outliers
  • 11. Comparison chart – Handle Missing Value Confusion Matrix Total Correctly Classified Instances Test split = 223 Accuracy Rate: 95.78% How many predictions by chance? Expected Accuracy Rate = Kappa Statistic - is used to measure the agreement between predicted and actual categorization of data while correcting for prediction that occurs by chance. 89% 95% 7% 14 Missing Replaced 90% 96% 5% 11 Missing Removed 87% 94% 8% 14 Complete Exp. Acc. Rate Act. Acc. Rate MAE # RULES DATASET PERFORMANCE EVALUATION 233 70 163 Total 66 63 3 M 167 7 160 B Total M B Class
  • 12. Data Pre-processing Missing Value Replaced - Mean-Mode Missing Value Removed - Mean-Mode
  • 13. Agenda Objective Dataset Approach Data Pre-Processing Classification Methods Decision Tree Problems Future direction
  • 14. Classification Methods Comparison 94% 97% 3% 233 Support Vector M 92% 97% 4% 233 DT-J48 79% 91% 10% 233 Neural Network 90% 96% 4% 233 Naïve Bayes Exp. Acc. Rate Act. Acc. Rate MAE # Total Inst. CLASSIFIER PERFORMANCE EVALUATION Test Set
  • 15. Classification using Decision Tree Decision Tree – WEKA J48 (C4.5) Divide and conquer algorithm Convert tree to Classification rules J48 can handle numeric attributes. Attribute Selection - Information gain
  • 16. Attributes Selected – most IG weka.filters.supervised.attribute.AttributeSelection-Eweka.attributeSelection.InfoGainAttributeEval-Sweka.attributeSelection.Ranker 89% 95% 7% 14 Missing Replaced 90% 96% 5% 11 Missing Removed 92% 97% 4% 11 Attributes Selected Exp. Acc. Rate Act. Acc. Rate MAE # RULES DATASET PERFORMANCE EVALUATION 0.198 Mitosis 9 0.443 Marginal Adhesion 8 0.459 Clump Thickness 7 0.466 Normal Nucleoli 6 0.505 Single Epithelial Cell Size 5 0.543 Bland Chromatin 4 0.564 Bare Nucleoli 3 0.66 Uniformity of Cell Shape 2 0.675 Uniformity of Cell Size 1 Information Gain Attribute Rank
  • 17. The DT – IG/Attribute selection Visualization
  • 18. Decision Tree - Problems Concerns Missing values Pruning – Preprune or postprune Estimating error rates Unbalanced Dataset Bias in prediction Overfitting – in test set Underfitting
  • 19. Confusion Matrix – Performance Evaluation The overall Accuracy rate is the number of correct classifications divided by the total number of classifications: TP+TN / TP+TN+FP+FN Error Rate = 1- Accuracy Not a correct measure if Unbalanced Dataset Classes are unequally represented TN FP M (4) FN TP B (2) Act. Class M (4) B (2) Predicted Class
  • 20. Unbalanced dataset problem Solution: Stratified Sampling Method Partitioning of dataset based on class Random Sampling Process Create Training and Test set with equal size class Testing set data independent from Training set. Standard Verification technique Best error estimate
  • 22. Performance Evaluation 92% 96% 3% 13 412 Testing set 97% 99% 2% 13 476 Training set Exp. Acc. Rate Act. Acc. Rate MAE # Rules # Instances Dataset PERFORMANCE EVALUATION Test Set
  • 24. Unbalanced dataset Problem Solution: Cost Matrix Cost sensitive classification Costs not known Complete financial analysis needed; i.e cost of Using ML tool Gathering training data Using the model Determining the attributes for test Cross Validation once all costs are known
  • 25. Future direction The overall accuracy of the classifier needs to be increased Cluster based Stratified Sampling Partitioning the original dataset using Kmeans Alg. Multiple Classifier model Bagging and Boosting techniques ROC (Receiver Operating Characteristic) Plotting the TP Rate (Y-axis) over FP Rate (X-Axis) Advantage: Does not regard class distribution or error costs.
  • 26. ROC Curve - Visualization For Benign class For Malignant class Area under the curve AUC Larger the area, better is the model
  • 27. Questions / Comments Thank You !