Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model  for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.
Examples of Classification TaskPredicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etc
Classification TechniquesDecision Tree based MethodsRule-based MethodsMemory based reasoningNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines
Decision Tree InductionMany Algorithms:Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ,SPRINT
Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion.IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
How to Specify Test Condition?Depends on attribute typesNominalOrdinalContinuousDepends on number of ways to split2-way splitMulti-way split
Splitting Based on Nominal AttributesCarTypeFamilyLuxurySportsMulti-way split: Use as many partitions as distinct values.
Contd…..Binary split:  Divides values into two subsets. Need to find optimal partitioningCarType{Sports, Luxury}{Family}
Splitting Based on Continuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval 		bucketing, equal frequency bucketing		(percentiles), or clustering.
Contd….Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive
How to determine the Best SplitGreedy approach: Nodes with homogeneous class distribution are preferredNeed a measure of node impurity:
Measures of Node ImpurityGini IndexEntropyMisclassification error
Measure of Impurity: GINIMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting information
Splitting Based on GINIUsed in CART, SLIQ, SPRINT.When a node p is split into k partitions (children), the quality of split is computed as,where,	ni = number of records at child i,    	          n = number of records at node p.
Binary Attributes: Computing GINI IndexSplits into two partitions
Effect of Weighing partitions:
Larger and Purer Partitions are sought for.Categorical Attributes: Computing Gini IndexFor each distinct value, gather counts for each class in the datasetUse the count matrix to make decisionsTwo-way split (find best partition of values)Multi-way split
Two-way split (find best partition of values)Use Binary Decisions based on one valueSeveral Choices for the splitting valueNumber of possible splitting values = Number of distinct valuesEach splitting value has a count matrix associated with itClass counts in each of the partitions, A < v and A  vSimple method to choose best vFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient! Repetition of work.
Measure of Impurity: EntropyEntropy at a given node t:Measures homogeneity of a node. Maximum (log nc) when records are equallydistributed among all classes implying leastInformationMinimum (0.0) when all records belong to oneclass, implying most information
Splitting based on EntropyParent Node, p is split into k partitionsni is the number of records in partition iClassification error at a node t :
Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)
Decision Tree Based ClassificationAdvantages:Inexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data sets
Practical Issues of ClassificationUnderfitting and OverfittingMissing ValuesCosts of Classification
Notes on OverfittingOverfitting results in decision trees that are more complex than necessaryTraining error no longer provides a good estimate of how well the tree will perform on previously unseen recordsNeed new ways for estimating errors
How to Address OverfittingStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
How to Address Overfitting…Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeCan use MDL for post-pruning
Other IssuesData FragmentationSearch StrategyExpressivenessTree Replication
Data FragmentationNumber of instances gets smaller as you traverse down the treeNumber of instances at the leaf nodes could be too small to make any statistically significant decision
Search StrategyFinding an optimal decision tree is NP-hardThe algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solutionOther strategies?Bottom-upBi-directional
ExpressivenessDecision tree provides expressive representation for learning discrete-valued functionBut they do not generalize well to certain types of Boolean functionsNot expressive enough for modeling continuous variablesParticularly when test condition involves only a single attribute at-a-time
Tree ReplicationSame subtree appears in multiple branches
Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model?Methods for Performance EvaluationHow to obtain reliable estimates?Methods for Model ComparisonHow to compare the relative performance among competing models?
Metrics for Performance EvaluationFocus on the predictive capability of a modelRather than how fast it takes to classify or build models, scalability, etc.It is determined using:Confusion matrixCost matrix
Methods for Performance EvaluationHow to obtain a reliable estimate of performance?Performance of a model may depend on other factors besides the learning algorithm:Class distributionCost of misclassificationSize of training and test sets
Methods of EstimationHoldoutReserve 2/3 for training and 1/3 for testing Random subsamplingRepeated holdoutCross validationPartition data into k disjoint subsetsk-fold: train on k-1 partitions, test on the remaining oneLeave-one-out:   k=nStratified sampling oversampling vsundersamplingBootstrapSampling with replacement
Methods for Model Comparison -ROCDeveloped in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarmsROC curve plots TP (on the y-axis) against FP (on the x-axis)Performance of each classifier represented as a point on the ROC curvechanging the threshold of algorithm, sample distribution or cost matrix changes the location of the point
Test of SignificanceGiven two models:Model M1: accuracy = 85%, tested on 30 instancesModel M2: accuracy = 75%, tested on 5000 instancesCan we say M1 is better than M2?How much confidence can we place on accuracy of M1 and M2?Can the difference in performance measure be explained as a result of random fluctuations in the test set?
ConclusionDecision tree inductionAlgorithm for decision tee inductionModel OverfittingEvaluating the performance of a classifier are studied  in detail

Classification

  • 1.
    Data Mining Classification:Basic Concepts, Decision Trees, and Model Evaluation
  • 2.
    Classification: DefinitionGiven acollection of records (training set )Each record contains a set of attributes, one of the attributes is the class.Find a model for class attribute as a function of the values of other attributes.Goal: previously unseen records should be assigned a class as accurately as possible.
  • 3.
    Examples of ClassificationTaskPredicting tumor cells as benign or malignantClassifying credit card transactions as legitimate or fraudulentClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coilCategorizing news stories as finance, weather, entertainment, sports, etc
  • 4.
    Classification TechniquesDecision Treebased MethodsRule-based MethodsMemory based reasoningNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines
  • 5.
    Decision Tree InductionManyAlgorithms:Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ,SPRINT
  • 6.
    Tree InductionGreedy strategy.Splitthe records based on an attribute test that optimizes certain criterion.IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
  • 7.
    How to SpecifyTest Condition?Depends on attribute typesNominalOrdinalContinuousDepends on number of ways to split2-way splitMulti-way split
  • 8.
    Splitting Based onNominal AttributesCarTypeFamilyLuxurySportsMulti-way split: Use as many partitions as distinct values.
  • 9.
    Contd…..Binary split: Divides values into two subsets. Need to find optimal partitioningCarType{Sports, Luxury}{Family}
  • 10.
    Splitting Based onContinuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
  • 11.
    Contd….Binary Decision: (A< v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive
  • 12.
    How to determinethe Best SplitGreedy approach: Nodes with homogeneous class distribution are preferredNeed a measure of node impurity:
  • 13.
    Measures of NodeImpurityGini IndexEntropyMisclassification error
  • 14.
    Measure of Impurity:GINIMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting information
  • 15.
    Splitting Based onGINIUsed in CART, SLIQ, SPRINT.When a node p is split into k partitions (children), the quality of split is computed as,where, ni = number of records at child i, n = number of records at node p.
  • 16.
    Binary Attributes: ComputingGINI IndexSplits into two partitions
  • 17.
    Effect of Weighingpartitions:
  • 18.
    Larger and PurerPartitions are sought for.Categorical Attributes: Computing Gini IndexFor each distinct value, gather counts for each class in the datasetUse the count matrix to make decisionsTwo-way split (find best partition of values)Multi-way split
  • 19.
    Two-way split (findbest partition of values)Use Binary Decisions based on one valueSeveral Choices for the splitting valueNumber of possible splitting values = Number of distinct valuesEach splitting value has a count matrix associated with itClass counts in each of the partitions, A < v and A  vSimple method to choose best vFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient! Repetition of work.
  • 20.
    Measure of Impurity:EntropyEntropy at a given node t:Measures homogeneity of a node. Maximum (log nc) when records are equallydistributed among all classes implying leastInformationMinimum (0.0) when all records belong to oneclass, implying most information
  • 21.
    Splitting based onEntropyParent Node, p is split into k partitionsni is the number of records in partition iClassification error at a node t :
  • 22.
    Stopping Criteria forTree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)
  • 23.
    Decision Tree BasedClassificationAdvantages:Inexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data sets
  • 24.
    Practical Issues ofClassificationUnderfitting and OverfittingMissing ValuesCosts of Classification
  • 25.
    Notes on OverfittingOverfittingresults in decision trees that are more complex than necessaryTraining error no longer provides a good estimate of how well the tree will perform on previously unseen recordsNeed new ways for estimating errors
  • 26.
    How to AddressOverfittingStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test)
  • 27.
    How to AddressOverfitting…Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-treeCan use MDL for post-pruning
  • 28.
    Other IssuesData FragmentationSearchStrategyExpressivenessTree Replication
  • 29.
    Data FragmentationNumber ofinstances gets smaller as you traverse down the treeNumber of instances at the leaf nodes could be too small to make any statistically significant decision
  • 30.
    Search StrategyFinding anoptimal decision tree is NP-hardThe algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solutionOther strategies?Bottom-upBi-directional
  • 31.
    ExpressivenessDecision tree providesexpressive representation for learning discrete-valued functionBut they do not generalize well to certain types of Boolean functionsNot expressive enough for modeling continuous variablesParticularly when test condition involves only a single attribute at-a-time
  • 32.
    Tree ReplicationSame subtreeappears in multiple branches
  • 33.
    Model EvaluationMetrics forPerformance EvaluationHow to evaluate the performance of a model?Methods for Performance EvaluationHow to obtain reliable estimates?Methods for Model ComparisonHow to compare the relative performance among competing models?
  • 34.
    Metrics for PerformanceEvaluationFocus on the predictive capability of a modelRather than how fast it takes to classify or build models, scalability, etc.It is determined using:Confusion matrixCost matrix
  • 35.
    Methods for PerformanceEvaluationHow to obtain a reliable estimate of performance?Performance of a model may depend on other factors besides the learning algorithm:Class distributionCost of misclassificationSize of training and test sets
  • 36.
    Methods of EstimationHoldoutReserve2/3 for training and 1/3 for testing Random subsamplingRepeated holdoutCross validationPartition data into k disjoint subsetsk-fold: train on k-1 partitions, test on the remaining oneLeave-one-out: k=nStratified sampling oversampling vsundersamplingBootstrapSampling with replacement
  • 37.
    Methods for ModelComparison -ROCDeveloped in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarmsROC curve plots TP (on the y-axis) against FP (on the x-axis)Performance of each classifier represented as a point on the ROC curvechanging the threshold of algorithm, sample distribution or cost matrix changes the location of the point
  • 38.
    Test of SignificanceGiventwo models:Model M1: accuracy = 85%, tested on 30 instancesModel M2: accuracy = 75%, tested on 5000 instancesCan we say M1 is better than M2?How much confidence can we place on accuracy of M1 and M2?Can the difference in performance measure be explained as a result of random fluctuations in the test set?
  • 39.
    ConclusionDecision tree inductionAlgorithmfor decision tee inductionModel OverfittingEvaluating the performance of a classifier are studied in detail
  • 40.
    Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.The tutorials section is free, self-guiding and will not involve any additional support.Visit us at www.dataminingtools.net