Data Mining Methodology
          Kevin Swingler
       University of Stirling
   Lecturer, Computing Science
        kms@cs.stir.ac.uk
What is Data Mining?
• Generally, methods of using large quantities of data
  and appropriate algorithms to allow a computer to
  ‘learn’ to perform a task
• Task oriented:
   – Predict outcomes or forecast the future
   – Classify objects as belonging to one of several categories
   – Separate data into clusters of similar objects
• Most methods produce a model of the data that
  performs the task

                                                                  2
Some Examples
• Predicting patterns of drug side-effects
• Spotting credit card or insurance fraud
• Controlling complex machinery
• Predicting the outcome of medical
  interventions
• Predicting the price of stocks and shares or
  exchange rates
• Knowing when a cow is most fertile (really!)
                                                 3
Examples in LIS
• Text Mining
  – Automatically determine what an article is ‘about’
  – Classify attitudes in social media
• Demand Prediction
  – Predicting demand for resources such as new books or
    journals or buildings
• Search and Recommend
  – Analysis of borrowing history to make recommendations
  – Links analysis for citation clustering


                                                            4
Data Sources
• In House – Data you own
  – Borrow records
  – Search histories
  – Catalogue data
• Bought in
  – Demographic data about customers
  – Demographic data about the locality around a
    library

                                                   5
Methods
• Techniques for data mining are based on
  mathematics and statistics, but are
  implemented in easy to use software
  packages
• Where methodology is important is in pre-
  processing the data, choosing the techniques,
  and interpreting the results


                                                  6
CRISP DM Standard
• CRoss Industry Standard Process for Data
  Mining




                                             7
Data Preparation
• Clean the data
  – Remove rows with missing values
  – Remove rows with obvious data entry errors – e.g.
    Age = 200
  – Recode obvious data entry inconsistencies – e.g. If
    Gender = M or F, but occasionally Male
  – Remove rows with minority values
  – Select which variables to use in the model


                                                      8
Data Quantity
• Choose the variables to be used for the model
• Look at the distributions of the chosen values
• Look at the level of noise in the data
• Look at the degree of linearity in the data
• Decide whether or not there are sufficient
  examples in the data
• Treat unbalanced data


                                                   9
Consider Error Costs
• Imagine a system that classifies input patterns
  into one of several possible categories
• Sometimes it will get things wrong, how often
  depends on the problem:
  – Direct mail targeting – very often
  – Credit risk assessment – quite often
  – Medical reasoning – very infrequently



                                                10
Error Costs
• An error in one direction can cost more than
  an error in the opposite direction
  – Recommending a blood test based on a false
    positive is better than missing an infection due to
    a false negative
  – Missing a case of insurance fraud is more costly
    than flagging a claim to be double checked
• The balance of examples in each case can be
  manipulated to reflect the cost

                                                          11
Check Points
• Data quantity and quality: do you have
  sufficient good data for the task?
  – How many variables are there?
  – How complex is the task?
  – Is the data’s distribution appropriate?
     • Outliers
     • Balance
     • Value set size


                                              12
Distributions
• A frequency distribution is a count of how
  often each variable contains each value in a
  data set
• For discrete numbers and categorical values,
  this is simply a count of each value
• For continuous numbers, the count is of how
  many values fall into each of a set of sub-
  ranges

                                                 13
Plotting Distributions
• The easiest way to visualise a distribution is to
  plot it in a histogram:




                                                  14
Features of a Distribution
                 to Look For
•   Outliers
•   Minority values
•   Data Balance
•   Data entry errors




                                       15
Outliers
• A small number of values that are much larger
  or much smaller than all the others
• Can disrupt the data mining process and give
  misleading results
• You should either remove them or, if they are
  important, collect more data to reflect this
  aspect of the world you are modelling
• Could be data entry errors

                                              16
Minority Values
• Values that only appear infrequently in the data
• Do they appear often enough to contribute to the
  model?
• Might be worth removing them from the data or
  collecting more data where they are represented
• Are they needed in the finished system?
• Could they be the result of data entry errors?



                                                     17
Minority Values
             600


             500


             400


             300


             200


             100


               0
                    Male     Female      M         F




What does this chart tell you about the gender variable in a data set?
What should you do before modelling or mining the data?

                                                                         18
Flat and Wide Variables
• Variables where all the values are minority values
  have a flat, wide distribution – one or two of each
  possible value
• Such variables are of little use in data mining because
  the goal of DM is to find general patterns from
  specific data
• No such patterns can exist if each data point is
  completely different
• Such variables should be excluded from a model

                                                        19
Data Balance
• Imagine I want to predict whether or not a
  prospective customer will respond to a mailing
  campaign
• I collect the data, put it into a data mining
  algorithm, which learns and reports a success
  rate of 98%
• Sounds good, but when I put a new set of
  prospects through to see who to mail, what
  happens?

                                               20
A Problem
• … the system predicts ‘No’ for every single
  prospect.
• With a response rate on a campaign of 2%,
  then the system is right 98% of the time if it
  always says ‘No’.
• So it never chooses anybody to target in the
  campaign


                                                   21
A Solution
• One data pre-processing solution is to balance the number of
  examples of each target class in the output variable
• In our previous example: 50% customers and 50% non-
  customers
• That way, any gain in accuracy over 50% would certainly be
  due to patterns in the data, not the prior distribution
• This is not always easy to achieve – you might need to throw
  away a lot of data to balance the examples, or build several
  models on balanced subsets
• Not always necessary – if an event is rare because its cause is
  rare, then the problem won’t arise


                                                                22
Data Quantity
• How much data do you need?
• How long is a piece of string?
• Data must be sufficient to:
  – Represent the dynamics of the system to be
    modelled
  – Cover all situations likely to be encountered when
    predictions are needed
  – Compensate for any noise in the data

                                                     23
Model Building
• Choose a number of techniques suitable to
  the task:
  – Neural network for prediction or classification
  – Decision tree for classification
  – Rule induction for classification
  – Bayesian network for classification
  – K-Means for clustering



                                                      24
Train Models
• For each technique:
  – Run a series of experiments with different
    parameters
  – Each experiment should use around 70% of the
    data for training and the rest for testing
  – When a good solution is found, use cross
    validation (10 fold is a good choice) to verify the
    result


                                                          25
Cross Validation
• Split the data into ten subsets, then train 10
  models – each one using 9 of the 10 subsets
  as training data and the 10th as test. The score
  is the average of all 10.
• This is a more accurate representation of how
  well the data may be modelled, as it reduces
  the risk of getting a lucky test set


                                                     26
Assess Models
• You can measure the success of your model in a
  number of ways
   – Mean Squared error – not always meaningful
   – Percentage correct for classification
   – Confusion matrix for classification

               Output= True        False
               True        80      30
               False       20      90

                                                   27
Probability Outputs
• Most classification techniques provide a score
  with the classification – either a probability or
  some other measure
• This can be used:
  – Allow an answer of “unsure” for cases where no
    single class has a high enough probability
  – Weighting outputs to allow for unequal cost of
    outcomes
  – Lift charts and ROC curves

                                                     28
Generalisation and Over Fitting
• Most data mining models have a degree of
  complexity that can be controlled by the
  designer
• The goal is to find the degree of complexity
  that is best suited to the data
• A model that is too simple over generalises
• A model that is too complex over fits
• Both have an adverse effect on performance
                                                 29
Gen-Spec Trade Off
• Adding to the complexity of the model fits the
  training data better at the expense of higher
  test error




                                               30
Repeat or Finish
• The result of the data mining will leave you
  with either a model that works or the need to
  improve
• More data may need to be collected
• Different variables might be tried
• The process can loop several times before a
  satisfactory answer is found


                                                  31
Understanding and Using the Results
• The resulting model has the ability to perform
  the task it was set, so can be embedded in an
  automated system
• Some techniques produce models that are
  human readable and allow insights into the
  structure of the data
• Some are almost impossible to extract
  knowledge from

                                               32
33

Kevin Swingler: Introduction to Data Mining

  • 1.
    Data Mining Methodology Kevin Swingler University of Stirling Lecturer, Computing Science [email protected]
  • 2.
    What is DataMining? • Generally, methods of using large quantities of data and appropriate algorithms to allow a computer to ‘learn’ to perform a task • Task oriented: – Predict outcomes or forecast the future – Classify objects as belonging to one of several categories – Separate data into clusters of similar objects • Most methods produce a model of the data that performs the task 2
  • 3.
    Some Examples • Predictingpatterns of drug side-effects • Spotting credit card or insurance fraud • Controlling complex machinery • Predicting the outcome of medical interventions • Predicting the price of stocks and shares or exchange rates • Knowing when a cow is most fertile (really!) 3
  • 4.
    Examples in LIS •Text Mining – Automatically determine what an article is ‘about’ – Classify attitudes in social media • Demand Prediction – Predicting demand for resources such as new books or journals or buildings • Search and Recommend – Analysis of borrowing history to make recommendations – Links analysis for citation clustering 4
  • 5.
    Data Sources • InHouse – Data you own – Borrow records – Search histories – Catalogue data • Bought in – Demographic data about customers – Demographic data about the locality around a library 5
  • 6.
    Methods • Techniques fordata mining are based on mathematics and statistics, but are implemented in easy to use software packages • Where methodology is important is in pre- processing the data, choosing the techniques, and interpreting the results 6
  • 7.
    CRISP DM Standard •CRoss Industry Standard Process for Data Mining 7
  • 8.
    Data Preparation • Cleanthe data – Remove rows with missing values – Remove rows with obvious data entry errors – e.g. Age = 200 – Recode obvious data entry inconsistencies – e.g. If Gender = M or F, but occasionally Male – Remove rows with minority values – Select which variables to use in the model 8
  • 9.
    Data Quantity • Choosethe variables to be used for the model • Look at the distributions of the chosen values • Look at the level of noise in the data • Look at the degree of linearity in the data • Decide whether or not there are sufficient examples in the data • Treat unbalanced data 9
  • 10.
    Consider Error Costs •Imagine a system that classifies input patterns into one of several possible categories • Sometimes it will get things wrong, how often depends on the problem: – Direct mail targeting – very often – Credit risk assessment – quite often – Medical reasoning – very infrequently 10
  • 11.
    Error Costs • Anerror in one direction can cost more than an error in the opposite direction – Recommending a blood test based on a false positive is better than missing an infection due to a false negative – Missing a case of insurance fraud is more costly than flagging a claim to be double checked • The balance of examples in each case can be manipulated to reflect the cost 11
  • 12.
    Check Points • Dataquantity and quality: do you have sufficient good data for the task? – How many variables are there? – How complex is the task? – Is the data’s distribution appropriate? • Outliers • Balance • Value set size 12
  • 13.
    Distributions • A frequencydistribution is a count of how often each variable contains each value in a data set • For discrete numbers and categorical values, this is simply a count of each value • For continuous numbers, the count is of how many values fall into each of a set of sub- ranges 13
  • 14.
    Plotting Distributions • Theeasiest way to visualise a distribution is to plot it in a histogram: 14
  • 15.
    Features of aDistribution to Look For • Outliers • Minority values • Data Balance • Data entry errors 15
  • 16.
    Outliers • A smallnumber of values that are much larger or much smaller than all the others • Can disrupt the data mining process and give misleading results • You should either remove them or, if they are important, collect more data to reflect this aspect of the world you are modelling • Could be data entry errors 16
  • 17.
    Minority Values • Valuesthat only appear infrequently in the data • Do they appear often enough to contribute to the model? • Might be worth removing them from the data or collecting more data where they are represented • Are they needed in the finished system? • Could they be the result of data entry errors? 17
  • 18.
    Minority Values 600 500 400 300 200 100 0 Male Female M F What does this chart tell you about the gender variable in a data set? What should you do before modelling or mining the data? 18
  • 19.
    Flat and WideVariables • Variables where all the values are minority values have a flat, wide distribution – one or two of each possible value • Such variables are of little use in data mining because the goal of DM is to find general patterns from specific data • No such patterns can exist if each data point is completely different • Such variables should be excluded from a model 19
  • 20.
    Data Balance • ImagineI want to predict whether or not a prospective customer will respond to a mailing campaign • I collect the data, put it into a data mining algorithm, which learns and reports a success rate of 98% • Sounds good, but when I put a new set of prospects through to see who to mail, what happens? 20
  • 21.
    A Problem • …the system predicts ‘No’ for every single prospect. • With a response rate on a campaign of 2%, then the system is right 98% of the time if it always says ‘No’. • So it never chooses anybody to target in the campaign 21
  • 22.
    A Solution • Onedata pre-processing solution is to balance the number of examples of each target class in the output variable • In our previous example: 50% customers and 50% non- customers • That way, any gain in accuracy over 50% would certainly be due to patterns in the data, not the prior distribution • This is not always easy to achieve – you might need to throw away a lot of data to balance the examples, or build several models on balanced subsets • Not always necessary – if an event is rare because its cause is rare, then the problem won’t arise 22
  • 23.
    Data Quantity • Howmuch data do you need? • How long is a piece of string? • Data must be sufficient to: – Represent the dynamics of the system to be modelled – Cover all situations likely to be encountered when predictions are needed – Compensate for any noise in the data 23
  • 24.
    Model Building • Choosea number of techniques suitable to the task: – Neural network for prediction or classification – Decision tree for classification – Rule induction for classification – Bayesian network for classification – K-Means for clustering 24
  • 25.
    Train Models • Foreach technique: – Run a series of experiments with different parameters – Each experiment should use around 70% of the data for training and the rest for testing – When a good solution is found, use cross validation (10 fold is a good choice) to verify the result 25
  • 26.
    Cross Validation • Splitthe data into ten subsets, then train 10 models – each one using 9 of the 10 subsets as training data and the 10th as test. The score is the average of all 10. • This is a more accurate representation of how well the data may be modelled, as it reduces the risk of getting a lucky test set 26
  • 27.
    Assess Models • Youcan measure the success of your model in a number of ways – Mean Squared error – not always meaningful – Percentage correct for classification – Confusion matrix for classification Output= True False True 80 30 False 20 90 27
  • 28.
    Probability Outputs • Mostclassification techniques provide a score with the classification – either a probability or some other measure • This can be used: – Allow an answer of “unsure” for cases where no single class has a high enough probability – Weighting outputs to allow for unequal cost of outcomes – Lift charts and ROC curves 28
  • 29.
    Generalisation and OverFitting • Most data mining models have a degree of complexity that can be controlled by the designer • The goal is to find the degree of complexity that is best suited to the data • A model that is too simple over generalises • A model that is too complex over fits • Both have an adverse effect on performance 29
  • 30.
    Gen-Spec Trade Off •Adding to the complexity of the model fits the training data better at the expense of higher test error 30
  • 31.
    Repeat or Finish •The result of the data mining will leave you with either a model that works or the need to improve • More data may need to be collected • Different variables might be tried • The process can loop several times before a satisfactory answer is found 31
  • 32.
    Understanding and Usingthe Results • The resulting model has the ability to perform the task it was set, so can be embedded in an automated system • Some techniques produce models that are human readable and allow insights into the structure of the data • Some are almost impossible to extract knowledge from 32
  • 33.