Principal Component Analysis for
                  Novelty Detection
A journal article submitted to and accepted by Pattern Recognition Letters



                                                 Jordan McBain, P.Eng.
                                            Markus Timusk, PhD, P.Eng.
Condition Monitoring
   Maintenance technique
       Maintenance undertaken when some indicator of health is
        flagged
       Advanced technique employed when cost-benefit analysis
        justifies the expense of monitoring equipment
       Alternative to run-to-failure maintenance and statistically
        determined time-based maintenance
   Employ pattern recognition to automate diagnosis
       Expert system employed to replicate technicians
        maintenance insight
           Computer and sensors replaces technician and screw driver set
            atop vibrating machine – the nature of the vibration used to
            discern state
Pattern Recognition
   Equality insufficient means of classifying real-world
    members of class (noise, variance, etc)
   Pattern recognition
       Real-world signals presumed to be representative of class
        reduced to representative n-dimensional feature vectors
       Plotted in N-dimensional space
       Decision boundary generated with pattern recognition
        techniques
           Employed as classification rule
       Problems
           Choice of features
               How representative?
               Maximize number of features?
               Curse of dimensionality
           Imbalance of data
Principal Component Analysis
   One technique used to find “optimal” set of features
       Finds the axes of normally distributed data
       Select the largest axes and omit smaller ones to define
        new basis
       Project data onto basis to reduce dimensionality of
        problem space
   Each feature presumed to be normally distributed
   N-dimensional scattering of features presumed
    independent
   Combined probability:
            P( A   B) P( A)* P( B)
d                     d                             1 xi       i 2
                                        1                   2
                                                               (            )
p( x )         p ( xi )                                  e          i


         i 1                   i 1      2            i
                                d
                              1    x        i 2                                     1   t         
                                  ( i        )
         1                    2i1                                   1               2
                                                                                      (x )    1
                                                                                                  (x )
               d
                          e             i
                                                                                e
                                                                        d
    (2 ) d                                                   (2 ) | |
                     i
               i 1
                                                    Find principal components
                                                     (i.e. axes of hyper-ellipsoidal
                                                     distribution)
                                                    Select maximum variance
                                                     (largest axes)
                                                    Eigenvalue problem
                                                        Eigenvectors – principle
                                                         components
                                                        Eigenvalues – size of
                                                         axis
Novelty Detection
   Deals with imbalance of data between classes
   Fault detection in machinery
       Easy to collect data representative of healthy state
       Difficult to collect data representative of faulted states
           Costly to break machinery
           Operationally unacceptable
           Poor database of faults kept
           Can never capture them all!
   Model healthy data with decision boundary
       If test patterns fall outside, classify as a fault!
Problem
   PCA is best for selecting a subspace that best
    represents the data
   In pattern recognition, we seek to discriminant
    between classes
   Objective of most feature reduction techniques are
    not optimized for novelty detection
Feature Reduction Techniques
Feature Reduction Techniques
   Feature Selection vs. Feature Extraction
   Selection
       Choosing small subsets of features that are adequate to
        describe classes
       E.g. “Search”
           Examines all subsets of feature combinations to find the one which
            maximizes some objective function
           May employ classifier error as objective function
           Exponential explosion
               Heuristics to mitigate possible
           If computationally feasible, gives the best results
   Extraction
       Computes a small number of new features form the set of old
        features
       E.g. PCA
Principal Component Analysis
   Seeks a subspace in which the data representation
    error is minimal
   Development
       For a set of n vectors in d-dimensional space
           seek the equation of a hyper plane onto which the data may be
            projected with minimal representation error
           Hyper plane fixed at the data’s mean, m
           Hyper plane’s orientation defined by direction vector, w (normal
            definition of a plane)



           Derive error function
   Optimization problem well known eigenvalue
    problem
   Resultant feature space is linear
       May not represent non-linear and changing data well
       Kernel PCA and Dynamic PCA
   Techniques only suitable for representing data not
    discriminating between them




                 Source: Duda, 2000
Multiple Discriminant Analysis
   Seeks to find efficient subspaces for discrimination
    rather than representation
   Development
       Two class problem with d-dimensional set of n-vectors
        grouped into D1 and D2
       Projected onto some direction vector w to give

       Consequently grouped into subsets Y1 and Y 2
       Find the direction vector w such that the distance
        between projected sample means m1 and m2 is
        maximized
           Rationalize the distance against the relative sample size
   Reduces to



   Solution is described as “analogous to the well known
    Rayleigh quotient:”
                                   
                             1
                      w    S w (m1 m2 )

   Technique extended for problems with n-classes
       Objective to maximize the spread between all classes in the
        projected space




                                                    Source: Duda, 2000
Extraction for Novelty Detection
Development
   Objective: distinguish between normal and abnormal
    classes
       KFDA inappropriate (assumes classes group well into
        separate classes)
       Novelty detection – classes may cluster well but abnormal
        classes expected to orbit the normal data
           Means could overlap
               Eliminating previous objective functions
   Approach: find the subspace maximizing difference
    between average spread of the normal class and
    average spread of the abnormal class measured
    from the mean of the normal class
   Mathematically, for an outlier class containing b
    elements and target class containing a-elements
    with mean m_t




   To simplify, introduce outlier scatter matrix, O, for
    outlier data centered at m_t

   Reducing to
   Maximize this objective function
       Find the eigenvectors and eigenvalues of the matrix St-O
   Select the first k largest eigenvalues and use
    corresponding eigenvectors as new basis
   Project data onto new basis
   Proceed with classification
   Limitations
       Still dependant on assumption of normal data distribution
           (as are other PCA techniques)
       Assumption: normal data scatter somewhat circularly and
        outlier data orbit nicely without intruding
           (as with PCA and MDA )
       Machinery vibration data are not normally Gaussian (heuristic)
Validation: Artificial Data
   Artificial 3-d data set
       Normal distribution:
           spherical (radius 50) centered at origin
       Outlier distribution:
           randomly generated spherical distribution (radius 100)
           Not permitted to fall within cylinder concentric with the normal
            data’s sphere and oriented with length parallel to [1,1,1]
Validation: Artifical Data
   Results (reduced to 2 dimensions)
       Subspace’s normal vector only 7 degrees off from
        expected [1,1,1]
Experimental Methodology
Apparatus
   Spectraquest gear dynamics simulator
       3-hp motor
       Magnetic particle brake loading
       National Instruments PXI data acquisition and control
       Accelerometers (sampling 4kHz)
Faults
   4 motors employed
       healthy
       Combo bearing faults
       Broken rotor bars
       Rotor unbalance
   Gearbox faults
       Fault-free conditions
       Missing tooth gear
       Chipped tooth
       Bearing with outer race faults
       Bearings with inner and outer race faults
Feature Extraction
   Autoregressive model
       a model of a statistical process generated by regressing
        previous values of that statistical process with itself
       Sampling of sampled signal that best represents the
        original sampling
       Order 10

Segmentation
   Vibration data segmented into groups based on
    intervals with constant number of shaft rotations
           Gaussian Window
           70% overlap between segments
Results: Proposed Algorithm
Results: Kernel PCA
Results: Kernel FDA
   N.B. Potential for singular matrices
Results: Exhaustive Feature Search
Feature Extraction in the
     Absence of Outliers
Motivation and Development
   The above violates assumption of novelty detection
       Limited data from fault classes
   In the case where we know nothing of the outlier
    classes
       Work with what we have: normal data
           Minimize variance of normal data
Results: Novelty Reduction (Outlier
Absence)
Conclusions
Conclusions
   Reduce a large feature space to a smaller one
       Mitigate the curse of dimensionality
   Objective function tweaked for novelty detection
   Similar to MDA but modified to accommodate case
    where normal and outlier means are closely
    separated
   Results good for artificial and machinery data
   Future work
       Extend technique with kernels
           Difficult problem due to need for mean
   Thanks
       CEMI
       Dr. Mechefske, Queens

Principal Component Analysis For Novelty Detection

  • 1.
    Principal Component Analysisfor Novelty Detection A journal article submitted to and accepted by Pattern Recognition Letters Jordan McBain, P.Eng. Markus Timusk, PhD, P.Eng.
  • 2.
    Condition Monitoring  Maintenance technique  Maintenance undertaken when some indicator of health is flagged  Advanced technique employed when cost-benefit analysis justifies the expense of monitoring equipment  Alternative to run-to-failure maintenance and statistically determined time-based maintenance  Employ pattern recognition to automate diagnosis  Expert system employed to replicate technicians maintenance insight  Computer and sensors replaces technician and screw driver set atop vibrating machine – the nature of the vibration used to discern state
  • 3.
    Pattern Recognition  Equality insufficient means of classifying real-world members of class (noise, variance, etc)  Pattern recognition  Real-world signals presumed to be representative of class reduced to representative n-dimensional feature vectors  Plotted in N-dimensional space  Decision boundary generated with pattern recognition techniques  Employed as classification rule  Problems  Choice of features  How representative?  Maximize number of features?  Curse of dimensionality  Imbalance of data
  • 4.
    Principal Component Analysis  One technique used to find “optimal” set of features  Finds the axes of normally distributed data  Select the largest axes and omit smaller ones to define new basis  Project data onto basis to reduce dimensionality of problem space  Each feature presumed to be normally distributed
  • 5.
    N-dimensional scattering of features presumed independent  Combined probability: P( A B) P( A)* P( B)
  • 6.
    d d 1 xi i 2  1 2 ( ) p( x ) p ( xi ) e i i 1 i 1 2 i d 1 x i 2 1   t   ( i ) 1 2i1 1 2 (x ) 1 (x ) d e i e d (2 ) d (2 ) | | i i 1  Find principal components (i.e. axes of hyper-ellipsoidal distribution)  Select maximum variance (largest axes)  Eigenvalue problem  Eigenvectors – principle components  Eigenvalues – size of axis
  • 7.
    Novelty Detection  Deals with imbalance of data between classes  Fault detection in machinery  Easy to collect data representative of healthy state  Difficult to collect data representative of faulted states  Costly to break machinery  Operationally unacceptable  Poor database of faults kept  Can never capture them all!  Model healthy data with decision boundary  If test patterns fall outside, classify as a fault!
  • 8.
    Problem  PCA is best for selecting a subspace that best represents the data  In pattern recognition, we seek to discriminant between classes  Objective of most feature reduction techniques are not optimized for novelty detection
  • 9.
  • 10.
    Feature Reduction Techniques  Feature Selection vs. Feature Extraction  Selection  Choosing small subsets of features that are adequate to describe classes  E.g. “Search”  Examines all subsets of feature combinations to find the one which maximizes some objective function  May employ classifier error as objective function  Exponential explosion  Heuristics to mitigate possible  If computationally feasible, gives the best results  Extraction  Computes a small number of new features form the set of old features  E.g. PCA
  • 11.
    Principal Component Analysis  Seeks a subspace in which the data representation error is minimal  Development  For a set of n vectors in d-dimensional space  seek the equation of a hyper plane onto which the data may be projected with minimal representation error  Hyper plane fixed at the data’s mean, m  Hyper plane’s orientation defined by direction vector, w (normal definition of a plane)  Derive error function
  • 12.
    Optimization problem well known eigenvalue problem  Resultant feature space is linear  May not represent non-linear and changing data well  Kernel PCA and Dynamic PCA  Techniques only suitable for representing data not discriminating between them Source: Duda, 2000
  • 13.
    Multiple Discriminant Analysis  Seeks to find efficient subspaces for discrimination rather than representation  Development  Two class problem with d-dimensional set of n-vectors grouped into D1 and D2  Projected onto some direction vector w to give  Consequently grouped into subsets Y1 and Y 2  Find the direction vector w such that the distance between projected sample means m1 and m2 is maximized  Rationalize the distance against the relative sample size
  • 14.
    Reduces to  Solution is described as “analogous to the well known Rayleigh quotient:”    1 w S w (m1 m2 )  Technique extended for problems with n-classes  Objective to maximize the spread between all classes in the projected space Source: Duda, 2000
  • 15.
  • 16.
    Development  Objective: distinguish between normal and abnormal classes  KFDA inappropriate (assumes classes group well into separate classes)  Novelty detection – classes may cluster well but abnormal classes expected to orbit the normal data  Means could overlap  Eliminating previous objective functions  Approach: find the subspace maximizing difference between average spread of the normal class and average spread of the abnormal class measured from the mean of the normal class
  • 17.
    Mathematically, for an outlier class containing b elements and target class containing a-elements with mean m_t  To simplify, introduce outlier scatter matrix, O, for outlier data centered at m_t  Reducing to
  • 18.
    Maximize this objective function  Find the eigenvectors and eigenvalues of the matrix St-O  Select the first k largest eigenvalues and use corresponding eigenvectors as new basis  Project data onto new basis  Proceed with classification  Limitations  Still dependant on assumption of normal data distribution  (as are other PCA techniques)  Assumption: normal data scatter somewhat circularly and outlier data orbit nicely without intruding  (as with PCA and MDA )  Machinery vibration data are not normally Gaussian (heuristic)
  • 19.
    Validation: Artificial Data  Artificial 3-d data set  Normal distribution:  spherical (radius 50) centered at origin  Outlier distribution:  randomly generated spherical distribution (radius 100)  Not permitted to fall within cylinder concentric with the normal data’s sphere and oriented with length parallel to [1,1,1]
  • 20.
    Validation: Artifical Data  Results (reduced to 2 dimensions)  Subspace’s normal vector only 7 degrees off from expected [1,1,1]
  • 21.
  • 22.
    Apparatus  Spectraquest gear dynamics simulator  3-hp motor  Magnetic particle brake loading  National Instruments PXI data acquisition and control  Accelerometers (sampling 4kHz)
  • 23.
    Faults  4 motors employed  healthy  Combo bearing faults  Broken rotor bars  Rotor unbalance  Gearbox faults  Fault-free conditions  Missing tooth gear  Chipped tooth  Bearing with outer race faults  Bearings with inner and outer race faults
  • 24.
    Feature Extraction  Autoregressive model  a model of a statistical process generated by regressing previous values of that statistical process with itself  Sampling of sampled signal that best represents the original sampling  Order 10 Segmentation  Vibration data segmented into groups based on intervals with constant number of shaft rotations  Gaussian Window  70% overlap between segments
  • 25.
  • 26.
  • 27.
    Results: Kernel FDA  N.B. Potential for singular matrices
  • 28.
  • 29.
    Feature Extraction inthe Absence of Outliers
  • 30.
    Motivation and Development  The above violates assumption of novelty detection  Limited data from fault classes  In the case where we know nothing of the outlier classes  Work with what we have: normal data  Minimize variance of normal data
  • 31.
    Results: Novelty Reduction(Outlier Absence)
  • 32.
  • 33.
    Conclusions  Reduce a large feature space to a smaller one  Mitigate the curse of dimensionality  Objective function tweaked for novelty detection  Similar to MDA but modified to accommodate case where normal and outlier means are closely separated  Results good for artificial and machinery data  Future work  Extend technique with kernels  Difficult problem due to need for mean  Thanks  CEMI  Dr. Mechefske, Queens