SlideShare a Scribd company logo
Document Classification
In PHP


        @ianbarber - ian@ibuildings.com.......
                        http://phpir.com.......
Document Classification


Defining The Task
Document Pre-processing
Term Selection
Algorithms
What is
Document Classification?
Uses



 Ian Barber / @ianbarber / ian@ibuildings.com......
 Filter          Organise           Metadata
Filtering -
Binary Classification
Organising -....
Single Label Classification....
Metadata -
Multiple Label Classification
Manual Rules Written
Domain Experts
Machine Learning -.....
Automatically Extract Rules.....
Classes




 Training        Test
Documents     Documents
Evaluation

                 spam       ham

                 true       false
         spam
                positive   positive
                  false      true
         ham
                negative   negative
Measures....

$accuracy    =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision   = $tp / ($tp + $fp);

$recall      = $tp / ($tp + $fn);
Vector Space Model -
Bag Of Words
$doc   = strtolower(strip_tags($doc));

$regex = '/w+/';
preg_match_all($regex, $doc, $matches);

$words = $matches[0];




Extract Tokens
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew



       i   really like eggs cabbage and donʼt stew

 A    1     1     1    1      0     0    0     0

 B    1     0     1    0      1     1    1     1
2.00




    1.00
i




       0




    -1.00
            0   0.50   1.00     1.50   2.00
                       really
$tf   
       = $termCount;

$idf      
   = log($totalDocs
                    / $docsWithTerm, 2);

$tfidf = $tf * $idf;




                         Term Weighting....
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew


      i really like eggs cabbage and donʼt stew
 A    0 0.58    0 1.58      0      0     0      0
 B    0    0    0    0    1.58    1.58 3.16 0.58
 C    0 1.17    0    0      0      0     0     0.58
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew


      i really like eggs cabbage and donʼt stew
 A    0 0.35    0 0.94      0      0     0      0
 B    0    0    0    0    0.31    0.31 0.63 0.11
 C    0 0.89    0    0      0      0     0     0.44
Dimensionality Reduction....
Stop Words....
happening - happen.......
                               happens - happen. .....
                             happened - happen.......
      http://tartarus.org/~martin/PorterStemmer ....
 hhttp://snowball.tartarus.org/algorithms/dutchtml..



Stemming
spam   ham
 term       $a    $b
not term    $c    $d




           Chi-Square....
$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);

$chisquare   = (
  $total *   pow($diff, 2 ) /
  (($a+$c)   * ($b+$d) *
   ($a+$b)   * ($c+$d));

      Chi-Square 1DF....
p         chi2.
0.1       2.71.
0.05      3.84.
0.01      6.63.
0.005     7.88.
0.001    10.83.


        p - Value....
Decision Tree - ID3

              ?

        ✔             ?

              ✖           ✔
Entropy....

$entropy =
   -( ($spam/$total)
       * log($spam/$total, 2))
   -( ($ham/$total)
       * log($ham/$total, 2));
1.00



          0.75
entropy




          0.50



          0.25



            0
                 0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0
                                         spam/total
Information Gain....

    $gain   =
     $baseEntropy
     -(($withCount/$total)* $withEntropy )
(    -(($woutCount/$total)* $woutEntropy )
Split   Entropy Proportion    E*P

 Base     50/50     1         1          1

 With     20/5    0.722      0.25      0.1805

Without   30/45    0.97      0.75      0.7275



        1 - With - Without = 0.092.
function build($tree) {
  if(!$tree->count('spam')) {
     $tree->setLeaf('ham');
  } else if(!$tree->count('ham')) {
    $tree->setLeaf('spam');
  } else {
    $term = $tree->findMaxGain();
    $tree->addSubtree($term,
         build($tree->getWith()),
         build($tree->getWout())
    ));
  }
  return $tree;
}
term


✔          term



     ✖            term



           ✔             ✖
Classification....
function classify($doc, $tree) {
  if($tree->isLeaf()) {
    return $tree->class;
  }
  $term = $tree->getSplitTerm();
  if(in_array($term, $doc)) {
    return classify($doc, $tree->getWith());
  } else {
    return classify($doc, $tree->getWout());
  }
}
Overfitting:....
Pruning or Stop Conditions....
K Nearest Neighbour
Spam
Term X




                         Ham


                Term Y
Term X




         Term Y
Term X




         Term Y
Cosine Similarity....


foreach($doca as $term => $tfidf) {
  $similarity +=
    floatval($tfidf) *
    floatval($docb[$term]);
}
Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
  Zend_Search_Lucene_Field::Text(
    'class', $class));
$doc->addField(
  Zend_Search_Lucene_Field::UnStored(
    'contents', $content));
$index->addDocument($doc);
Zend_Search_Lucene::setResultSetLimit(25);

$analyser =
  Zend_Search_Lucene_Analysis_Analyzer::getDefault();
$tokens = $analyser->tokenize($content);

foreach($tokens as $key => $token) {
  $tok = $token->getTermText();
  if(strlen($tok) > 4)
    $filtered[$tok]++;
}
arsort($filtered);

              Classifying with ZSL....
$q = new Zend_Search_Lucene_Search_Query_MultiTerm();

$tc = 0;
foreach($filtered as $t => $tf) {
  $q->addTerm(
    new Zend_Search_Lucene_Index_Term($t));
  if(++$tc > 49) { break;}
}

$results = $index->find($q);
foreach($results as $result) {
  $classes[$result->class] += 1;
}

arsort($classes);
$class = key($classes);
Flax/Xapian Search Service
http://www.flax.co.uk.......
$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
  'store'      => true,
  'exacttext’ => true));
$db->addField('contents', array(
  'store'      => false,
  'freetext' => array('language'=>'en')));
$db->commit();

$db->addDocument(array(
  'class'    => $class,
  'contents' => $document));
$db->commit();
$db->addDocument(
        array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$db->deleteDocument('foo');
$db->commit();

foreach($results['results'] as $r) {
  if($r['docid'] != 'foo') {
    $classes[$r['data']['class'][0]] += 1;
  }
}

arsort($classes);
$class = key($classes);
Spam
Term X




                         Ham




                Term Y
Prototypes For Rocchio

$mul = 1 / count($classDocs);

foreach($classDocs as $doc) {
  foreach($doc as $tid => $tfidf) {
    $prototype[$tid] += $mul * $tfidf;
  }
}
Naive Bayes -
Probability Based Classifier
Bayes Theorem
  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
                           Pr(Doc)



  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
Likelihood Of Term Occurring
Given Class

  word      spam freq   pr(word|spam)   ham freq   pr(word|ham)

 register     1757          0.11          246          0.02

  sent        487           0.03         4600          0.36
Estimating Likelihood
$this->db->query("
   INSERT INTO class_terms
       (class, term, likelihood)
   SELECT d.class, d.term,
       count(*) / " . $classCount . "
   FROM documents AS d
   JOIN document_terms AS dt USING (did)
   WHERE d.class = '" . $class . "'"
);
Classifying A Document
foreach($classes as $class) {
  $prob[$class] = 0.5; // assume prior

    foreach($document as $term) {
      $prob[$class] *=
            $likely[$term][$class];
    }
}

arsort($prob);
$class = key($prob);
Document Classification


Defining The Problem
Document Processing
Term Selection
Algorithm
Image Credits
Title          http://www.flickr.com/photos/themacinator/3499579760/
What is...     http://www.flickr.com/photos/austinevan/1225274637/
Filter         http://www.flickr.com/photos/benimoto/2913950616/
Organise       http://www.flickr.com/photos/ellasdad/425813314/
Metadata       http://www.flickr.com/photos/banky177/2282734063/
Manual         http://www.flickr.com/photos/foundphotoslj/1134150364/
Automatic      http://www.flickr.com/photos/29278394@N00/59538978/
Vector Space   http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/
Reduction      http://www.flickr.com/photos/wili/157220657/sizes/l/
Stemming       http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/
Stop words     http://www.flickr.com/photos/afroswede/22237769/
Chi-Squared    http://www.flickr.com/photos/kdkd/2837565850/sizes/o/
ID3            http://www.flickr.com/photos/tonythemisfit/2414239471
Overfitting     http://www.flickr.com/photos/akirkley/3222128726/sizes/l/
Bayes          http://www.flickr.com/photos/darwinbell/440080655/sizes/l/
Conclusion     http://www.flickr.com/photos/mukluk/241256203
Credits        http://www.flickr.com/photos/librarianavengers/413762956/
Questions?



       @ianbarber - ian@ibuildings.com.......
                       http://phpir.com     .

More Related Content

What's hot (20)

PDF
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
Rafael Dohms
 
PDF
PHP for Adults: Clean Code and Object Calisthenics
Guilherme Blanco
 
KEY
Can't Miss Features of PHP 5.3 and 5.4
Jeff Carouth
 
PDF
Object Calisthenics Adapted for PHP
Chad Gray
 
PDF
PHPUnit でよりよくテストを書くために
Yuya Takeyama
 
PPT
PHP and MySQL
Sanketkumar Biswas
 
KEY
循環参照のはなし
Masahiro Honma
 
PDF
PhpUnit - The most unknown Parts
Bastian Feder
 
PPT
An Elephant of a Different Colour: Hack
Vic Metcalfe
 
PDF
You code sucks, let's fix it
Rafael Dohms
 
PDF
The Art of Transduction
David Stockton
 
PDF
Functional programming with php7
Sérgio Rafael Siqueira
 
TXT
20160227 Granma
Sharon Liu
 
PDF
Your code sucks, let's fix it
Rafael Dohms
 
PDF
Teaching Your Machine To Find Fraudsters
Ian Barber
 
KEY
Exhibition of Atrocity
Michael Pirnat
 
PDF
Debugging: Rules And Tools - PHPTek 11 Version
Ian Barber
 
PPT
Mocking Dependencies in PHPUnit
mfrost503
 
PDF
Taking Perl to Eleven with Higher-Order Functions
David Golden
 
“Writing code that lasts” … or writing code you won’t hate tomorrow. - PHPKonf
Rafael Dohms
 
PHP for Adults: Clean Code and Object Calisthenics
Guilherme Blanco
 
Can't Miss Features of PHP 5.3 and 5.4
Jeff Carouth
 
Object Calisthenics Adapted for PHP
Chad Gray
 
PHPUnit でよりよくテストを書くために
Yuya Takeyama
 
PHP and MySQL
Sanketkumar Biswas
 
循環参照のはなし
Masahiro Honma
 
PhpUnit - The most unknown Parts
Bastian Feder
 
An Elephant of a Different Colour: Hack
Vic Metcalfe
 
You code sucks, let's fix it
Rafael Dohms
 
The Art of Transduction
David Stockton
 
Functional programming with php7
Sérgio Rafael Siqueira
 
20160227 Granma
Sharon Liu
 
Your code sucks, let's fix it
Rafael Dohms
 
Teaching Your Machine To Find Fraudsters
Ian Barber
 
Exhibition of Atrocity
Michael Pirnat
 
Debugging: Rules And Tools - PHPTek 11 Version
Ian Barber
 
Mocking Dependencies in PHPUnit
mfrost503
 
Taking Perl to Eleven with Higher-Order Functions
David Golden
 

Similar to Document Classification In PHP - Slight Return (20)

KEY
Document Classification In PHP
Ian Barber
 
PDF
Top 10 php classic traps
Damien Seguy
 
PDF
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
MongoSF
 
KEY
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Masahiro Nagano
 
KEY
ddd+scala
潤一 加藤
 
PDF
Php tips-and-tricks4128
PrinceGuru MS
 
KEY
Hidden treasures of Ruby
Tom Crinson
 
KEY
PHPSpec BDD for PHP
Marcello Duarte
 
PDF
Smelling your code
Raju Mazumder
 
PDF
Lithium: The Framework for People Who Hate Frameworks
Nate Abele
 
PDF
Separation of concerns - DPC12
Stephan Hochdörfer
 
PDF
Unit testing with zend framework tek11
Michelangelo van Dam
 
PDF
Php unit the-mostunknownparts
Bastian Feder
 
PDF
Advanced symfony Techniques
Kris Wallsmith
 
KEY
Intermediate PHP
Bradley Holt
 
PDF
Unittests für Dummies
Lars Jankowfsky
 
KEY
Unit testing with zend framework PHPBenelux
Michelangelo van Dam
 
PDF
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
Lucas Arruda
 
KEY
Spl Not A Bridge Too Far phpNW09
Michelangelo van Dam
 
PDF
From mysql to MongoDB(MongoDB2011北京交流会)
Night Sailer
 
Document Classification In PHP
Ian Barber
 
Top 10 php classic traps
Damien Seguy
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
MongoSF
 
Designing Opeation Oriented Web Applications / YAPC::Asia Tokyo 2011
Masahiro Nagano
 
ddd+scala
潤一 加藤
 
Php tips-and-tricks4128
PrinceGuru MS
 
Hidden treasures of Ruby
Tom Crinson
 
PHPSpec BDD for PHP
Marcello Duarte
 
Smelling your code
Raju Mazumder
 
Lithium: The Framework for People Who Hate Frameworks
Nate Abele
 
Separation of concerns - DPC12
Stephan Hochdörfer
 
Unit testing with zend framework tek11
Michelangelo van Dam
 
Php unit the-mostunknownparts
Bastian Feder
 
Advanced symfony Techniques
Kris Wallsmith
 
Intermediate PHP
Bradley Holt
 
Unittests für Dummies
Lars Jankowfsky
 
Unit testing with zend framework PHPBenelux
Michelangelo van Dam
 
1st CI&T Lightning Talks: Writing better code with Object Calisthenics
Lucas Arruda
 
Spl Not A Bridge Too Far phpNW09
Michelangelo van Dam
 
From mysql to MongoDB(MongoDB2011北京交流会)
Night Sailer
 
Ad

More from Ian Barber (10)

PDF
How to stand on the shoulders of giants
Ian Barber
 
PDF
ZeroMQ: Messaging Made Simple
Ian Barber
 
PDF
ZeroMQ Is The Answer: PHP Tek 11 Version
Ian Barber
 
PDF
ZeroMQ Is The Answer: DPC 11 Version
Ian Barber
 
PDF
ZeroMQ Is The Answer
Ian Barber
 
PDF
Deployment Tactics
Ian Barber
 
PDF
In Search Of: Integrating Site Search (PHP Barcelona)
Ian Barber
 
PDF
Debugging: Rules & Tools
Ian Barber
 
PDF
In Search Of... (Dutch PHP Conference 2010)
Ian Barber
 
PDF
In Search Of... integrating site search
Ian Barber
 
How to stand on the shoulders of giants
Ian Barber
 
ZeroMQ: Messaging Made Simple
Ian Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
Ian Barber
 
ZeroMQ Is The Answer: DPC 11 Version
Ian Barber
 
ZeroMQ Is The Answer
Ian Barber
 
Deployment Tactics
Ian Barber
 
In Search Of: Integrating Site Search (PHP Barcelona)
Ian Barber
 
Debugging: Rules & Tools
Ian Barber
 
In Search Of... (Dutch PHP Conference 2010)
Ian Barber
 
In Search Of... integrating site search
Ian Barber
 
Ad

Recently uploaded (20)

PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Document Classification In PHP - Slight Return

  • 1. Document Classification In PHP @ianbarber - [email protected]....... http://phpir.com.......
  • 2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  • 4. Uses Ian Barber / @ianbarber / [email protected]...... Filter Organise Metadata
  • 6. Organising -.... Single Label Classification....
  • 7. Metadata - Multiple Label Classification
  • 10. Classes Training Test Documents Documents
  • 11. Evaluation spam ham true false spam positive positive false true ham negative negative
  • 12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  • 13. Vector Space Model - Bag Of Words
  • 14. $doc = strtolower(strip_tags($doc)); $regex = '/w+/'; preg_match_all($regex, $doc, $matches); $words = $matches[0]; Extract Tokens
  • 15. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  • 16. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  • 17. $tf = $termCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  • 18. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.58 0 1.58 0 0 0 0 B 0 0 0 0 1.58 1.58 3.16 0.58 C 0 1.17 0 0 0 0 0 0.58
  • 19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.35 0 0.94 0 0 0 0 B 0 0 0 0 0.31 0.31 0.63 0.11 C 0 0.89 0 0 0 0 0 0.44
  • 22. happening - happen....... happens - happen. ..... happened - happen....... http://tartarus.org/~martin/PorterStemmer .... hhttp://snowball.tartarus.org/algorithms/dutchtml.. Stemming
  • 23. spam ham term $a $b not term $c $d Chi-Square....
  • 24. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  • 25. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  • 26. Decision Tree - ID3 ? ✔ ? ✖ ✔
  • 27. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  • 28. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  • 29. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  • 30. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  • 31. function build($tree) { if(!$tree->count('spam')) { $tree->setLeaf('ham'); } else if(!$tree->count('ham')) { $tree->setLeaf('spam'); } else { $term = $tree->findMaxGain(); $tree->addSubtree($term, build($tree->getWith()), build($tree->getWout()) )); } return $tree; }
  • 32. term ✔ term ✖ term ✔ ✖
  • 33. Classification.... function classify($doc, $tree) { if($tree->isLeaf()) { return $tree->class; } $term = $tree->getSplitTerm(); if(in_array($term, $doc)) { return classify($doc, $tree->getWith()); } else { return classify($doc, $tree->getWout()); } }
  • 36. Spam Term X Ham Term Y
  • 37. Term X Term Y
  • 38. Term X Term Y
  • 39. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  • 40. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  • 41. Zend_Search_Lucene::setResultSetLimit(25); $analyser = Zend_Search_Lucene_Analysis_Analyzer::getDefault(); $tokens = $analyser->tokenize($content); foreach($tokens as $key => $token) { $tok = $token->getTermText(); if(strlen($tok) > 4) $filtered[$tok]++; } arsort($filtered); Classifying with ZSL....
  • 42. $q = new Zend_Search_Lucene_Search_Query_MultiTerm(); $tc = 0; foreach($filtered as $t => $tf) { $q->addTerm( new Zend_Search_Lucene_Index_Term($t)); if(++$tc > 49) { break;} } $results = $index->find($q); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes);
  • 44. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  • 45. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  • 46. Spam Term X Ham Term Y
  • 47. Prototypes For Rocchio $mul = 1 / count($classDocs); foreach($classDocs as $doc) { foreach($doc as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; } }
  • 48. Naive Bayes - Probability Based Classifier
  • 49. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  • 50. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  • 51. Estimating Likelihood $this->db->query(" INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / " . $classCount . " FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = '" . $class . "'" );
  • 52. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  • 53. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  • 54. Image Credits Title http://www.flickr.com/photos/themacinator/3499579760/ What is... http://www.flickr.com/photos/austinevan/1225274637/ Filter http://www.flickr.com/photos/benimoto/2913950616/ Organise http://www.flickr.com/photos/ellasdad/425813314/ Metadata http://www.flickr.com/photos/banky177/2282734063/ Manual http://www.flickr.com/photos/foundphotoslj/1134150364/ Automatic http://www.flickr.com/photos/29278394@N00/59538978/ Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/ Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/ Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/ Stop words http://www.flickr.com/photos/afroswede/22237769/ Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/ ID3 http://www.flickr.com/photos/tonythemisfit/2414239471 Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/ Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/ Conclusion http://www.flickr.com/photos/mukluk/241256203 Credits http://www.flickr.com/photos/librarianavengers/413762956/
  • 55. Questions? @ianbarber - [email protected]....... http://phpir.com .

Editor's Notes

  • #2: Hello! PSC @ Ibuildings Twitter Email Blog - related posts
  • #3: This what you need to do to implement a classifier And also our table of contents a note on PHP - Qs at end, but ask syntax qs straight away First, talk about what and why?
  • #4: What is it - Assign documents to classes from a predefined set Classes can be any label - e.g. topic words, categories Documents in this case is text, web pages, emails, books But it can be really anything as long as you can extract features from it Algos not hard, applicable in all langs. Python/Java have good library versions So - Why do in PHP? Integrate into web apps - WP, Drupal, MediaWiki
  • #5: Classification is really organising of information - every day Lots of uses - can group into common tasks of filter, organise, add metadata Might do all three with uploading photos to flickr or facebook Filter, get rid of bad ones. Organise, upload to album or set Tag photos with people in them etc.
  • #6: Filtering is binary - Class OR Not Class - often hide or remove one lot Can break others types down into series of this binary choices often BUT: simple, not easy. In flickr example, what is good? - photographer, composition, light etc. - regular person, contains their friends etc. - SUBJECTIVE
  • #7: Organising is putting document in one place - one label chosen from a set of many possible Single label only (often EXACTLY 1, 0 not allowed) Folders, albums, libraries, handwriting recognition
  • #8: Tagging, can have multiple Often 0 to many labels Often for tagging topics in content E.g. a news story on us-china embargo talk might be filed under: US, China, Trade
  • #9: In 80s people would come up with rules - computers would apply IF this term AND this term THEN this category Took a lot of time - Needed domain expert -Needed knowledge engineer to get knowledge out of expert Hard to scale, need more experts for new cats - Subjective - experts disagree Usually result was 60%-90% accurate
  • #10: Machine Learning - ‘look at examples’ - Supervised Learning Work out rules based on manually classified data People don’t need to explain their thinking - just organise - easier Scales better, is cheaper, and about as accurate! In the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
  • #11: So what do we need? 1. the classes to classify to 2. A set of manually classified documents to train the classifier on 3. A set of manually classified docs to test on In some cases may have a third set of manually classified docs for validation How do we use these? We train a learner on training data to create a model
  • #12: Then use the model to classify each test document Compare manual to automatic judgements Here we’ve got a binary classification, for a spam checker Top is the manual judgement, side is classifier judgement Boxes will just be counts of judgements With that we can calculate some stats
  • #13: Accuracy is just correct percentage - BUT big biased sets make ‘no to all’ to all accurate. Or we desire bias e.g. FN over FP with spam Precision measures the number of positives that are true positives Recall measures what percentage of the available positives we captures Can have one w/out other: high threshold for precision, all positive for recall Researchers quote breakeven or fbeta
  • #14: To compare classifiers, researchers often quote breakeven point This is just where recall and precision are equal F-Beta allows weighing precision more than recall, or vice versa. Beta = 1 is balanced If beta = 0.5, recall is half as important as precision, such as spam checker Before classifying, we need to extract features. How do we represent text
  • #15: All this work is classic Information Retrieval Bag of Words is so called because we discard the structure, and just note down appearances of words Throw away the ordering, any structure at all from web pages etc. See why called vector space in a couple of slides
  • #16: First we have to extract words Simplest version: Take continuous sequences of word character Ignore all punctuation including apostrophes etc. Each new token we find in each document will be added to a dictionary Each document has a vector - there is a dimension for each dictionary word Value is 1 if the document contained that token, 0 if it did not
  • #17: Here is the collection of these two phrases as a vector. 1 if the word is in the document, 0 if not Note both vectors have the same dimensions In a real document collection there are lots of dimensions!
  • #18: We can plot the documents on a graph - using 2 terms ‘i’ and ‘really’ Here the green circle is document A the red triangle B The documents on the last page are in 8 dimensional space - 8 terms But we want more information - how important a term is to a document Need to capture a position in that dimension other than 0 & 1 A weight
  • #19: TFIDF is a classic and very common weighting - are a lot of variations though TF is just count of instances of that term in the doc IDF is number of docs divided by number with term Gives less common terms a higher weight So best is uncommon term that appears a lot Lets look at a similar example to before, with some term weights added
  • #20: The idf means that the ‘i’ and ‘like’ actually disappear here In all docs - no distinguishing power - no value to doc Don’t gets weighted higher in B Then normalise to unit length
  • #21: Normalise is just each value divided by total length (sqrt of the sqrd values) I and Like still 0 though Waste of time processing Maybe there are others that are a waste of time?
  • #22: DR or term space reduction is removing terms that don’t contribute much This can often be by a factor of 10 or 100 Speeds up execution
  • #23: May have heard of stop words - Common in search engines Words like ‘of’ ‘the’ ‘an’ - or ‘het’, ‘de’ in dutch Little to no semantic value to us Can use a stoplist of words, or infer it from low idf scores Collection stop words Pokemon in english, not a stop word. Pokemon on pokemon forum: stop word.
  • #24: Try to come up with ‘root’ word Maps lots of different variations onto one term, reducing dimensions Result is usually not a real world, it’s just repeatable
  • #25: Kai-Square, greek not chinese - Helps choose indicative terms for each class Statistical technique - Calculates how related a term is to a class Take 4 Counts from data. How many spam docs contain term etc. We look for difference between expected and actual counts For a given cell Expected is the row sum * col sum / total Square the difference, divided by expected value, and add all them up
  • #26: Plug the numbers into this formula: a one step way of doing the same thing Comes out with a number - not interesting absolutely But is interesting relatively Chi-square is a distribution, so we can calculate a probability of the events being unrelated using the area from this distribution 1DF because there is one variable and one dependent (term) (class)
  • #27: P is the chance that variables are independent For > 10.83 we are 99.9% certain the variables change together Can work out the probability number from a chi-square distribution But for DR, can just use a threshold and remove terms below OK, so we’ve got a good set of data, now we need a learner
  • #28: Tree of has term questions - ends in class decision Easy to classify, and recursive building algorithm pretty easy Algo is: If all collection is class, then leaf of class Else, choose the best term - Split into 2 collections, WITH and WITHOUT term Recurse on each half But how does it determine best?
  • #29: First, calculate entropy Take counts for how many total docs, how many spam, how many ham minus section could be repeated for multiple classes Represents num bits needed to encode the class of a random choice of document from this set How much new information we get - Easier to see on graph
  • #30: Percentage of spam on horizontal entropy on vertical If all spam or no spam no entropy - we know what will come out If 50/50 entropy is 1 - we can’t guess ahead of time We want to reduce entropy - so that the sets are more consistent
  • #31: We’re using the entropy to calculated the maximum information gain This is the overall reduction in entropy The original entropy minus the new entropy New is weighted by the proportion of docs in each group withCount is the number of docs that have the feature woutCount is the number without, total is the total
  • #32: The split is how many of each class are in the group The entropy is calculated with the formula before The proportion is just the percentage of the total documents Final col is just entropy times proportion Note that the with class is very biased with a low entropy BUT - only a small proportion, so the final information gain is low
  • #33: Easy to implement recursive builder If ‘spam’ or ‘ham’ are empty - we say the tree is a leaf node. If not, we find the term with the highest info gain And built a subtree based on the set of terms with and without the term Just need to traverse to classify
  • #34: An completely made up example of an output tree.
  • #35: Millions of ways to do this, of course! Simple function to return leaf node Assumes document is an array of words
  • #36: Problem: Tree gets too specific to training data - Need to generalise Stop condition - min info gain or other Pruning - test by trimming off bottom parts of tree Use validation set to test effectiveness of measures DTs generate human interpretable rules - very handy BUT expensive to train, need small N dimension, and often require rebuilding
  • #37: KNN is much cheaper at training time - as there is no training Recall we can regard documents as vectors in a N-dimensional space Where N is the size of the dictionary
  • #38: Lets consider only 2 terms Docs with weights for terms X and Y Documents of class triangle and class circle They seem to have a spatial cluster This is also true in higher dimension for real documents
  • #39: Class of new doc = class of it’s K nearest neighbours The K is how many we look for
  • #40: In this case K is three, and the nearest three are all green circles. Choosing K is kind of hard, you might try a few different values but it’s usually in the 11-30 doc range - uneven to num classes Only real challenge is comparing documents Here we are looking at just the X and Y distance, this is the euclidean distance
  • #41: Very easy. Simply looking at the difference between one and the other Can actually do the whole thing in the database ! But, has some problems, so more common...
  • #42: Similarity measure, goes to 1 for identical, 0 for orthogonal Easy to do with normalised vectors - just take dot product Multiply each dimension in Doc A with it in Doc B, and sum Provides better matching than Euclidean We could just loop over documents, find K most similar But search engines do a very similar job - why not use one?
  • #43: Two options when classifying: count most common or add similarities Second helps, e.g. if 5 good matches in class A, 10 poor matches class B For multiple class tagging: use thresholds BUT: Have to compare all documents Search engines do a very similar job, use similar scoring. Why not use one?
  • #44: We can use Zend Framework native PHP implementation of Lucene We add an unindexed ‘class’ field, and our contents We would loop over our training data this way, adding documents
  • #45: Then, we construct a query. Use the same analyser to tokenise documents the same as training data And take a count of how often each word appears We don’t have IDF, so we’re just filtering short words
  • #46: Construct a query with the top 50 words by term frequency Results: take the most common class Works OK, not great. Java Lucene, can get a term vector - includes the true weights We aren’t limited to using pure PHP search engines though
  • #47: Flax is based on the open source Xapian engine, kind of like their Solr Has a similarity search that makes KNN ridiculously easy and very effective It works around the same lines as before, but extracting a set of relevant terms from the document or documents in question Weighing scheme is BM25 - more advanced
  • #48: This code creates a database, adds two fields to it, and indexes a document Uses a restful web service - available from any language
  • #49: Very similar to lucene loop Except we add then remove a document to use searchSimilar feature Gets good accuracy and is really fast. However, if we want to use this kind of technique and don’t have a flax handy, there is another related technique
  • #50: Instead of taking each value and comparing it We take the average of all the documents in each class And compare against that Very easy This works surprisingly well!
  • #51: Here we compute the centroid or average of all the class By summing the weight * 1/count. You might do this in the database, pretty straightforward op. Called a rohk-key-oh classifier because it’s based on a relevance feedback technique by Rocchio Classify by doing similarity against each - taking closest
  • #52: Quick and easy probability based classifier Very commonly used in spam checking, very trendy a couple of years back Naive assumption is that words are independent One word does not influence chances of seeing another - not true! BUT: Means that we don’t need an example for each combination of attributes Bayes is good at very high dimensionality because of this
  • #53: This is the Bayes theorem. Read the pipe as ‘given’, pr as probability of Pr(Doc) is constant, can be dropped for ranking Pr(Class) is either count or assumed - e.g. 60% spam = 0.6, or just use 0.5/0.5 for binary Have to work out Pr(Doc|Class) We calculate that by looking at the probability of the features in the docs
  • #54: We can look at the data itself to calculate the term likelihoods Conditional probability: Docs with term in class / Docs in class We had 1757 docs with the word register in the spam class, and about 16,000 docs in the spam class, so the probability is about 0.11. Register is more spam than ham, sent is more ham than spam
  • #55: Can calculate in SQL directly ClassCount is the number of docs in that class - from earlier query Divide: Number of docs in class containing term / Number of docs in class The stored value is the likelihood of seeing that term in a doc of that class Would call once for each class
  • #56: Independence assumption lets us treat probability of doc as product of probabilities of word for the given class Loop over the terms and multiply likelihood for each class Assumed prior of 0.5 Multi-bernouli - multinomial is term count in class over overall term count
  • #57: To sum up, these are the steps for a wide range of problems Step 1: Recognising that something is a classification problem - context spelling, author ident, intrusion detection, find genes in DNA Then extract features from the docs Apply a learner to generate a model for classifying Something for your mental toolbox!
  • #58: Thanks to the people who put their photos on flickr under Creative Commons
  • #59: Any questions?