SlideShare a Scribd company logo
Statistical Learning and Text Classification with
              NLTK and scikit-learn

                   Olivier Grisel
            http://twitter.com/ogrisel

              PyCON FR – 2010
Applications of Text Classification
               Task                    Predicted outcome

Spam filtering                   Spam, Ham

Language guessing                English, Spanish, French, ...

Sentiment Analysis for Product
                               Positive, Neutral, Negative
Reviews
News Feed Topic                  Politics, Business, Technology,
Categorization                   Sports, ...
Pay-per-click optimal ads
                                 Will yield money, Won't
placement
Personal twitter filter          Will interest me, Won't
Malware detection in log files   Normal, Malware
Supervised Learning Overview
●   Convert training data to a set of vectors of features
    (input) & label (output)
●   Build a model based on the statistical properties of
    features in the training set, e.g.
    ●   Naïve Bayesian Classifier
    ●   Logistic Regression / Maxent Classifier
    ●   Support Vector Machines
●   For each new text document to classify
    ●   Extract features
    ●   Asked model to predict the most likely outcome
Supervised Learning Summary
                     features
    Training         vectors
      Text
   Documents

                                Machine
                                Learning
                                Algorithm
    Labels




  New          features
  Text         vector           Predictive   Expected
Document                          Model       Label
Typical features for text documents
●   Tokenize document into list of words: uni-grams
['the', 'quick', 'brown', 'fox', 'jumps', 'over',   
'the', 'lazy', 'dog']
●   Then chose one of:
    ●    Binary occurrences of uni-grams:
        {'the': True, 'quick': True, ...}
    ●    Frequencies of uni-grams: nb times word_i / nb
         words in document:
    {'the': 0.22, 'quick': 0.11, ...}
    ●    TF-IDF of uni-grams (see next slides)
Better than freqs: TF-IDF
●   Term Frequency



●   Inverse Document Frequency




=> No real need for stop words any more, non informative
words such as “the” are scaled done by IDF term
More advanced features
●   Instead of uni-grams use
    ●   bi-grams of words: “New York”, “very bad”, “not
        good”
    ●   n-grams of chars: “the”, “ed ”, “ a ” (useful for
        language guessing)
●   And the combine with:
    ●   Binary occurrences
    ●   Frequencies
    ●   TF-IDF
NLTK
●   Code: ASL 2.0 & Book: CC-BY-NC-ND
●   Tokenizers, Stemmers, Parsers, Classifiers,
    Clusterers, Corpus Readers
NLTK Corpus Downloader
>>> import nltk
>>> nltk.download()
Using a NLTK corpus
>>> from nltk.corpus import movie_reviews
>>> pos_ids = movie_reviews.fileids('pos')
>>> neg_ids = movie_reviews.fileids('neg')
>>> len(pos_ids), len(neg_ids) 
1000, 1000
>>> print movie_reviews.raw(pos_ids[0])[:100]
films adapted from comic books have had plenty of success , 
whether they're about superheroes ( batm 
>>> movie_reviews.words(pos_ids[0])
['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
Common data cleanup operations
●   Switch to lower case: s.lower()
●   Remove accentuated chars:
  import unicodedata
  s = ''.join(c for c in unicodedata.normalize('NFD', s)
                if unicodedata.category(c) != 'Mn')


●   Extract only word tokens of at least 2 chars
    ●   Using NLTK tokenizers & stemmers
    ●   Using a simple regexp:

     re.compile(r"bww+b", re.U).findall(s)
Feature Extraction with NLTK
●   Simple word binary occurrence features:
def word_features(words):
    return dict((word, True) for word in words)

●   Word Bigrams occurrence features:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures as BAM
from itertools import chain


def bigram_word_features(words, score_fn=BAM.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict((bg, True) for bg in chain(words, bigrams))
The NLTK - Naïve Bayes Classifier
from nltk.classify import NaiveBayesClassifier
mr = movie_reviews
neg_examples = [(features(mr.words(i)), 'neg')
                for i in neg_ids]
pos_examples = [(features(mr.words(i)), 'pos')
                for i in pos_ids]
train_set = pos_examples + neg_examples
classifier = NaiveBayesClassifier.train(train_set)


# later on a previously unseed document 
predicted_label = classifier.classify(new_doc_features)
Most informative features
>>> classifier.show_most_informative_features()
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0
scikit-learn
Features Extraction in scikit-learn
from scikits.learn.features.text import *


text = u"J'ai mangxe9 du kangourou  ce midi, c'xe9tait pas 
trxeas bon."
print WordNGramAnalyzer(min_n=1, max_n=2).analyze(text)
[u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi', 
u'etait', u'pas', u'tres', u'bon', u'ai mange', u'mange du', 
u'du kangourou', u'kangourou ce', u'ce midi', u'midi etait', 
u'etait pas', u'pas tres', u'tres bon']
char_ngrams = CharNGramAnalyzer(min_n=3, max_n=6)
print char_ngrams[:5] + char_ngrams[­5:]
[u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres 
', u'tres b', u'res bo', u'es bon']
TF-IDF features & SVMs
from scikits.learn.features.text import *
from scikits.learn.sparse.svm import LinearSVC


hv = SparseHashingVectorizer(dim=1000000, analyzer=)
hv.vectorize(list_of_documents)
features = hv.get_tfidf()
clf = SparseLinearSVC(C=10, dual=false)
clf.fit(features, labels)


# later with the same clf instance
predicted_labels = clf.predict(features_of_new_docs) 
Typical performance results
●   Naïve Bayesian Classifier with unigram
    occurrences on movie reviews: ~ 70%
●   Same as above selecting the top 10000 most
    informative features only: ~ 93%
●   TF-IDF unigram features + Linear SVC on 20
    newsgroups ~93% (with 20 target categories)
●   Language guessing with character ngram
    frequencies features + Linear SVC: almost
    perfect if document is long enough
Confusion Matrix (20 newsgroups)
00 alt.atheism
01 comp.graphics
02 comp.os.ms-windows.misc
03 comp.sys.ibm.pc.hardware
04 comp.sys.mac.hardware
05 comp.windows.x
06 misc.forsale
07 rec.autos
08 rec.motorcycles
09 rec.sport.baseball
10 rec.sport.hockey
11 sci.crypt
12 sci.electronics
13 sci.med
14 sci.space
15 soc.religion.christian
16 talk.politics.guns
17 talk.politics.mideast
18 talk.politics.misc
19 talk.religion.misc
Handling many possible outcomes
●   Example: possible outcomes are all the
    categories of Wikipedia (565,108)
●   Document Categorization becomes Information
    Retrieval
●   Instead of building one linear model for each
    outcome build a fulltext index and perform TF-
    IDF similarity queries
●   Smart way to find the top 10 search keywords
●   Use Apache Lucene / Solr MoreLikeThisQuery
NLTK – Online demos
NLTK – REST APIs
% curl -d "text=Inception is the best movie ever" 
            http://text-processing.com/api/sentiment/


{
    "probability": {
         "neg": 0.36647424288117808,
         "pos": 0.63352575711882186
    },
    "label": "pos"
}
Google Prediction API
Some pointers
●   http://www.nltk.org (Code & Doc & PDF Book)
●   http://scikit-learn.sf.net (Doc & Examples)
    http://github.com/scikit-learn (Code)
●   http://www.slideshare.net/ogrisel (These slides)

●   http://streamhacker.com/(Blog on NLTK & APIs)
●   http://github.com/hmason/tc (Twitter classifier –
    work in progress)

More Related Content

What's hot (20)

PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
PDF
Online Machine Learning: introduction and examples
Felipe
 
PDF
Collections forceawakens
RichardWarburton
 
PDF
Java collections the force awakens
RichardWarburton
 
PPTX
Transfer learning, active learning using tensorflow object detection api
설기 김
 
PDF
Ge aviation spark application experience porting analytics into py spark ml p...
Databricks
 
PDF
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PPTX
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
PDF
How Green are Java Best Coding Practices? - GreenDays @ Rennes - 2014-07-01
Jérôme Rocheteau
 
PDF
Anomaly Detection at Scale
Jeff Henrikson
 
PDF
Introduction to Deep Learning with Python
indico data
 
PDF
Spark schema for free with David Szakallas
Databricks
 
KEY
Know yourengines velocity2011
Demis Bellot
 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
PDF
Basic ideas on keras framework
Alison Marczewski
 
PDF
Generics Past, Present and Future (Latest)
RichardWarburton
 
PDF
Spark Meetup TensorFrames
Jen Aman
 
PDF
Large Scale Deep Learning with TensorFlow
Jen Aman
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 
Online Machine Learning: introduction and examples
Felipe
 
Collections forceawakens
RichardWarburton
 
Java collections the force awakens
RichardWarburton
 
Transfer learning, active learning using tensorflow object detection api
설기 김
 
Ge aviation spark application experience porting analytics into py spark ml p...
Databricks
 
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
TensorFrames: Google Tensorflow on Apache Spark
Databricks
 
How Green are Java Best Coding Practices? - GreenDays @ Rennes - 2014-07-01
Jérôme Rocheteau
 
Anomaly Detection at Scale
Jeff Henrikson
 
Introduction to Deep Learning with Python
indico data
 
Spark schema for free with David Szakallas
Databricks
 
Know yourengines velocity2011
Demis Bellot
 
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Basic ideas on keras framework
Alison Marczewski
 
Generics Past, Present and Future (Latest)
RichardWarburton
 
Spark Meetup TensorFrames
Jen Aman
 
Large Scale Deep Learning with TensorFlow
Jen Aman
 

Viewers also liked (20)

PPTX
Document Classification using the Python Natural Language Toolkit
Ben Healey
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PDF
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
PyData
 
PPTX
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
PDF
Active sourcing: Direktansprache von Spitzenkräften
Experteer GmbH
 
PPTX
NLTK
Muhammed Shokr
 
PDF
Strategies and Tools for Parallel Machine Learning in Python
Olivier Grisel
 
PPTX
Nltk
Anirudh
 
PDF
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Masumi Shirakawa
 
PPTX
NLTK
Girish Khanzode
 
PPTX
Python NLTK
Alberts Pumpurs
 
DOCX
Natural Language Processing
Mariana Soffer
 
PDF
Machine Learning in NLP
Vijay Ganti
 
PDF
Natural language processing (Python)
Sumit Raj
 
PDF
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
PDF
Gradient descent method
Sanghyuk Chun
 
PPTX
Text Classification/Categorization
Oswal Abhishek
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
PDF
Word Embeddings - Introduction
Christian Perone
 
Document Classification using the Python Natural Language Toolkit
Ben Healey
 
Introduction to Machine Learning
Rahul Jain
 
Bayesian Machine Learning & Python – Naïve Bayes (PyData SV 2013)
PyData
 
A Beginner's Guide to Machine Learning with Scikit-Learn
Sarah Guido
 
Active sourcing: Direktansprache von Spitzenkräften
Experteer GmbH
 
Strategies and Tools for Parallel Machine Learning in Python
Olivier Grisel
 
Nltk
Anirudh
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Masumi Shirakawa
 
Python NLTK
Alberts Pumpurs
 
Natural Language Processing
Mariana Soffer
 
Machine Learning in NLP
Vijay Ganti
 
Natural language processing (Python)
Sumit Raj
 
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
Gradient descent method
Sanghyuk Chun
 
Text Classification/Categorization
Oswal Abhishek
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Word Embeddings - Introduction
Christian Perone
 
Ad

Similar to Statistical Learning and Text Classification with NLTK and scikit-learn (20)

PDF
Hack Like It's 2013 (The Workshop)
Itzik Kotler
 
PDF
Building data "Py-pelines"
Rob Winters
 
PPTX
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Codemotion
 
PDF
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
PPTX
Xuedong Huang - Deep Learning and Intelligent Applications
Machine Learning Prague
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Secure Code Review 101
Narudom Roongsiriwong, CISSP
 
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
PPTX
Build 2019 Recap
Eran Stiller
 
PDF
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
PPTX
Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...
WSO2
 
PDF
Sharable of qualities of clean code
Eman Mohamed
 
PDF
Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
Richard Bouška
 
PDF
BDD Testing Using Godog - Bangalore Golang Meetup # 32
OpenEBS
 
PPTX
My benchmarks brings all the boys to the yard
Ion Dormenco
 
PDF
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Fwdays
 
PDF
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Hack Like It's 2013 (The Workshop)
Itzik Kotler
 
Building data "Py-pelines"
Rob Winters
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Codemotion
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Xuedong Huang - Deep Learning and Intelligent Applications
Machine Learning Prague
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Secure Code Review 101
Narudom Roongsiriwong, CISSP
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Build 2019 Recap
Eran Stiller
 
Elasticsearch Performance Testing and Scaling @ Signal
Joachim Draeger
 
Data to Insight in a Flash: Introduction to Real-Time Analytics with WSO2 Com...
WSO2
 
Sharable of qualities of clean code
Eman Mohamed
 
Telemetry: The Overlooked Treasure in Axon Server-Centric Applications
Richard Bouška
 
BDD Testing Using Godog - Bangalore Golang Meetup # 32
OpenEBS
 
My benchmarks brings all the boys to the yard
Ion Dormenco
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Fwdays
 
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Ad

Recently uploaded (20)

PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 

Statistical Learning and Text Classification with NLTK and scikit-learn

  • 1. Statistical Learning and Text Classification with NLTK and scikit-learn Olivier Grisel http://twitter.com/ogrisel PyCON FR – 2010
  • 2. Applications of Text Classification Task Predicted outcome Spam filtering Spam, Ham Language guessing English, Spanish, French, ... Sentiment Analysis for Product Positive, Neutral, Negative Reviews News Feed Topic Politics, Business, Technology, Categorization Sports, ... Pay-per-click optimal ads Will yield money, Won't placement Personal twitter filter Will interest me, Won't Malware detection in log files Normal, Malware
  • 3. Supervised Learning Overview ● Convert training data to a set of vectors of features (input) & label (output) ● Build a model based on the statistical properties of features in the training set, e.g. ● Naïve Bayesian Classifier ● Logistic Regression / Maxent Classifier ● Support Vector Machines ● For each new text document to classify ● Extract features ● Asked model to predict the most likely outcome
  • 4. Supervised Learning Summary features Training vectors Text Documents Machine Learning Algorithm Labels New features Text vector Predictive Expected Document Model Label
  • 5. Typical features for text documents ● Tokenize document into list of words: uni-grams ['the', 'quick', 'brown', 'fox', 'jumps', 'over',    'the', 'lazy', 'dog'] ● Then chose one of: ● Binary occurrences of uni-grams: {'the': True, 'quick': True, ...} ● Frequencies of uni-grams: nb times word_i / nb words in document:     {'the': 0.22, 'quick': 0.11, ...} ● TF-IDF of uni-grams (see next slides)
  • 6. Better than freqs: TF-IDF ● Term Frequency ● Inverse Document Frequency => No real need for stop words any more, non informative words such as “the” are scaled done by IDF term
  • 7. More advanced features ● Instead of uni-grams use ● bi-grams of words: “New York”, “very bad”, “not good” ● n-grams of chars: “the”, “ed ”, “ a ” (useful for language guessing) ● And the combine with: ● Binary occurrences ● Frequencies ● TF-IDF
  • 8. NLTK ● Code: ASL 2.0 & Book: CC-BY-NC-ND ● Tokenizers, Stemmers, Parsers, Classifiers, Clusterers, Corpus Readers
  • 10. Using a NLTK corpus >>> from nltk.corpus import movie_reviews >>> pos_ids = movie_reviews.fileids('pos') >>> neg_ids = movie_reviews.fileids('neg') >>> len(pos_ids), len(neg_ids)  1000, 1000 >>> print movie_reviews.raw(pos_ids[0])[:100] films adapted from comic books have had plenty of success ,  whether they're about superheroes ( batm  >>> movie_reviews.words(pos_ids[0]) ['films', 'adapted', 'from', 'comic', 'books', 'have', ...]
  • 11. Common data cleanup operations ● Switch to lower case: s.lower() ● Remove accentuated chars:   import unicodedata   s = ''.join(c for c in unicodedata.normalize('NFD', s)                 if unicodedata.category(c) != 'Mn') ● Extract only word tokens of at least 2 chars ● Using NLTK tokenizers & stemmers ● Using a simple regexp:      re.compile(r"bww+b", re.U).findall(s)
  • 12. Feature Extraction with NLTK ● Simple word binary occurrence features: def word_features(words):     return dict((word, True) for word in words) ● Word Bigrams occurrence features: from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures as BAM from itertools import chain def bigram_word_features(words, score_fn=BAM.chi_sq, n=200):     bigram_finder = BigramCollocationFinder.from_words(words)     bigrams = bigram_finder.nbest(score_fn, n)     return dict((bg, True) for bg in chain(words, bigrams))
  • 13. The NLTK - Naïve Bayes Classifier from nltk.classify import NaiveBayesClassifier mr = movie_reviews neg_examples = [(features(mr.words(i)), 'neg')                 for i in neg_ids] pos_examples = [(features(mr.words(i)), 'pos')                 for i in pos_ids] train_set = pos_examples + neg_examples classifier = NaiveBayesClassifier.train(train_set) # later on a previously unseed document  predicted_label = classifier.classify(new_doc_features)
  • 14. Most informative features >>> classifier.show_most_informative_features()          magnificent = True              pos : neg    =     15.0 : 1.0          outstanding = True              pos : neg    =     13.6 : 1.0            insulting = True              neg : pos    =     13.0 : 1.0           vulnerable = True              pos : neg    =     12.3 : 1.0            ludicrous = True              neg : pos    =     11.8 : 1.0               avoids = True              pos : neg    =     11.7 : 1.0          uninvolving = True              neg : pos    =     11.7 : 1.0           astounding = True              pos : neg    =     10.3 : 1.0          fascination = True              pos : neg    =     10.3 : 1.0              idiotic = True              neg : pos    =      9.8 : 1.0
  • 16. Features Extraction in scikit-learn from scikits.learn.features.text import * text = u"J'ai mangxe9 du kangourou  ce midi, c'xe9tait pas  trxeas bon." print WordNGramAnalyzer(min_n=1, max_n=2).analyze(text) [u'ai', u'mange', u'du', u'kangourou', u'ce', u'midi',  u'etait', u'pas', u'tres', u'bon', u'ai mange', u'mange du',  u'du kangourou', u'kangourou ce', u'ce midi', u'midi etait',  u'etait pas', u'pas tres', u'tres bon'] char_ngrams = CharNGramAnalyzer(min_n=3, max_n=6) print char_ngrams[:5] + char_ngrams[­5:] [u"j'a", u"'ai", u'ai ', u'i m', u' ma', u's tres', u' tres  ', u'tres b', u'res bo', u'es bon']
  • 17. TF-IDF features & SVMs from scikits.learn.features.text import * from scikits.learn.sparse.svm import LinearSVC hv = SparseHashingVectorizer(dim=1000000, analyzer=) hv.vectorize(list_of_documents) features = hv.get_tfidf() clf = SparseLinearSVC(C=10, dual=false) clf.fit(features, labels) # later with the same clf instance predicted_labels = clf.predict(features_of_new_docs) 
  • 18. Typical performance results ● Naïve Bayesian Classifier with unigram occurrences on movie reviews: ~ 70% ● Same as above selecting the top 10000 most informative features only: ~ 93% ● TF-IDF unigram features + Linear SVC on 20 newsgroups ~93% (with 20 target categories) ● Language guessing with character ngram frequencies features + Linear SVC: almost perfect if document is long enough
  • 19. Confusion Matrix (20 newsgroups) 00 alt.atheism 01 comp.graphics 02 comp.os.ms-windows.misc 03 comp.sys.ibm.pc.hardware 04 comp.sys.mac.hardware 05 comp.windows.x 06 misc.forsale 07 rec.autos 08 rec.motorcycles 09 rec.sport.baseball 10 rec.sport.hockey 11 sci.crypt 12 sci.electronics 13 sci.med 14 sci.space 15 soc.religion.christian 16 talk.politics.guns 17 talk.politics.mideast 18 talk.politics.misc 19 talk.religion.misc
  • 20. Handling many possible outcomes ● Example: possible outcomes are all the categories of Wikipedia (565,108) ● Document Categorization becomes Information Retrieval ● Instead of building one linear model for each outcome build a fulltext index and perform TF- IDF similarity queries ● Smart way to find the top 10 search keywords ● Use Apache Lucene / Solr MoreLikeThisQuery
  • 22. NLTK – REST APIs % curl -d "text=Inception is the best movie ever" http://text-processing.com/api/sentiment/ { "probability": { "neg": 0.36647424288117808, "pos": 0.63352575711882186 }, "label": "pos" }
  • 24. Some pointers ● http://www.nltk.org (Code & Doc & PDF Book) ● http://scikit-learn.sf.net (Doc & Examples) http://github.com/scikit-learn (Code) ● http://www.slideshare.net/ogrisel (These slides) ● http://streamhacker.com/(Blog on NLTK & APIs) ● http://github.com/hmason/tc (Twitter classifier – work in progress)