SlideShare a Scribd company logo
The Science behind
   Predictive Analytics:
a Text Mining Perspective


         Ankur Pandey
     [24]7 Innovation Labs
Predictive Analytics

• Use of statistical techniques to analyse
  current data to make predictions about future
  events.

• In most cases, we deal with unstructured text
  data.

• Text data is very high dimensional & sparse.
Text Data: Representation
• As a pre-processing step, typically:
    Stop-words removal
    Stemming

• Bag of words model
    As an e.g., in Vector Space model, each document in
     a corpus is represented by a vector which represents
     words as elements. Measures of elements could be:
      Term Frequency
      TF.IDF
    Most often we uses cosine between vectors to
     measure similarity of documents.
Vector Space Model




A term-document matrix
Text Data: Representation
• String models
    e.g., n-gram model



• Semantic models
    Parsing

    Dimensionality Reduction
       e.g., LSA(I) uses a transformation of term-document matrix to
        another matrix with reduced singular values.

    Topic Models
       e.g., LDA assumes a distribution of latent topics in each
        document.
Latent Semantic Analysis
Latent Dirichlet Allocation
• Assume that number of words in the documents
  (in a corpus) follow some distribution.
• Assume that each document is some mixture
  (more accurately, distribution) of k abstract
  topics.
• Assume that each topic is some mixture (more
  accurately, distribution) of words.

• LDA then tries to backtrack from the documents
  to find a set of topics that are likely to have
  generated the collection.
Text Data: Clustering
• To group similar documents together, in an
  unsupervised manner.

• Topic Modelling and several Dimensionality
  Reduction methods are inherently clustering
  approaches.

• Other important approaches are: k-means
  clustering, and hierarchical clustering.
k-means Clustering
• We select k points and group all other points as
  neighbours of one of these k points based on
  (cosine) similarity.
• Original k points are replaced by centroids of
  these groups (clusters), and same process is
  repeated until convergence.
Text Data: Classification
• Classification problem is to assign one of the
  existing labels to a document in the corpus.

• We assume that we have a given collection of
  training data.
Rule-based Classifier
• The data space is modelled with a set of rules. The
  rule can be employed over presence (or absence)
  of words (or phrases).

• The rules can be further enriched by length of
  neighbourhood, position of words (or phrases)
  etc.

• Decision tree learning is a kind of rule-based
  classification.
Decision Tree Learning




Decision tree showing survival of passengers on the Titanic
(‘sibsp’ is the number of spouses or siblings aboard). The fig. under
the leaves show the probability of survival and the percentage of
observations in the leaf.
Linear Classifier
• Linear Classifiers are those for which the output is
  defined to be f= A.D+ b, where D is a document
  vector, A is a coefficient vector of same
  dimensionality, and b is a scalar.

• Logistic Regression is an e.g., in which the output
  f is related to the probability of D being in some
  class C (i.e., p(C|D)) via a log function.
Query Categorization is used to classify incoming queries
into pre-defined multi-level categories.




                                                    © [24]7, Inc.
Ontology Learning is the process of developing an
ontology in an automated manner.




                                                © Wong, Wilson Yiksen
References
• Mining Text Data, edited by Charu C. Aggarwal &
  ChengXiang Zhai.

• Introduction to Information Retrieval, by
  Christopher D. Manning. Prabhakar Raghavan &
  Hinrich Schütze.

• Pattern Recognition and Machine Learning, by
  Christopher M. Bishop

More Related Content

What's hot (19)

PPT
slides
butest
 
PPT
Text classification using Text kernels
Dev Nath
 
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
PPTX
Presentation on Text Classification
Sai Srinivas Kotni
 
PPTX
Cluster Analysis
guest0edcaf
 
PPTX
Term weighting
Primya Tamil
 
PPTX
Tdm probabilistic models (part 2)
KU Leuven
 
PPTX
Data Mining: clustering and analysis
DataminingTools Inc
 
PPTX
Document clustering and classification
Mahmoud Alfarra
 
PPT
5.4 mining sequence patterns in biological data
Krish_ver2
 
PPTX
Vector space classification
Ujjawal
 
PPT
Ir models
Ambreen Angel
 
PPTX
Data Compression in Data mining and Business Intelligencs
ShahDhruv21
 
PDF
Java -lec-6
Zubair Khalid
 
PDF
Search: Probabilistic Information Retrieval
Vipul Munot
 
PPTX
Clustering in data Mining (Data Mining)
Mustafa Sherazi
 
PPT
Textmining Retrieval And Clustering
guest0edcaf
 
PPT
Clustering
M Rizwan Aqeel
 
PDF
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 
slides
butest
 
Text classification using Text kernels
Dev Nath
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
IJTET Journal
 
Presentation on Text Classification
Sai Srinivas Kotni
 
Cluster Analysis
guest0edcaf
 
Term weighting
Primya Tamil
 
Tdm probabilistic models (part 2)
KU Leuven
 
Data Mining: clustering and analysis
DataminingTools Inc
 
Document clustering and classification
Mahmoud Alfarra
 
5.4 mining sequence patterns in biological data
Krish_ver2
 
Vector space classification
Ujjawal
 
Ir models
Ambreen Angel
 
Data Compression in Data mining and Business Intelligencs
ShahDhruv21
 
Java -lec-6
Zubair Khalid
 
Search: Probabilistic Information Retrieval
Vipul Munot
 
Clustering in data Mining (Data Mining)
Mustafa Sherazi
 
Textmining Retrieval And Clustering
guest0edcaf
 
Clustering
M Rizwan Aqeel
 
Text Classification, Sentiment Analysis, and Opinion Mining
Fabrizio Sebastiani
 

Viewers also liked (6)

PDF
Transforming Big Data Sets into Business Assets by Prof. Ravi Vatapu, CBS
Komfo
 
PDF
Improving Post-Click User Engagement on Native Ads via Survival Analysis
Mounia Lalmas-Roelleke
 
PDF
Information Retrieval with Open Source
korzonek
 
PPTX
Text Similarities - PG Pushpin
jsurve
 
PPT
L7 decision tree & table
Neha Gupta
 
PPT
Analytic Frameworks
Effective Health Care Program
 
Transforming Big Data Sets into Business Assets by Prof. Ravi Vatapu, CBS
Komfo
 
Improving Post-Click User Engagement on Native Ads via Survival Analysis
Mounia Lalmas-Roelleke
 
Information Retrieval with Open Source
korzonek
 
Text Similarities - PG Pushpin
jsurve
 
L7 decision tree & table
Neha Gupta
 
Analytic Frameworks
Effective Health Care Program
 
Ad

Similar to The science behind predictive analytics a text mining perspective (20)

PDF
Machine Learning: Learning with data
ONE Talks
 
PDF
One talk Machine Learning
ONE Talks
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PPT
Predictive Text Analytics
Seth Grimes
 
PDF
Survey of Machine Learning Techniques in Textual Document Classification
IOSR Journals
 
PPT
PIRST Presentation
Abhay Shete
 
PPT
Textmining Predictive Models
Datamining Tools
 
PPT
Textmining Predictive Models
DataminingTools Inc
 
PDF
A rough set based hybrid method to text categorization
Ninad Samel
 
DOC
Presentation on Machine Learning and Data Mining
butest
 
PPT
.ppt
butest
 
PDF
Is this document relevant probably
unyil96
 
PPT
Data Mining: Practical Machine Learning Tools and Techniques ...
butest
 
PDF
Evaluating the efficiency of rule techniques for file
eSAT Publishing House
 
PDF
Evaluating the efficiency of rule techniques for file classification
eSAT Journals
 
PPT
Cluster
guest1babda
 
PPTX
Probabilistic models (part 1)
KU Leuven
 
PPT
Lecture 2
butest
 
PDF
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
PPTX
Machine Learning Summary for Caltech2
Lukas Mandrake
 
Machine Learning: Learning with data
ONE Talks
 
One talk Machine Learning
ONE Talks
 
Introduction to Machine Learning
Rahul Jain
 
Predictive Text Analytics
Seth Grimes
 
Survey of Machine Learning Techniques in Textual Document Classification
IOSR Journals
 
PIRST Presentation
Abhay Shete
 
Textmining Predictive Models
Datamining Tools
 
Textmining Predictive Models
DataminingTools Inc
 
A rough set based hybrid method to text categorization
Ninad Samel
 
Presentation on Machine Learning and Data Mining
butest
 
.ppt
butest
 
Is this document relevant probably
unyil96
 
Data Mining: Practical Machine Learning Tools and Techniques ...
butest
 
Evaluating the efficiency of rule techniques for file
eSAT Publishing House
 
Evaluating the efficiency of rule techniques for file classification
eSAT Journals
 
Cluster
guest1babda
 
Probabilistic models (part 1)
KU Leuven
 
Lecture 2
butest
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
Machine Learning Summary for Caltech2
Lukas Mandrake
 
Ad

Recently uploaded (20)

PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Python basic programing language for automation
DanialHabibi2
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
July Patch Tuesday
Ivanti
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 

The science behind predictive analytics a text mining perspective

  • 1. The Science behind Predictive Analytics: a Text Mining Perspective Ankur Pandey [24]7 Innovation Labs
  • 2. Predictive Analytics • Use of statistical techniques to analyse current data to make predictions about future events. • In most cases, we deal with unstructured text data. • Text data is very high dimensional & sparse.
  • 3. Text Data: Representation • As a pre-processing step, typically:  Stop-words removal  Stemming • Bag of words model  As an e.g., in Vector Space model, each document in a corpus is represented by a vector which represents words as elements. Measures of elements could be: Term Frequency TF.IDF  Most often we uses cosine between vectors to measure similarity of documents.
  • 4. Vector Space Model A term-document matrix
  • 5. Text Data: Representation • String models  e.g., n-gram model • Semantic models  Parsing  Dimensionality Reduction  e.g., LSA(I) uses a transformation of term-document matrix to another matrix with reduced singular values.  Topic Models  e.g., LDA assumes a distribution of latent topics in each document.
  • 7. Latent Dirichlet Allocation • Assume that number of words in the documents (in a corpus) follow some distribution. • Assume that each document is some mixture (more accurately, distribution) of k abstract topics. • Assume that each topic is some mixture (more accurately, distribution) of words. • LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
  • 8. Text Data: Clustering • To group similar documents together, in an unsupervised manner. • Topic Modelling and several Dimensionality Reduction methods are inherently clustering approaches. • Other important approaches are: k-means clustering, and hierarchical clustering.
  • 9. k-means Clustering • We select k points and group all other points as neighbours of one of these k points based on (cosine) similarity. • Original k points are replaced by centroids of these groups (clusters), and same process is repeated until convergence.
  • 10. Text Data: Classification • Classification problem is to assign one of the existing labels to a document in the corpus. • We assume that we have a given collection of training data.
  • 11. Rule-based Classifier • The data space is modelled with a set of rules. The rule can be employed over presence (or absence) of words (or phrases). • The rules can be further enriched by length of neighbourhood, position of words (or phrases) etc. • Decision tree learning is a kind of rule-based classification.
  • 12. Decision Tree Learning Decision tree showing survival of passengers on the Titanic (‘sibsp’ is the number of spouses or siblings aboard). The fig. under the leaves show the probability of survival and the percentage of observations in the leaf.
  • 13. Linear Classifier • Linear Classifiers are those for which the output is defined to be f= A.D+ b, where D is a document vector, A is a coefficient vector of same dimensionality, and b is a scalar. • Logistic Regression is an e.g., in which the output f is related to the probability of D being in some class C (i.e., p(C|D)) via a log function.
  • 14. Query Categorization is used to classify incoming queries into pre-defined multi-level categories. © [24]7, Inc.
  • 15. Ontology Learning is the process of developing an ontology in an automated manner. © Wong, Wilson Yiksen
  • 16. References • Mining Text Data, edited by Charu C. Aggarwal & ChengXiang Zhai. • Introduction to Information Retrieval, by Christopher D. Manning. Prabhakar Raghavan & Hinrich Schütze. • Pattern Recognition and Machine Learning, by Christopher M. Bishop