Learning to Learn -
to retrieve information
Pramit Choudhary,
Sr. DataScientist@DataScience Inc.
Agenda
● Introduction to effective retrieval of information
● Some Natural Language Processing Techniques and concepts
● Build a simple Sentiment Analyzer
Introduction
● Information is nothing without retrieval
● Natural Language Processing techniques that could help better
retrieval
Understand Ranking
Stopwords
● Removing stop-words before processing documents and queries helps in improving
performance
● Function words
○ articles( the, a ), pro-nouns(he-him, she-her), conjunctions( either..or, neither..nor),
auxiliary verbs and more
● High Frequency Words
● Handle with care while adjusting stop words
○ to be or not to be
○ Will and Grace
○ On the road again
Stemming
● Helps in mapping word to its root - removes morphological affixes from words
● Different kinds of stemmer
○ Lancaster Stemmer
○ Porter Stemmer - have been quite successful with this one
○ Snowball Stemmer
● Useful when computing similarity scores between documents and queries or similar
documents as more common words to compare in the Vector Space
● Benefit of stemming is subjective
● Better question to ask, how to avoid overstemming vs understemming
● Non-NLP alternative to stemming could be n-grams
Let’s get our system ready
● Mac/Linux:
○ sudo pip install -U nltk
○ sudo pip install -U numpy
○ sudo pip install -U pandas
Stemming Example
Parts of Speech Tagging ( Lexical
Categories )
● Assigns syntactic category to each word in a text
● Helps remove ambiguities
Parts Of Speech
Treebank might come handy
POS Example
Compound and Statistical Phrases
● Use of multi-token, non-stop words, thresholded based on frequency
● Blindly using n-grams could be expensive
● A mix of single and multi-token works great
● How do we evaluate a query ?
○ Single token - user searches for York, it should not return document
where it matches New York
○ Compound token - user searches for New York should not return
documents with only New or York
● Most of the time, one find oneself adding tailored heuristics to improve
retrieval accuracy
Chunking - Shallow Parsing
● Is the process of extracting Noun Phrases and Verb Phrases
● Shallow parsing
● Helps in entity recognition
● Relation extraction
Chunking Example
SENTIMENT ANALYSIS
Definition
● As per wiki “(also known as opinion mining) refers to the use of natural
language processing, text analysis and computational linguistics to identify
and extract subjective information in source materials. Sentiment analysis is
widely applied to reviews and social media for a variety of applications,
ranging from marketing to customer service.”
● To summarize - Identify the polarity of a given text at document level,
sentence level or feature level
Other Names
● Subjectivity Analysis
● Sentiment Mining
● Opinion Mining
● Opinion Extraction
Applications
● Product Review
● Crowd sentiment - how confident do others feel about a product ?
● Market trends - positive sentiment vs negative sentiment
● Movie/Food ratings
● Identifying better deals at stores based on crowd feedback
● Restaurant Reviews
Techniques for Sentiment Analysis
1. Knowledge based approach
a. Knowledge about the presence of words with obvious effect
b. happy, sad, angry
c. Words probable towards a particular emotion
2. Statistical Methods
a. Machine Learning Algorithms
i. SVM ( Support Vector Machines )
ii. LSI/LSA ( Latent Semantic Analysis/Indexing )
1. helps in identifying relationship between terms and concepts in a collection of
document
2. Basic Principle: Words used in the same context have similar meaning
iii. Naive Bayes - a simple classifier
iv. Pointwise Mutual Information
Continued ...
2. Statistical Methods
v. Grammatical dependency relations - deep/shallow parsing of text
3. Hybrid Approach
a. Text is non-monotonic in nature
b. Combination of knowledge representation and statistical models
c. Human in the loop
d. Remember we as humans disagree as well
Let’s evaluate
They would not let my dog stay in this hotel
Vs
I would not let my dog stay in this hotel
Can we form an opinion on the quality of the hotel ?
Is the person not happy with the hotel ?
Does ‘They’ refer to staff members of the hotel ?
Evaluation
● How effectively can we answer the question ?
○ How many times did humans agree with results ? - Remember humans
disagree too
● Typically Precision and Recall is used
○ Precision: Information Retrieval: How many selected items where relevant ?
How precise is one’s solutions ?
○ Another definition: measures the exactness of a classifier
○ Higher precision means less false positives
Evaluation continued ...
● Precision and Recall
○ Recall: Information Retrieval: How many relevant items are selected ?
○
○ Another definition: Measures the sensitivity of a classifier
○ Lower recall means more false negatives
Famous Picture
Naive Bayes Classifier
● Is a supervised probabilistic learning algorithm
● Is a conditional probability model
● Assumption: independence between the features
● Based on Bayes’ theorem
○ For e.g. lets define E(row of text) = (x1, x2, … xn)
○ E could belong to class C = +/-
○ Probability of E belonging to class c(+ or -) = p(c|E) = (P(E|c) * p(c)) / p(E)
○ E would belong to class C = + if f(E) >= 1 p(c = +|E) / p(c = -|E)
○ Considering chain rule for conditional probability and conditional independence
■ p(E|c) = p(x1
, x2
, … xn
|c) = ∏ p(xi
| c)
○ NB Classifier = f(E) = (p(c = +)/ p(c = -))* ∏p(xi
| c=+)/p(xi
| c=-)
Tree Representation
Review on Black Mirror Black_Mirror
Class(C) = +ve / -ve
Demo
Simple Sentiment Analyzer
QnA
@MaverickPramit
References
● Information Retrieval using robust Natural Language Processing - Tomek
Strzalkowski and Barbara Vauthey,’92
● NLTK: http://www.nltk.org/
● Extracting Information from text ( a must read )
● Wikipedia - of-course

More Related Content

PPT
Ml ppt
PPTX
Sentiment Analysis
PDF
Networks and Natural Language Processing
PDF
Sentiment Analysis
PPTX
Sentiment Analysis
PPTX
Presentation on Sentiment Analysis
PPT
Using lexical chains for text summarization
PDF
Document Summarization
Ml ppt
Sentiment Analysis
Networks and Natural Language Processing
Sentiment Analysis
Sentiment Analysis
Presentation on Sentiment Analysis
Using lexical chains for text summarization
Document Summarization

What's hot (19)

PPTX
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
PDF
Extraction Based automatic summarization
PDF
machine translation evaluation resources and methods: a survey
PPTX
Sentiment analysis
PDF
ODP
Sentiment analysis: Incremental learning to build domain-models
PPTX
Text summarization
PPTX
2007 CogSci 2020 poster
PDF
Text summarization
PDF
Text Summarization
PDF
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
PDF
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
PDF
LSTM Based Sentiment Analysis
PDF
Sentiment Analysis of Feedback Data
ODP
Query recommendation papers
PDF
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
PDF
Text summarization
PDF
Sentimental analysis
PDF
The sarcasm detection with the method of logistic regression
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Extraction Based automatic summarization
machine translation evaluation resources and methods: a survey
Sentiment analysis
Sentiment analysis: Incremental learning to build domain-models
Text summarization
2007 CogSci 2020 poster
Text summarization
Text Summarization
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
SENTIMENT ANALYSIS-AN OBJECTIVE VIEW
LSTM Based Sentiment Analysis
Sentiment Analysis of Feedback Data
Query recommendation papers
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
Text summarization
Sentimental analysis
The sarcasm detection with the method of logistic regression
Ad

Similar to Learning to learn - to retrieve information (20)

PDF
Disambiguating Polysemous Queries For Document Retrieval
PPTX
Qualitative approaches to learning analytics
PDF
Building business intuition from data
PDF
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
PDF
Data Driven College Counseling by SchooLinks
PPTX
0 Employer Employee Scheme.pptx
PDF
Search in Research, Let's Make it More Complex!
PDF
Sentiment Analysis using Machine Learning.pdf
PDF
Pedersen masters-thesis-oct-10-2014
PPTX
Information Retrieval Systems_Lecture_1_Text_Analytics.pptx
PPTX
00 Content Analysis-Qualitative Data Analysis.pptx
PDF
Ontology matching
PPTX
Perceptual Data_04182016
DOCX
Copy_of_ENGLISH_Reviewer.dbubbugyuguygyugyugyugocx
PDF
Opinion mining book_review
PPTX
Content analysis
PPTX
Data Analysis in Research for Social Study
PPTX
PPT Unit 5=software- engineering-21.pptx
PDF
Information Retrieval without feeling lucky - the Art and Science of Search
PPTX
DH Tools Workshop #1: Text Analysis
Disambiguating Polysemous Queries For Document Retrieval
Qualitative approaches to learning analytics
Building business intuition from data
A SURVEY OF SENTIMENT CLASSSIFICTION TECHNIQUES
Data Driven College Counseling by SchooLinks
0 Employer Employee Scheme.pptx
Search in Research, Let's Make it More Complex!
Sentiment Analysis using Machine Learning.pdf
Pedersen masters-thesis-oct-10-2014
Information Retrieval Systems_Lecture_1_Text_Analytics.pptx
00 Content Analysis-Qualitative Data Analysis.pptx
Ontology matching
Perceptual Data_04182016
Copy_of_ENGLISH_Reviewer.dbubbugyuguygyugyugyugocx
Opinion mining book_review
Content analysis
Data Analysis in Research for Social Study
PPT Unit 5=software- engineering-21.pptx
Information Retrieval without feeling lucky - the Art and Science of Search
DH Tools Workshop #1: Text Analysis
Ad

More from Pramit Choudhary (7)

PDF
Model Evaluation in the land of Deep Learning
PDF
Human in the loop: Bayesian Rules Enabling Explainable AI
PDF
Model evaluation in the land of deep learning
PDF
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
PDF
Scalable analytics with spark and scala system(sassy)
PDF
Learning to Optimize
PPTX
Need for Time series Database
Model Evaluation in the land of Deep Learning
Human in the loop: Bayesian Rules Enabling Explainable AI
Model evaluation in the land of deep learning
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Scalable analytics with spark and scala system(sassy)
Learning to Optimize
Need for Time series Database

Recently uploaded (20)

PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
International_Financial_Reporting_Standa.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
IGGE1 Understanding the Self1234567891011
PDF
Complications of Minimal Access-Surgery.pdf
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
My India Quiz Book_20210205121199924.pdf
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PDF
Hazard Identification & Risk Assessment .pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
What if we spent less time fighting change, and more time building what’s rig...
FORM 1 BIOLOGY MIND MAPS and their schemes
International_Financial_Reporting_Standa.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Introduction to pro and eukaryotes and differences.pptx
IGGE1 Understanding the Self1234567891011
Complications of Minimal Access-Surgery.pdf
Unit 4 Computer Architecture Multicore Processor.pptx
My India Quiz Book_20210205121199924.pdf
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Virtual and Augmented Reality in Current Scenario
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Environmental Education MCQ BD2EE - Share Source.pdf
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Hazard Identification & Risk Assessment .pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf

Learning to learn - to retrieve information

  • 1. Learning to Learn - to retrieve information Pramit Choudhary, Sr. DataScientist@DataScience Inc.
  • 2. Agenda ● Introduction to effective retrieval of information ● Some Natural Language Processing Techniques and concepts ● Build a simple Sentiment Analyzer
  • 3. Introduction ● Information is nothing without retrieval ● Natural Language Processing techniques that could help better retrieval
  • 5. Stopwords ● Removing stop-words before processing documents and queries helps in improving performance ● Function words ○ articles( the, a ), pro-nouns(he-him, she-her), conjunctions( either..or, neither..nor), auxiliary verbs and more ● High Frequency Words ● Handle with care while adjusting stop words ○ to be or not to be ○ Will and Grace ○ On the road again
  • 6. Stemming ● Helps in mapping word to its root - removes morphological affixes from words ● Different kinds of stemmer ○ Lancaster Stemmer ○ Porter Stemmer - have been quite successful with this one ○ Snowball Stemmer ● Useful when computing similarity scores between documents and queries or similar documents as more common words to compare in the Vector Space ● Benefit of stemming is subjective ● Better question to ask, how to avoid overstemming vs understemming ● Non-NLP alternative to stemming could be n-grams
  • 7. Let’s get our system ready ● Mac/Linux: ○ sudo pip install -U nltk ○ sudo pip install -U numpy ○ sudo pip install -U pandas
  • 9. Parts of Speech Tagging ( Lexical Categories ) ● Assigns syntactic category to each word in a text ● Helps remove ambiguities
  • 13. Compound and Statistical Phrases ● Use of multi-token, non-stop words, thresholded based on frequency ● Blindly using n-grams could be expensive ● A mix of single and multi-token works great ● How do we evaluate a query ? ○ Single token - user searches for York, it should not return document where it matches New York ○ Compound token - user searches for New York should not return documents with only New or York ● Most of the time, one find oneself adding tailored heuristics to improve retrieval accuracy
  • 14. Chunking - Shallow Parsing ● Is the process of extracting Noun Phrases and Verb Phrases ● Shallow parsing ● Helps in entity recognition ● Relation extraction
  • 17. Definition ● As per wiki “(also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.” ● To summarize - Identify the polarity of a given text at document level, sentence level or feature level
  • 18. Other Names ● Subjectivity Analysis ● Sentiment Mining ● Opinion Mining ● Opinion Extraction
  • 19. Applications ● Product Review ● Crowd sentiment - how confident do others feel about a product ? ● Market trends - positive sentiment vs negative sentiment ● Movie/Food ratings ● Identifying better deals at stores based on crowd feedback ● Restaurant Reviews
  • 20. Techniques for Sentiment Analysis 1. Knowledge based approach a. Knowledge about the presence of words with obvious effect b. happy, sad, angry c. Words probable towards a particular emotion 2. Statistical Methods a. Machine Learning Algorithms i. SVM ( Support Vector Machines ) ii. LSI/LSA ( Latent Semantic Analysis/Indexing ) 1. helps in identifying relationship between terms and concepts in a collection of document 2. Basic Principle: Words used in the same context have similar meaning iii. Naive Bayes - a simple classifier iv. Pointwise Mutual Information
  • 21. Continued ... 2. Statistical Methods v. Grammatical dependency relations - deep/shallow parsing of text 3. Hybrid Approach a. Text is non-monotonic in nature b. Combination of knowledge representation and statistical models c. Human in the loop d. Remember we as humans disagree as well
  • 22. Let’s evaluate They would not let my dog stay in this hotel Vs I would not let my dog stay in this hotel Can we form an opinion on the quality of the hotel ? Is the person not happy with the hotel ? Does ‘They’ refer to staff members of the hotel ?
  • 23. Evaluation ● How effectively can we answer the question ? ○ How many times did humans agree with results ? - Remember humans disagree too ● Typically Precision and Recall is used ○ Precision: Information Retrieval: How many selected items where relevant ? How precise is one’s solutions ? ○ Another definition: measures the exactness of a classifier ○ Higher precision means less false positives
  • 24. Evaluation continued ... ● Precision and Recall ○ Recall: Information Retrieval: How many relevant items are selected ? ○ ○ Another definition: Measures the sensitivity of a classifier ○ Lower recall means more false negatives
  • 26. Naive Bayes Classifier ● Is a supervised probabilistic learning algorithm ● Is a conditional probability model ● Assumption: independence between the features ● Based on Bayes’ theorem ○ For e.g. lets define E(row of text) = (x1, x2, … xn) ○ E could belong to class C = +/- ○ Probability of E belonging to class c(+ or -) = p(c|E) = (P(E|c) * p(c)) / p(E) ○ E would belong to class C = + if f(E) >= 1 p(c = +|E) / p(c = -|E) ○ Considering chain rule for conditional probability and conditional independence ■ p(E|c) = p(x1 , x2 , … xn |c) = ∏ p(xi | c) ○ NB Classifier = f(E) = (p(c = +)/ p(c = -))* ∏p(xi | c=+)/p(xi | c=-)
  • 27. Tree Representation Review on Black Mirror Black_Mirror Class(C) = +ve / -ve
  • 30. References ● Information Retrieval using robust Natural Language Processing - Tomek Strzalkowski and Barbara Vauthey,’92 ● NLTK: http://www.nltk.org/ ● Extracting Information from text ( a must read ) ● Wikipedia - of-course