Learning to learn - to retrieve information

Learning to Learn -
to retrieve information
Pramit Choudhary,
Sr. DataScientist@DataScience Inc.

Agenda
● Introduction to effective retrieval of information
● Some Natural Language Processing Techniques and concepts
● Build a simple Sentiment Analyzer

Introduction
● Information is nothing without retrieval
● Natural Language Processing techniques that could help better
retrieval

Stopwords
● Removing stop-words before processing documents and queries helps in improving
performance
● Function words
○ articles( the, a ), pro-nouns(he-him, she-her), conjunctions( either..or, neither..nor),
auxiliary verbs and more
● High Frequency Words
● Handle with care while adjusting stop words
○ to be or not to be
○ Will and Grace
○ On the road again

Stemming
● Helps in mapping word to its root - removes morphological affixes from words
● Different kinds of stemmer
○ Lancaster Stemmer
○ Porter Stemmer - have been quite successful with this one
○ Snowball Stemmer
● Useful when computing similarity scores between documents and queries or similar
documents as more common words to compare in the Vector Space
● Benefit of stemming is subjective
● Better question to ask, how to avoid overstemming vs understemming
● Non-NLP alternative to stemming could be n-grams

Let’s get our system ready
● Mac/Linux:
○ sudo pip install -U nltk
○ sudo pip install -U numpy
○ sudo pip install -U pandas

Parts of Speech Tagging ( Lexical
Categories )
● Assigns syntactic category to each word in a text
● Helps remove ambiguities

Compound and Statistical Phrases
● Use of multi-token, non-stop words, thresholded based on frequency
● Blindly using n-grams could be expensive
● A mix of single and multi-token works great
● How do we evaluate a query ?
○ Single token - user searches for York, it should not return document
where it matches New York
○ Compound token - user searches for New York should not return
documents with only New or York
● Most of the time, one find oneself adding tailored heuristics to improve
retrieval accuracy

Chunking - Shallow Parsing
● Is the process of extracting Noun Phrases and Verb Phrases
● Shallow parsing
● Helps in entity recognition
● Relation extraction

Definition
● As per wiki “(also known as opinion mining) refers to the use of natural
language processing, text analysis and computational linguistics to identify
and extract subjective information in source materials. Sentiment analysis is
widely applied to reviews and social media for a variety of applications,
ranging from marketing to customer service.”
● To summarize - Identify the polarity of a given text at document level,
sentence level or feature level

Other Names
● Subjectivity Analysis
● Sentiment Mining
● Opinion Mining
● Opinion Extraction

Applications
● Product Review
● Crowd sentiment - how confident do others feel about a product ?
● Market trends - positive sentiment vs negative sentiment
● Movie/Food ratings
● Identifying better deals at stores based on crowd feedback
● Restaurant Reviews

Techniques for Sentiment Analysis
1. Knowledge based approach
a. Knowledge about the presence of words with obvious effect
b. happy, sad, angry
c. Words probable towards a particular emotion
2. Statistical Methods
a. Machine Learning Algorithms
i. SVM ( Support Vector Machines )
ii. LSI/LSA ( Latent Semantic Analysis/Indexing )
1. helps in identifying relationship between terms and concepts in a collection of
document
2. Basic Principle: Words used in the same context have similar meaning
iii. Naive Bayes - a simple classifier
iv. Pointwise Mutual Information

Continued ...
2. Statistical Methods
v. Grammatical dependency relations - deep/shallow parsing of text
3. Hybrid Approach
a. Text is non-monotonic in nature
b. Combination of knowledge representation and statistical models
c. Human in the loop
d. Remember we as humans disagree as well

Let’s evaluate
They would not let my dog stay in this hotel
Vs
I would not let my dog stay in this hotel
Can we form an opinion on the quality of the hotel ?
Is the person not happy with the hotel ?
Does ‘They’ refer to staff members of the hotel ?

Evaluation
● How effectively can we answer the question ?
○ How many times did humans agree with results ? - Remember humans
disagree too
● Typically Precision and Recall is used
○ Precision: Information Retrieval: How many selected items where relevant ?
How precise is one’s solutions ?
○ Another definition: measures the exactness of a classifier
○ Higher precision means less false positives

Evaluation continued ...
● Precision and Recall
○ Recall: Information Retrieval: How many relevant items are selected ?
○
○ Another definition: Measures the sensitivity of a classifier
○ Lower recall means more false negatives

Naive Bayes Classifier
● Is a supervised probabilistic learning algorithm
● Is a conditional probability model
● Assumption: independence between the features
● Based on Bayes’ theorem
○ For e.g. lets define E(row of text) = (x1, x2, … xn)
○ E could belong to class C = +/-
○ Probability of E belonging to class c(+ or -) = p(c|E) = (P(E|c) * p(c)) / p(E)
○ E would belong to class C = + if f(E) >= 1 p(c = +|E) / p(c = -|E)
○ Considering chain rule for conditional probability and conditional independence
■ p(E|c) = p(x1
, x2
, … xn
|c) = ∏ p(xi
| c)
○ NB Classifier = f(E) = (p(c = +)/ p(c = -))* ∏p(xi
| c=+)/p(xi
| c=-)

Tree Representation
Review on Black Mirror Black_Mirror
Class(C) = +ve / -ve

Demo
Simple Sentiment Analyzer

References
● Information Retrieval using robust Natural Language Processing - Tomek
Strzalkowski and Barbara Vauthey,’92
● NLTK: http://www.nltk.org/
● Extracting Information from text ( a must read )
● Wikipedia - of-course

Learning to learn - to retrieve information

More Related Content

What's hot (19)

Similar to Learning to learn - to retrieve information (20)

More from Pramit Choudhary (7)

Recently uploaded (20)

Learning to learn - to retrieve information