02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf

Chapter Two: Text Operations
Introduction to
Information Storage and Retrieval

Chapter Objectives
• At the end of this chapter students will be
able to
– Identify major steps in text operation
– Understand challenges in text operation
– Differentiate content bearing and function
terms
– Tokenize text into index terms
– Eliminate stopwords from text
– Understand stemming and stemming
algorithms
– Understand the need for thesaurus
02: Text Operation 2

Text Operations
• Five text operations
1. Lexical analysis
– Objective is to handle digits, hyphens, punctuation marks, and case of
letters
2. Stop word elimination
– Objective is filtering out words with very low discriminating power for
retrieval purpose
3. Stemming
– Removing affixes to allow retrieval of documents with syntactical variation
of query terms
4. Term selection
– Select index terms that carry more semantic
5. Thesaurus construction
– Term categorization to allow expansion of query with related terms

Text Operations …
• Not all words in a document are equally significant to represent the
contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, need to preprocess the text of a document in a collection
to be used as index terms
• Using the set of all words in a collection to index documents creates
too much noise for the retrieval task
– Reduce noise means reduce words which can be used to refer to the
document
• Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
– Preprocessing will lead to an improvement in the information retrieval
performance
• However, some search engines on the Web omit preprocessing
– Every word in the document is an index term

Generating Document Representatives
 Text Processing System
 Input text – full text, abstract or title
 Output – a document representative adequate for use in an
automatic retrieval system
 The document representative consists of a list of class
names, each name representing a class of words
occurring in the total input text. A document will be indexed
by a name if one of its significant words occurs as a member of that class.
Documents Tokenization Stop word Stemming Thesaurus
Index
Terms

Tokenization/Lexical Analysis
 Change text of the documents into words to be adopted
as index terms
 Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999);
but 510 B.C. – unique
 Hyphen – break up the words (e.g. state-of-the-art =
state of the art)- but some words, e.g. gilt-edged, B-49
- unique words which require hyphens
 gilt-edged and gilt edged do not mean the same; similarly, B-
49 refers to submarine in Soviet while B49 refers to city bus
in the US
 Punctuation marks – remove them totally unless
significant, e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to

Tokenization …
• Analyze text into a sequence of discrete tokens (words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry,
after further processing
• But what are valid tokens to produce?

Issues in Tokenization
One word or multiple: How do you decide it is one
token or two or more?
splitting on white space can also split what
should be regarded as a single token. This
occurs most commonly with names (San
Francisco, Los Angeles
 Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 lowercase, lower-case, lower case ?
 data base, database, data-base
• cases with internal spaces that we might wish to
regard as a single token include phone numbers
((800) 234-2333) and dates (Mar 11, 1983)
• Splitting tokens on spaces can cause bad retrieval

Issues in Tokenization …
• The problems of hyphens and non-separating whitespace can even
interact; Advertisements for air fares frequently contain items like
San Francisco-Los Angeles, where simply doing whitespace splitting
would give unfortunate results
• Elimination of period
 IP addresses (100.2.86.144)
How to handle special cases involving apostrophes,
hyphens etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token, they are not usually however
 Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens
Generally, don’t index numbers as text,
Will often index “meta-data” , including creation date, format, etc.
separately
Issues of tokenization are language specific

Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out from index list
Examples of stopword:
 articles (a, an, the);
 pronouns: (I, he, she, it, their, his)
 prepositions (on, of, in, about, besides, against),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
Stopwords are language dependent

Elimination of STOPWORD …
Stop word elimination used to be standard in older IR
systems
But the trend is away from doing this. Most web search
engines index stop words:
Good query optimization techniques mean you pay little
at query time for including stop words.
You need stopwords for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stopwords might reduce recall (e.g. “To
be or not to be” – all eliminated except “be” – no or
irrelevant retrieval)

How to determine a list of stopword?
Intuition:
Stopword have little semantic content; It is typical to remove
such high-frequency words
Stopwords take up 50% of the text. Hence, document size
reduces by 30-50% when stopword are eliminated
 One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
In a collection about insurance practices, “insurance” would be a
stop word
Another method: Build a stop word list that contains a set of
articles, pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare
and exclude from index terms entirely the most common words
 With the removal of stopwords, we can measure better
approximation of importance for classification,
summarization, etc.

Normalization
• It is standardizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
 Need to “normalize” terms in indexed text as well as
query terms into the same form
 Example: We want to match U.S.A. and USA, by
deleting periods in a term
Case Folding: Often best to lower case everything,
since users will use lowercase regardless of
‘correct’ capitalization…
 Republican vs. republican
 Fasil vs. fasil vs. FASIL

Normalization issues
 Good for
 Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
 Helps a search engine when most users type ferrari
when they are interested in a Ferrari car
 Bad for
 Proper names vs. common nouns
 E.g. General Motors, Associated Press, …
 Solution:
 lowercase only words at the beginning of the sentence
 In IR, lowercasing is most practical because of the way
users issue their queries

Stemming
• Stemming is the conflation of variant forms of a
word into a single representation, the stem,
semantically related to the variants
– The words connect, connects, connected, connecting,
connectivity, connection can bed stemmed into connect
• The stem does not need to be a valid word, it need to
capture the meaning of the word though
• Stemming aims to increase effectiveness of an
information retrieval, recall where by more relevant
out of the entire collection are retrieved
• Its also used to reduce the size of index files, since a
single stem typically corresponds to several full
terms, by storing stems instead of terms,
compression factor of 50 percent can be achieve
• Conflation of words or so called stemming can either

Term conflation
• One of the problems involved in the use of free
text for indexing and retrieval is the variation
in word forms that is likely to be encountered
• The most common types of variations are
– spelling errors (father, fathor)
– alternative spellings i.e. locality or national usage
(color vs colour, labor vs labour)
– multi-word concepts (database, data base)
– affixes (dependent, independent, dependently)
– abbreviations (i.e., that is).

Conflation or stemming methods
Automatic
approaches
Affix Removal
Method
longest match
simple
removal
Table lookup
method
Successor
Variety
Method
n-gram
Method

Affix removal
• Affix removal method removes suffix or prefix
from the words so as to convert them into a
common stem form
• Most of the stemmers that are currently used use
this type of approach for conflation
• Affix removal method is based on two
principles one is iterations and the other is
longest match
• An iterative stemming algorithm is simply a
recursive procedure, as its name implies, which
removes strings in each order-class one at a

Affix removal
• Iteration is usually based on the fact that suffixes
are attached to stems in a certain order, that is,
there exist order-classes of suffixes
• The longest-match principle states that within any
given class of endings, if more than one ending
provides a match, the one which is longest should
be removed
• The first stemmer based on this approach is the one
developed by Lovins (1968); MF Porter (1980) also
used this method
• However, Porter’s stemmer is more compact and

Table lookup approach
• Store terms and their corresponding stems in a table
• Stemming is then done via lookups in the table
• One way to do stemming is to store a table of all index
terms and their stems
• Terms from queries and indexes could then be stemmed
via table lookup
• Problems with this approach
– making these lookup tables we need to extensively
work on a language
– There will be some probability that these tables may
miss out some exceptional cases
– storage overhead for such a table

Successor Variety approach
• Determine word and morpheme boundaries based on the distribution of
phonemes in a large body of utterances
• The successor variety of a string is the number of different characters that follow
it in words in some body of text
• The successor variety of substrings of a term will decrease as more characters
are added until a segment boundary is reached
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Prefix Successor Variety Letters
R
RE
REA
READ
READA
READAB
READABL
READABLE
3
2
1
3
1
1
1
1
E,I,O
A,D
D
A,I,S
B
L
E
(Blank)

n-gram stemmers
• Another method of conflating terms called the shared digram method
• A digram is a pair of consecutive letters
• Besides digrams we can also use trigrams and hence it is called n-gram method
in general
• Association measures are calculated between pairs of terms based on shared
unique digrams
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
• Dice’s coefficient (similarity)
• A and B are the numbers of unique digrams in the first and the second words.
C is the number of unique digrams shared by A and B
80
.
8
7
6
*
2
2
=
+
=
+
=
B
A
C
S

n-gram stemmers …
• Similarity measures are determined for all pairs
of terms in the database, forming a similarity
matrix
• Such similarity measures are determined for all
pairs of terms in the database
• Once such similarity is computed for all the word
pairs they are clustered as groups
• The value of Dice coefficient gives us the hint
that the stem for these pair of words lies in the
first unique 8 digrams out of 10.

Criteria for judging stemmers
• Correctness
– Overstemming: too much of a term is removed.
• Can cause unrelated terms to be conflated ➔ retrieval of
non-relevant documents
• Over-stemming is when two words with different stems
are stemmed to the same root. This is also known as a
false positive
– Understemming: too little of a term is removed.
• Prevent related terms from being conflated ➔ relevant
documents may not be retrieved
• Under-stemming is when two words that should be
stemmed to the same root are not. This is also known as a
false negative
Term Stem
GASES GAS (correct)
GAS GA (over)
GASEOUS GASE (under)

Assumptions in Stemming
• Words with the same stem are semantically
related and have the same meaning to the
user of the text
• The chance of matching increase when the
index term are reduced to their word stems
because it is normal to search using “
retrieve” than “retrieving”

• Of the four types of stemming strategies (affix
removal, table
lookup, successor variety, and n-grams) which is
preferable?
• Table lookup consists simply of looking for the
stem of a word in a table, a simple procedure but
dependent on data on stems for the whole
language. Since such data is not readily available
and might require considerable storage space, this
type of stemming algorithm might not be practical.
• Successor variety stemming is based on the
determination of morpheme boundaries, uses
knowledge from structural linguistics, more

Porter Stemmer
 Porter stemmer is the most popular affix (suffix) removal algorithm, for its simplicity
and elegance
• Despite being simpler, the Porter algorithm yields results comparable to those of the
more sophisticated algorithms
• The Porter algorithm uses a suffix list for suffix stripping, the idea is to apply a
series of rules to the suffixes of the words in the text. For instance, the suffix s is
replace by nil as shown below converting plural into singular
s→ ∅
• The longest sequence of letters is searched left hand side in a set
of rules
𝑠𝑠𝑒𝑠 → 𝑠𝑠
𝑠 → ∅
• Applied to the word stresses yields the stem stress instead of the stem stresse.
• A detailed description of the Porter algorithm can be found in the appendix of the
text book and its implementation at
http://tartarus.org/~martin/PorterStemmer/index.html

Porter stemmer
Most common algorithm for stemming English words to
their common grammatical root
It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
 SSES → SS caresses → caress
IES → y ponies → pony
SS → SS caress → caress
S →  cats → cat
ment→ (Delete final element if what remains is longer
than 1 character )
replacement → replace
cement → cement

Thesauri
 Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same
concept
 Problem: The same meaning can be expressed using different
terms that are synonyms, homonyms, and related terms
 How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
 Thesaurus: The vocabulary of a controlled indexing language,
formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.
 A thesaurus contains terms and relationships between terms
 IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT, and RT to demonstrate inter-
term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle

Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index
such equivalences
When the document contains automobile, index it under car as
well (usually, also vice-versa)
 to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
 to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs

Thesaurus Construction
• Example: thesaurus built to assist IR for
searching cars and vehicles :
Term: Motor vehicles
UF : Automobiles, Cars, Trucks
BT: Vehicles
RT: Road Engineering, Road Transport
• Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
– UF natural language processing (UF=used for
NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)

Language-specificity
• Many of the above features embody
transformations that are
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins
are available for handling these

Statistical Properties of Text
• How is the frequency of different words
distributed? Refer to Zipf’s law
• How fast does vocabulary size grow with the
size of a corpus? Refer to Heap’s law
– Such factors affect the performance of IR system
& can be used to select suitable term weights &
other aspects of the system.
• A few words are very common.
– 2 most frequent words (e.g. “the”, “of”) can
account for about 10% of word occurrences in a
document.
• Most words are very rare.
– Half the words in a corpus appear only once,
called “read only once”

Sample Word Frequency Data

Zipf’s distributions
For all the words in a documents collection, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
(The most commonly occurring word has rank 1, etc.)
f
r
w has rank r and
frequency f
Distribution of sorted word
frequencies, according to Zipf’s
law

Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistics professor
George Kingsley Zipf (1902-1950),
– attempts to capture the distribution of the frequencies
(i.e. , number of occurances ) of the words within a text
• Zipf's Law states that when the distinct words in a text
are arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by
the constant rank-frequency law of Zipf:
Frequency * Rank = constant
• If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r * f = c
– Different collections have different constants c

Example: Zipf's Law
 The table shows the most frequently occurring words
from 336,310 document collection containing 125, 720,
891 total words; out of which there are 508,209 unique
words

Zipf’s law: modeling word distribution
• The collection frequency of the ith most common term is
proportional to 1/i
– If the most frequent term occurs f1 then the second
most frequent term has half as many occurrences, the
third most frequent term has a third as many, etc
• Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent
word
– occurrence of some event ( P ), as a function of the
rank (i) when the rank is determined by the frequency
of occurrence, is a power-law function Pi ~ 1/i Ө with
the exponent Ө close to unity.
i
fi
1


More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095

Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off)
• Used by almost all systems
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words (lower
cut-off)
• Used in IR systems
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighted
less
• Used by almost all ranking methods

Explanations for Zipf’s Law
The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a language
to repeat certain words instead of coining new words
Zipf’s explanation was his “principle of least effort”
which balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one
 Zipf’s Law Impact on IR
 Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-index
storage costs
 Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation
analysis for query expansion) is difficult since they are
extremely rare

Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word occurrence in
a text furnishes a useful measurement of word
significance
Luhn suggested that both extremely common and
extremely uncommon words were not very useful for
indexing

For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
The words exceeding the upper cutoff were considered to
be common
The words below the lower cutoff were considered to be
rare
Hence they are not contributing significantly to the
content of the text
The ability of words to discriminate content, reached a
peak at a rank order position half way between the two-
cutoffs
Let f be the frequency of occurrence of words in a text, and r
their rank in decreasing order of word frequency, then a plot
Word significance: Luhn’s Ideas

Luhn’s Ideas
Luhn (1958) suggested that both extremely common and extremely
uncommon words were not very useful for document representation
& indexing

Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will
scale with the size of the corpus.
• Heaps’ law: estimates the number of vocabularies in a
given corpus
– For a text of n words, the vocabulary size grows by O(nβ),
where β is a constant, 0 < β < 1
– If V is the size of the vocabulary and n is the length of the
corpus in words, Heaps provides the following equation:
• Where constants:
– K  10−100
–   0.4−0.6 (approx. square-root)

Kn
V =

Heaps’ distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
• Example: from 1,000,000,000 words, there may be
1,000,000 distinct words. Do you agree?

Example
• We want to estimate the size of the vocabulary for
a corpus of 1,000,000 words. However, we only
know the statistics computed on smaller corpora
sizes:
– For 100,000 words, there are 50,000 unique words
– For 500,000 words, there are 150,000 unique words
– Estimate the vocabulary size for the 1,000,000 words
corpus?
– How about for a corpus of 1,000,000,000 words?

Relevance Measure
irrelevant
retrieved &
irrelevant
not retrieved
& irrelevant
relevant
retrieved &
relevant
not retrieved
but relevant
retrieved not retrieved

Issues: recall and precision
• breaking up hyphenated terms increase
recall but decrease precision
• preserving case distinctions enhance
precision but decrease recall
• commercial information systems usually
take recall enhancing approach (numbers
and words containing digits are index
terms, and all are case insensitive)

Exercise: Tokenization
• The cat slept peacefully in the living room.
It’s a very old cat.
• Mr. O’Neill thinks that the boys’ stories
about Chile’s capital aren’t amusing.

Index term selection
• Index language is the language used to describe
documents and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be
described, or may be arrived at independently.
– If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text
indexing
– Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is
basic to design an efficient searching IR system

End of Chapter Two

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf

More Related Content

Similar to 02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf (20)

More from beshahashenafe20 (7)

Recently uploaded (20)

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf