Chapter Two: Text Operations
Introduction to
Information Storage and Retrieval
Chapter Objectives
• At the end of this chapter students will be
able to
– Identify major steps in text operation
– Understand challenges in text operation
– Differentiate content bearing and function
terms
– Tokenize text into index terms
– Eliminate stopwords from text
– Understand stemming and stemming
algorithms
– Understand the need for thesaurus
02: Text Operation 2
Text Operations
• Five text operations
1. Lexical analysis
– Objective is to handle digits, hyphens, punctuation marks, and case of
letters
2. Stop word elimination
– Objective is filtering out words with very low discriminating power for
retrieval purpose
3. Stemming
– Removing affixes to allow retrieval of documents with syntactical variation
of query terms
4. Term selection
– Select index terms that carry more semantic
5. Thesaurus construction
– Term categorization to allow expansion of query with related terms
02: Text Operation 3
Text Operations …
• Not all words in a document are equally significant to represent the
contents/meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, need to preprocess the text of a document in a collection
to be used as index terms
• Using the set of all words in a collection to index documents creates
too much noise for the retrieval task
– Reduce noise means reduce words which can be used to refer to the
document
• Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
– Preprocessing will lead to an improvement in the information retrieval
performance
• However, some search engines on the Web omit preprocessing
– Every word in the document is an index term
02: Text Operation 4
Generating Document Representatives
 Text Processing System
 Input text – full text, abstract or title
 Output – a document representative adequate for use in an
automatic retrieval system
 The document representative consists of a list of class
names, each name representing a class of words
occurring in the total input text. A document will be indexed
by a name if one of its significant words occurs as a member of that class.
Documents Tokenization Stop word Stemming Thesaurus
Index
Terms
02: Text Operation 5
Tokenization/Lexical Analysis
 Change text of the documents into words to be adopted
as index terms
 Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999);
but 510 B.C. – unique
 Hyphen – break up the words (e.g. state-of-the-art =
state of the art)- but some words, e.g. gilt-edged, B-49
- unique words which require hyphens
 gilt-edged and gilt edged do not mean the same; similarly, B-
49 refers to submarine in Soviet while B49 refers to city bus
in the US
 Punctuation marks – remove them totally unless
significant, e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
02: Text Operation 6
Tokenization …
• Analyze text into a sequence of discrete tokens (words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry,
after further processing
• But what are valid tokens to produce?
02: Text Operation 7
Issues in Tokenization
One word or multiple: How do you decide it is one
token or two or more?
splitting on white space can also split what
should be regarded as a single token. This
occurs most commonly with names (San
Francisco, Los Angeles
 Hewlett-Packard → Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 lowercase, lower-case, lower case ?
 data base, database, data-base
• cases with internal spaces that we might wish to
regard as a single token include phone numbers
((800) 234-2333) and dates (Mar 11, 1983)
• Splitting tokens on spaces can cause bad retrieval
02: Text Operation 8
Issues in Tokenization …
• The problems of hyphens and non-separating whitespace can even
interact; Advertisements for air fares frequently contain items like
San Francisco-Los Angeles, where simply doing whitespace splitting
would give unfortunate results
• Elimination of period
 IP addresses (100.2.86.144)
How to handle special cases involving apostrophes,
hyphens etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token, they are not usually however
 Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens
Generally, don’t index numbers as text,
Will often index “meta-data” , including creation date, format, etc.
separately
Issues of tokenization are language specific
02: Text Operation 9
Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out from index list
Examples of stopword:
 articles (a, an, the);
 pronouns: (I, he, she, it, their, his)
 prepositions (on, of, in, about, besides, against),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
Stopwords are language dependent
02: Text Operation 10
Elimination of STOPWORD …
Stop word elimination used to be standard in older IR
systems
But the trend is away from doing this. Most web search
engines index stop words:
Good query optimization techniques mean you pay little
at query time for including stop words.
You need stopwords for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stopwords might reduce recall (e.g. “To
be or not to be” – all eliminated except “be” – no or
irrelevant retrieval)
02: Text Operation 11
How to determine a list of stopword?
Intuition:
Stopword have little semantic content; It is typical to remove
such high-frequency words
Stopwords take up 50% of the text. Hence, document size
reduces by 30-50% when stopword are eliminated
 One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
In a collection about insurance practices, “insurance” would be a
stop word
Another method: Build a stop word list that contains a set of
articles, pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare
and exclude from index terms entirely the most common words
 With the removal of stopwords, we can measure better
approximation of importance for classification,
summarization, etc.
02: Text Operation 12
Normalization
• It is standardizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
 Need to “normalize” terms in indexed text as well as
query terms into the same form
 Example: We want to match U.S.A. and USA, by
deleting periods in a term
Case Folding: Often best to lower case everything,
since users will use lowercase regardless of
‘correct’ capitalization…
 Republican vs. republican
 Fasil vs. fasil vs. FASIL
02: Text Operation 13
Normalization issues
 Good for
 Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
 Helps a search engine when most users type ferrari
when they are interested in a Ferrari car
 Bad for
 Proper names vs. common nouns
 E.g. General Motors, Associated Press, …
 Solution:
 lowercase only words at the beginning of the sentence
 In IR, lowercasing is most practical because of the way
users issue their queries
02: Text Operation 14
Stemming
• Stemming is the conflation of variant forms of a
word into a single representation, the stem,
semantically related to the variants
– The words connect, connects, connected, connecting,
connectivity, connection can bed stemmed into connect
• The stem does not need to be a valid word, it need to
capture the meaning of the word though
• Stemming aims to increase effectiveness of an
information retrieval, recall where by more relevant
out of the entire collection are retrieved
• Its also used to reduce the size of index files, since a
single stem typically corresponds to several full
terms, by storing stems instead of terms,
compression factor of 50 percent can be achieve
• Conflation of words or so called stemming can either
02: Text Operation 15
Term conflation
• One of the problems involved in the use of free
text for indexing and retrieval is the variation
in word forms that is likely to be encountered
• The most common types of variations are
– spelling errors (father, fathor)
– alternative spellings i.e. locality or national usage
(color vs colour, labor vs labour)
– multi-word concepts (database, data base)
– affixes (dependent, independent, dependently)
– abbreviations (i.e., that is).
02: Text Operation 16
Conflation or stemming methods
Automatic
approaches
Affix Removal
Method
longest match
simple
removal
Table lookup
method
Successor
Variety
Method
n-gram
Method
02: Text Operation 17
Affix removal
• Affix removal method removes suffix or prefix
from the words so as to convert them into a
common stem form
• Most of the stemmers that are currently used use
this type of approach for conflation
• Affix removal method is based on two
principles one is iterations and the other is
longest match
• An iterative stemming algorithm is simply a
recursive procedure, as its name implies, which
removes strings in each order-class one at a
02: Text Operation 18
Affix removal
• Iteration is usually based on the fact that suffixes
are attached to stems in a certain order, that is,
there exist order-classes of suffixes
• The longest-match principle states that within any
given class of endings, if more than one ending
provides a match, the one which is longest should
be removed
• The first stemmer based on this approach is the one
developed by Lovins (1968); MF Porter (1980) also
used this method
• However, Porter’s stemmer is more compact and
02: Text Operation 19
Table lookup approach
• Store terms and their corresponding stems in a table
• Stemming is then done via lookups in the table
• One way to do stemming is to store a table of all index
terms and their stems
• Terms from queries and indexes could then be stemmed
via table lookup
• Problems with this approach
– making these lookup tables we need to extensively
work on a language
– There will be some probability that these tables may
miss out some exceptional cases
– storage overhead for such a table
02: Text Operation 20
Successor Variety approach
• Determine word and morpheme boundaries based on the distribution of
phonemes in a large body of utterances
• The successor variety of a string is the number of different characters that follow
it in words in some body of text
• The successor variety of substrings of a term will decrease as more characters
are added until a segment boundary is reached
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Prefix Successor Variety Letters
R
RE
REA
READ
READA
READAB
READABL
READABLE
3
2
1
3
1
1
1
1
E,I,O
A,D
D
A,I,S
B
L
E
(Blank)
02: Text Operation 21
n-gram stemmers
• Another method of conflating terms called the shared digram method
• A digram is a pair of consecutive letters
• Besides digrams we can also use trigrams and hence it is called n-gram method
in general
• Association measures are calculated between pairs of terms based on shared
unique digrams
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
• Dice’s coefficient (similarity)
• A and B are the numbers of unique digrams in the first and the second words.
C is the number of unique digrams shared by A and B
80
.
8
7
6
*
2
2
=
+
=
+
=
B
A
C
S
02: Text Operation 22
n-gram stemmers …
• Similarity measures are determined for all pairs
of terms in the database, forming a similarity
matrix
• Such similarity measures are determined for all
pairs of terms in the database
• Once such similarity is computed for all the word
pairs they are clustered as groups
• The value of Dice coefficient gives us the hint
that the stem for these pair of words lies in the
first unique 8 digrams out of 10.
02: Text Operation 23
Criteria for judging stemmers
• Correctness
– Overstemming: too much of a term is removed.
• Can cause unrelated terms to be conflated ➔ retrieval of
non-relevant documents
• Over-stemming is when two words with different stems
are stemmed to the same root. This is also known as a
false positive
– Understemming: too little of a term is removed.
• Prevent related terms from being conflated ➔ relevant
documents may not be retrieved
• Under-stemming is when two words that should be
stemmed to the same root are not. This is also known as a
false negative
Term Stem
GASES GAS (correct)
GAS GA (over)
GASEOUS GASE (under)
02: Text Operation 24
Assumptions in Stemming
• Words with the same stem are semantically
related and have the same meaning to the
user of the text
• The chance of matching increase when the
index term are reduced to their word stems
because it is normal to search using “
retrieve” than “retrieving”
02: Text Operation 25
• Of the four types of stemming strategies (affix
removal, table
lookup, successor variety, and n-grams) which is
preferable?
• Table lookup consists simply of looking for the
stem of a word in a table, a simple procedure but
dependent on data on stems for the whole
language. Since such data is not readily available
and might require considerable storage space, this
type of stemming algorithm might not be practical.
• Successor variety stemming is based on the
determination of morpheme boundaries, uses
knowledge from structural linguistics, more
02: Text Operation 26
Porter Stemmer
 Porter stemmer is the most popular affix (suffix) removal algorithm, for its simplicity
and elegance
• Despite being simpler, the Porter algorithm yields results comparable to those of the
more sophisticated algorithms
• The Porter algorithm uses a suffix list for suffix stripping, the idea is to apply a
series of rules to the suffixes of the words in the text. For instance, the suffix s is
replace by nil as shown below converting plural into singular
s→ ∅
• The longest sequence of letters is searched left hand side in a set
of rules
𝑠𝑠𝑒𝑠 → 𝑠𝑠
𝑠 → ∅
• Applied to the word stresses yields the stem stress instead of the stem stresse.
• A detailed description of the Porter algorithm can be found in the appendix of the
text book and its implementation at
http://tartarus.org/~martin/PorterStemmer/index.html
02: Text Operation 27
Porter stemmer
Most common algorithm for stemming English words to
their common grammatical root
It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
 SSES → SS caresses → caress
IES → y ponies → pony
SS → SS caress → caress
S →  cats → cat
ment→ (Delete final element if what remains is longer
than 1 character )
replacement → replace
cement → cement
02: Text Operation 28
Thesauri
 Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same
concept
 Problem: The same meaning can be expressed using different
terms that are synonyms, homonyms, and related terms
 How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
 Thesaurus: The vocabulary of a controlled indexing language,
formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.
 A thesaurus contains terms and relationships between terms
 IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT, and RT to demonstrate inter-
term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle
02: Text Operation 29
Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index
such equivalences
When the document contains automobile, index it under car as
well (usually, also vice-versa)
 to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
 to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs
02: Text Operation 30
Thesaurus Construction
• Example: thesaurus built to assist IR for
searching cars and vehicles :
Term: Motor vehicles
UF : Automobiles, Cars, Trucks
BT: Vehicles
RT: Road Engineering, Road Transport
• Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
– UF natural language processing (UF=used for
NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)
02: Text Operation 31
Language-specificity
• Many of the above features embody
transformations that are
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins
are available for handling these
02: Text Operation 32
Statistical Properties of Text
• How is the frequency of different words
distributed? Refer to Zipf’s law
• How fast does vocabulary size grow with the
size of a corpus? Refer to Heap’s law
– Such factors affect the performance of IR system
& can be used to select suitable term weights &
other aspects of the system.
• A few words are very common.
– 2 most frequent words (e.g. “the”, “of”) can
account for about 10% of word occurrences in a
document.
• Most words are very rare.
– Half the words in a corpus appear only once,
called “read only once”
02: Text Operation 33
Sample Word Frequency Data
02: Text Operation 34
Zipf’s distributions
For all the words in a documents collection, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
(The most commonly occurring word has rank 1, etc.)
f
r
w has rank r and
frequency f
Distribution of sorted word
frequencies, according to Zipf’s
law
02: Text Operation 35
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistics professor
George Kingsley Zipf (1902-1950),
– attempts to capture the distribution of the frequencies
(i.e. , number of occurances ) of the words within a text
• Zipf's Law states that when the distinct words in a text
are arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by
the constant rank-frequency law of Zipf:
Frequency * Rank = constant
• If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r * f = c
– Different collections have different constants c
02: Text Operation 36
Example: Zipf's Law
 The table shows the most frequently occurring words
from 336,310 document collection containing 125, 720,
891 total words; out of which there are 508,209 unique
words
02: Text Operation 37
Zipf’s law: modeling word distribution
• The collection frequency of the ith most common term is
proportional to 1/i
– If the most frequent term occurs f1 then the second
most frequent term has half as many occurrences, the
third most frequent term has a third as many, etc
• Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent
word
– occurrence of some event ( P ), as a function of the
rank (i) when the rank is determined by the frequency
of occurrence, is a power-law function Pi ~ 1/i Ө with
the exponent Ө close to unity.
i
fi
1

02: Text Operation 38
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
02: Text Operation 39
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off)
• Used by almost all systems
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words (lower
cut-off)
• Used in IR systems
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighted
less
• Used by almost all ranking methods
02: Text Operation 40
Explanations for Zipf’s Law
The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a language
to repeat certain words instead of coining new words
Zipf’s explanation was his “principle of least effort”
which balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one
 Zipf’s Law Impact on IR
 Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-index
storage costs
 Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation
analysis for query expansion) is difficult since they are
extremely rare
02: Text Operation 41
Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word occurrence in
a text furnishes a useful measurement of word
significance
Luhn suggested that both extremely common and
extremely uncommon words were not very useful for
indexing
02: Text Operation 42
For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
The words exceeding the upper cutoff were considered to
be common
The words below the lower cutoff were considered to be
rare
Hence they are not contributing significantly to the
content of the text
The ability of words to discriminate content, reached a
peak at a rank order position half way between the two-
cutoffs
Let f be the frequency of occurrence of words in a text, and r
their rank in decreasing order of word frequency, then a plot
Word significance: Luhn’s Ideas
02: Text Operation 43
Luhn’s Ideas
Luhn (1958) suggested that both extremely common and extremely
uncommon words were not very useful for document representation
& indexing
02: Text Operation 44
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will
scale with the size of the corpus.
• Heaps’ law: estimates the number of vocabularies in a
given corpus
– For a text of n words, the vocabulary size grows by O(nβ),
where β is a constant, 0 < β < 1
– If V is the size of the vocabulary and n is the length of the
corpus in words, Heaps provides the following equation:
• Where constants:
– K  10−100
–   0.4−0.6 (approx. square-root)

Kn
V =
02: Text Operation 45
Heaps’ distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
• Example: from 1,000,000,000 words, there may be
1,000,000 distinct words. Do you agree?
02: Text Operation 46
Example
• We want to estimate the size of the vocabulary for
a corpus of 1,000,000 words. However, we only
know the statistics computed on smaller corpora
sizes:
– For 100,000 words, there are 50,000 unique words
– For 500,000 words, there are 150,000 unique words
– Estimate the vocabulary size for the 1,000,000 words
corpus?
– How about for a corpus of 1,000,000,000 words?
02: Text Operation 47
Relevance Measure
irrelevant
retrieved &
irrelevant
not retrieved
& irrelevant
relevant
retrieved &
relevant
not retrieved
but relevant
retrieved not retrieved
02: Text Operation 48
Issues: recall and precision
• breaking up hyphenated terms increase
recall but decrease precision
• preserving case distinctions enhance
precision but decrease recall
• commercial information systems usually
take recall enhancing approach (numbers
and words containing digits are index
terms, and all are case insensitive)
02: Text Operation 49
Exercise: Tokenization
• The cat slept peacefully in the living room.
It’s a very old cat.
• Mr. O’Neill thinks that the boys’ stories
about Chile’s capital aren’t amusing.
02: Text Operation 50
Index term selection
• Index language is the language used to describe
documents and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be
described, or may be arrived at independently.
– If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text
indexing
– Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is
basic to design an efficient searching IR system
02: Text Operation 51
End of Chapter Two
02: Text Operation 52

More Related Content

PPT
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
PPT
Information retrieval chapter 2-Text Operations.ppt
PPT
2_text operationinformation retrieval. ppt
PDF
Shilpa shukla processing_text
PPTX
More on Indexing Text Operations (1).pptx
PPT
CHapter 2_text operation.ppt material for university students
PPTX
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
Information retrieval chapter 2-Text Operations.ppt
2_text operationinformation retrieval. ppt
Shilpa shukla processing_text
More on Indexing Text Operations (1).pptx
CHapter 2_text operation.ppt material for university students

Similar to 02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf (20)

PDF
Information retrieval concept, practice and challenge
PPT
information retrieval --> dictionary.ppt
PDF
learn about text preprocessing nip using nltk
PPTX
unit1...b.pptxfgfgfgfgfgfgfgfgfgfgfgffggg
PPT
Information Retrieval
PPTX
Lecture 7- Text Statistics and Document Parsing
PDF
Concepts and Challenges of Text Retrieval for Search Engine
PPTX
IRS-Cataloging and Indexing-2.1.pptx
PPTX
Info 2402 irt-chapter_4
PDF
text _preprocessing _in _NLP AI llms.pdf
PPTX
01 IRS-1 (1) document upload the link to
PPTX
01 IRS to upload the data according to the.pptx
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PPTX
I F T S – S Q L 2008 F T S Engine
PDF
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
PPTX
Information retrieval
PPTX
3. introduction to text mining
PPTX
3. introduction to text mining
PPT
Stemming is one of several text normalization techniques that converts raw te...
PDF
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
Information retrieval concept, practice and challenge
information retrieval --> dictionary.ppt
learn about text preprocessing nip using nltk
unit1...b.pptxfgfgfgfgfgfgfgfgfgfgfgffggg
Information Retrieval
Lecture 7- Text Statistics and Document Parsing
Concepts and Challenges of Text Retrieval for Search Engine
IRS-Cataloging and Indexing-2.1.pptx
Info 2402 irt-chapter_4
text _preprocessing _in _NLP AI llms.pdf
01 IRS-1 (1) document upload the link to
01 IRS to upload the data according to the.pptx
14. Michael Oakes (UoW) Natural Language Processing for Translation
I F T S – S Q L 2008 F T S Engine
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
Information retrieval
3. introduction to text mining
3. introduction to text mining
Stemming is one of several text normalization techniques that converts raw te...
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
Ad

More from beshahashenafe20 (7)

PDF
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
PDF
Transaction Management, Concurrency Control and Deadlocks.pdf
PDF
Architecture Database_Unit5_Transaction Models and Concurrency Control.pdf
PDF
Coronel_PPT_hdgshaahaakjhfakjdhfajkdCh10.pdf
PPTX
DSA chapter 4.pptxhdjaaaaaadjhsssssssssssssssssssssssssss
PPT
Chapter Seven - .pptbhhhdfhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhd
PPT
Chapter Five.ppthhjhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
Transaction Management, Concurrency Control and Deadlocks.pdf
Architecture Database_Unit5_Transaction Models and Concurrency Control.pdf
Coronel_PPT_hdgshaahaakjhfakjdhfajkdCh10.pdf
DSA chapter 4.pptxhdjaaaaaadjhsssssssssssssssssssssssssss
Chapter Seven - .pptbhhhdfhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhd
Chapter Five.ppthhjhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Ad

Recently uploaded (20)

PPTX
MMW-CHAPTER-1-final.pptx major Elementary Education
PDF
African Communication Research: A review
PPTX
Cite It Right: A Compact Illustration of APA 7th Edition.pptx
PDF
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
PDF
anganwadi services for the b.sc nursing and GNM
PPTX
operating_systems_presentations_delhi_nc
PPTX
ACFE CERTIFICATION TRAINING ON LAW.pptx
PDF
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
PPTX
Copy of ARAL Program Primer_071725(1).pptx
PPT
hemostasis and its significance, physiology
PPT
hsl powerpoint resource goyloveh feb 07.ppt
PPTX
ENGlishGrade8_Quarter2_WEEK1_LESSON1.pptx
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
PPTX
IT infrastructure and emerging technologies
PPTX
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
PPTX
Diploma pharmaceutics notes..helps diploma students
PDF
POM_Unit1_Notes.pdf Introduction to Management #mba #bba #bcom #bballb #class...
PPTX
CHROMIUM & Glucose Tolerance Factor.pptx
PDF
Laparoscopic Imaging Systems at World Laparoscopy Hospital
MMW-CHAPTER-1-final.pptx major Elementary Education
African Communication Research: A review
Cite It Right: A Compact Illustration of APA 7th Edition.pptx
LATAM’s Top EdTech Innovators Transforming Learning in 2025.pdf
anganwadi services for the b.sc nursing and GNM
operating_systems_presentations_delhi_nc
ACFE CERTIFICATION TRAINING ON LAW.pptx
BSc-Zoology-02Sem-DrVijay-Comparative anatomy of vertebrates.pdf
Copy of ARAL Program Primer_071725(1).pptx
hemostasis and its significance, physiology
hsl powerpoint resource goyloveh feb 07.ppt
ENGlishGrade8_Quarter2_WEEK1_LESSON1.pptx
Disorder of Endocrine system (1).pdfyyhyyyy
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
IT infrastructure and emerging technologies
Key-Features-of-the-SHS-Program-v4-Slides (3) PPT2.pptx
Diploma pharmaceutics notes..helps diploma students
POM_Unit1_Notes.pdf Introduction to Management #mba #bba #bcom #bballb #class...
CHROMIUM & Glucose Tolerance Factor.pptx
Laparoscopic Imaging Systems at World Laparoscopy Hospital

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf

  • 1. Chapter Two: Text Operations Introduction to Information Storage and Retrieval
  • 2. Chapter Objectives • At the end of this chapter students will be able to – Identify major steps in text operation – Understand challenges in text operation – Differentiate content bearing and function terms – Tokenize text into index terms – Eliminate stopwords from text – Understand stemming and stemming algorithms – Understand the need for thesaurus 02: Text Operation 2
  • 3. Text Operations • Five text operations 1. Lexical analysis – Objective is to handle digits, hyphens, punctuation marks, and case of letters 2. Stop word elimination – Objective is filtering out words with very low discriminating power for retrieval purpose 3. Stemming – Removing affixes to allow retrieval of documents with syntactical variation of query terms 4. Term selection – Select index terms that carry more semantic 5. Thesaurus construction – Term categorization to allow expansion of query with related terms 02: Text Operation 3
  • 4. Text Operations … • Not all words in a document are equally significant to represent the contents/meanings of a document – Some word carry more meaning than others – Noun words are the most representative of a document content • Therefore, need to preprocess the text of a document in a collection to be used as index terms • Using the set of all words in a collection to index documents creates too much noise for the retrieval task – Reduce noise means reduce words which can be used to refer to the document • Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words used as index terms – Preprocessing will lead to an improvement in the information retrieval performance • However, some search engines on the Web omit preprocessing – Every word in the document is an index term 02: Text Operation 4
  • 5. Generating Document Representatives  Text Processing System  Input text – full text, abstract or title  Output – a document representative adequate for use in an automatic retrieval system  The document representative consists of a list of class names, each name representing a class of words occurring in the total input text. A document will be indexed by a name if one of its significant words occurs as a member of that class. Documents Tokenization Stop word Stemming Thesaurus Index Terms 02: Text Operation 5
  • 6. Tokenization/Lexical Analysis  Change text of the documents into words to be adopted as index terms  Objective - identify words in the text  Digits, hyphens, punctuation marks, case of letters  Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique  Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but some words, e.g. gilt-edged, B-49 - unique words which require hyphens  gilt-edged and gilt edged do not mean the same; similarly, B- 49 refers to submarine in Soviet while B49 refers to city bus in the US  Punctuation marks – remove them totally unless significant, e.g. program code: x.exe and xexe  Case of letters – not important and can convert all to 02: Text Operation 6
  • 7. Tokenization … • Analyze text into a sequence of discrete tokens (words). • Input: “Friends, Romans and Countrymen” • Output: Tokens (an instance of a sequence of characters that are grouped together as a useful semantic unit for processing) – Friends – Romans – and – Countrymen • Each such token is now a candidate for an index entry, after further processing • But what are valid tokens to produce? 02: Text Operation 7
  • 8. Issues in Tokenization One word or multiple: How do you decide it is one token or two or more? splitting on white space can also split what should be regarded as a single token. This occurs most commonly with names (San Francisco, Los Angeles  Hewlett-Packard → Hewlett and Packard as two tokens?  state-of-the-art: break up hyphenated sequence.  San Francisco, Los Angeles  lowercase, lower-case, lower case ?  data base, database, data-base • cases with internal spaces that we might wish to regard as a single token include phone numbers ((800) 234-2333) and dates (Mar 11, 1983) • Splitting tokens on spaces can cause bad retrieval 02: Text Operation 8
  • 9. Issues in Tokenization … • The problems of hyphens and non-separating whitespace can even interact; Advertisements for air fares frequently contain items like San Francisco-Los Angeles, where simply doing whitespace splitting would give unfortunate results • Elimination of period  IP addresses (100.2.86.144) How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, …  Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token, they are not usually however  Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens Generally, don’t index numbers as text, Will often index “meta-data” , including creation date, format, etc. separately Issues of tokenization are language specific 02: Text Operation 9
  • 10. Elimination of STOPWORD Stopwords are extremely common words across document collections that have no discriminatory power  They may occur in 80% of the documents in a collection.  They would appear to be of little value in helping select documents matching a user need and needs to be filtered out from index list Examples of stopword:  articles (a, an, the);  pronouns: (I, he, she, it, their, his)  prepositions (on, of, in, about, besides, against),  conjunctions/ connectors (and, but, for, nor, or, so, yet),  verbs (is, are, was, were),  adverbs (here, there, out, because, soon, after) and  adjectives (all, any, each, every, few, many, some) Stopwords are language dependent 02: Text Operation 10
  • 11. Elimination of STOPWORD … Stop word elimination used to be standard in older IR systems But the trend is away from doing this. Most web search engines index stop words: Good query optimization techniques mean you pay little at query time for including stop words. You need stopwords for:  Phrase queries: “King of Denmark”  Various song titles, etc.: “Let it be”, “To be or not to be”  “Relational” queries: “flights to London”  Elimination of stopwords might reduce recall (e.g. “To be or not to be” – all eliminated except “be” – no or irrelevant retrieval) 02: Text Operation 11
  • 12. How to determine a list of stopword? Intuition: Stopword have little semantic content; It is typical to remove such high-frequency words Stopwords take up 50% of the text. Hence, document size reduces by 30-50% when stopword are eliminated  One method: Sort terms (in decreasing order) by collection frequency and take the most frequent ones In a collection about insurance practices, “insurance” would be a stop word Another method: Build a stop word list that contains a set of articles, pronouns, etc.  Why do we need stop lists: With a stop list, we can compare and exclude from index terms entirely the most common words  With the removal of stopwords, we can measure better approximation of importance for classification, summarization, etc. 02: Text Operation 12
  • 13. Normalization • It is standardizing tokens so that matches occur despite superficial differences in the character sequences of the tokens  Need to “normalize” terms in indexed text as well as query terms into the same form  Example: We want to match U.S.A. and USA, by deleting periods in a term Case Folding: Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…  Republican vs. republican  Fasil vs. fasil vs. FASIL 02: Text Operation 13
  • 14. Normalization issues  Good for  Allow instances of Automobile at the beginning of a sentence to match with a query of automobile  Helps a search engine when most users type ferrari when they are interested in a Ferrari car  Bad for  Proper names vs. common nouns  E.g. General Motors, Associated Press, …  Solution:  lowercase only words at the beginning of the sentence  In IR, lowercasing is most practical because of the way users issue their queries 02: Text Operation 14
  • 15. Stemming • Stemming is the conflation of variant forms of a word into a single representation, the stem, semantically related to the variants – The words connect, connects, connected, connecting, connectivity, connection can bed stemmed into connect • The stem does not need to be a valid word, it need to capture the meaning of the word though • Stemming aims to increase effectiveness of an information retrieval, recall where by more relevant out of the entire collection are retrieved • Its also used to reduce the size of index files, since a single stem typically corresponds to several full terms, by storing stems instead of terms, compression factor of 50 percent can be achieve • Conflation of words or so called stemming can either 02: Text Operation 15
  • 16. Term conflation • One of the problems involved in the use of free text for indexing and retrieval is the variation in word forms that is likely to be encountered • The most common types of variations are – spelling errors (father, fathor) – alternative spellings i.e. locality or national usage (color vs colour, labor vs labour) – multi-word concepts (database, data base) – affixes (dependent, independent, dependently) – abbreviations (i.e., that is). 02: Text Operation 16
  • 17. Conflation or stemming methods Automatic approaches Affix Removal Method longest match simple removal Table lookup method Successor Variety Method n-gram Method 02: Text Operation 17
  • 18. Affix removal • Affix removal method removes suffix or prefix from the words so as to convert them into a common stem form • Most of the stemmers that are currently used use this type of approach for conflation • Affix removal method is based on two principles one is iterations and the other is longest match • An iterative stemming algorithm is simply a recursive procedure, as its name implies, which removes strings in each order-class one at a 02: Text Operation 18
  • 19. Affix removal • Iteration is usually based on the fact that suffixes are attached to stems in a certain order, that is, there exist order-classes of suffixes • The longest-match principle states that within any given class of endings, if more than one ending provides a match, the one which is longest should be removed • The first stemmer based on this approach is the one developed by Lovins (1968); MF Porter (1980) also used this method • However, Porter’s stemmer is more compact and 02: Text Operation 19
  • 20. Table lookup approach • Store terms and their corresponding stems in a table • Stemming is then done via lookups in the table • One way to do stemming is to store a table of all index terms and their stems • Terms from queries and indexes could then be stemmed via table lookup • Problems with this approach – making these lookup tables we need to extensively work on a language – There will be some probability that these tables may miss out some exceptional cases – storage overhead for such a table 02: Text Operation 20
  • 21. Successor Variety approach • Determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances • The successor variety of a string is the number of different characters that follow it in words in some body of text • The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached Test Word: READABLE Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE Prefix Successor Variety Letters R RE REA READ READA READAB READABL READABLE 3 2 1 3 1 1 1 1 E,I,O A,D D A,I,S B L E (Blank) 02: Text Operation 21
  • 22. n-gram stemmers • Another method of conflating terms called the shared digram method • A digram is a pair of consecutive letters • Besides digrams we can also use trigrams and hence it is called n-gram method in general • Association measures are calculated between pairs of terms based on shared unique digrams statistics => st ta at ti is st ti ic cs unique digrams = at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique digrams = al at ca ic is st ta ti • Dice’s coefficient (similarity) • A and B are the numbers of unique digrams in the first and the second words. C is the number of unique digrams shared by A and B 80 . 8 7 6 * 2 2 = + = + = B A C S 02: Text Operation 22
  • 23. n-gram stemmers … • Similarity measures are determined for all pairs of terms in the database, forming a similarity matrix • Such similarity measures are determined for all pairs of terms in the database • Once such similarity is computed for all the word pairs they are clustered as groups • The value of Dice coefficient gives us the hint that the stem for these pair of words lies in the first unique 8 digrams out of 10. 02: Text Operation 23
  • 24. Criteria for judging stemmers • Correctness – Overstemming: too much of a term is removed. • Can cause unrelated terms to be conflated ➔ retrieval of non-relevant documents • Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive – Understemming: too little of a term is removed. • Prevent related terms from being conflated ➔ relevant documents may not be retrieved • Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative Term Stem GASES GAS (correct) GAS GA (over) GASEOUS GASE (under) 02: Text Operation 24
  • 25. Assumptions in Stemming • Words with the same stem are semantically related and have the same meaning to the user of the text • The chance of matching increase when the index term are reduced to their word stems because it is normal to search using “ retrieve” than “retrieving” 02: Text Operation 25
  • 26. • Of the four types of stemming strategies (affix removal, table lookup, successor variety, and n-grams) which is preferable? • Table lookup consists simply of looking for the stem of a word in a table, a simple procedure but dependent on data on stems for the whole language. Since such data is not readily available and might require considerable storage space, this type of stemming algorithm might not be practical. • Successor variety stemming is based on the determination of morpheme boundaries, uses knowledge from structural linguistics, more 02: Text Operation 26
  • 27. Porter Stemmer  Porter stemmer is the most popular affix (suffix) removal algorithm, for its simplicity and elegance • Despite being simpler, the Porter algorithm yields results comparable to those of the more sophisticated algorithms • The Porter algorithm uses a suffix list for suffix stripping, the idea is to apply a series of rules to the suffixes of the words in the text. For instance, the suffix s is replace by nil as shown below converting plural into singular s→ ∅ • The longest sequence of letters is searched left hand side in a set of rules 𝑠𝑠𝑒𝑠 → 𝑠𝑠 𝑠 → ∅ • Applied to the word stresses yields the stem stress instead of the stem stresse. • A detailed description of the Porter algorithm can be found in the appendix of the text book and its implementation at http://tartarus.org/~martin/PorterStemmer/index.html 02: Text Operation 27
  • 28. Porter stemmer Most common algorithm for stemming English words to their common grammatical root It is simple procedure for removing known affixes in English without using a dictionary. To gets rid of plurals the following rules are used:  SSES → SS caresses → caress IES → y ponies → pony SS → SS caress → caress S →  cats → cat ment→ (Delete final element if what remains is longer than 1 character ) replacement → replace cement → cement 02: Text Operation 28
  • 29. Thesauri  Mostly full-text searching cannot be accurate, since different authors may select different words to represent the same concept  Problem: The same meaning can be expressed using different terms that are synonyms, homonyms, and related terms  How can it be achieved such that for the same meaning the identical terms are used in the index and the query?  Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori relationships between concepts (for example as "broader" and “related") are made explicit.  A thesaurus contains terms and relationships between terms  IR thesauri rely typically upon the use of symbols such as USE/UF (UF=used for), BT, and RT to demonstrate inter- term relationships. e.g., car = automobile, truck, bus, taxi, motor vehicle 02: Text Operation 29
  • 30. Aim of Thesaurus Thesaurus tries to control the use of the vocabulary by showing a set of related words to handle synonyms and homonyms The aim of thesaurus is therefore:  to provide a standard vocabulary for indexing and searching Thesaurus rewrite to form equivalence classes, and we index such equivalences When the document contains automobile, index it under car as well (usually, also vice-versa)  to assist users with locating terms for proper query formulation: When the query contains automobile, look under car as well for expanding query  to provide classified hierarchies that allow the broadening and narrowing of the current request according to user needs 02: Text Operation 30
  • 31. Thesaurus Construction • Example: thesaurus built to assist IR for searching cars and vehicles : Term: Motor vehicles UF : Automobiles, Cars, Trucks BT: Vehicles RT: Road Engineering, Road Transport • Example: thesaurus built to assist IR in the fields of computer science: TERM: natural languages – UF natural language processing (UF=used for NLP) – BT languages (BT=broader term is languages) – TT languages (TT = top term is languages) 02: Text Operation 31
  • 32. Language-specificity • Many of the above features embody transformations that are – Language-specific and – Often, application-specific • These are “plug-in” addenda to the indexing process • Both open source and commercial plug-ins are available for handling these 02: Text Operation 32
  • 33. Statistical Properties of Text • How is the frequency of different words distributed? Refer to Zipf’s law • How fast does vocabulary size grow with the size of a corpus? Refer to Heap’s law – Such factors affect the performance of IR system & can be used to select suitable term weights & other aspects of the system. • A few words are very common. – 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences in a document. • Most words are very rare. – Half the words in a corpus appear only once, called “read only once” 02: Text Operation 33
  • 34. Sample Word Frequency Data 02: Text Operation 34
  • 35. Zipf’s distributions For all the words in a documents collection, for each word w • f : is the frequency that w appears • r : is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f r w has rank r and frequency f Distribution of sorted word frequencies, according to Zipf’s law 02: Text Operation 35
  • 36. Word distribution: Zipf's Law • Zipf's Law- named after the Harvard linguistics professor George Kingsley Zipf (1902-1950), – attempts to capture the distribution of the frequencies (i.e. , number of occurances ) of the words within a text • Zipf's Law states that when the distinct words in a text are arranged in decreasing order of their frequency of occuerence (most frequent words first), the occurence characterstics of the vocabulary can be characterized by the constant rank-frequency law of Zipf: Frequency * Rank = constant • If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c – Different collections have different constants c 02: Text Operation 36
  • 37. Example: Zipf's Law  The table shows the most frequently occurring words from 336,310 document collection containing 125, 720, 891 total words; out of which there are 508,209 unique words 02: Text Operation 37
  • 38. Zipf’s law: modeling word distribution • The collection frequency of the ith most common term is proportional to 1/i – If the most frequent term occurs f1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc • Zipf's Law states that the frequency of the i-th most frequent word is 1/iӨ times that of the most frequent word – occurrence of some event ( P ), as a function of the rank (i) when the rank is determined by the frequency of occurrence, is a power-law function Pi ~ 1/i Ө with the exponent Ө close to unity. i fi 1  02: Text Operation 38
  • 39. More Example: Zipf’s Law • Illustration of Rank-Frequency Law. Let the total number of word occurrences in the sample N = 1,000,000 Rank (R) Term Frequency (F) R.(F/N) 1 the 69 971 0.070 2 of 36 411 0.073 3 and 28 852 0.086 4 to 26 149 0.104 5 a 23237 0.116 6 in 21341 0.128 7 that 10595 0.074 8 is 10099 0.081 9 was 9816 0.088 10 he 9543 0.095 02: Text Operation 39
  • 40. Methods that Build on Zipf's Law • Stop lists: Ignore the most frequent words (upper cut- off) • Used by almost all systems • Significant words: Take words in between the most frequent (upper cut-off) and least frequent words (lower cut-off) • Used in IR systems • Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighted less • Used by almost all ranking methods 02: Text Operation 40
  • 41. Explanations for Zipf’s Law The law has been explained by “principle of least effort” which makes it easier for a speaker or writer of a language to repeat certain words instead of coining new words Zipf’s explanation was his “principle of least effort” which balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one  Zipf’s Law Impact on IR  Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs  Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare 02: Text Operation 41
  • 42. Word significance: Luhn’s Ideas Luhn Idea (1958): the frequency of word occurrence in a text furnishes a useful measurement of word significance Luhn suggested that both extremely common and extremely uncommon words were not very useful for indexing 02: Text Operation 42
  • 43. For this, Luhn specifies two cutoff points: an upper and a lower cutoffs based on which non-significant words are excluded The words exceeding the upper cutoff were considered to be common The words below the lower cutoff were considered to be rare Hence they are not contributing significantly to the content of the text The ability of words to discriminate content, reached a peak at a rank order position half way between the two- cutoffs Let f be the frequency of occurrence of words in a text, and r their rank in decreasing order of word frequency, then a plot Word significance: Luhn’s Ideas 02: Text Operation 43
  • 44. Luhn’s Ideas Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for document representation & indexing 02: Text Operation 44
  • 45. Vocabulary Growth: Heaps’ Law • How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? – This determines how the size of the inverted index will scale with the size of the corpus. • Heaps’ law: estimates the number of vocabularies in a given corpus – For a text of n words, the vocabulary size grows by O(nβ), where β is a constant, 0 < β < 1 – If V is the size of the vocabulary and n is the length of the corpus in words, Heaps provides the following equation: • Where constants: – K  10−100 –   0.4−0.6 (approx. square-root)  Kn V = 02: Text Operation 45
  • 46. Heaps’ distributions • Distribution of size of the vocabulary: there is a linear relationship between vocabulary size and number of tokens • Example: from 1,000,000,000 words, there may be 1,000,000 distinct words. Do you agree? 02: Text Operation 46
  • 47. Example • We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However, we only know the statistics computed on smaller corpora sizes: – For 100,000 words, there are 50,000 unique words – For 500,000 words, there are 150,000 unique words – Estimate the vocabulary size for the 1,000,000 words corpus? – How about for a corpus of 1,000,000,000 words? 02: Text Operation 47
  • 48. Relevance Measure irrelevant retrieved & irrelevant not retrieved & irrelevant relevant retrieved & relevant not retrieved but relevant retrieved not retrieved 02: Text Operation 48
  • 49. Issues: recall and precision • breaking up hyphenated terms increase recall but decrease precision • preserving case distinctions enhance precision but decrease recall • commercial information systems usually take recall enhancing approach (numbers and words containing digits are index terms, and all are case insensitive) 02: Text Operation 49
  • 50. Exercise: Tokenization • The cat slept peacefully in the living room. It’s a very old cat. • Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. 02: Text Operation 50
  • 51. Index term selection • Index language is the language used to describe documents and requests • Elements of the index language are index terms which may be derived from the text of the document to be described, or may be arrived at independently. – If a full text representation of the text is adopted, then all words in the text are used as index terms = full text indexing – Otherwise, need to select the words to be used as index terms for reducing the size of the index file which is basic to design an efficient searching IR system 02: Text Operation 51
  • 52. End of Chapter Two 02: Text Operation 52