Full Text Search
for when a database is not enough...
TOC
● What is "Full text search"?
● How does it work?
● What is it good for?
● What makes it so good?
● Common Caracteristics
● Some of the most known solutions
● Who uses them?
● Practical Example
What is full text search?
Wikipedia says: full text search refers to a technique for searching a
computer-stored document or database. In a full text search, the search engine
examines all of the words in every stored document as it tries to match search
words supplied by the user.
I say: Full text search is a technique for searching documents or databases
that allows for a more relevant search (getting the results that we need instead
of the results that just "match" with our query).
How does it work?
In order to do a full text search, we first have to index all the information.
There are several techniques for indexing, but the basic idea behind it is as
follows:
1. Scan the document
2. For every word within the document, create an entry in the index with that
word, and with the relative position within the document.
3. Apply specific rules to the terms, such us:
○ Ignoring stop words
○ Stemming
○ etc
... how? part II
We have the index ready, now what?
Depending on the solution used, we'll have access to a formal querying
language. Using that, we can query our engine to tell it what we're looking for.
Something like:
title:"The Right Way" AND text:goorjakarta^4 apache
This will tell our search engine to look for documents with a title equal to "The
Right Way" and also, those that have the words "goorjakarta" and "apache"
on it's text, the only difference, is that "goorjakarta" is 4 times more important
than the word "apache"
What is it good for?
Full text search allows us to search (well duh!) very large amounts of
information in a very small time frame.
This type of solutions are generally used when the size of the database to be
search rises to the giga bytes.
It is normally used for searching inside the content of documents, such as word
documents, excel spreadsheets, web pages, etc.
What makes it so good?
Full text search is great! (but why?)
Some of the most important caracteristics to all full text search
solutions are:
-Relevant search: The results we get can be sorted based on relevance, this
allows for the user to get what he is looking for easily. (i.e: if we search for "red"
and "apple" we want to get the fruit and not results about the Apple company)
-Keywords: When indexing, keywords can be assigned to different parts of the
documents, allowing for a more specific type of query.
-Wildcards: Great tool that allows us to search terms when we don't know
exactly how to write it.
-Fuzzy search: Using this techniques, we can search terms that are close to
the ones on our query string.
Common caracteristics
Let's talk about some of the most common caracteristics
amongst full text search solutions.
● Presicion vs. Recall
● Stopwords
● Stemming
● Wildcards
Precision vs. recall tradeoff
Precision: Number of relevant results returned divided by the
total of results returned.
Recall: Number of relevant results returned divided by the total
of relevant results.
When choosing a solution, it is important to manage this two
concepts correctly. An increase on precision regularly means a
decrease on recall, and the oposite also applies.
Stopwords
Stopwords are terms that are too common on a language and
therefore are not specific enough to be of used when
searching.
Some examples of this are words like "the", "a", "an", "by",
"can", etc.
They're normally ignored by full text analyzers when indexing
information.
Stemming
Stemming allows us to reduce a word to it's root form (or stem)
in order to generalize terms while searching. Note that this is
not the same as synonyms.
For example, a stemmer would generalize words like "catlike",
"catty" and "cats" to their root form: "cat".
W?ldc*ds (A.k.a: Wildcards)
Wildcards are a bit more known and they do what you'd expect
them to do: they are used in place of characters when you don't
know exactly how your search terms are formed.
Wildcards characters may vary from one solution to the other,
but there are normally two: one that represents a single
character, and one that represents a group of them.
For example: the string 'hel*' would match words like 'hello',
'helium' and others, while the string 'hel?' would only match
words that begin with "hel" and end with one more character,
like "hell" but not "helium".
Some of the most known solutions
There are different types of solutions, some of them are just
APIs that can be integrated into our proyects, whilst others are
servers that provide an entire layer of services between our
application and the information.
Some examples of this are:
APIs:
● Xapian
● Lucene
Servers:
● Sphinx
● Solr
... a bit more about Lucene and Xapian
There are many more, but those are some of the most known
ones...
Xapian and Lucene are two APIs but they work differently,
because Xapian needs bindins for every language in order to
be compatible.
In the case of Lucene, there are specific implementations of
Lucene for every compatible language.
... and a bit more about Sphinx and Solr
On the other hand, Solr (which is based on Lucene) and
Sphinx are both full text search servers.
They both provide their functionalities through interfaces and
not directly inside the application.
Sphinx is designed to be efficient while indexing database
content.
Who uses them?
This types of solutions are used by many companies, for
example:
-Debian uses Xapian for many tasks, one of them
is Searching their archive of software packages
- NASA Planetary Data System (PDS) uses Solr to search for
dataset, mission, instrument, target, and host information
- Digg uses Solr for searching their site
- Craigslist uses Sphinx
- Moove-it! has used Sphinx on some of it's projects
- And many more...
Practical Example
Let's take a look at a very original example...
Thanks for reading...
... and happy searching!

More Related Content

PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PPT
Multiple Regression.ppt
PPT
Multivariate Linear Regression.ppt
PPT
M8.logreg.ppt
PPTX
hyperlink-160825073537.pptx
PPTX
hyperlink-170223001606.pptx
PPTX
introductiontoindexingpresentation1-120925051418-phpapp02.pptx
Storytelling For The Web: Integrate Storytelling in your Design Process
2024 Trend Updates: What Really Works In SEO & Content Marketing
Multiple Regression.ppt
Multivariate Linear Regression.ppt
M8.logreg.ppt
hyperlink-160825073537.pptx
hyperlink-170223001606.pptx
introductiontoindexingpresentation1-120925051418-phpapp02.pptx

fulltextsearch-110729124429-phpapp02.pptx

  • 1. Full Text Search for when a database is not enough...
  • 2. TOC ● What is "Full text search"? ● How does it work? ● What is it good for? ● What makes it so good? ● Common Caracteristics ● Some of the most known solutions ● Who uses them? ● Practical Example
  • 3. What is full text search? Wikipedia says: full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. I say: Full text search is a technique for searching documents or databases that allows for a more relevant search (getting the results that we need instead of the results that just "match" with our query).
  • 4. How does it work? In order to do a full text search, we first have to index all the information. There are several techniques for indexing, but the basic idea behind it is as follows: 1. Scan the document 2. For every word within the document, create an entry in the index with that word, and with the relative position within the document. 3. Apply specific rules to the terms, such us: ○ Ignoring stop words ○ Stemming ○ etc
  • 5. ... how? part II We have the index ready, now what? Depending on the solution used, we'll have access to a formal querying language. Using that, we can query our engine to tell it what we're looking for. Something like: title:"The Right Way" AND text:goorjakarta^4 apache This will tell our search engine to look for documents with a title equal to "The Right Way" and also, those that have the words "goorjakarta" and "apache" on it's text, the only difference, is that "goorjakarta" is 4 times more important than the word "apache"
  • 6. What is it good for? Full text search allows us to search (well duh!) very large amounts of information in a very small time frame. This type of solutions are generally used when the size of the database to be search rises to the giga bytes. It is normally used for searching inside the content of documents, such as word documents, excel spreadsheets, web pages, etc.
  • 7. What makes it so good? Full text search is great! (but why?) Some of the most important caracteristics to all full text search solutions are: -Relevant search: The results we get can be sorted based on relevance, this allows for the user to get what he is looking for easily. (i.e: if we search for "red" and "apple" we want to get the fruit and not results about the Apple company) -Keywords: When indexing, keywords can be assigned to different parts of the documents, allowing for a more specific type of query. -Wildcards: Great tool that allows us to search terms when we don't know exactly how to write it. -Fuzzy search: Using this techniques, we can search terms that are close to the ones on our query string.
  • 8. Common caracteristics Let's talk about some of the most common caracteristics amongst full text search solutions. ● Presicion vs. Recall ● Stopwords ● Stemming ● Wildcards
  • 9. Precision vs. recall tradeoff Precision: Number of relevant results returned divided by the total of results returned. Recall: Number of relevant results returned divided by the total of relevant results. When choosing a solution, it is important to manage this two concepts correctly. An increase on precision regularly means a decrease on recall, and the oposite also applies.
  • 10. Stopwords Stopwords are terms that are too common on a language and therefore are not specific enough to be of used when searching. Some examples of this are words like "the", "a", "an", "by", "can", etc. They're normally ignored by full text analyzers when indexing information.
  • 11. Stemming Stemming allows us to reduce a word to it's root form (or stem) in order to generalize terms while searching. Note that this is not the same as synonyms. For example, a stemmer would generalize words like "catlike", "catty" and "cats" to their root form: "cat".
  • 12. W?ldc*ds (A.k.a: Wildcards) Wildcards are a bit more known and they do what you'd expect them to do: they are used in place of characters when you don't know exactly how your search terms are formed. Wildcards characters may vary from one solution to the other, but there are normally two: one that represents a single character, and one that represents a group of them. For example: the string 'hel*' would match words like 'hello', 'helium' and others, while the string 'hel?' would only match words that begin with "hel" and end with one more character, like "hell" but not "helium".
  • 13. Some of the most known solutions There are different types of solutions, some of them are just APIs that can be integrated into our proyects, whilst others are servers that provide an entire layer of services between our application and the information. Some examples of this are: APIs: ● Xapian ● Lucene Servers: ● Sphinx ● Solr
  • 14. ... a bit more about Lucene and Xapian There are many more, but those are some of the most known ones... Xapian and Lucene are two APIs but they work differently, because Xapian needs bindins for every language in order to be compatible. In the case of Lucene, there are specific implementations of Lucene for every compatible language.
  • 15. ... and a bit more about Sphinx and Solr On the other hand, Solr (which is based on Lucene) and Sphinx are both full text search servers. They both provide their functionalities through interfaces and not directly inside the application. Sphinx is designed to be efficient while indexing database content.
  • 16. Who uses them? This types of solutions are used by many companies, for example: -Debian uses Xapian for many tasks, one of them is Searching their archive of software packages - NASA Planetary Data System (PDS) uses Solr to search for dataset, mission, instrument, target, and host information - Digg uses Solr for searching their site - Craigslist uses Sphinx - Moove-it! has used Sphinx on some of it's projects - And many more...
  • 17. Practical Example Let's take a look at a very original example...
  • 18. Thanks for reading... ... and happy searching!