SlideShare a Scribd company logo
Full Text Search
David LeBer
Align Software Inc.
What is full text search?
Full Text Search with Lucene
How?

•   Wild card database queries

•   Database implementations

•   Third party search engines

•   Text indexing libraries
Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
Wild Card Queries



•   Easy
Wild Card Queries


•   Slow

•   Hard to optimize

•   Difficult to rank
Database Implementations


•   MySQL FULLTEXT index and MATCH queries

•   PostgreSQL tsvector & tsquery
Database Implementations



•   Fairly Easy
Database Implementations

•   Database specific SQL

•   May include additional limitations
    (i.e: MySQL - MyISAM tables only)

•   Functionality define by the DB engine
Third Party Search Engines



•   Google indexing / searching of your content
Third Party Search Engines


•   Easy

•   Matches user expectations
Third Party Search Engines


•   Content must be available for indexing

•   Loss of control

•   Enhances the Google hegemony
Text Indexing Library



•   Lucene
Text Indexing Library

•   Complete control

•   Database independent

•   Flexible search behaviour

•   Ranked results
Text Indexing Library


•   Adds complexity

•   Additional query language

•   Parallel index
Lucene Overview

•   Open Source - part of the Apache Project

•   Very flexible

•   Wickedly fast

•   Index based
Lucene : Installing


•   Add the Lucene jars to your classpath

•   Use ERIndexing
Lucene : Tasks


•   Indexing

•   Searching
Indexing
What is Indexing?
Indexing : Steps


•   Conversion (to plain text)

•   Analysis (clean and convert the text to tokens)

•   Index (save the tokens to the index)
Indexing : Parts


•   Index - either file or memory based

•   Document - represents a unique object added to the index

•   Field - identifies a chunk of data in the document
Indexing : Classes

•   IndexWriter

•   Directory

•   Analyzer

•   Document

•   Field
Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
                                IndexWriter.MaxFieldLength.UNLIMITED);
Indexing : Field Parameters


•   Stored or not

•   Analyzed or not, with and without norms

•   Include position, offset, and term frequency
Indexing : Analyzers

•   SimpleAnalyzer

•   StopAnalyzer

•   StandardAnalyzer

•   ...
Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
                            Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);
Indexing : Fun with indexes



•   Multiple Access
Searching
What is Searching
Searching : Steps

•   Clean the user input

•   Create a Query

•   Query the Index

•   Return the results
Searching : Search Classes
•   IndexReader

•   IndexSearcher

•   Query

•   QueryParser

•   TopDocs/ScoreDocs

•   Document
Searching : QueryTypes
•   TermQuery

•   RangeQuery

•   PrefixQuery

•   BooleanQuery

•   PhraseQuery

•   WildCardQuery

•   FuzzyQuery
Searching : QueryParser
•   'webobjects' - contains an exact match - TermQuery

•   'webobjects apple', 'webobjects OR apple' - an OR Query

•   +webobjects +apple / webobjects AND apple - an AND Query

•   title:webobjects - Contains the term in title field

•   title:webobjects -subject:iTunes / title:webobjects AND NOT
    subject:iTunes

•   (webobjects OR apple) AND iTunes
Searching : QueryParser

•   title:"apple webobjects" - Phrase Query

•   title:"apple webobjects"~5 - slop of 5

•   webobj* - Prefix Query

•   webobjicts~ - Fuzzy Query

•   lastmodified:[1/1/10 TO 1/1/11] - Range Query
Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
                                          "content", analyzer);
Query query = queryParser.parse(queryString);
Demo
Scoring
“The more times a query term appears in a
document relative to the number of times the term
 appears in all the documents in the collection, the
   more relevant that document is to the query”
Boost

•   While Indexing

    •   Document

    •   Field

•   While Searching

    •   Query
Luke
Demo
ERIndexing
ERIndexing : Strengths

•   Hides some of the complexity of integrating Lucene with WO

•   Offers lots of utility and helper methods

•   Speaks WebObjects collection classes

•   Simplifies index creation
ERIndexing : Weaknesses


•   Hides some of the complexity of integrating Lucene with WO

•   Not fully baked

•   Auto indexing may be dangerous
Demo
Beyond Lucene


•   Solr

•   Compass

•   ElasticSearch
Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

More Related Content

What's hot (18)

ODP
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
PPTX
Apache Lucene Basics
Anirudh Sharma
 
PPT
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
PPTX
Azure search
Alexej Sommer
 
PPT
Lucece Indexing
Prasenjit Mukherjee
 
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
PDF
Munching & crunching - Lucene index post-processing
abial
 
PPTX
Intro to Apache Lucene and Solr
Grant Ingersoll
 
PPTX
Hacking Lucene for Custom Search Results
OpenSource Connections
 
PDF
Wanna search? Piece of cake!
Alex Kursov
 
PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
PPTX
Search Me: Using Lucene.Net
gramana
 
PPTX
Introduction to Apache Solr
Andy Jackson
 
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
PPT
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
PDF
Full text search
deleteman
 
KEY
Content extraction with apache tika
Jukka Zitting
 
PDF
What's new with Apache Tika?
gagravarr
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Apache Lucene Basics
Anirudh Sharma
 
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Azure search
Alexej Sommer
 
Lucece Indexing
Prasenjit Mukherjee
 
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Munching & crunching - Lucene index post-processing
abial
 
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Wanna search? Piece of cake!
Alex Kursov
 
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
Search Me: Using Lucene.Net
gramana
 
Introduction to Apache Solr
Andy Jackson
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
Full text search
deleteman
 
Content extraction with apache tika
Jukka Zitting
 
What's new with Apache Tika?
gagravarr
 

Similar to Full Text Search with Lucene (20)

PPT
Advanced full text searching techniques using Lucene
Asad Abbas
 
PPT
Lucene and MySQL
farhan "Frank"​ mashraqi
 
PPTX
Search enabled applications with lucene.net
Willem Meints
 
PPT
Lucene Bootcamp -1
GokulD
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PPT
Lucene Bootcamp - 2
GokulD
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PPTX
Introduction to search engine-building with Lucene
Kai Chan
 
PPTX
Illuminating Lucene.Net
Dean Thrasher
 
PPT
Introduction to Search Engines
Nitin Pande
 
PPTX
Introduction to search engine-building with Lucene
Kai Chan
 
PDF
IR with lucene
Stelios Gorilas
 
PDF
Tutorial 5 (lucene)
Kira
 
PPTX
JavaEdge09 : Java Indexing and Searching
Shay Sofer
 
PDF
Solr中国6月21日企业搜索
longkeyy
 
PPT
Apache Lucene Searching The Web
Francisco Gonçalves
 
PPTX
Introduction to Information Retrieval using Lucene
DeeKan3
 
PDF
Full Text Search In PostgreSQL
Karwin Software Solutions LLC
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Advanced full text searching techniques using Lucene
Asad Abbas
 
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Search enabled applications with lucene.net
Willem Meints
 
Lucene Bootcamp -1
GokulD
 
Lucene for Solr Developers
Erik Hatcher
 
Lucene Bootcamp - 2
GokulD
 
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
Introduction to search engine-building with Lucene
Kai Chan
 
Illuminating Lucene.Net
Dean Thrasher
 
Introduction to Search Engines
Nitin Pande
 
Introduction to search engine-building with Lucene
Kai Chan
 
IR with lucene
Stelios Gorilas
 
Tutorial 5 (lucene)
Kira
 
JavaEdge09 : Java Indexing and Searching
Shay Sofer
 
Solr中国6月21日企业搜索
longkeyy
 
Apache Lucene Searching The Web
Francisco Gonçalves
 
Introduction to Information Retrieval using Lucene
DeeKan3
 
Full Text Search In PostgreSQL
Karwin Software Solutions LLC
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Ad

More from WO Community (20)

PDF
KAAccessControl
WO Community
 
PDF
In memory OLAP engine
WO Community
 
PDF
Using Nagios to monitor your WO systems
WO Community
 
PDF
Build and deployment
WO Community
 
PDF
High availability
WO Community
 
PDF
Reenabling SOAP using ERJaxWS
WO Community
 
PDF
Chaining the Beast - Testing Wonder Applications in the Real World
WO Community
 
PDF
D2W Stateful Controllers
WO Community
 
PDF
Deploying WO on Windows
WO Community
 
PDF
Unit Testing with WOUnit
WO Community
 
PDF
Life outside WO
WO Community
 
PDF
Apache Cayenne for WO Devs
WO Community
 
PDF
Advanced Apache Cayenne
WO Community
 
PDF
Migrating existing Projects to Wonder
WO Community
 
PDF
iOS for ERREST - alternative version
WO Community
 
PDF
iOS for ERREST
WO Community
 
PDF
"Framework Principal" pattern
WO Community
 
PDF
Filtering data with D2W
WO Community
 
PDF
WOver
WO Community
 
PDF
Localizing your apps for multibyte languages
WO Community
 
KAAccessControl
WO Community
 
In memory OLAP engine
WO Community
 
Using Nagios to monitor your WO systems
WO Community
 
Build and deployment
WO Community
 
High availability
WO Community
 
Reenabling SOAP using ERJaxWS
WO Community
 
Chaining the Beast - Testing Wonder Applications in the Real World
WO Community
 
D2W Stateful Controllers
WO Community
 
Deploying WO on Windows
WO Community
 
Unit Testing with WOUnit
WO Community
 
Life outside WO
WO Community
 
Apache Cayenne for WO Devs
WO Community
 
Advanced Apache Cayenne
WO Community
 
Migrating existing Projects to Wonder
WO Community
 
iOS for ERREST - alternative version
WO Community
 
iOS for ERREST
WO Community
 
"Framework Principal" pattern
WO Community
 
Filtering data with D2W
WO Community
 
Localizing your apps for multibyte languages
WO Community
 
Ad

Recently uploaded (20)

PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
The Future of Artificial Intelligence (AI)
Mukul
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 

Full Text Search with Lucene