SlideShare a Scribd company logo
5
Most read
11
Most read
16
Most read
Introduction to Apache Lucene

Sumit Luthra
Agenda
What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Core Indexing Classes
Core Searching Classes
Demo
Questions & Answers
What is Apache Lucene?
Apache Lucene is a high-performance, full- featured text search
engine library written entirely in Java.”
Also known as Information Retrieval Library.
Lucene is specifically an API, not an application.
Open Source
Focus
Indexing Documents
Searching Documents

Note :
You can use Lucene to provide consistent full-text indexing across
both database objects and documents in various formats (Microsoft
Office documents, PDF, HTML, text, emails and so on).
Lucene Architecture
Index
document

Users

Analyze
document

Search UI

Build document

Index

Build
query

Render
results

Acquire content
Raw
Content

Run query
Indexing Documents
IndexWriter writer = new IndexWriter(directory, analyzer, true);
Document doc = new Document();
doc.add(new Field(“content", “Hello World”,
Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field(“name", “filename.txt",
Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field(“path", “http://myfile/",
Field.Store.YES, Field.Index.TOKENIZED));
// [...]
writer.addDocument(doc);
writer.close();
Core indexing classes
IndexWriter
Directory
Analyzer
Document
Field
IndexWriter construction
// Deprecated
IndexWriter(Directory d, Analyzer a, // default analyzer
IndexWriter.MaxFieldLength mfl);

// Preferred
IndexWriter(Directory d,
IndexWriterConfig c);
Directory
FSDirectory
RAMDirectory
DbDirectory
FileSwitchDirectory
JEDirectory
Analyzers
Tokenizes the input text
Common Analyzers
–

WhitespaceAnalyzer
Splits tokens on whitespace

–

SimpleAnalyzer
Splits tokens on non-letters, and then lowercases

–

StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words

–

StandardAnalyzer
Most sophisticated analyzer that knows about certain token types,
lowercases, removes stop words, ...
Analysis examples
•

“The quick brown fox jumped over the lazy dog”

•

WhitespaceAnalyzer
–

•

SimpleAnalyzer
–

•

[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

StopAnalyzer
–

•

[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

[quick] [brown] [fox] [jumped] [over] [lazy] [dog]

StandardAnalyzer
–

[quick] [brown] [fox] [jumped] [over] [lazy] [dog]
More analysis examples
•

“XY&Z Corporation – xyz@example.com”

•

WhitespaceAnalyzer
–

•

SimpleAnalyzer
–

•

[xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer
–

•

[XY&Z] [Corporation] [-] [xyz@example.com]

[xy] [z] [corporation] [xyz] [example] [com]

StandardAnalyzer
–

[xy&z] [corporation] [xyz@example.com]
Document & Fields
A Document is the atomic unit of indexing and
searching, It contains Fields
Fields have a name and a value
–

You have to translate raw content into Fields

–

Examples: Title, author, date, abstract, body, URL, keywords, ...

–

Different documents can have different fields
Field options
Field.Store
–

NO : Don’t store the field value in the index

–

YES : Store the field value in the index

Field.Index
–

ANALYZED : Tokenize with an Analyzer

–

NOT_ANALYZED : Do not tokenize

–

NO : Do not index this field
Searching an Index
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version, field_name
,analyzer);
Query query = parser.parse(WORD_SEARCHED);
TopDocs hits = searcher.search(query, noOfHits);
ScoreDoc[] document = hits.scoreDocs;
Document doc = searcher.doc(0); // look at first match
System.out.println(“name=" + doc.get(“name"));
searcher.close();
Core searching classes
IndexSearcher
Query
QueryParser
TopDocs
ScoreDoc
IndexSearcher
Constructor:
–

IndexSearcher(Directory d);
•

–

// Deprecated

IndexSearcher(IndexReader r);
•

Construct an IndexReader with static method
IndexReader.open(dir)
Query
•

TermQuery
–

Constructed from a Term

•

TermRangeQuery

•

NumericRangeQuery

•

PrefixQuery

•

BooleanQuery

•

PhraseQuery

•

WildcardQuery

•

FuzzyQuery

•

MatchAllDocsQuery
QueryParser
•

Constructor
–

•

QueryParser(Version matchVersion,
String defaultField,
Analyzer analyzer);

Parsing methods
–

Query parse(String query) throws
ParseException;

–

... and many more
QueryParser syntax examples
Query expression

Document matches if…

java

Contains the term java in the default field

java junit
java OR junit

Contains the term java or junit or both in the default field
(the default operator can be changed to AND)

+java +junit

Contains both java and junit in the default field

java AND junit
title:ant

Contains the term ant in the title field

title:extreme –subject:sports

Contains extreme in the title and not sports in subject

(agile OR extreme) AND java

Boolean expression matches

title:”junit in action”

Phrase matches in title

title:”junit action”~5

Proximity matches (within 5) in title

java*

Wildcard matches

java~

Fuzzy matches

lastmodified:[1/1/09 TO
12/31/09]

Range matches
TopDocs
Class containing top N ranked searched documents/results
that match a given query.

ScoreDoc
Array of ScoreDoc containing documents/results
that match a given query.
Demo of simple indexing and searching
using Apache Lucene

You will require lucene-core-x.y.jar for this demo.
Any Questions ?
Thank You.

More Related Content

What's hot (20)

ODP
Elasticsearch V/s Relational Database
Richa Budhraja
 
PPTX
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
PPTX
An Introduction to Elastic Search.
Jurriaan Persyn
 
PDF
Spark SQL
Joud Khattab
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PPTX
An Intro to Elasticsearch and Kibana
ObjectRocket
 
PDF
Javascript Basic
Kang-min Liu
 
PPTX
Elastic search Walkthrough
Suhel Meman
 
PPTX
엘라스틱 서치 세미나
종현 김
 
PDF
Apache NiFi Record Processing
Bryan Bende
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PPTX
Elasticsearch Introduction
Roopendra Vishwakarma
 
PPTX
Lucene
Harshit Agarwal
 
PPTX
Elastic stack Presentation
Amr Alaa Yassen
 
PPTX
Centralized log-management-with-elastic-stack
Rich Lee
 
PDF
Elasticsearch: An Overview
Ruby Shrestha
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
Elastic search overview
ABC Talks
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PPTX
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
 
Elasticsearch V/s Relational Database
Richa Budhraja
 
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
An Introduction to Elastic Search.
Jurriaan Persyn
 
Spark SQL
Joud Khattab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
An Intro to Elasticsearch and Kibana
ObjectRocket
 
Javascript Basic
Kang-min Liu
 
Elastic search Walkthrough
Suhel Meman
 
엘라스틱 서치 세미나
종현 김
 
Apache NiFi Record Processing
Bryan Bende
 
Stream processing using Kafka
Knoldus Inc.
 
Elasticsearch Introduction
Roopendra Vishwakarma
 
Elastic stack Presentation
Amr Alaa Yassen
 
Centralized log-management-with-elastic-stack
Rich Lee
 
Elasticsearch: An Overview
Ruby Shrestha
 
Physical Plans in Spark SQL
Databricks
 
Elastic search overview
ABC Talks
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Josh Elser
 

Viewers also liked (20)

PDF
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
PPTX
Solr
sortivo
 
ODP
Search Lucene
Jeremy Coates
 
PDF
Devinsampa nginx-scripting
Tony Fabeen
 
PDF
Munching & crunching - Lucene index post-processing
abial
 
PPTX
Index types
Volodymyr Zhabiuk
 
PDF
From Lucene to Elasticsearch, a short explanation of horizontal scalability
Stéphane Gamard
 
PDF
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
PDF
Lucene
Matt Wood
 
PPT
Lucene and MySQL
farhan "Frank"​ mashraqi
 
PPT
Lucandra
otisg
 
PPT
Inverted index
Krishna Gehlot
 
PPT
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
PPT
An introduction to inverted index
weedge
 
PDF
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
PDF
Introduction to solr
Sematext Group, Inc.
 
PDF
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
PPT
Lucene Introduction
otisg
 
ODP
The search engine index
CJ Jenkins
 
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
Solr
sortivo
 
Search Lucene
Jeremy Coates
 
Devinsampa nginx-scripting
Tony Fabeen
 
Munching & crunching - Lucene index post-processing
abial
 
Index types
Volodymyr Zhabiuk
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
Stéphane Gamard
 
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Lucene
Matt Wood
 
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Lucandra
otisg
 
Inverted index
Krishna Gehlot
 
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
An introduction to inverted index
weedge
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Introduction to solr
Sematext Group, Inc.
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Lucene Introduction
otisg
 
The search engine index
CJ Jenkins
 
Ad

Similar to Introduction To Apache Lucene (20)

PDF
Tutorial 5 (lucene)
Kira
 
PPTX
Apache Lucene Basics
Anirudh Sharma
 
PPT
Lucene basics
Nitin Pande
 
PPTX
Elastic search basic conceptes by gggg.pptx
gows88
 
PPTX
Apache lucene
Dr. Abhiram Gandhe
 
PPTX
Introduction to Information Retrieval using Lucene
DeeKan3
 
PDF
IR with lucene
Stelios Gorilas
 
PPT
Advanced full text searching techniques using Lucene
Asad Abbas
 
PPTX
Examiness hints and tips from the trenches
Ismail Mayat
 
PPT
Lucene Bootcamp -1
GokulD
 
PPTX
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
PDF
Wanna search? Piece of cake!
Alex Kursov
 
PDF
Full Text Search In PostgreSQL
Karwin Software Solutions LLC
 
PDF
Full Text Search with Lucene
WO Community
 
PPTX
ElasticSearch Basic Introduction
Mayur Rathod
 
PDF
Solr中国6月21日企业搜索
longkeyy
 
PDF
Fast track to lucene
Marouane Gazanayi
 
PPTX
Elasticsearch
Ricardo Peres
 
PDF
Intro to Elasticsearch
Clifford James
 
Tutorial 5 (lucene)
Kira
 
Apache Lucene Basics
Anirudh Sharma
 
Lucene basics
Nitin Pande
 
Elastic search basic conceptes by gggg.pptx
gows88
 
Apache lucene
Dr. Abhiram Gandhe
 
Introduction to Information Retrieval using Lucene
DeeKan3
 
IR with lucene
Stelios Gorilas
 
Advanced full text searching techniques using Lucene
Asad Abbas
 
Examiness hints and tips from the trenches
Ismail Mayat
 
Lucene Bootcamp -1
GokulD
 
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
Wanna search? Piece of cake!
Alex Kursov
 
Full Text Search In PostgreSQL
Karwin Software Solutions LLC
 
Full Text Search with Lucene
WO Community
 
ElasticSearch Basic Introduction
Mayur Rathod
 
Solr中国6月21日企业搜索
longkeyy
 
Fast track to lucene
Marouane Gazanayi
 
Elasticsearch
Ricardo Peres
 
Intro to Elasticsearch
Clifford James
 
Ad

More from Mindfire Solutions (20)

PDF
Physician Search and Review
Mindfire Solutions
 
PDF
diet management app
Mindfire Solutions
 
PDF
Business Technology Solution
Mindfire Solutions
 
PDF
Remote Health Monitoring
Mindfire Solutions
 
PDF
Influencer Marketing Solution
Mindfire Solutions
 
PPT
High Availability of Azure Applications
Mindfire Solutions
 
PPTX
IOT Hands On
Mindfire Solutions
 
PPTX
Glimpse of Loops Vs Set
Mindfire Solutions
 
ODP
Oracle Sql Developer-Getting Started
Mindfire Solutions
 
PPT
Adaptive Layout In iOS 8
Mindfire Solutions
 
PPT
Introduction to Auto-layout : iOS/Mac
Mindfire Solutions
 
PPT
LINQPad - utility Tool
Mindfire Solutions
 
PPT
Get started with watch kit development
Mindfire Solutions
 
PPTX
Swift vs Objective-C
Mindfire Solutions
 
ODP
Material Design in Android
Mindfire Solutions
 
ODP
Introduction to OData
Mindfire Solutions
 
PPT
Ext js Part 2- MVC
Mindfire Solutions
 
PPT
ExtJs Basic Part-1
Mindfire Solutions
 
PPT
Spring Security Introduction
Mindfire Solutions
 
Physician Search and Review
Mindfire Solutions
 
diet management app
Mindfire Solutions
 
Business Technology Solution
Mindfire Solutions
 
Remote Health Monitoring
Mindfire Solutions
 
Influencer Marketing Solution
Mindfire Solutions
 
High Availability of Azure Applications
Mindfire Solutions
 
IOT Hands On
Mindfire Solutions
 
Glimpse of Loops Vs Set
Mindfire Solutions
 
Oracle Sql Developer-Getting Started
Mindfire Solutions
 
Adaptive Layout In iOS 8
Mindfire Solutions
 
Introduction to Auto-layout : iOS/Mac
Mindfire Solutions
 
LINQPad - utility Tool
Mindfire Solutions
 
Get started with watch kit development
Mindfire Solutions
 
Swift vs Objective-C
Mindfire Solutions
 
Material Design in Android
Mindfire Solutions
 
Introduction to OData
Mindfire Solutions
 
Ext js Part 2- MVC
Mindfire Solutions
 
ExtJs Basic Part-1
Mindfire Solutions
 
Spring Security Introduction
Mindfire Solutions
 

Recently uploaded (20)

PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 

Introduction To Apache Lucene