SlideShare a Scribd company logo
Enterprise Search platform
Building solid scalable enterprise search REST services on top of Apache Lucene




                                 Tommaso Teofili
Agenda

• Apache Lucene overview


• Why do we need Apache Solr?


• Everyman tales from Solr


• Enterprise what?


• One step beyond...
Apache Lucene overview

• Information Retrieval library


• Inverted indexes are quick and efficient


• Vector space model


• Advanced search options (synonims, stopwords, similarity, nearness)


• Different language implementations (Java, .NET, C, Python)
The Lucene API

• Lucene indexes are built on a Directory


• Directory can be accessed by IndexReaders and IndexWriters


• IndexSearchers are built on top of Directories and IndexReaders


• IndexWriters can write Documents inside the index


• Documents are made of Fields


• Fields have value(s) and options


• Directory > IndexReader/Writer > Document > Field
Indexing Lucene
Indexing Lucene

• A Lucene index has one or more segments and a generation


• Changes to the index must be committed (and optimized)


• No fixed schema


• Each field can be STORED, INDEXED and ANALYZED


• Each field can have NORMS and TERM VECTORS
Searching Lucene

• Open an IndexSearcher on top of an IndexReader over a Directory


• Many query types: TermQuery, MultiTermQuery, BooleanQuery,
  WildcardQuery, PhraseQuery, PrefixQuery, MultiPhraseQuery, FuzzyQuery,
  TermRangeQuery, NumericRangeQuery


• Get results from a TopDocs object
Why do we need Apache Solr?

• Lucene is a library

• Lucene by itself can only be queried programmatically

• Often the search system has to be totally independent from other
  systems (i.e.: CMS)

• A ready to deploy search server is what you need

• Need to scale both vertically and horizontally
The Solar System
Everyman tales with Solr
Apache Solr - Overview

• Ready to use enterprise search server


• REST (and programmatic) API


• Results in XML, JSON, PHP, Ruby, etc...


• Exploit Lucene power


• Scaling capabilities (replication, distributed search)


• Easy administration interface


• Easy to extend and customize (plugin architecture)
Apache Solr - Project status

• Latest release 1.4.1 on June 2010


• Lots of new features on trunk


• Most of new features on branch 3.0


• A huge very active community


• Lucid Imagination powered project
Solr - 5 minutes tutorial

• Download latest release (1.4.1)


• cd $SOLR_HOME/example


• java -jar -server start.jar


• You have an up and running Solr instance you can access via http://localhost:8983/solr
  (this runs on top of Jetty)


• cd $SOLR_HOME/example/exampledocs


• Index with the command: sh post.sh *.xml


• Search with your browser
Solr - Query syntax

• Default operator is OR (you can override adding &q.op=AND to the HTTP req)


• You can query fields with fieldname:value


• Common + - AND OR NOT modifiers


• Range queries on date or numeric fields timestamp:[* TO NOW]


• Boost terms, i.e.: roma^2 inter


• Fuzzy search roam~0.6


• ...
Solr - Basic configuration steps
• Define fields, types and analysis inside schema.xml


• Play with solrconfig.xml:


    • request handlers (update, search)


    • index parameters


    • caches


    • deletion policy


    • autowarming


    • replication, clustering, etc...
Solr - schema.xml

• Types


• Analyzers to use for each type


• Fields with name, type and options


• Unique key


• Dynamic fields


• Copy fields


• Don’t use the default schema.xml, write it from scratch!
Solr - Type definition
                        Analyzers for querying and indexing
  inside the schema
Solr - solrconfig.xml

• Where Solr will write the index


• Index merge factor


• Control different caches: documents, query results, filters


• Request handlers available to consume (HTTP) requests, typically at least a (standard)
  search and an update handler exist


• Update request processor chains to configure indexing behavior


• Event listeners (newSearcher, firstSearcher)


• and much more...
Solr - Indexing

• Update requests on index are given with XML commands via HTTP POST


• <add> to insert and update




• <del> to remove by unique key or query
Solr - Searching

• HTTP GET to Solr instance with mandatory q parameter which specify the
  query


• df - the default field to query


• fl - the list of fields to return (stored fields only)


• sort - fields used for sorting, default to score (it’s not a field)


• start, rows - paging attributes


• wt - response type, default to xml, can be json, php, ruby, etc
Solr - Data import

• Typically “old” systems rely on databases


• Data can be imported from DBs using the DataImportHandler component


• Define datasource, driver and mappings
Solr - Highlighting

• Useful when a snippet of the search results is needed


• In Solr 1.4.1 only stored fields can be highlighted


• Add &hl=true&hl.fl=field1,field2 to HTTP search request in order to enable
  highlighting on field1 and field2
Solr - Faceting

• Break up search results into multiple categories showing counts for each


• Often used in stores


• Can be very useful in guiding user experience


• User can then drill down only results of a certain category
Solr - Filter queries

• Queries used as filters against the actual query


• Define document superset without influencing score


• Useful for domain specific queries where you want the user to search only in
  certain “areas” of the index


• Add &fq=somefilterquery with the default Solr syntax
Solr - Enterprise
what?
Multicore
Replication
Distributed search
...
Solr - Multi core

• Define multiple Solr cores inside one only Solr instance


• Each cores maintain its own index


• Unified administration interface


• Runtime commands to create, swap, load, unload, delete cores
Solr - Replication

• It’s useful in case of high traffic to replicate a Solr instance and split (with
  eventually some load balancer in front) the queries


• Master has the original index


• Slave polls master asking the last version of index


• If slave has a lower version of the index asks the master for the difference
  (rsync like)


• In the meanwhile indexes remain available
Solr - Distributed search

• When an index is too large, in terms of space or memory required, it can be
  useful to define two or more shards


• A shard is a Solr instance and can be searched or indexed independently


• At the same time it’s possible to query all the shards having the result be
  merged from the sub-results of each shard


• http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/
  solr&indent=true&q=category:information


• Note that the document distribution among indexes is up to the user (or who
  feeds the indexes)
One step beyond...

• Solr in the cloud


• Spatial search


• Solr & UIMA :-)
References

• http://lucene.apache.org/solr/


• http://lucene.apache.org/solr/tutorial.html


• http://wiki.apache.org/solr/FrontPage

More Related Content

What's hot (20)

PDF
Solr Black Belt Pre-conference
Erik Hatcher
 
PDF
Intro to Apache Solr
Shalin Shekhar Mangar
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
New-Age Search through Apache Solr
Edureka!
 
PDF
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Lucidworks
 
PDF
Solr Indexing and Analysis Tricks
Erik Hatcher
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
ODP
Introduction to Apache solr
Knoldus Inc.
 
PPTX
Apache Solr
Minh Tran
 
PDF
Introduction to Apache Solr
Christos Manios
 
PDF
Flexible search in Apache Jackrabbit Oak
Tommaso Teofili
 
PDF
Solr Powered Lucene
Erik Hatcher
 
PDF
Solr Recipes
Erik Hatcher
 
PDF
Make your gui shine with ajax solr
lucenerevolution
 
PDF
Apache Solr Workshop
Saumitra Srivastav
 
PDF
Lucene's Latest (for Libraries)
Erik Hatcher
 
ODP
Apache SolrCloud
Michał Warecki
 
PPTX
Enterprise Search Using Apache Solr
sagar chaturvedi
 
PDF
Scaling search in Oak with Solr
Tommaso Teofili
 
PPS
Introduction to Solr
Jayesh Bhoyar
 
Solr Black Belt Pre-conference
Erik Hatcher
 
Intro to Apache Solr
Shalin Shekhar Mangar
 
Introduction to Solr
Erik Hatcher
 
New-Age Search through Apache Solr
Edureka!
 
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
Lucidworks
 
Solr Indexing and Analysis Tricks
Erik Hatcher
 
Solr Application Development Tutorial
Erik Hatcher
 
Introduction to Apache solr
Knoldus Inc.
 
Apache Solr
Minh Tran
 
Introduction to Apache Solr
Christos Manios
 
Flexible search in Apache Jackrabbit Oak
Tommaso Teofili
 
Solr Powered Lucene
Erik Hatcher
 
Solr Recipes
Erik Hatcher
 
Make your gui shine with ajax solr
lucenerevolution
 
Apache Solr Workshop
Saumitra Srivastav
 
Lucene's Latest (for Libraries)
Erik Hatcher
 
Apache SolrCloud
Michał Warecki
 
Enterprise Search Using Apache Solr
sagar chaturvedi
 
Scaling search in Oak with Solr
Tommaso Teofili
 
Introduction to Solr
Jayesh Bhoyar
 

Similar to Apache Solr - Enterprise search platform (20)

PDF
Introduction to Solr
Erik Hatcher
 
PDF
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PDF
Solr search engine with multiple table relation
Jay Bharat
 
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
'Moinuddin Ahmed
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
KEY
Solr 101
Findwise
 
KEY
Intro to Apache Solr for Drupal
Chris Caple
 
PPTX
20130310 solr tuorial
Chris Huang
 
DOCX
Apache solr tech doc
Barot Sagar
 
PDF
Get the most out of Solr search with PHP
Paul Borgermans
 
PDF
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
PDF
Apace Solr Web Development.pdf
Abanti Aazmin
 
PDF
A Practical Introduction to Apache Solr
Angel Borroy López
 
PPTX
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
Introduction to Solr
Erik Hatcher
 
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Solr search engine with multiple table relation
Jay Bharat
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
'Moinuddin Ahmed
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Solr 101
Findwise
 
Intro to Apache Solr for Drupal
Chris Caple
 
20130310 solr tuorial
Chris Huang
 
Apache solr tech doc
Barot Sagar
 
Get the most out of Solr search with PHP
Paul Borgermans
 
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
Apace Solr Web Development.pdf
Abanti Aazmin
 
A Practical Introduction to Apache Solr
Angel Borroy López
 
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
Ad

More from Tommaso Teofili (16)

PDF
Affect Enriched Word Embeddings for News IR
Tommaso Teofili
 
PDF
Data replication in Sling
Tommaso Teofili
 
PDF
Search engines in the industry
Tommaso Teofili
 
PDF
Text categorization with Lucene and Solr
Tommaso Teofili
 
PPT
Machine learning with Apache Hama
Tommaso Teofili
 
KEY
Adapting Apache UIMA to OSGi
Tommaso Teofili
 
PDF
Oak / Solr integration
Tommaso Teofili
 
PPTX
Domeo, Text Mining, UIMA and Clerezza
Tommaso Teofili
 
PDF
Natural Language Search in Solr
Tommaso Teofili
 
PDF
Apache UIMA - Hands on code
Tommaso Teofili
 
PDF
Apache UIMA Introduction
Tommaso Teofili
 
ODP
OSS Enterprise Search EU Tour
Tommaso Teofili
 
PDF
Information Extraction with UIMA - Usecases
Tommaso Teofili
 
PDF
Apache UIMA and Metadata Generation
Tommaso Teofili
 
PDF
Data and Information Extraction on the Web
Tommaso Teofili
 
KEY
Apache UIMA and Semantic Search
Tommaso Teofili
 
Affect Enriched Word Embeddings for News IR
Tommaso Teofili
 
Data replication in Sling
Tommaso Teofili
 
Search engines in the industry
Tommaso Teofili
 
Text categorization with Lucene and Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Tommaso Teofili
 
Adapting Apache UIMA to OSGi
Tommaso Teofili
 
Oak / Solr integration
Tommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Tommaso Teofili
 
Natural Language Search in Solr
Tommaso Teofili
 
Apache UIMA - Hands on code
Tommaso Teofili
 
Apache UIMA Introduction
Tommaso Teofili
 
OSS Enterprise Search EU Tour
Tommaso Teofili
 
Information Extraction with UIMA - Usecases
Tommaso Teofili
 
Apache UIMA and Metadata Generation
Tommaso Teofili
 
Data and Information Extraction on the Web
Tommaso Teofili
 
Apache UIMA and Semantic Search
Tommaso Teofili
 
Ad

Recently uploaded (20)

PDF
Complete Network Protection with Real-Time Security
L4RGINDIA
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Complete Network Protection with Real-Time Security
L4RGINDIA
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 

Apache Solr - Enterprise search platform

  • 1. Enterprise Search platform Building solid scalable enterprise search REST services on top of Apache Lucene Tommaso Teofili
  • 2. Agenda • Apache Lucene overview • Why do we need Apache Solr? • Everyman tales from Solr • Enterprise what? • One step beyond...
  • 3. Apache Lucene overview • Information Retrieval library • Inverted indexes are quick and efficient • Vector space model • Advanced search options (synonims, stopwords, similarity, nearness) • Different language implementations (Java, .NET, C, Python)
  • 4. The Lucene API • Lucene indexes are built on a Directory • Directory can be accessed by IndexReaders and IndexWriters • IndexSearchers are built on top of Directories and IndexReaders • IndexWriters can write Documents inside the index • Documents are made of Fields • Fields have value(s) and options • Directory > IndexReader/Writer > Document > Field
  • 6. Indexing Lucene • A Lucene index has one or more segments and a generation • Changes to the index must be committed (and optimized) • No fixed schema • Each field can be STORED, INDEXED and ANALYZED • Each field can have NORMS and TERM VECTORS
  • 7. Searching Lucene • Open an IndexSearcher on top of an IndexReader over a Directory • Many query types: TermQuery, MultiTermQuery, BooleanQuery, WildcardQuery, PhraseQuery, PrefixQuery, MultiPhraseQuery, FuzzyQuery, TermRangeQuery, NumericRangeQuery • Get results from a TopDocs object
  • 8. Why do we need Apache Solr? • Lucene is a library • Lucene by itself can only be queried programmatically • Often the search system has to be totally independent from other systems (i.e.: CMS) • A ready to deploy search server is what you need • Need to scale both vertically and horizontally
  • 11. Apache Solr - Overview • Ready to use enterprise search server • REST (and programmatic) API • Results in XML, JSON, PHP, Ruby, etc... • Exploit Lucene power • Scaling capabilities (replication, distributed search) • Easy administration interface • Easy to extend and customize (plugin architecture)
  • 12. Apache Solr - Project status • Latest release 1.4.1 on June 2010 • Lots of new features on trunk • Most of new features on branch 3.0 • A huge very active community • Lucid Imagination powered project
  • 13. Solr - 5 minutes tutorial • Download latest release (1.4.1) • cd $SOLR_HOME/example • java -jar -server start.jar • You have an up and running Solr instance you can access via http://localhost:8983/solr (this runs on top of Jetty) • cd $SOLR_HOME/example/exampledocs • Index with the command: sh post.sh *.xml • Search with your browser
  • 14. Solr - Query syntax • Default operator is OR (you can override adding &q.op=AND to the HTTP req) • You can query fields with fieldname:value • Common + - AND OR NOT modifiers • Range queries on date or numeric fields timestamp:[* TO NOW] • Boost terms, i.e.: roma^2 inter • Fuzzy search roam~0.6 • ...
  • 15. Solr - Basic configuration steps • Define fields, types and analysis inside schema.xml • Play with solrconfig.xml: • request handlers (update, search) • index parameters • caches • deletion policy • autowarming • replication, clustering, etc...
  • 16. Solr - schema.xml • Types • Analyzers to use for each type • Fields with name, type and options • Unique key • Dynamic fields • Copy fields • Don’t use the default schema.xml, write it from scratch!
  • 17. Solr - Type definition Analyzers for querying and indexing inside the schema
  • 18. Solr - solrconfig.xml • Where Solr will write the index • Index merge factor • Control different caches: documents, query results, filters • Request handlers available to consume (HTTP) requests, typically at least a (standard) search and an update handler exist • Update request processor chains to configure indexing behavior • Event listeners (newSearcher, firstSearcher) • and much more...
  • 19. Solr - Indexing • Update requests on index are given with XML commands via HTTP POST • <add> to insert and update • <del> to remove by unique key or query
  • 20. Solr - Searching • HTTP GET to Solr instance with mandatory q parameter which specify the query • df - the default field to query • fl - the list of fields to return (stored fields only) • sort - fields used for sorting, default to score (it’s not a field) • start, rows - paging attributes • wt - response type, default to xml, can be json, php, ruby, etc
  • 21. Solr - Data import • Typically “old” systems rely on databases • Data can be imported from DBs using the DataImportHandler component • Define datasource, driver and mappings
  • 22. Solr - Highlighting • Useful when a snippet of the search results is needed • In Solr 1.4.1 only stored fields can be highlighted • Add &hl=true&hl.fl=field1,field2 to HTTP search request in order to enable highlighting on field1 and field2
  • 23. Solr - Faceting • Break up search results into multiple categories showing counts for each • Often used in stores • Can be very useful in guiding user experience • User can then drill down only results of a certain category
  • 24. Solr - Filter queries • Queries used as filters against the actual query • Define document superset without influencing score • Useful for domain specific queries where you want the user to search only in certain “areas” of the index • Add &fq=somefilterquery with the default Solr syntax
  • 26. Solr - Multi core • Define multiple Solr cores inside one only Solr instance • Each cores maintain its own index • Unified administration interface • Runtime commands to create, swap, load, unload, delete cores
  • 27. Solr - Replication • It’s useful in case of high traffic to replicate a Solr instance and split (with eventually some load balancer in front) the queries • Master has the original index • Slave polls master asking the last version of index • If slave has a lower version of the index asks the master for the difference (rsync like) • In the meanwhile indexes remain available
  • 28. Solr - Distributed search • When an index is too large, in terms of space or memory required, it can be useful to define two or more shards • A shard is a Solr instance and can be searched or indexed independently • At the same time it’s possible to query all the shards having the result be merged from the sub-results of each shard • http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/ solr&indent=true&q=category:information • Note that the document distribution among indexes is up to the user (or who feeds the indexes)
  • 29. One step beyond... • Solr in the cloud • Spatial search • Solr & UIMA :-)