SlideShare a Scribd company logo
Introduction to
Apache Lucene/Solr
April 2014 HDSG Meetup
Rahul Jain
@rahuldausa
Who am I?
 Software Engineer @ IVY Comptech, Hyderabad
 7 years of programming learning experience
 Built a platform to search logs in Near real time with
volume of 1TB/day#
 Worked on a Solr search based SEO/SEM software with
40 billion records/month (Topic of next talk?)
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
Agenda
• IR Overview
• Basic Concepts
• Lucene
• Solr
• Use-cases
• Solr In Action (demo)
• Q&A
3
Information Retrieval (IR)
”Information retrieval is the activity of
obtaining information resources (in the
form of documents) relevant to an
information need from a collection of
information resources. Searches can
be based on metadata or on full-text
(or other content-based) indexing”
- Wikipedia
4
Basic Concepts
• tf (t in d) : term frequency in a document
• measure of how often a term appears in the document
• the number of times term t appears in the currently scored
document d
• idf (t) : inverse document frequency
• measure of whether the term is common or rare across all
documents, i.e. how often the term appears across the index
• obtained by dividing the total number of documents by the
number of documents containing the term, and then taking
the logarithm of that quotient.
• boost (index) : boost of the field at index-time
• boost (query) : boost of the field at query-time
5
Basic Concepts
TF - IDF
TF - IDF = Term Frequency X Inverse Document Frequency
Credit: http://http://whatisgraphsearch.com/
Apache Lucene
7
Apache Lucene
• Fast, high performance, scalable search/IR library
• Open source
• Initially developed by Doug Cutting (Also author
of Hadoop)
• Indexing and Searching
• Inverted Index of documents
• Provides advanced Search options like
synonyms, stopwords, based on
similarity, proximity.
• http://lucene.apache.org/ 8
Lucene Internals - Inverted Index
Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
9
Lucene Internals (Contd.)
• Defines documents Model
• Index contains documents.
• Each document consist of fields.
• Each Field has attributes.
– What is the data type (FieldType)
– How to handle the content (Analyzers, Filters)
– Is it a stored field (stored="true") or Index field (indexed="true")
10
Indexing Pipeline
• Analyzer : create tokens using a Tokenizer and/or applying
Filters (Token Filters)
• Each field can define an Analyzer at index time/query time or
the both at same time.
Credit : http://www.slideshare.net/otisg/lucene-introduction 11
Analysis Process - Tokenizer
WhitespaceAnalyzer
Simplest built-in analyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
Tokens
Analysis Process - Tokenizer
SimpleAnalyzer
Lowercases, split at non-letter boundaries
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
Tokens
Apache Solr
14
Apache Solr
• Created by Yonik Seeley for CNET
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing (SolrCloud), Replication, and
load balanced querying
• http://lucene.apache.org/solr
15
High level overview
Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
Apache Solr - Features
• full-text search
• faceted search (similar to GroupBy clause in RDBMS)
• scalability
– caching
– replication
– distributed search
• near real-time indexing
• geospatial search
• and many more : highlighting, database integration, rich document
(e.g., Word, PDF) handling
17
How to start
It’s very Easy.
1. Start Solr
java -jar start.jar
2. Index your data
java -jar post.jar *.xml
3. Search
http://localhost:8983/solr
Solr APIs
• HTTP GET/POST
• JSON/XML
• Clients
– SolrJ (embedded or HTTP)
– solr-ruby
– python, PHP, solrsharp
Solr – schema.xml
• Types with index and query Analyzers - similar to data
type
• Fields with name, type and options
• Unique Key : Unique Identifier of a document. For e.g. “id”
• Dynamic Fields : Dynamic fields allow Solr to index fields that you did not
explicitly define in your schema. For e.g. fieldName: *_i or *_txts
• Copy Fields : Solr has a mechanism for making copies of fields so that you can apply
several distinct field types to a single piece of incoming information. field ‘a‘ populates field ‘b’ with
its value before tokenizing (having different analyzer/filter).
20
Solr – Content Analysis
• Field Attributes
 Name : Name of the field
 Type : Data-type (FieldType) of the field
 Indexed : Should it be indexed (indexed="true/false")
 Stored : Should it be stored (stored="true/false")
 Required : is it a mandatory field
(required="true/false")
 Multi-Valued : Would it will contains multiple values
e.g. text: pizza, food (multiValued="true/false")
e.g. <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
21
Solr – solrconfig.xml
• Data dir: where all index data will be stored
• Index configuration
• Cache configurations
• Request Handler configuration
• Search components, response writers, query
parsers
22
Query Types
• Single and multi term queries
• ex fieldname:value or title: software engineer
• +, -, AND, OR NOT operators.
• ex. title: (software AND engineer)
• Range queries on date or numeric fields,
• ex: timestamp: [ * TO NOW ] or price: [ 1 TO 100 ]
• Boost queries:
• e.g. title:Engineer ^1.5 OR text:Engineer
• Fuzzy search : is a search for words that are similar in
spelling
• e.g. roam~0.8 => noam
• Proximity Search : with a sloppy phrase query. The
close together the two terms appear, higher the score.
• ex “apache lucene”~20 : will look for all documents where
“apache” word occurs within 20 words of “lucene”
23
Solr/Lucene Use-cases
• Search
• Analytics
• NoSQL datastore
• Auto-suggestion / Auto-correction
• Recommendation Engine (MoreLikeThis)
• Relevancy Engine (Feedback to other applications)
• Solr as a White-List
• GeoSpatial based Search
24
Search
• Application
– Eclipse, Hibernate search
• E-Commerce :
– Flipkart.com, Infibeam.com, Buy.com, Netflix.com, ebay.com
• Jobs
– Indeed.com, Simplyhired.com, Naukri.com
• Auto
– AOL.com
• Travel
– Cleartrip.com
• Social Network
– Twitter.com, LinkedIn.com, mylife.com
25
Source: http://www.quora.com/Which-major-companies-are-using-Solr-for-search
Search (Contd.)
• Search Engine
– Yandex.ru, DuckDuckGo.com
• News Paper
– Guardian.co.uk
• Music/Movies
– Apple.com, Netflix.com
• Events
– Stubhub.com, Eventbrite.com
• Cloud Log Management
– Loggly.com
• Others
– Whitehouse.gov
26
Faceting
Source: www.career9.com, www.indeed.com 27
• Grouping results based on field
value
• Facet on: field
terms, queries, date ranges
• &facet=on
&facet.field=job_title
&facet.query=salary:[30000 TO
100000]
• http://wiki.apache.org/solr/Sim
pleFacetParameters
Analytics
 Analytics source : Kibana.org based on ElasticSearch and Logstash
 Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8
28
Autosuggestion
Source: www.drupal.org , www.yelp.com 29
Integration
• Clustering (Solr-Carrot2)
• Named Entity extraction (Solr-UIMA)
• SolrCloud (Solr-Zookeeper)
• Parsing of many Different File Formats (Solr-Tika)
• Machine Learning/Data Mining (Apache Mahout)
• Large scale Indexing (Hadoop)
30
References
• http://en.wikipedia.org/wiki/Tf%E2%80%93idf
• http://lucene.apache.org/core/4_5_0/core/org/apache/lucene/search/similarities
/TFIDFSimilarity.html
• http://www.quora.com/Which-major-companies-are-using-Solr-for-search
• http://marc.info/?l=solr-user&m=137271228610366&w=2
• http://java.dzone.com/articles/apache-solr-get-started-get
31
Solr/Lucene Meetup
• Building Big Data Analytics Platforms using Elasticsearch
(Kibana)
• Saturday, April 19, 2014 10:00 AM
• IIIT Hyderabad
• URL: http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/events/150134392/
OR
• Search on Google …
Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
Find Interesting ?
Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
33

More Related Content

What's hot (20)

PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Get the most out of Solr search with PHP
Paul Borgermans
 
PDF
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
PPTX
20130310 solr tuorial
Chris Huang
 
PPTX
Intro to Apache Lucene and Solr
Grant Ingersoll
 
PPTX
Enterprise Search Using Apache Solr
sagar chaturvedi
 
PDF
Retrieving Information From Solr
Ramzi Alqrainy
 
PDF
Building your own search engine with Apache Solr
Biogeeks
 
PDF
Solr Recipes Workshop
Erik Hatcher
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PDF
New-Age Search through Apache Solr
Edureka!
 
PDF
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
PDF
Lucene's Latest (for Libraries)
Erik Hatcher
 
PDF
Solr Recipes
Erik Hatcher
 
PPT
Lucene basics
Nitin Pande
 
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
PDF
Solr Architecture
Ramez Al-Fayez
 
PPTX
Battle of the giants: Apache Solr vs ElasticSearch
Rafał Kuć
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Introduction to Solr
Erik Hatcher
 
Get the most out of Solr search with PHP
Paul Borgermans
 
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
20130310 solr tuorial
Chris Huang
 
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Enterprise Search Using Apache Solr
sagar chaturvedi
 
Retrieving Information From Solr
Ramzi Alqrainy
 
Building your own search engine with Apache Solr
Biogeeks
 
Solr Recipes Workshop
Erik Hatcher
 
Solr Application Development Tutorial
Erik Hatcher
 
New-Age Search through Apache Solr
Edureka!
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
Lucene's Latest (for Libraries)
Erik Hatcher
 
Solr Recipes
Erik Hatcher
 
Lucene basics
Nitin Pande
 
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Solr Architecture
Ramez Al-Fayez
 
Battle of the giants: Apache Solr vs ElasticSearch
Rafał Kuć
 
Rapid Prototyping with Solr
Erik Hatcher
 

Similar to Introduction to Apache Lucene/Solr (20)

PDF
Suche mit Apache Lucene & Co.
inovex GmbH
 
PPTX
Introduction to Lucene and Solr - 1
YI-CHING WU
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
KEY
Apache Solr - Enterprise search platform
Tommaso Teofili
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PPTX
Solr introduction
Lap Tran
 
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
PDF
Solr search engine with multiple table relation
Jay Bharat
 
PPTX
Apache Solr Workshop
JSGB
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
PDF
A Practical Introduction to Apache Solr
Angel Borroy López
 
PDF
Apache Solr Workshop
Saumitra Srivastav
 
PPTX
Search Me: Using Lucene.Net
gramana
 
PDF
Solr Powered Lucene
Erik Hatcher
 
PDF
Solr 8 interview
Alihossein shahabi
 
ODP
Introduction to Apache Solr
Shalin Shekhar Mangar
 
Suche mit Apache Lucene & Co.
inovex GmbH
 
Introduction to Lucene and Solr - 1
YI-CHING WU
 
Lucene for Solr Developers
Erik Hatcher
 
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Lucene for Solr Developers
Erik Hatcher
 
Solr introduction
Lap Tran
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Kai Chan
 
Solr search engine with multiple table relation
Jay Bharat
 
Apache Solr Workshop
JSGB
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Introduction to Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
A Practical Introduction to Apache Solr
Angel Borroy López
 
Apache Solr Workshop
Saumitra Srivastav
 
Search Me: Using Lucene.Net
gramana
 
Solr Powered Lucene
Erik Hatcher
 
Solr 8 interview
Alihossein shahabi
 
Introduction to Apache Solr
Shalin Shekhar Mangar
 
Ad

More from Rahul Jain (14)

PDF
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PPTX
Introduction to Scala
Rahul Jain
 
PPTX
What is NoSQL and CAP Theorem
Rahul Jain
 
PPTX
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
PPTX
Introduction to Kafka and Zookeeper
Rahul Jain
 
PPTX
Apache kafka
Rahul Jain
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
DOC
Hibernate tutorial for beginners
Rahul Jain
 
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to Machine Learning
Rahul Jain
 
Introduction to Scala
Rahul Jain
 
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Kafka and Zookeeper
Rahul Jain
 
Apache kafka
Rahul Jain
 
Hadoop & HDFS for Beginners
Rahul Jain
 
Hibernate tutorial for beginners
Rahul Jain
 
Ad

Recently uploaded (20)

PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PPTX
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
Digital water marking system project report
Kamal Acharya
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Biosensors, BioDevices, Biomediccal.pptx
AsimovRiyaz
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Digital water marking system project report
Kamal Acharya
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Distribution reservoir and service storage pptx
dhanashree78
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Design Thinking basics for Engineers.pdf
CMR University
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 

Introduction to Apache Lucene/Solr

  • 1. Introduction to Apache Lucene/Solr April 2014 HDSG Meetup Rahul Jain @rahuldausa
  • 2. Who am I?  Software Engineer @ IVY Comptech, Hyderabad  7 years of programming learning experience  Built a platform to search logs in Near real time with volume of 1TB/day#  Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?)  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
  • 3. Agenda • IR Overview • Basic Concepts • Lucene • Solr • Use-cases • Solr In Action (demo) • Q&A 3
  • 4. Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia 4
  • 5. Basic Concepts • tf (t in d) : term frequency in a document • measure of how often a term appears in the document • the number of times term t appears in the currently scored document d • idf (t) : inverse document frequency • measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index • obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. • boost (index) : boost of the field at index-time • boost (query) : boost of the field at query-time 5
  • 6. Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/
  • 8. Apache Lucene • Fast, high performance, scalable search/IR library • Open source • Initially developed by Doug Cutting (Also author of Hadoop) • Indexing and Searching • Inverted Index of documents • Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. • http://lucene.apache.org/ 8
  • 9. Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html 9
  • 10. Lucene Internals (Contd.) • Defines documents Model • Index contains documents. • Each document consist of fields. • Each Field has attributes. – What is the data type (FieldType) – How to handle the content (Analyzers, Filters) – Is it a stored field (stored="true") or Index field (indexed="true") 10
  • 11. Indexing Pipeline • Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) • Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction 11
  • 12. Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens
  • 13. Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens
  • 15. Apache Solr • Created by Yonik Seeley for CNET • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing (SolrCloud), Replication, and load balanced querying • http://lucene.apache.org/solr 15
  • 16. High level overview Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
  • 17. Apache Solr - Features • full-text search • faceted search (similar to GroupBy clause in RDBMS) • scalability – caching – replication – distributed search • near real-time indexing • geospatial search • and many more : highlighting, database integration, rich document (e.g., Word, PDF) handling 17
  • 18. How to start It’s very Easy. 1. Start Solr java -jar start.jar 2. Index your data java -jar post.jar *.xml 3. Search http://localhost:8983/solr
  • 19. Solr APIs • HTTP GET/POST • JSON/XML • Clients – SolrJ (embedded or HTTP) – solr-ruby – python, PHP, solrsharp
  • 20. Solr – schema.xml • Types with index and query Analyzers - similar to data type • Fields with name, type and options • Unique Key : Unique Identifier of a document. For e.g. “id” • Dynamic Fields : Dynamic fields allow Solr to index fields that you did not explicitly define in your schema. For e.g. fieldName: *_i or *_txts • Copy Fields : Solr has a mechanism for making copies of fields so that you can apply several distinct field types to a single piece of incoming information. field ‘a‘ populates field ‘b’ with its value before tokenizing (having different analyzer/filter). 20
  • 21. Solr – Content Analysis • Field Attributes  Name : Name of the field  Type : Data-type (FieldType) of the field  Indexed : Should it be indexed (indexed="true/false")  Stored : Should it be stored (stored="true/false")  Required : is it a mandatory field (required="true/false")  Multi-Valued : Would it will contains multiple values e.g. text: pizza, food (multiValued="true/false") e.g. <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 21
  • 22. Solr – solrconfig.xml • Data dir: where all index data will be stored • Index configuration • Cache configurations • Request Handler configuration • Search components, response writers, query parsers 22
  • 23. Query Types • Single and multi term queries • ex fieldname:value or title: software engineer • +, -, AND, OR NOT operators. • ex. title: (software AND engineer) • Range queries on date or numeric fields, • ex: timestamp: [ * TO NOW ] or price: [ 1 TO 100 ] • Boost queries: • e.g. title:Engineer ^1.5 OR text:Engineer • Fuzzy search : is a search for words that are similar in spelling • e.g. roam~0.8 => noam • Proximity Search : with a sloppy phrase query. The close together the two terms appear, higher the score. • ex “apache lucene”~20 : will look for all documents where “apache” word occurs within 20 words of “lucene” 23
  • 24. Solr/Lucene Use-cases • Search • Analytics • NoSQL datastore • Auto-suggestion / Auto-correction • Recommendation Engine (MoreLikeThis) • Relevancy Engine (Feedback to other applications) • Solr as a White-List • GeoSpatial based Search 24
  • 25. Search • Application – Eclipse, Hibernate search • E-Commerce : – Flipkart.com, Infibeam.com, Buy.com, Netflix.com, ebay.com • Jobs – Indeed.com, Simplyhired.com, Naukri.com • Auto – AOL.com • Travel – Cleartrip.com • Social Network – Twitter.com, LinkedIn.com, mylife.com 25 Source: http://www.quora.com/Which-major-companies-are-using-Solr-for-search
  • 26. Search (Contd.) • Search Engine – Yandex.ru, DuckDuckGo.com • News Paper – Guardian.co.uk • Music/Movies – Apple.com, Netflix.com • Events – Stubhub.com, Eventbrite.com • Cloud Log Management – Loggly.com • Others – Whitehouse.gov 26
  • 27. Faceting Source: www.career9.com, www.indeed.com 27 • Grouping results based on field value • Facet on: field terms, queries, date ranges • &facet=on &facet.field=job_title &facet.query=salary:[30000 TO 100000] • http://wiki.apache.org/solr/Sim pleFacetParameters
  • 28. Analytics  Analytics source : Kibana.org based on ElasticSearch and Logstash  Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8 28
  • 30. Integration • Clustering (Solr-Carrot2) • Named Entity extraction (Solr-UIMA) • SolrCloud (Solr-Zookeeper) • Parsing of many Different File Formats (Solr-Tika) • Machine Learning/Data Mining (Apache Mahout) • Large scale Indexing (Hadoop) 30
  • 31. References • http://en.wikipedia.org/wiki/Tf%E2%80%93idf • http://lucene.apache.org/core/4_5_0/core/org/apache/lucene/search/similarities /TFIDFSimilarity.html • http://www.quora.com/Which-major-companies-are-using-Solr-for-search • http://marc.info/?l=solr-user&m=137271228610366&w=2 • http://java.dzone.com/articles/apache-solr-get-started-get 31
  • 32. Solr/Lucene Meetup • Building Big Data Analytics Platforms using Elasticsearch (Kibana) • Saturday, April 19, 2014 10:00 AM • IIIT Hyderabad • URL: http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/events/150134392/ OR • Search on Google …
  • 33. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 33