SlideShare a Scribd company logo
Lucene for Solr
  Developers
      uberconf - July 14, 2011
     Presented by Erik Hatcher
erik.hatcher@lucidimagination.com
         Lucid Imagination
 http://www.lucidimagination.com
Lucene Core
• IndexWriter
• Directory
• IndexReader, IndexSearcher
• analysis: Analyzer, TokenStream,
  Tokenizer,TokenFilter
• Query
Solr Architecture
Customizing - Don't do it!

•   Unless you need to.
•   In other words... ensure you've given the built-in
    capabilities a try, asked on the e-mail list, and
    spelunked into at least Solr's code a bit to make
    some sense of the situation.
•   But we're here to roll up our sleeves, because we
    need to...
But first...
•   Look at Lucene and/or Solr source code as
    appropriate

•   Carefully read javadocs and wiki pages - lots of tips
    there

•   And, hey, search for what you're trying to do...

    •   Google, of course

    •   But try out LucidFind and other Lucene ecosystem
        specific search systems -
        http://www.lucidimagination.com/search/
Extension points
•   Tokenizer, TokenFilter,   •   QParser
    CharFilter
                              •   DataImportHandler
•   SearchComponent               hooks

•   RequestHandler                •   data sources

•   ResponseWriter                •   entity processors

•   FieldType                     •   transformers

•   Similarity                •   several others
Factories
• FooFactory (most) everywhere.
  Sometimes there's BarPlugin style

• for sake of discussion... let's just skip the
  "factory" part
• In Solr, Factories and Plugins are used by
  configuration loading to parameterize and
  construct
"Installing" plugins
• Compile .java to .class, JAR it up
• Put JAR files in either:
 • <solr-home>/lib
 • a shared lib when using multicore
 • anywhere, and register location in
    solrconfig.xml
• Hook in plugins as appropriate
Multicore sharedLib

<solr sharedLib="/usr/local/solr/customlib"
       persistent="true">
   <cores adminPath="/admin/cores">
      <core instanceDir="core1" name="core1"/>
      <core instanceDir="core2" name="core2"/>
   </cores>
</solr>
Plugins via
        solrconfig.xml


• <lib dir="/path/to/your/custom/jars" />
Analysis

• CharFilter
• Tokenizer
• TokenFilter
Primer

• Tokens, Terms
• Attributes: Type, Payloads, Offsets,
  Positions, Term Vectors
• part of the picture:
Version

• enum:
 • Version.LUCENE_31,
    Version.LUCENE_32, etc
• Version.onOrAfter(Version other)
CharFilter
• extend BaseCharFilter
• enables pre-tokenization filtering/morphing
  of incoming field value
• only affects tokenization, not stored value
• Built-in CharFilters: HTMLStripCharFilter,
  PatternReplaceCharFilter, and
  MappingCharFilter
Tokenizer
•   common to extend CharTokenizer

•   implement -

    •   protected abstract boolean isTokenChar(int c);

•   optionally override -

    •   protected int normalize(int c)

•   extend Tokenizer directly for finer control

•   Popular built-in Tokenizers include: WhitespaceTokenizer,
    StandardTokenizer, PatternTokenizer, KeywordTokenizer,
    ICUTokenizer
TokenFilter

• a TokenStream whose input is another
  TokenStream
• Popular TokenFilters include:
  LowerCaseFilter, CommonGramsFilter,
  SnowballFilter, StopFilter,
  WordDelimiterFilter
Lucene's analysis APIs
• tricky business, what with Attributes
  (Source/Factory's), State, characters, code
  points,Version, etc...
• Test!!!
 • BaseTokenStreamTestCase
 • Look at Lucene and Solr's test cases
Solr's Analysis Tools

• Admin analysis tool
• Field analysis request handler
• DEMO
Query Parsing


• String -> org.apache.lucene.search.Query
QParserPlugin
public abstract class QParserPlugin
    implements NamedListInitializedPlugin {

    public abstract QParser createParser(
      String qstr,
      SolrParams localParams,
      SolrParams params,
      SolrQueryRequest req);
}
QParser
public abstract class QParser {

    public abstract Query parse()
              throws ParseException;

}
Built-in QParsers
from QParserPlugin.java
  /** internal use - name to class mappings of builtin parsers */
  public static final Object[] standardPlugins = {
     LuceneQParserPlugin.NAME, LuceneQParserPlugin.class,
     OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class,
     FunctionQParserPlugin.NAME, FunctionQParserPlugin.class,
     PrefixQParserPlugin.NAME, PrefixQParserPlugin.class,
     BoostQParserPlugin.NAME, BoostQParserPlugin.class,
     DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class,
     ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class,
     FieldQParserPlugin.NAME, FieldQParserPlugin.class,
     RawQParserPlugin.NAME, RawQParserPlugin.class,
     TermQParserPlugin.NAME, TermQParserPlugin.class,
     NestedQParserPlugin.NAME, NestedQParserPlugin.class,
     FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class,
     SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class,
     SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class,
     JoinQParserPlugin.NAME, JoinQParserPlugin.class,
  };
Local Parameters

• {!qparser_name param=value}expression
 • or
• {!qparser_name param=value v=expression}
• Can substitute $references from request
  parameters
Param Substitution
solrconfig.xml
<requestHandler name="/document"
                class="solr.SearchHandler">
  <lst name="invariants">
    <str name="q">{!term f=id v=$id}</str>
  </lst>
</requestHandler>

Solr request
http://localhost:8983/solr/document?id=FOO37
Custom QParser

• Implement a QParserPlugin that creates your
  custom QParser
• Register in solrconfig.xml
 • <queryParser name="myparser"
    class="com.mycompany.MyQParserPlugin"/>
Update Processor

• Responsible for handling these commands:
 • add/update
 • delete
 • commit
 • merge indexes
Built-in Update
            Processors
•   RunUpdateProcessor
    •   Actually performs the operations, such as
        adding the documents to the index
•   LogUpdateProcessor
    •   Logs each operation
•   SignatureUpdateProcessor
    •   duplicate detection and optionally rejection
UIMA Update
           Processor
•   UIMA - Unstructured Information Management
    Architecture - http://uima.apache.org/

•   Enables UIMA components to augment
    documents

•   Entity extraction, automated categorization,
    language detection, etc

•   "contrib" plugin

•   http://wiki.apache.org/solr/SolrUIMA
Update Processor
         Chain
• UpdateProcessor's sequence into a chain
• Each processor can abort the entire update
  or hand processing to next processor in
  the chain
• Chains, of update processor factories, are
  specified in solrconfig.xml
• Update requests can specify an
  update.processor parameter
Default update
            processor chain
From SolrCore.java
// construct the default chain
UpdateRequestProcessorFactory[] factories =
  new UpdateRequestProcessorFactory[]{
     new RunUpdateProcessorFactory(),
     new LogUpdateProcessorFactory()
  };

    Note: these steps have been swapped on trunk recently
Example Update
           Processor
•   What are the best facets to show for a particular
    query? Wouldn't it be nice to see the distribution of
    document "attributes" represented across a result
    set?

•   Learned this trick from the Smithsonian, who were
    doing it manually - add an indexed field containing the
    field names of the interesting other fields on the
    document.

•   Facet on that field "of field names" initially, then
    request facets on the top values returned.
Config for custom
           update processor
<updateRequestProcessorChain name="fields_used" default="true">
 <processor class="solr.processor.FieldsUsedUpdateProcessorFactory">
  <str name="fieldsUsedFieldName">attribute_fields</str>
  <str name="fieldNameRegex">.*_attribute</str>
 </processor>
 <processor class="solr.LogUpdateProcessorFactory" />
 <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
FieldsUsedUpdateProcessorFactory


public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory {
 private String fieldsUsedFieldName;
 private Pattern fieldNamePattern;

    public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp,
                                                                  UpdateRequestProcessor next) {
      return new FieldsUsedUpdateProcessor(req, rsp, this, next);
    }

    // ... next slide ...

}
FieldsUsedUpdateProcessorFactory
 @Override
 public void init(NamedList args) {
  if (args == null) return;

     SolrParams params = SolrParams.toSolrParams(args);

     fieldsUsedFieldName = params.get("fieldsUsedFieldName");
     if (fieldsUsedFieldName == null) {
       throw new SolrException
          (SolrException.ErrorCode.SERVER_ERROR,
             "fieldsUsedFieldName must be specified");
     }

     // TODO check that fieldsUsedFieldName is a valid field name and multiValued

     String fieldNameRegex = params.get("fieldNameRegex");
     if (fieldNameRegex == null) {
       throw new SolrException
          (SolrException.ErrorCode.SERVER_ERROR,
             "fieldNameRegex must be specified");
     }
     fieldNamePattern = Pattern.compile(fieldNameRegex);

     super.init(args);
 }
class FieldsUsedUpdateProcessor extends UpdateRequestProcessor {
  public FieldsUsedUpdateProcessor(SolrQueryRequest req,
                                   SolrQueryResponse rsp,
                                   FieldsUsedUpdateProcessorFactory factory,
                                   UpdateRequestProcessor next) {
    super(next);
  }

    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.getSolrInputDocument();

        Collection<String> incomingFieldNames = doc.getFieldNames();

        Iterator<String> iterator = incomingFieldNames.iterator();
        ArrayList<String> usedFields = new ArrayList<String>();
        while (iterator.hasNext()) {
          String f = iterator.next();
          if (fieldNamePattern.matcher(f).matches()) {
            usedFields.add(f);
          }
        }

        doc.addField(fieldsUsedFieldName, usedFields.toArray());
        super.processAdd(cmd);
    }
}
FieldsUsedUpdateProcessor
          in action
schema.xml
  <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/>

Add some documents
solr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => 'L', :color_attribute => 'Blue'},
          {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => 'White'}])
solr.commit

Facet on attribute_fields
 - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on
      "facet_fields":{
          "attribute_fields":[
             "color_attribute",2,
             "memory_attribute",1,
             "size_attribute",1]}
Search Components
• Built-in: Clustering, Debug, Facet, Highlight,
  MoreLikeThis, Query, QueryElevation,
  SpellCheck, Stats, TermVector, Terms
• Non-distributed API:
 • prepare(ResponseBuilder rb)
 • process(ResponseBuilder rb)
Example - auto facet
          select
•   It sure would be nice if you could have Solr automatically
    select field(s) for faceting based dynamically off the
    profile of the results. For example, you're indexing
    disparate types of products, all with varying attributes
    (color, size - like for apparel, memory_size - for
    electronics, subject - for books, etc), and a user searches
    for "ipod" where most products match products with
    color and memory_size attributes... let's automatically
    facet on those fields.

•   https://issues.apache.org/jira/browse/SOLR-2641
AutoFacetSelection
       Component
•   Too much code for a slide, let's take a look in
    an IDE...

•   Basically -

    •   process() gets autofacet.field and autofacet.n
        request params, facets on field, takes top N
        values, sets those as facet.field's

    •   Gotcha - need to call rb.setNeedDocSet
        (true) in prepare() as faceting needs it
SearchComponent
              config
<searchComponent name="autofacet"
     class="solr.AutoFacetSelectionComponent"/>
<requestHandler name="/searchplus"
                class="solr.SearchHandler">
  <arr name="components">
    <str>query</str>
    <str>autofacet</str>
    <str>facet</str>
    <str>debug</str>
  </arr>
</requestHandler>
autofacet success
http://localhost:8983/solr/searchplus
?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on
{
  "response":{"numFound":2,"start":0,"docs":[
       {
         "size_attribute":["L"],
         "color_attribute":["Blue"],
         "name":"Big Blue Shoes",
         "id":"1",
         "attribute_fields":["size_attribute",
           "color_attribute"]},
       {
         "color_attribute":["White"],
         "name":"Cool Gizmo",
         "memory_attribute":["16GB"],
         "id":"2",
         "attribute_fields":["color_attribute",
           "memory_attribute"]}]
  },
  "facet_counts":{
     "facet_queries":{},
     "facet_fields":{
       "color_attribute":[
         "Blue",1,
         "White",1],
       "memory_attribute":[
         "16GB",1]}}}
Distributed-aware
    SearchComponents
•   SearchComponent has a few distributed mode
    methods:

    •   distributedProcess(ResponseBuilder)

    •   modifyRequest(ResponseBuilder rb,
        SearchComponent who, ShardRequest sreq)

    •   handleResponses(ResponseBuilder rb,
        ShardRequest sreq)

    •   finishStage(ResponseBuilder rb)
Testing

• AbstractSolrTestCase
• SolrTestCaseJ4
• SolrMeter
 • http://code.google.com/p/solrmeter/
For more information...
•   http://www.lucidimagination.com

•   LucidFind

    •   search Lucene ecosystem: mailing lists, wikis, JIRA, etc

    •   http://search.lucidimagination.com

•   Getting started with LucidWorks Enterprise:

    •   http://www.lucidimagination.com/products/
        lucidworks-search-platform/enterprise

•   http://lucene.apache.org/solr - wiki, e-mail lists
Thank You!

More Related Content

What's hot (20)

PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Solr Troubleshooting - TreeMap approach
Alexandre Rafalovitch
 
PDF
Apache Solr! Enterprise Search Solutions at your Fingertips!
Murshed Ahmmad Khan
 
PPTX
Ingesting and Manipulating Data with JavaScript
Lucidworks
 
PPTX
JSON in Solr: from top to bottom
Alexandre Rafalovitch
 
PDF
Lucene's Latest (for Libraries)
Erik Hatcher
 
PDF
Solr Powered Lucene
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr Recipes
Erik Hatcher
 
KEY
Apache Solr - Enterprise search platform
Tommaso Teofili
 
PPTX
Rapid Solr Schema Development (Phone directory)
Alexandre Rafalovitch
 
PPTX
Apache Solr Workshop
JSGB
 
PDF
Apache Solr Workshop
Saumitra Srivastav
 
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
PDF
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
PDF
Make your gui shine with ajax solr
lucenerevolution
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Building your own search engine with Apache Solr
Biogeeks
 
PDF
From content to search: speed-dating Apache Solr (ApacheCON 2018)
Alexandre Rafalovitch
 
PDF
Solr Query Parsing
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
Solr Troubleshooting - TreeMap approach
Alexandre Rafalovitch
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Murshed Ahmmad Khan
 
Ingesting and Manipulating Data with JavaScript
Lucidworks
 
JSON in Solr: from top to bottom
Alexandre Rafalovitch
 
Lucene's Latest (for Libraries)
Erik Hatcher
 
Solr Powered Lucene
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr Recipes
Erik Hatcher
 
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Rapid Solr Schema Development (Phone directory)
Alexandre Rafalovitch
 
Apache Solr Workshop
JSGB
 
Apache Solr Workshop
Saumitra Srivastav
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
Make your gui shine with ajax solr
lucenerevolution
 
Lucene for Solr Developers
Erik Hatcher
 
Building your own search engine with Apache Solr
Biogeeks
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
Alexandre Rafalovitch
 
Solr Query Parsing
Erik Hatcher
 

Viewers also liked (10)

PDF
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Lucidworks
 
PDF
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Lucidworks
 
PDF
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
PDF
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucidworks
 
PDF
Secure Search - Using Apache Sentry to Add Authentication and Authorization S...
Lucidworks
 
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
PDF
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
PDF
Indexing Text and HTML Files with Solr
Lucidworks (Archived)
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Lucidworks
 
Deep Data at Macy's - Searching Hierarchichal Documents for eCommerce Merchan...
Lucidworks
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucidworks
 
Secure Search - Using Apache Sentry to Add Authentication and Authorization S...
Lucidworks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 
Solr4 nosql search_server_2013
Lucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Lucidworks (Archived)
 
Ad

Similar to Lucene for Solr Developers (20)

PDF
Apache Solr crash course
Tommaso Teofili
 
PDF
Apache solr liferay
Binesh Gummadi
 
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
PPS
Introduction to Solr
Jayesh Bhoyar
 
PPTX
Tutorial on developing a Solr search component plugin
searchbox-com
 
PDF
ERRest in Depth
WO Community
 
ODP
Solr a.b-ab
hero_of_the_day
 
PDF
Solr @ Etsy - Apache Lucene Eurocon
Giovanni Fernandez-Kincade
 
PDF
Introduction to Solr
Erik Hatcher
 
PPTX
Developing a Real-time Engine with Akka, Cassandra, and Spray
Jacob Park
 
PDF
Introduction to Solr
Erik Hatcher
 
PPTX
IT talk SPb "Full text search for lazy guys"
DataArt
 
PPTX
Introduction to Laravel Framework (5.2)
Viral Solani
 
PPTX
Robotframework
Ella Sun
 
PDF
ERRest and Dojo
WO Community
 
PPTX
Apache Solr + ajax solr
Net7
 
PDF
ERRest - Designing a good REST service
WO Community
 
PDF
Code transformation With Spoon
Gérard Paligot
 
PDF
Rapid API Development ArangoDB Foxx
Michael Hackstein
 
PDF
JSLT: JSON querying and transformation
Lars Marius Garshol
 
Apache Solr crash course
Tommaso Teofili
 
Apache solr liferay
Binesh Gummadi
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Introduction to Solr
Jayesh Bhoyar
 
Tutorial on developing a Solr search component plugin
searchbox-com
 
ERRest in Depth
WO Community
 
Solr a.b-ab
hero_of_the_day
 
Solr @ Etsy - Apache Lucene Eurocon
Giovanni Fernandez-Kincade
 
Introduction to Solr
Erik Hatcher
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Jacob Park
 
Introduction to Solr
Erik Hatcher
 
IT talk SPb "Full text search for lazy guys"
DataArt
 
Introduction to Laravel Framework (5.2)
Viral Solani
 
Robotframework
Ella Sun
 
ERRest and Dojo
WO Community
 
Apache Solr + ajax solr
Net7
 
ERRest - Designing a good REST service
WO Community
 
Code transformation With Spoon
Gérard Paligot
 
Rapid API Development ArangoDB Foxx
Michael Hackstein
 
JSLT: JSON querying and transformation
Lars Marius Garshol
 
Ad

More from Erik Hatcher (13)

PDF
Ted Talk
Erik Hatcher
 
PDF
Solr Payloads
Erik Hatcher
 
PDF
Solr Powered Libraries
Erik Hatcher
 
PDF
"Solr Update" at code4lib '13 - Chicago
Erik Hatcher
 
PDF
Query Parsing - Tips and Tricks
Erik Hatcher
 
PDF
Solr 4
Erik Hatcher
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
What's New in Solr 3.x / 4.0
Erik Hatcher
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Erik Hatcher
 
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr
Erik Hatcher
 
Ted Talk
Erik Hatcher
 
Solr Payloads
Erik Hatcher
 
Solr Powered Libraries
Erik Hatcher
 
"Solr Update" at code4lib '13 - Chicago
Erik Hatcher
 
Query Parsing - Tips and Tricks
Erik Hatcher
 
Solr 4
Erik Hatcher
 
Introduction to Solr
Erik Hatcher
 
What's New in Solr 3.x / 4.0
Erik Hatcher
 
Solr Application Development Tutorial
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Erik Hatcher
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Erik Hatcher
 

Recently uploaded (20)

PPTX
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
PPTX
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
The Different Types of Non-Experimental Research
Thelma Villaflores
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
PATIENT ASSIGNMENTS AND NURSING CARE RESPONSIBILITIES.pptx
PRADEEP ABOTHU
 
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
Dimensions of Societal Planning in Commonism
StefanMz
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
The Different Types of Non-Experimental Research
Thelma Villaflores
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Unit 2 COMMERCIAL BANKING, Corporate banking.pptx
AnubalaSuresh1
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
Generative AI: it's STILL not a robot (CIJ Summer 2025)
Paul Bradshaw
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 

Lucene for Solr Developers

  • 1. Lucene for Solr Developers uberconf - July 14, 2011 Presented by Erik Hatcher [email protected] Lucid Imagination http://www.lucidimagination.com
  • 2. Lucene Core • IndexWriter • Directory • IndexReader, IndexSearcher • analysis: Analyzer, TokenStream, Tokenizer,TokenFilter • Query
  • 4. Customizing - Don't do it! • Unless you need to. • In other words... ensure you've given the built-in capabilities a try, asked on the e-mail list, and spelunked into at least Solr's code a bit to make some sense of the situation. • But we're here to roll up our sleeves, because we need to...
  • 5. But first... • Look at Lucene and/or Solr source code as appropriate • Carefully read javadocs and wiki pages - lots of tips there • And, hey, search for what you're trying to do... • Google, of course • But try out LucidFind and other Lucene ecosystem specific search systems - http://www.lucidimagination.com/search/
  • 6. Extension points • Tokenizer, TokenFilter, • QParser CharFilter • DataImportHandler • SearchComponent hooks • RequestHandler • data sources • ResponseWriter • entity processors • FieldType • transformers • Similarity • several others
  • 7. Factories • FooFactory (most) everywhere. Sometimes there's BarPlugin style • for sake of discussion... let's just skip the "factory" part • In Solr, Factories and Plugins are used by configuration loading to parameterize and construct
  • 8. "Installing" plugins • Compile .java to .class, JAR it up • Put JAR files in either: • <solr-home>/lib • a shared lib when using multicore • anywhere, and register location in solrconfig.xml • Hook in plugins as appropriate
  • 9. Multicore sharedLib <solr sharedLib="/usr/local/solr/customlib" persistent="true"> <cores adminPath="/admin/cores"> <core instanceDir="core1" name="core1"/> <core instanceDir="core2" name="core2"/> </cores> </solr>
  • 10. Plugins via solrconfig.xml • <lib dir="/path/to/your/custom/jars" />
  • 12. Primer • Tokens, Terms • Attributes: Type, Payloads, Offsets, Positions, Term Vectors • part of the picture:
  • 13. Version • enum: • Version.LUCENE_31, Version.LUCENE_32, etc • Version.onOrAfter(Version other)
  • 14. CharFilter • extend BaseCharFilter • enables pre-tokenization filtering/morphing of incoming field value • only affects tokenization, not stored value • Built-in CharFilters: HTMLStripCharFilter, PatternReplaceCharFilter, and MappingCharFilter
  • 15. Tokenizer • common to extend CharTokenizer • implement - • protected abstract boolean isTokenChar(int c); • optionally override - • protected int normalize(int c) • extend Tokenizer directly for finer control • Popular built-in Tokenizers include: WhitespaceTokenizer, StandardTokenizer, PatternTokenizer, KeywordTokenizer, ICUTokenizer
  • 16. TokenFilter • a TokenStream whose input is another TokenStream • Popular TokenFilters include: LowerCaseFilter, CommonGramsFilter, SnowballFilter, StopFilter, WordDelimiterFilter
  • 17. Lucene's analysis APIs • tricky business, what with Attributes (Source/Factory's), State, characters, code points,Version, etc... • Test!!! • BaseTokenStreamTestCase • Look at Lucene and Solr's test cases
  • 18. Solr's Analysis Tools • Admin analysis tool • Field analysis request handler • DEMO
  • 19. Query Parsing • String -> org.apache.lucene.search.Query
  • 20. QParserPlugin public abstract class QParserPlugin implements NamedListInitializedPlugin { public abstract QParser createParser( String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req); }
  • 21. QParser public abstract class QParser { public abstract Query parse() throws ParseException; }
  • 22. Built-in QParsers from QParserPlugin.java /** internal use - name to class mappings of builtin parsers */ public static final Object[] standardPlugins = { LuceneQParserPlugin.NAME, LuceneQParserPlugin.class, OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class, FunctionQParserPlugin.NAME, FunctionQParserPlugin.class, PrefixQParserPlugin.NAME, PrefixQParserPlugin.class, BoostQParserPlugin.NAME, BoostQParserPlugin.class, DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class, ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class, FieldQParserPlugin.NAME, FieldQParserPlugin.class, RawQParserPlugin.NAME, RawQParserPlugin.class, TermQParserPlugin.NAME, TermQParserPlugin.class, NestedQParserPlugin.NAME, NestedQParserPlugin.class, FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class, SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class, SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class, JoinQParserPlugin.NAME, JoinQParserPlugin.class, };
  • 23. Local Parameters • {!qparser_name param=value}expression • or • {!qparser_name param=value v=expression} • Can substitute $references from request parameters
  • 24. Param Substitution solrconfig.xml <requestHandler name="/document" class="solr.SearchHandler"> <lst name="invariants"> <str name="q">{!term f=id v=$id}</str> </lst> </requestHandler> Solr request http://localhost:8983/solr/document?id=FOO37
  • 25. Custom QParser • Implement a QParserPlugin that creates your custom QParser • Register in solrconfig.xml • <queryParser name="myparser" class="com.mycompany.MyQParserPlugin"/>
  • 26. Update Processor • Responsible for handling these commands: • add/update • delete • commit • merge indexes
  • 27. Built-in Update Processors • RunUpdateProcessor • Actually performs the operations, such as adding the documents to the index • LogUpdateProcessor • Logs each operation • SignatureUpdateProcessor • duplicate detection and optionally rejection
  • 28. UIMA Update Processor • UIMA - Unstructured Information Management Architecture - http://uima.apache.org/ • Enables UIMA components to augment documents • Entity extraction, automated categorization, language detection, etc • "contrib" plugin • http://wiki.apache.org/solr/SolrUIMA
  • 29. Update Processor Chain • UpdateProcessor's sequence into a chain • Each processor can abort the entire update or hand processing to next processor in the chain • Chains, of update processor factories, are specified in solrconfig.xml • Update requests can specify an update.processor parameter
  • 30. Default update processor chain From SolrCore.java // construct the default chain UpdateRequestProcessorFactory[] factories = new UpdateRequestProcessorFactory[]{ new RunUpdateProcessorFactory(), new LogUpdateProcessorFactory() }; Note: these steps have been swapped on trunk recently
  • 31. Example Update Processor • What are the best facets to show for a particular query? Wouldn't it be nice to see the distribution of document "attributes" represented across a result set? • Learned this trick from the Smithsonian, who were doing it manually - add an indexed field containing the field names of the interesting other fields on the document. • Facet on that field "of field names" initially, then request facets on the top values returned.
  • 32. Config for custom update processor <updateRequestProcessorChain name="fields_used" default="true"> <processor class="solr.processor.FieldsUsedUpdateProcessorFactory"> <str name="fieldsUsedFieldName">attribute_fields</str> <str name="fieldNameRegex">.*_attribute</str> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
  • 33. FieldsUsedUpdateProcessorFactory public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory { private String fieldsUsedFieldName; private Pattern fieldNamePattern; public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new FieldsUsedUpdateProcessor(req, rsp, this, next); } // ... next slide ... }
  • 34. FieldsUsedUpdateProcessorFactory @Override public void init(NamedList args) { if (args == null) return; SolrParams params = SolrParams.toSolrParams(args); fieldsUsedFieldName = params.get("fieldsUsedFieldName"); if (fieldsUsedFieldName == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldsUsedFieldName must be specified"); } // TODO check that fieldsUsedFieldName is a valid field name and multiValued String fieldNameRegex = params.get("fieldNameRegex"); if (fieldNameRegex == null) { throw new SolrException (SolrException.ErrorCode.SERVER_ERROR, "fieldNameRegex must be specified"); } fieldNamePattern = Pattern.compile(fieldNameRegex); super.init(args); }
  • 35. class FieldsUsedUpdateProcessor extends UpdateRequestProcessor { public FieldsUsedUpdateProcessor(SolrQueryRequest req, SolrQueryResponse rsp, FieldsUsedUpdateProcessorFactory factory, UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); Collection<String> incomingFieldNames = doc.getFieldNames(); Iterator<String> iterator = incomingFieldNames.iterator(); ArrayList<String> usedFields = new ArrayList<String>(); while (iterator.hasNext()) { String f = iterator.next(); if (fieldNamePattern.matcher(f).matches()) { usedFields.add(f); } } doc.addField(fieldsUsedFieldName, usedFields.toArray()); super.processAdd(cmd); } }
  • 36. FieldsUsedUpdateProcessor in action schema.xml <dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/> Add some documents solr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => 'L', :color_attribute => 'Blue'}, {:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => 'White'}]) solr.commit Facet on attribute_fields - http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on "facet_fields":{ "attribute_fields":[ "color_attribute",2, "memory_attribute",1, "size_attribute",1]}
  • 37. Search Components • Built-in: Clustering, Debug, Facet, Highlight, MoreLikeThis, Query, QueryElevation, SpellCheck, Stats, TermVector, Terms • Non-distributed API: • prepare(ResponseBuilder rb) • process(ResponseBuilder rb)
  • 38. Example - auto facet select • It sure would be nice if you could have Solr automatically select field(s) for faceting based dynamically off the profile of the results. For example, you're indexing disparate types of products, all with varying attributes (color, size - like for apparel, memory_size - for electronics, subject - for books, etc), and a user searches for "ipod" where most products match products with color and memory_size attributes... let's automatically facet on those fields. • https://issues.apache.org/jira/browse/SOLR-2641
  • 39. AutoFacetSelection Component • Too much code for a slide, let's take a look in an IDE... • Basically - • process() gets autofacet.field and autofacet.n request params, facets on field, takes top N values, sets those as facet.field's • Gotcha - need to call rb.setNeedDocSet (true) in prepare() as faceting needs it
  • 40. SearchComponent config <searchComponent name="autofacet" class="solr.AutoFacetSelectionComponent"/> <requestHandler name="/searchplus" class="solr.SearchHandler"> <arr name="components"> <str>query</str> <str>autofacet</str> <str>facet</str> <str>debug</str> </arr> </requestHandler>
  • 41. autofacet success http://localhost:8983/solr/searchplus ?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on { "response":{"numFound":2,"start":0,"docs":[ { "size_attribute":["L"], "color_attribute":["Blue"], "name":"Big Blue Shoes", "id":"1", "attribute_fields":["size_attribute", "color_attribute"]}, { "color_attribute":["White"], "name":"Cool Gizmo", "memory_attribute":["16GB"], "id":"2", "attribute_fields":["color_attribute", "memory_attribute"]}] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "color_attribute":[ "Blue",1, "White",1], "memory_attribute":[ "16GB",1]}}}
  • 42. Distributed-aware SearchComponents • SearchComponent has a few distributed mode methods: • distributedProcess(ResponseBuilder) • modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) • handleResponses(ResponseBuilder rb, ShardRequest sreq) • finishStage(ResponseBuilder rb)
  • 43. Testing • AbstractSolrTestCase • SolrTestCaseJ4 • SolrMeter • http://code.google.com/p/solrmeter/
  • 44. For more information... • http://www.lucidimagination.com • LucidFind • search Lucene ecosystem: mailing lists, wikis, JIRA, etc • http://search.lucidimagination.com • Getting started with LucidWorks Enterprise: • http://www.lucidimagination.com/products/ lucidworks-search-platform/enterprise • http://lucene.apache.org/solr - wiki, e-mail lists