Solr Indexing and Analysis Tricks
@ErikHatcher
Senior Solutions Architect, LucidWorks
Erik Hatcher's Relevant Professional Bio
• 
• 
• 
• 
• 

Lucene/Solr committer
Apache Software Foundation member
Co-founder, Senior Solutions Architect, and Janitor at LucidWorks
Creator of Blacklight
Co-author of "Ant in Action" and "Lucene in Action"
Abstract

This session will introduce and demonstrate several
techniques for enhancing the search experience by
augmenting documents during indexing. First we'll survey
the analysis components available in Solr, and then we'll
delve into using Solr's update processing pipeline to
modify documents on the way in. The session will build
on Erik's "Poor Man's Entity Extraction" blog at http://
www.searchhub.org/2013/06/27/poor-mans-entityextraction-with-solr/
Poor Man’s Entity Extraction

•  acronyms: a searchable/filterable/facetable (but not stored)
field containing all three+ letter CAPS acronyms
•  key_phrases: a searchable/filterable/facetable (but also not
stored) field containing any key phrases matching a
provided list
•  links: a stored field containing http(s) links
•  extracted_locations: lat/long points that are mentioned in
the document content, indexed as geographically savvy
points, stored on the document for easy use in maps or
elsewhere
example_data.txt
The DUB airport is at 53.421389,-6.27
See also:
http://en.wikipedia.org/wiki/Dublin_Airport
End results
Challenges and needs
• 

• 

• 

Analyzers and Query Parsers
–  Analysis != query parsing
•  Query parsers generally analyze “chunks” of the query expression and
combine the results in various ways
–  Synergy, working in conjunction
Query parsing
–  q=Lucene Revolution in Dubhlinn
–  q="Lucene Revolution"
–  q=lucene [AND/OR] revolution
–  On which field(s)? Which query parser?
Analysis
–  +((content:lucen) (content:revolut) (content:dublin)) [from edismax]
Extracting with copyField
• 

copyField content => acronyms
–  Note that destination of a copy field generally should not be stored
(stored="false)

• 

"caps" field type
–  PatternCaptureGroupFilterFactory with pattern="((?:[A-Z].?){3,})"

• 

"The Dublin airport, DUB, is at…"
=> DUB

• 

Results could be suitable for faceting, searching, and boosting but the results are
not "stored" values (only indexed terms)
Extracting with ScriptUpdateProcessor
• 
• 

An update processor can manipulate (add, modify, delete) document fields
–  Field values can be stored
update.chain=script
–  With post.jar:
•  java –Dauto -Dparams=update.chain=script -jar post.jar
–  Or make the update chain the default
•  <updateRequestProcessorChain default="true"…
// basic lat/long pattern matching eg "38.1384683,-78.4527887"
var location_regexp = /(-?d{1,2}.d{2,7},-?d{1,3}.d{2,7})/g;
var extracted_locations = getMatches(location_regexp, content);
doc.setField("extracted_locations", extracted_locations);
Analysis in ScriptUpdateProcessor
var analyzer =
req.getCore().getLatestSchema()
.getFieldTypeByName("<field type>")
.getAnalyzer();
doc.setField("token_ss",
getAnalyzerResult(analyzer, null, content));
getAnalyzerResult
function getAnalyzerResult(analyzer, fieldName, fieldValue) {
var result = [];
var token_stream =
analyzer.tokenStream(fieldName, new java.io.StringReader(fieldValue));
var term_att = token_stream.getAttribute(
Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
token_stream.reset();
while (token_stream.incrementToken()) {
result.push(term_att.toString());
}
token_stream.end();
token_stream.close();
return result;
}
Using analysis externally

•  http://localhost:8983/solr/collection1/analysis/field
–  ?analysis.fieldvalue=Dubhlinn
–  &analysis.fieldtype=just_synonyms
•  => dublin
Solr Indexing and Analysis Tricks

Solr Indexing and Analysis Tricks

  • 1.
    Solr Indexing andAnalysis Tricks @ErikHatcher Senior Solutions Architect, LucidWorks
  • 2.
    Erik Hatcher's RelevantProfessional Bio •  •  •  •  •  Lucene/Solr committer Apache Software Foundation member Co-founder, Senior Solutions Architect, and Janitor at LucidWorks Creator of Blacklight Co-author of "Ant in Action" and "Lucene in Action"
  • 3.
    Abstract This session willintroduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http:// www.searchhub.org/2013/06/27/poor-mans-entityextraction-with-solr/
  • 4.
    Poor Man’s EntityExtraction •  acronyms: a searchable/filterable/facetable (but not stored) field containing all three+ letter CAPS acronyms •  key_phrases: a searchable/filterable/facetable (but also not stored) field containing any key phrases matching a provided list •  links: a stored field containing http(s) links •  extracted_locations: lat/long points that are mentioned in the document content, indexed as geographically savvy points, stored on the document for easy use in maps or elsewhere
  • 5.
    example_data.txt The DUB airportis at 53.421389,-6.27 See also: http://en.wikipedia.org/wiki/Dublin_Airport
  • 6.
  • 7.
    Challenges and needs •  •  •  Analyzersand Query Parsers –  Analysis != query parsing •  Query parsers generally analyze “chunks” of the query expression and combine the results in various ways –  Synergy, working in conjunction Query parsing –  q=Lucene Revolution in Dubhlinn –  q="Lucene Revolution" –  q=lucene [AND/OR] revolution –  On which field(s)? Which query parser? Analysis –  +((content:lucen) (content:revolut) (content:dublin)) [from edismax]
  • 8.
    Extracting with copyField •  copyFieldcontent => acronyms –  Note that destination of a copy field generally should not be stored (stored="false) •  "caps" field type –  PatternCaptureGroupFilterFactory with pattern="((?:[A-Z].?){3,})" •  "The Dublin airport, DUB, is at…" => DUB •  Results could be suitable for faceting, searching, and boosting but the results are not "stored" values (only indexed terms)
  • 9.
    Extracting with ScriptUpdateProcessor •  •  Anupdate processor can manipulate (add, modify, delete) document fields –  Field values can be stored update.chain=script –  With post.jar: •  java –Dauto -Dparams=update.chain=script -jar post.jar –  Or make the update chain the default •  <updateRequestProcessorChain default="true"… // basic lat/long pattern matching eg "38.1384683,-78.4527887" var location_regexp = /(-?d{1,2}.d{2,7},-?d{1,3}.d{2,7})/g; var extracted_locations = getMatches(location_regexp, content); doc.setField("extracted_locations", extracted_locations);
  • 10.
    Analysis in ScriptUpdateProcessor varanalyzer = req.getCore().getLatestSchema() .getFieldTypeByName("<field type>") .getAnalyzer(); doc.setField("token_ss", getAnalyzerResult(analyzer, null, content));
  • 11.
    getAnalyzerResult function getAnalyzerResult(analyzer, fieldName,fieldValue) { var result = []; var token_stream = analyzer.tokenStream(fieldName, new java.io.StringReader(fieldValue)); var term_att = token_stream.getAttribute( Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute); token_stream.reset(); while (token_stream.incrementToken()) { result.push(term_att.toString()); } token_stream.end(); token_stream.close(); return result; }
  • 12.
    Using analysis externally • http://localhost:8983/solr/collection1/analysis/field –  ?analysis.fieldvalue=Dubhlinn –  &analysis.fieldtype=just_synonyms •  => dublin