Rapid Prototyping with Solr

Rapid Prototyping
with
Solr
uberconf - July 14, 2011
Presented by Erik Hatcher
erik.hatcher@lucidimagination.com
Lucid Imagination
http://www.lucidimagination.com

About me...

• Co-author, “Lucene in Action”
• Commiter, Lucene and Solr
• Lucene PMC and ASF member
• Member of Technical Staff / co-founder,
Lucid Imagination

About Lucid Imagination...
• Lucid Imagination provides commercial-grade
support, training, high-level consulting and value-
added software for Lucene and Solr.

• We make Lucene ‘enterprise-ready’ by offering:

• Free, certiﬁed, distributions and downloads.

• Support, training, and consulting.

• LucidWorks Enterprise, a commercial search
platform built on top of Solr.

Abstract
Got data? Let's make it searchable! Rapid Prototyping with
Solr will demonstrate getting documents into Solr quickly,
provide some tips in adjusting Solr's schema to match your
needs better, and ﬁnally will discuss how to showcase your
data in a ﬂexible search user interface. We'll see how to
rapidly leverage faceting, highlighting, spell checking, and
debugging. Even after all that, there will be enough time left
to outline the next steps in developing your search
application and taking it to production.

What is Lucene?
• An open source Java-based IR library with best practice indexing
and query capabilities, fast and lightweight search and indexing.

• 100% Java (.NET, Perl and other versions too).

• Stable, mature API.

• Continuously improved and tuned over more than 10 years.

• Cleanly implemented, easy to embed in an application.

• Compact, portable index representation.

• Programmable text analyzers, spell checking and highlighting.

• Not a crawler or a text extraction tool.

Lucene's History
• Created by Doug Cutting in 1999

• built on ideas from search projects Doug created at Xerox PARC
and Apple.

• Donated to the Apache Software Foundation (ASF) in 2001.

• Became an Apache top-level project in 2005.

• Has grown and morphed through the years and is now both:

• A search library.

• An ASF Top-Level Project (TLP) encompassing several sub-projects.

• Lucene and Solr "merged" development in early 2010.

What is Solr?
• An open source search engine.

• Indexes content sources, processes query requests, returns
search results.

• Uses Lucene as the "engine", but adds full enterprise search
server features and capabilities.

• A web-based application that processes HTTP requests and
returns HTTP responses.

• Initially started in 2004 and developed by CNET as an in-house
project to add search capability for the company website.

• Donated to ASF in 2006.

What Version of Solr?

• There’s more than one answer!

• The current, released, stable version is 3.3

• The development release is referred to as “trunk”.

• This is where the new, less tested work goes on

• Also referred to as 4.0

• LucidWorks Enterprise is built on a trunk snapshot +
additional features.

Why prototype?

• Demonstrate Solr can handle your needs
• Stake/purse-holder buy-in
• It's quick, easy, and fun!
• The User Interface is the app

Workflow

• Ingest data
• Use
• Refine config/interactions, repeat

Got Data?
• Rich text ﬁles?
• Databases?
• Feeds (Atom/RSS/XML)?
• 3rd party repositories? (SharePoint,
Documentum, ...)
• CSV!!!!

Getting Started
• Download Solr
• http://lucene.apache.org/solr
• "Install" it
• unzip or tar -xvf
• Start it
• cd example; java -jar start.jar

e.g. Conference
Attendees

First Name,Last Name,Company,Title,Work Country

Erik,Hatcher,Lucid Imagination,"Member, Technical Staff", USA

.

.

.

First Try

curl "http://localhost:8983/solr/update/csv?stream.file=attendees.csv"

undefined field First Name

Dynamic Fields

<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>

Second try

curl "http://localhost:8983/solr/update/csv?
stream.file=attendees.csv
&fieldnames=first_s,last_s,company_s,title_t,country_s
&header=true"

Document [null] missing required field: id

uniqueKey

• Optional, Solr-speciﬁc, feature
• generally "string" type
• schema.xml: <uniqueKey>id</uniqueKey>
• adds of existing id'd documents updates
(delete + add)

id
curl "http://localhost:8983/solr/update/csv
?stream.file=attendees.csv
&fieldnames=first_s, id,company_s,title_t,co untry_s&header=true"

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">40</int>
</lst>
</response>

Schema tinkering
• Removed all example field definitions

• Uncomment and adjust catch-all dynamic field:

• <dynamicField name="*" type="string"
multiValued="false"/>

• Ensure uniqueKey is appropriate

• unusual in this example, disabled it

• Make every document/field fully searchable!

• <copyField source="*" dest="text"/>

After adjusting conﬁg...

• Restart Solr
• Or... reload the core (when in multicore
mode)

Clean import
# Delete all documents
curl "http://localhost:8983/solr/update?stream.body=
%3Cdelete%3E%3Cquery %3E*:*%3C/query%3E%3C/delete
%3E&commit=true"

# Index your data
curl "http://localhost:8983/solr/update/csv?
commit=true&stream.file=EuroCon2010.csv&fieldnames=first
,last, company,title,country&header=true"

Facets
http://localhost:8983/solr/browse?facet.ﬁeld=country

Value Normalization

• http://localhost:8983/solr/update/csv?
commit=true&stream.file=attendees.csv&fi
eldnames=first,last,company,title,country&h
eader=true&f.country.map=Great
+Britain:United+Kingdom

Polishing

• Customize request handler mappings
• Edit templates
• hit display
• header/footer
• style

/browse
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<str name="wt">velocity</str><str name="v.layout">layout</str>
<str name="v.template">browse</str>

<str name="rows">10</str><str name="fl">*,score</str>

<str name="defType">lucene</str><str name="q">*:*</str>
<str name="debugQuery">true</str>
<str name="hl">on</str><str name="hl.fl">title</str>
<str name="hl.fragsize">0</str>
<str name="hl.alternateField">title</str>

<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.missing">true</str>
</lst>
<lst name="appends"><str name="facet.field">country</str></lst>
</requestHandler>

hit.vm

<div class="result-document">
<p>$doc.getFieldValue('first') $doc.getFieldValue('last')</p>
<p>$!doc.getFieldValue('title'), $!doc.getFieldValue('company')</p>
<p>$!doc.getFieldValue('country')</p>
</div>

Adding bells and
whistles
• jQuery
• <script type="text/javascript" src="/solr/
admin/jquery.js"/>
• TreeMap
• <script type="text/javascript" src="/scripts/
treemap.js"/>

TreeMap code
<script type="text/javascript">
function onLoad() {
jQuery("#treemap-country").treemap(640,480, {});
}
</script>
----------------------------
<body onload="onLoad();">
----------------------------
<table id="treemap-country">
#foreach($facet in $response.getFacetField('country').values)
<tr>
<td>#if($facet.name)
$esc.html($facet.name)#else<Unspecified>#end</td>
<td>$facet.count</td>
<td>#if($facet.name)$esc.html($facet.name)#{else}
Unspecified#end</td>
</tr>
#end
</table>

Ajax fun: giveaways

• Add a "static" templated page
• jQuery Ajax request
• snippet templated output

solrconfig.xml
"static" page
<requestHandler name="/giveaways"
class="solr.DumpRequestHandler">
<str name="wt">velocity</str>
<str name="v.template">giveaways</str>
<str name="v.layout">layout</str>
</lst>
</requestHandler>

giveaways.vm
<input type="button" value="Pick a Winner"
onClick="javascript:$ ('#winner').load('/solr/
generate_winner?sort=random_' + new Date().getTime() +
'+asc');">
<h2>And the winner is...</h2> <center><font
size="20"><div id="winner"></div></font></center>

fragment template
solrconfig.xml
<requestHandler name="/generate_winner" class="solr.SearchHandler">

<str name="v.template">winner</str>
<str name="rows">1</str>
<str name="fl">first,last</str>
<str name="defType">lucene</str>
<str name="q">*:*
-company:"Lucid Imagination"
-company:"Stone Circle Productions"</str>
</lst>
</requestHandler>

winner.vm
#set($winner=$response.results.get(0))
$winner.getFieldValue('first') $winner.getFieldValue('last')

Data.gov CSV catalog
URL,Title,Agency,Subagency,Category,Date Released,Date Updated,Time
Period,Frequency,Description,Data.gov Data Category Type,Specialized Data Category
Designation,Keywords,Citation,Agency Program Page,Agency Data Series Page,Unit of
Analysis,Granularity,Geographic Coverage,Collection Mode,Data Collection
Instrument,Data Dictionary/Variable List,Applicable Agency Information Quality
Guideline Designation,Data Quality Certification,Privacy and Confidentiality,Technical
Documentation,Additional Metadata,FGDC Compliance (Geospatial Only),Statistical
Methodology,Sampling,Estimation,Weighting,Disclosure Avoidance,Questionnaire
Design,Series Breaks,Non-response Adjustment,Seasonal Adjustment,Statistical
Characteristics,Feeds Access Point,Feeds File Size,XML Access Point,XML File Size,CSV/
TXT Access Point,CSV/TXT File Size,XLS Access Point,XLS File Size,KML/KMZ Access
Point,KML File Size,ESRI Access Point,ESRI File Size,Map Access Point,Data Extraction
Access Point,Widget Access Point
"http://www.data.gov/details/4","Next Generation Radar (NEXRAD) Locations","Department of Commerce","National Oceanic
and Atmospheric Administration","Geography and Environment","1991","Irregular as needed","1991 to present","Between 4
and 10 minutes","This geospatial rendering of weather radar sites gives access to an historical archive of Terminal
Doppler Weather Radar data and is used primarily for research purposes. The archived data includes base data and
derived products of the National Weather Service (NWS) Weather Surveillance Radar 88 Doppler (WSR-88D) next generation
(NEXRAD) weather radar. Weather radar detects the three meteorological base data quantities: reflectivity, mean radial
velocity, and spectrum width. From these quantities, computer processing generates numerous meteorological analysis
products for forecasts, archiving and dissemination. There are 159 operational NEXRAD radar systems deployed
throughout the United States and at selected overseas locations. At the Radar Operations Center (ROC) in Norman OK,
personnel from the NWS, Air Force, Navy, and FAA use this distributed weather radar system to collect the data needed
to warn of impending severe weather and possible flash floods; support air traffic safety and assist in the management
of air traffic flow control; facilitate resource protection at military bases; and optimize the management of water,
agriculture, forest, and snow removal. This data set is jointly owned by the National Oceanic and Atmospheric
Administration, Federal Aviation Administration, and Department of Defense.","Raw Data Catalog",...

Debugging
http://localhost:8983/solr/data.gov?q=searching&debugQuery=true

Mapping field values
• CSV update handler can map field values
• &f.privacy_and_confidentiality.map=YES:Yes
&f.data_quality_certification.map=YES:Yes

Splitting keywords
• CSV handler: f.keywords.split=true
• stored values are split, multivalued
• Or via schema
• Stored value remains as in original, single valued
<fieldType name="comma_separated" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="s*,s*"/>
</analyzer>
</fieldType>
...
<field name="keywords" type="comma_separated" indexed="true" stored="true"/>

Suggest
• Suggest terms as user types in search box
• Technique: jQuery autocomplete, Solr’s
TermsComponent,Velocity template
http://localhost:8983/solr/terms
?terms.fl=suggest
&terms.prefix=sola&terms.sort=count
&wt=velocity&v.template=suggest
#foreach($t in $response.response.terms.suggest)
$t.key
#end

Suggest schema
<fieldType name="suggestable" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])"
replacement="" replace="all"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="true" />
</analyzer>
</fieldType>

...

<field name="suggest" type="suggestable"
indexed="true" stored="false" multiValued="true"/>

Custom pages

• Document detail page
• Multiple query intersection comparison
with Venn visualization

Document detail
http://localhost:8983/solr/data.gov/document
?id=https%3A%2F%2Fblue-sea-697d.quartiers047.workers.dev%3A443%2Fhttp%2Fwww.data.gov%2Fdetails%2F61

Document detail detail
solrconfig.xml
<requestHandler name="/data.gov/document" class="solr.SearchHandler">
<str name="v.template">document</str>
<str name="title">Data.gov data set</str>
<str name="q">{!raw f=id v=$id}</str>
</lst>
</requestHandler>
document.vm
#set($doc= $response.results.get(0))
<span><a href="$doc.getFieldValue('id')">$doc.getFieldValue('id')</a></span>

<table>
#foreach($fieldname in $doc.fieldNames)
<tr>
<td>$fieldname:</td>
<td>
#foreach($value in $doc.getFieldValues($fieldname))
$esc.html($value)
#end
</td>
</tr>
#end
</table>

Query intersection

• Just showing off.... how easy it is to do
something with a bit of visual impact
• Compare three independent queries,
intersecting them in a Venn diagram
visualization

Compare static page
solrconfig.xml
<requestHandler name="/data.gov/compare" class="solr.DumpRequestHandler">
<str name="v.template">compare</str>
<str name="title">Data.gov Query Comparison</str>
</lst>
</requestHandler> compare.vm
<script type="text/javascript">
function generate_venn() {
var a=encodeURIComponent($("#a").val());
var b=encodeURIComponent($("#b").val());
var c=encodeURIComponent($("#c").val());
var ab='('+a+')+AND+('+b+')';
var ac='('+a+')+AND+('+c+')';
var bc='('+b+')+AND+('+c+')';
var abc='('+a+')+AND+('+b+')+AND+('+c+')';
$('#venn').load('/solr/select?
q=*:*&wt=velocity&v.template=venn&rows=0&facet=on&facet.query={!key=a}'+a+'&facet.query={!key=b}'+b
+'&facet.query={!key=c}'+c+'&facet.query={!key=intersect_ab}'+ab+'&facet.query={!key=intersect_ac}'+ac
+'&facet.query={!key=intersect_bc}'+bc+'&facet.query={!key=intersect_abc}'+abc+'&q_a='+a+'&q_b='+b+'&q_c='+c
+'&q_ab='+ab+'&q_ac='+ac+'&q_bc='+bc+'&q_abc='+abc);
return false;
}
</script>
<form action="#" id="compare_form" onsubmit="return generate_venn()">
A: <input type="text" name="a" id="a" value="health"/>
B: <input type="text" name="b" id="b" value="weather"/>
C: <input type="text" name="c" id="c" value="ozone"/>
<input type="submit"/>
</form>
<div id="venn"></div>

Venn chart
venn.vm
#set($values = $response.response.facet_counts.facet_queries)
#set($params = $response.responseHeader.params)

<img src="http://chart.apis.google.com/chart?
chs=600x400&cht=v&chd=t:$values.a,$values.b,$values.c,
$values.intersect_ab,$values.intersect_ac,$values.intersect_bc,
$values.intersect_abc&chdl=$esc.url($params.q_a)|$esc.url
($params.q_b)|$esc.url($params.q_c)"/>
<ul>
<li>A: <a href="/solr/data.gov?q={!lucene}$params.q_a">$params.q_a</a> ($values.a)</li>
<li>B: <a href="/solr/data.gov?q={!lucene}$params.q_b">$params.q_b</a> ($values.b)</li>
<li>C: <a href="/solr/data.gov?q={!lucene}$params.q_c">$params.q_c</a> ($values.c)</li>
<li>A&B: <a href="/solr/data.gov?q={!lucene}$params.q_ab">$params.q_ab</a>
($values.intersect_ab)</li>
<li>A&C: <a href="/solr/data.gov?q={!lucene}$params.q_ac">$params.q_ac</a>
($values.intersect_ac)</li>
<li>B&C: <a href="/solr/data.gov?q={!lucene}$params.q_bc">$params.q_bc</a>
($values.intersect_bc)</li>
<li>A&B&C: <a href="/solr/data.gov?q={!lucene}$params.q_abc">$params.q_abc</a>
($values.intersect_abc)</li>
</ul>

Solritas
• Pronounced: so-LAIR-uh-toss

• Celeritas is a Latin word, translated as "swiftness" or
"speed". It is often given as the origin of the symbol c,
the universal notation for the speed of light - http://
en.wikipedia.org/wiki/Celeritas

• VelocityResponseWriter - simply passes the Solr
response through the Apache Velocity templating
engine

• http://wiki.apache.org/solr/VelocityResponseWriter

Solr Flare
• Ruby on Rails plugin
• facet ﬁeld detection, autosuggest, saved
search, inverted facets, pie charts, Simile
Timeline and Exhibit integration
• Useful for rapid prototyping
• See Flare's big brother, Blacklight, for
production quality

• UVA radiation = blacklight
• libraries are much more than books
• opinionated
• Ruby on Rails: best choice for an
extensible user interface development
framework

Prototyping Tools

• CSV update handler - /update/csv
• Schema Browser
• Solritas, Flare, Blacklight, or...
• just HTML+JavaScript (wt=json)

Test

• Performance
• Scalability
• Relevance
• Automate all of the above, start baselines
early, avoid regressions

Then what?
• Script the indexing process: full & delta
• Work with real users on actual needs
• Integrate with production systems
• Iterate on schema enhancements,
conﬁguration tweaks such as caching
• Deploy to staging/production environments
and work at scale: collection size, real queries
and performance, hardware and JVM settings

LucidFind

http://www.lucidimagination.com/search/?q=user+interface

For more information...
• http://www.lucidimagination.com

• LucidFind

• search Lucene ecosystem: mailing lists, wikis, JIRA, etc

• http://search.lucidimagination.com

• Getting started with LucidWorks Enterprise:

• http://www.lucidimagination.com/products/
lucidworks-search-platform/enterprise

• http://lucene.apache.org/solr - wiki, e-mail lists

Rapid Prototyping with Solr

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Rapid Prototyping with Solr (20)

More from Erik Hatcher (12)

Recently uploaded (20)

Rapid Prototyping with Solr