Rapid Solr Schema Development (Phone directory)

Rapid
Solr Schema
Development
Alexandre Rafalovitch (@arafalov)
Apache Solr Committer
Montreal Solr/ML meetup May 2018

Phone directory - content
Names, often from multiple cultures
Addresses
Phone numbers
Company/Group
Locations
Other fun data
I use https://www.fakenamegenerator.com/ for demos
 Can generate bulk entries in csv, tab-separated, sql, etc
 Many fields, languages, regions
 Warning: comes with an – invisible – byte order mark
Slide 2

Today's exploration
Solr 7.3 (latest)
The smallest learning schema/configuration required
Rapid schema evolution workflow
Free-form and fielded user entry
Dealing with multiple languages
Dealing with alternative name spellings
Searching phone numbers by any-length suffix
Configuring Solr to simplify API interface
(Bonus points) Fit into 40 minutes presentation!
Slide 3

Today's dataset
http://www.fakenamegenerator.com/ - Bulk request (20000 identities) – Free and configurable!
Name sets: American, Arabic, Australian, Chinese, French, Hispanic, Polish, Russian, Russian
(Cyrillic), Thai
Countries: Australia, Canada, France, Poland, Spain, United Kingdom, United States
Age range: 19 - 85 years old
Gender: 50% male, 50% female
Fields:
id,Gender,NameSet,Title,GivenName,MiddleInitial,Surname,StreetAddress,City,StateFull,ZipCod
e,CountryFull,EmailAddress,Username,TelephoneNumber,TelephoneCountryCode,Birthday,Age,T
ropicalZodiac,Color,Occupation,Company,BloodType,Kilograms,Centimeters,GUID,Latitude,Longi
tude
Renamed first field (Number) to id to fit Solr's naming convention
Removed BOM (in Vim, :set nobomb)
Slide 4

First try – Solr's built in schema
bin/solr start – standalone (non-clustered) server with no initial collections
bin/solr create -c demo1 – uses default configset, with 'schemaless' mode, not for production
Starts with 4 fields (id, _text_, _version_, _root_)
Auto-creates the rest on first occurance
bin/post -c demo1 ../dataset.csv
auto-detect content type from extension
can bulk upload files
see techproducts shipped example
bin/solr start –e techproducts
For one file, can also do via Admin UI
DEMO
Slide 5

Schemaless schema – lessons learned
Imported 1 record
Failed on the second one, because ZipCode was detected as a number
Can fix that by explicit configuration and rebuilding – see films example
(example/films/README.txt)
Other issues
Dual fields for text and string
Everything multivalued – because "just in case" – No sorting, API is messier, etc
Many large files
managed-schema: 546 lines (without comments)
solrconfig.xml: 1364 lines (with comments)
Plus another 42 configuration files, mostly language stopwords
Home work to get this working – not enough time today
Slide 6

Learning schema
managed-schema: start from nearly nothing – add as needed
solrconfig.xml: start from nearly all defaults – Most definitely NOT production ready
Not SolrCloud ready – add those as you scale
No extra field types – add as you need them
How small can we go?!?
Based on exploration done for my presentation at Lucene/Solr Revolution 2016
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-
2016 (slides and video)
https://github.com/arafalov/solr-deconstructing-films-example - repo
A bit out of date – schemaless mode was tuned since
Today's version uses latest Solr feature
https://github.com/arafalov/solr-presentation-2018-may/commits/master (changes commit-
by-commit)
Slide 7

Learning schema – managed-schema
<?xml version="1.0" encoding="UTF-8"?>
<schema name="smallest-config" version="1.6">
<field name="id" type="string" required="true" indexed="true" stored="true" />
<field name="_text_" type="text_basic" multiValued="true" indexed="true"
stored="false" docValues="false"/>
<dynamicField name="*" type="text_basic" indexed="true" stored="true"/>
<copyField source="*" dest="_text_"/>
<uniqueKey>id</uniqueKey>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/>
<fieldType name="text_basic" class="solr.SortableTextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</schema>
Slide 8

Learning schema – solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>7.3.0</luceneMatchVersion>
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">_text_</str>
<str name="echoParams">all</str>
</lst>
</requestHandler>
</config>
Slide 9

2 files, 33 lines combined, including blanks – but Will It Blend Search?
bin/solr create -c tinydir -d ../configs/smallest/ - provide custom config files to the collection
bin/post -c tinydir ../dataset.csv – Remember the BOM and renaming column Number->id
Does it search?
General search?
Case-insensitive search?
Range search: Centimeters:[* TO 99]
Fielded search?
Facet?
Sort?
Are ids preserved?
Are individual fields easy to work with (fl, etc)?
DEMO
Learning schema – create and index
Slide 10

It works! And ready to start being used from other parts of the project
Do NOT expose Solr directly to the Internet. Not until you are a Solr Wizard, the Gray.
managed-schema file has NOT changed – because of dynamicField
Still 21 lines
Would still keep the comments
Would still preserve field/type definitions
Will change on first AdminUI/API modification – gets rewritten
What else? Actual search-engine tuning!
Special cases
Numerics – e.g. for Range search
Spatial search – e.g. for Mapping/distance ranking
Multivalued fields
Dates
Special parsing (e.g. names/surnames)
Useful telephone number search
Relevancy tuning!
Learning schema - conclusion
Slide 11

Several possibilities
Admin UI
Delete schema field
Add schema field with new definition
Reindex
Sometimes causes docValue-related exception, have to rebuild collection from scratch
Schema API (Admin UI uses a subset of it)
See: https://lucene.apache.org/solr/guide/7_3/schema-api.html
Also has Replace a Field
Also has Add/Delete Field Type
Great to use programmatically or with something like Postman (https://www.getpostman.com/)
Edit schema/solrconfig.xml directly and reload the collection
Not recommended for production, but OK with a single server/single developer
Remember to edit actual scheme not the original config one
◦ Check "Instance" location in Admin UI, in collections' Overview screen
Remember that in SolrCloud mode, the config files are NOT on disk (they are in ZooKeeper).
Evolving schema
Slide 12

Numeric fields
 Age – int
 Centimeters (height?) – int
 Kilograms – float
Copy missing field types (pint, pfloat) from solr-7.3.0/server/solr/configsets/_default/conf/managed-schema
Map numeric fields explicitly
Delete content due to radical storage needs change
 bin/post -c tinydir -format solr -d "<delete><query>*:*</query></delete>"
Reload the core in Admin UI's Core Admin (menu is different in SolrCloud mode)
Index again
 bin/post -c tinydir ../dataset.csv
New queries
 facet=true&facet.range=Age&facet.range.start=0&facet.range.end=200&facet.range.gap=10
 Centimeters:[* TO 99] (again)
DEMO
Evolving schema – add numeric fields
Slide 13

Solr supports extensive spatial search
https://lucene.apache.org/solr/guide/7_3/spatial-search.html
bounding-box with different shapes (circles, polygons, etc)
distance limiting or boosting
different options with different functionalities
LatLonPointSpatialField
SpatialRecursivePrefixTreeFieldType
BBoxField
All require combined Lat Lon coordinates (lat,lon)
We are providing separate Latitude and Longitude fields – need to merge them with a comma
Let's copy a field type and create a field:
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
distErrPct="0.025" maxDistErr="0.001" distanceUnits="kilometers" />
<field name="location" type="location_rpt" indexed="true" stored="true" />
Remember to reload – no need to delete, as it is a new field
Next, need to also give merge instructions with an Update Request Processor
Evolving schema – spatial search
Slide 14

Update Request Processors
Deal with the data before it touches the schema
Can do pre-processing magic with many, many processors
See: https://lucene.apache.org/solr/guide/7_3/update-request-processors.html
See: http://www.solr-start.com/info/update-request-processors/ (mine)
Some are more magical then others and have shortcuts, e.g. TemplateUpdateProcessorFactory
All can be configured with chains in solrconfig.xml and apply explicitly or by default
That's how the schemaless mode works (default chain in solrconfig.xml of _default configset)
Also check the way dates are parsed in it, search for parse-date – can be used standalone
IgnoreFieldUpdateProcessorFactory could be useful to drop fields we don't want Solr to process at all
(including in collect-all _text_ field)
Let's reindex everything using the template to populate the new field:
bin/post -c tinydir -params "processor=template&template.field=location:{Latitude},{Longitude}" ../dataset.csv
Query:
q=*:*&rows=1&
fq={!geofilt sfield=location}&
pt=45.493444, -73.558154&d=100&
facet=on&facet.field=City&facet.mincount=1
DEMO
Evolving schema – URPs
Slide 15

Search for John and look at the phone numbers (q=John&fl=TelephoneNumber):
03.99.56.91.63
(08) 9435 3911
79 196 65 43
306-724-3986
Can we search that?
TelephoneNumber:3911 – yes
TelephoneNumber:"65 43" – sort of (need to quote or know these are together)
TelephoneNumber:3986 – sort of: some at the end, some at middle
Use Case: Just search the last digits (suffix) regardless of formatting
We have MANY analyzers, tokenizers, and character and token filters to help us with it
https://lucene.apache.org/solr/guide/7_3/understanding-analyzers-tokenizers-and-filters.html
http://www.solr-start.com/info/analyzers/ (mine)
Evolving schema – phone numbers
Slide 16

Let's define a super-custom field type:
<fieldType name="phone" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^0-9])"
replacement="" replace="all"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
Notice
Asymmetric analyzers
Reversing the string to make it end-digits starts digit (make sure that's symmetric!)
Edge n-grams (3-30 character substrings) - makes the index larger, but the search very fast
Evolving schema – digits-only type
Slide 17

Remap TelephoneNumber to it
<field name="TelephoneNumber" type="phone"
indexed="true" stored="true" />
And reindex (don't forget our speed hack' for now):
bin/post -c tinydir -params
"processor=template&template.field=location:{Latitude},{Longitude
}" ../dataset.csv
Check terms in Admin UI Schema screen and do our test searches
TelephoneNumber:3911
TelephoneNumber:"65 43"
DEMO
Evolving schema – digits-only type - cont
Slide 18

Many languages have accents on letters
Frédéric, Thérèse, Jérôme
Many users can't be bothered to type them
Sometimes, they don't even know how to type them
Łódź, Kędzierzyn-Koźle
Can we just ignore the accents when we search?
Several ways, but let's use the simplest by insert a filter into the text_basic type definition
<filter class="solr.ASCIIFoldingFilterFactory" />
Before the LowerCaseFilterFactory
Reload the collection and reindex – because the filter is symmetric (affects indexing)
Search without accents, general or fielded
Lodz, Frederic, Therese, GivenName:jerome
DEMO
Evolving schema – collapsing accents
Slide 19

What are similar names to 'Alexandre':
q=GivenName:Alexandre~2&
facet=on&facet.field=GivenName&facet.mincount=1
Alexander, Alexandra, Alexandrin, Leixandre, Alexandre, Alexandrie
We can't ask the user to enter arcane Solr syntax
Let's do a phonetic search instead
Bunch of different ways, each with its own tradeoffs
PhoneticFilterFactory, BeiderMorseFilterFactory, DaitchMokotoffSoundexFilterFactory,
DoubleMetaphoneFilterFactory,....
https://lucene.apache.org/solr/guide/7_3/phonetic-matching.html
Best to have one - or several - separate Field Type definitions with a copy field
Allows to experiment
Allows to trigger them at different times (e.g. in advanced search, but not general one)
Allows to tune them for relevancy by assign different weights
Evolving schema – Names and Surnames
Slide 20

How do we actually search multiple fields at once?
We've been using the default 'lucene' query parser so far on either _text_ or specific field
Solr has MANY parsers
General: "lucene", DisMax, Extended DisMax (edismax)
Specialized: Block Join, Boolean, Boost, Collapsing, Complex Phrase, Field, Filters, Function, Function Range,
Graph, Join, Learning to Rank, .....
 https://lucene.apache.org/solr/guide/7_3/other-parsers.html
We already used Spatial geofilt query parser: fq={!geofilt sfield=location}
edismax allows to search against multiple fields, with different weights, boosts, ties, minimum-
match specifications, etc
Choose with defType=edismax or {edismax param=value param=value}search_string
Let's search for "George Brown" against (qf) "GivenName Surname Company StreetAddress City"
and display same fields only
DEMO
Try using http://splainer.io/ to review the results
Try with qf=GivenName^5 Surname^5 Company StreetAddress City
Side-trip into eDisMax and query parsers
Slide 21

Result: 149 records, but all over the field values
Enter RELEVANCY
Recall – did we find all documents?
Precision – did we find just the documents we needed
Recall and Precision – fight. Perfect recall is q=*:* ......
Ranking – First hit is very important, ones after that less so (not always)
Side note: Field sorting destroys ranking.
We were optimizing Recall
Dump everything into _text_ and let search sort it out
Optimizing for Precision may seem easy too
Under eDisMax, set mm=100%
DEMO
eDisMax exploration continues
Slide 22

It is a business decision what Precision and Recall mean for your use case
Often "find more just in case" and focus on "ranking better" is the right approach
Try
qf=GivenName^5 Surname^5 Company StreetAddress City (no mm)
qf=GivenName^5 Surname^5 Company StreetAddress City and mm=100%
qf=GivenName^5 Surname^5 _text_ and mm=100%
DEMO in Splainer
Relevancy business case for our names (GivenName, Surname)
UPPER/lower case does not matter
Exact spelling (with accents) matches best – new Field Type needed (actually original text_basic...)
Accent-free spelling matches next – existing text_basic and therefore dynamic field match is fine
Phonetic spelling matches lowest (but higher than fallback _text_ field) – new Field Type needed
eDisMax for ranking
Slide 23

<fieldType name="text_exact" class="solr.SortableTextField" positionIncrementGap="100">
<analyzer>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_phonetic" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldType>
<field name="GivenName_exact" type="text_exact" indexed="true" stored="false"/>
<field name="Surname_exact" type="text_exact" indexed="true" stored="false"/>
<field name="GivenName_ph" type="text_phonetic" indexed="true" stored="false"/>
<field name="Surname_ph" type="text_phonetic" indexed="true" stored="false"/>
<copyField source="GivenName" dest="GivenName_exact"/>
<copyField source="GivenName" dest="GivenName_ph"/>
<copyField source="Surname" dest="Surname_exact"/>
<copyField source="Surname" dest="Surname_ph"/>
Multiple fields for same content
Slide 24

Our test cases
Frédéric, Thérèse, Jérôme
Check different analysis in Admin UI's Analysis screen
Can choose fields or field types from drop-down, use types as we have dynamic fields
Can also test analysis vs search and highlight the matches
Test search with Admin UI and Splainer with eDisMax enabled and Thérèse against different set
of Query Fields (qf)
Default search (qf=_text_)
GivenName
GivenName _text_
GivenName^10 _text_
GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_
DEMO
Testing multiple representations
Slide 25

Original search URL: http://...:8983/solr/tinydir/select?defType=edismax&fl=.....
The good parameter set:
defType=edismax
qf=GivenName_exact^15 GivenName^10 GivenName_ph^5% _text_
fl=GivenName Surname Company StreetAddress City CountryFull
Lock it in a dedicated request handler in solrconfig.xml
<requestHandler name="/namesearch" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">_text_</str>
<str name="echoParams">all</str>
<str name="defType">edismax</str>
<str name="qf">GivenName_exact^15 GivenName^10 GivenName_ph^5 _text_</str>
<str name="fl">GivenName Surname Company StreetAddress City CountryFull</str>
</lst>
</requestHandler>
Now: http://...:8983/solr/tinydir/namesearch?q=Thérèse
DEMO
Simplify API usage
Slide 26

Based on previous work with Thai language: https://github.com/arafalov/solr-thai-test
Needs ICU libraries in solrconfig.xml
 <lib path="../../../contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-7.3.0.jar" />
<lib path="../../../contrib/analysis-extras/lib/icu4j-59.1.jar" />
Field, type, and copyField definition in managed-schema:
<fieldType name="text_ru_en" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUTransformFilterFactory" id="ru-en" />
<filter class="solr.BeiderMorseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.BeiderMorseFilterFactory" />
</analyzer>
</fieldType>
<field name="GivenName_ruen" type="text_ru_en" indexed="true" stored="false"/>
<copyField source="GivenName" dest="GivenName_ruen"/>
Reload, reindex
Search
 GivenName:Zahar
 GivenName_ruen:Zahar
And BOOM!
Bonus magic
Slide 27

Rapid Solr Schema Development (Phone directory)

More Related Content

What's hot (20)

Similar to Rapid Solr Schema Development (Phone directory) (20)

Recently uploaded (20)

Rapid Solr Schema Development (Phone directory)

Editor's Notes