SlideShare a Scribd company logo
Query Parsing
Presented by Erik Hatcher
    27 February 2013




                            1
Description

    Interpreting what the user
    meant and what they ideally
    would like to find is tricky
    business. This talk will cover
    useful tips and tricks to better
    leverage and extend Solr's
    analysis and query parsing
    capabilities to more richly parse
    and interpret user queries.

                                        2
Abstract

 In this talk, Solr's built-in query parsers will be
 detailed included when and how to use them. Solr has
 nested query parsing capability, allowing for multiple
 query parsers to be used to generate a single query.
 The nested query parsing feature will be described
 and demonstrated. In many domains, e-commerce in
 particular, parsing queries often means interpreting
 which entities (e.g. products, categories, vehicles) the
 user likely means; this talk will conclude with
 techniques to achieve richer query interpretation.




                                                            3
Query parsers in Solr




                        4
Query Parsers in Solr




                        5
What’s new in 4.x?
    • Surround query parser
    • _query_:”{!...}” ugliness now unnecessary
      - just use nested {!...} expressions
    • Coming to 4.2: "switch" query parser




                                                  6
"lucene"-syntax query parser
    • FieldType awareness
       - range queries, numerics
       - allows date math
       - reverses wildcard terms, if indexing used ReverseWildcardFilter
    • Magic fields
       - _val_: function query injection
       - _query_: nested query, to use a different query parser
    • Multi-term analysis (type="multiterm")
       - Analyzes prefix, wildcard, regex expressions to normalize
            diacritics, lowercase, etc
       - If not explicitly defined, all MultiTermAwareComponent's from
            query analyzer are used, or KeywordTokenizer for effectively
            no analysis



                                                                           7
dismax
   • Simple constrained syntax
       - "supports phrases" +requiredTerms -prohibitedTerms loose terms
   • Spreads terms across specified query fields (qf) and entire query string
        across phrase fields (pf)
       - with field-specific boosting
       - and explicit and implicit phrase slop
       - scores each document with the maximum score for that document as produced by
             any subquery; primary score associated with the highest boost, not the sum of the
             field scores (as BooleanQuery would give)

   • Minimum match (mm) allows query fields gradient between AND and
        OR
       - some number of terms must match, but not all necessarily, and can vary depending
             on number of actual query terms

   • Additive boost queries (bq) and boost functions (bf)
   • Debug output includes parsed boost and function queries



                                                                                                 8
Specifying the query parser
    • defType=parser_name
      - defines main query parser
    • {!parser_name local=param...}expression
      - Can specify parser per query expression
    • These are equivalent:
      - q=ApacheCon NA
         2013&defType=dismax&mm=2&qf=name
      - q={!dismax qf=name mm=2}ApacheCon NA
         2013
      - q={!dismax qf=name mm=2 v='ApacheCon NA
         2013'}

                                                  9
Local Parameter Substitution
    /document?id=13




                               10
Nested Query Parsing
   • Leverages the "lucene" query parser's _query_/{!...} trick
   • Example:
      - q={!dismax qf='title^2 body' v=$user_query} AND
          {!dismax qf='keywords^5 description^2' v=$topic}
      - &user_query=ApacheCon NA 2013
      - &topic=events
   • Setting the complex nested q parameter in a request handler
       can make the client request lean and clean
      - And even qf and other parameters can be substituted:
          • {!dismax qf=$title_qf pf=$title_pf v=$title_query}
          • &title_qf=title^5 subtitle^2...

   • Real world example, Stanford University Libraries:
      - http://searchworks.stanford.edu/advanced
      - Insanely complex sets of nested dismax's and qf/pf settings


                                                                      11
edismax: extended dismax
   • "An advanced multi-field query parser based on the dismax parser"
       - Handles "lucene" syntax as well as dismax features
   • Fields available to user may be limited (uf)
       - including negations and dynamic fields, e.g. uf=* -cost -timestamp
   • Shingles query into 2 and 3 term phrases
       - Improves quality of results when query contains terms across multiple fields
       - pf2/pf3 and ps2/ps3
       - removes stop words from shingled phrase queries
   • multiplicative "boost" functions
   • Additional features
       - Query comprised entirely of "stopwords" optionally allowed
           • if indexed, but query analyzer is set to remove them
       - Allow "lowercaseOperators" by default; or/OR, and/AND



                                                                                        12
term query parser
    • FieldType aware, no analysis
      - converts to internal representation automatically
    • "raw" query parser is similar
      - though raw parser is not field type aware; no
         internal representation conversion
    • Best practice for filtering on single facet
       value
      - fq={!term f=facet_field}crazy:value :)
         • no query string escaping needed; but of course still
            need URL encoding when appropriate




                                                                  13
prefix query parser
    • No field type awareness
    • {!prefix f=field_name}prefixValue
      - Similar to Lucene query parser
         field_name:prefixValue*
      - Solr's "lucene" query parser has multiterm
         analysis capability, but the prefix query parser
         does not analyze




                                                            14
boost query parser
    • Multiplicative to wrapped query score
      - Internally used by edismax "boost"
    • {!boost b=recip(ms(NOW,mydatefield),
       3.16e-11,1,1)}foo




                                              15
field query parser
    • Same as handling of field:"Some Text" clause by Solr's
        "lucene" query parser
    • FieldType aware
       - TermQuery generated, unless field type has special handling
    • TextField
       -   PhraseQuery: if multiple tokens in different positions
       -   MultiPhraseQuery: if multiple tokens share some positions
       -   BooleanQuery: if multiple terms all in same position
       -   TermQuery: if only a single token

    • Other types that handle field queries specially:
       - currency, spatial types (point, latlon, etc)
       - {!field f=location}49.25,8.883333

                                                                       16
surround query parser
   • Creates Lucene SpanQuery's for fine-grained proximity
       matching, including use of wildcards
   • Uses infix and prefix notation
      - infix: AND/OR/NOT/nW/nN/()
      - prefix: AND/OR/nW/nN
      - Supports Lucene query parser basics
          • field:value, boost^5, wild?c*rd, prefix*
      - Proximity operators:
          • N: ordered
          • W: unordered

   • No analysis of clauses
      - requires user or search client to lowercase, normalize, etc
   • Example:
      - q={!surround}Apache* 4w Portland

                                                                      17
join query parser
    • Pseudo-join
       - Field values from inner result set used to map to another field to select final
            result set
       - No information from inner result set carries to final result set, such as
            scores or field values (it's not SQL!)

    • Can join from another local Solr core
       - Allows for different types of entities to be indexed in separate indexes
            altogether, modeled into clean schemas
       - Separate cores can scale independently, especially with commit and
            warming issues

    • Syntax:
       - {!join from=... to=... [fromIndex=core_name]}query
    • For more information:
       - Yonik's Lucene Revolution 2011 presentation: http://vimeo.com/25015101
       - http://wiki.apache.org/solr/Join


                                                                                           18
spatial query parsers
    • Operates on geohash, latlon, and point types
    • geofilt
       - Exact distance filtering
       - fq={!geofilt sfield=location pt=10.312,-20.556 d=3.5}
    • bbox
       - Alternatively use a range query:
            • fq=location:[45,-94 TO 46,-93]
    • Can use in conjunction with geodist() function
       - Sorting:
            • sort=geodist() asc
       -   Returning distance:
            • fl=_dist_:geodist()


                                                                 19
frange: function range
    • Match a field term range, textual or numeric
    • Example:
      - fq={!frange l=0 u=2.2}
         sum(user_ranking,editor_ranking)




                                                     20
switch query parser
    • acts like a "switch/case" statement
    • Example:
      - fq={!switch
            case.all='*:*'
            case.yes='inStock:true'
            case.no='inStock:false'
            v=$in_stock}
      - &in_stock=yes
    • Solr 4.2+



                                            21
PostFilter
    • Query's implementing PostFilter interface consulted after
        query and all other filters have narrowed documents for
        consideration
    • Queries supporting PostFilter
       - frange, geofilt, bbox
    • Enabled by setting cache=false and cost >= 100
       - Example:
           • fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist()))
    • More info:
       - Advanced filter caching
           • http://searchhub.org/2012/02/10/advanced-filter-caching-in-solr/
       - Custom security filtering
           • http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/


                                                                                         22
Phonetic, Stem, Synonym
   • Users tend to expect loose matching
      - but with "more exact" matches ranked higher
   • Various mechanisms for loosening matching:
      - Phonetic sounds-like: cat/kat, similar/similer
      - Stemming: search/searches/searched/searching
      - Synonyms: cat/feline, dog/canine
   • Distinguish ranking between exact and looser matching:
      - copyField original to a new (unstored, yet indexed) field with
           desired looser matching analysis
      - query across original field and looser field, with higher boosting
           for original field
          • /select?q=ApatchyCon&defType=dismax&qf=name^5 name_phonetic




                                                                             23
Suggest things, not strings
    • Model It As You Need It
       - Leverage Lucene's Document/Field/Query/score & sort &
           highlight

    • Example 1: Selling automobile parts
       - Exact year/make/model is needed to pick the right parts
       - Suggest a vehicle as user types
          • from the main parts index: tricky, requires lots of special fields and analysis
               tricks and even then you're suggesting fields from "parts"
          • Another (better?) approach: model vehicles as a separate core, "search"
               when suggesting, return documents, not field terms

    • Example 2: Technical Conferences
       - /select?q=Con&wt=csv&fl=name
          • Lucene EuroCon
          • ApacheCon



                                                                                              24
Query parsing and relevancy
   • The query is the formula that determines
      each document's score
   • Tuning is about what your application needs
     - Build tests using your corpus and real-world
        queries and ranking expectations
     - Re-run tests frequently/continuously as query
        parameters are tweaked
   • Tooling, currently, is mostly in-house custom
     - but that's changing, stay tuned!



                                                       25
Development/troubleshooting
   • Analysis
      - /analysis/field
          • ?analysis.fieldname=name
          • &analysis.fieldvalue=NA ApacheCon 2013
          • &q=apachecon
          • &analysis.showmatch=true
      - Also /analysis/document
      - admin UI analysis tool
   • Query Parsing
      - &debug=query
   • Relevancy
      - &debug=results
          • shows scoring explanations


                                                     26
Future of Solr query parsing
    • JSON query parser
    • XML query parser
    • PayloadTermQuery parser




                                27
JSON query parser
   • https://issues.apache.org/jira/browse/
      SOLR-4351
   • Current patch enables these:
     - {'term':{'id':'13'}}
     - {'field':{'text':'ApacheCon'}}
     - {'frange':{'v':'mul(rating,2)', 'l':20,'u':24}}}
     - {'join':{'from':'book_id', 'to':'id', 'v':{'term':
         {'text':'search'}}}}




                                                            28
XML query parser
   • Will allow a rich query "tree"
   • Parameters will fill in variables in a server-
      side XSLT query tree definition, or can
      provide full query tree
   • Useful for "advanced" query, multi-valued,
      input
   • https://issues.apache.org/jira/browse/
      SOLR-839




                                                      29
Payload term query parser
   • Solr supports indexing payload data on
      terms using DelimitedPayloadTokenFilter,
      but currently no support for querying with
      payloads
   • Requires custom Similarity implementation
      to provide score factor for payload data
   • Allows index-time weighting of terms
     - e.g. <b>bold words</b> weighted higher
   • https://issues.apache.org/jira/browse/
      SOLR-1485


                                                   30
BlockJoinQuery
   • https://issues.apache.org/jira/browse/
      SOLR-3076
   • Lucene provides a way to index a
      hierarchical "block" of documents and
      query it using ToParentBlockJoinQuery
      and ToChildBlockJoinQuery
     - Indexing a block is not yet supported by Solr
   • Example use case: What books greater than
      100 pages have paragraphs containing
      "information retrieval"?


                                                       31
32

More Related Content

What's hot (20)

PDF
Semantic Role Labeling
Marina Santini
 
PDF
Genetic Algorithms
adil raja
 
PPT
Introduction to Natural Language Processing
Pranav Gupta
 
PDF
Ontology matching
Ícaro Medeiros
 
PDF
Intro to nlp
Rutu Mulkar-Mehta
 
PPTX
NLP_KASHK:POS Tagging
Hemantha Kulathilake
 
PDF
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
PPTX
Recent trends in natural language processing
Balayogi G
 
PDF
Clase NTICx: Comunicación y tipos de Nubes
Matías Gonzalez
 
PPTX
anaphora resolution.pptx
aishaahmed324049
 
PPTX
Real World Applications of OWL
Michel Dumontier
 
PPTX
Machine learning (ML) and natural language processing (NLP)
Nikola Milosevic
 
PPTX
Parts of Speect Tagging
theyaseen51
 
PPTX
Deep Learning for Artificial Intelligence (AI)
Er. Shiva K. Shrestha
 
PPTX
Solr 디렉토리 구조와 관리 콘솔
용호 최
 
PDF
Api_testing.pdf
RameshN849679
 
PPTX
Tutorial on word2vec
Leiden University
 
PPTX
Machine learning introduction
Anas Jamil
 
PDF
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
NAVER D2
 
Semantic Role Labeling
Marina Santini
 
Genetic Algorithms
adil raja
 
Introduction to Natural Language Processing
Pranav Gupta
 
Ontology matching
Ícaro Medeiros
 
Intro to nlp
Rutu Mulkar-Mehta
 
NLP_KASHK:POS Tagging
Hemantha Kulathilake
 
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Recent trends in natural language processing
Balayogi G
 
Clase NTICx: Comunicación y tipos de Nubes
Matías Gonzalez
 
anaphora resolution.pptx
aishaahmed324049
 
Real World Applications of OWL
Michel Dumontier
 
Machine learning (ML) and natural language processing (NLP)
Nikola Milosevic
 
Parts of Speect Tagging
theyaseen51
 
Deep Learning for Artificial Intelligence (AI)
Er. Shiva K. Shrestha
 
Solr 디렉토리 구조와 관리 콘솔
용호 최
 
Api_testing.pdf
RameshN849679
 
Tutorial on word2vec
Leiden University
 
Machine learning introduction
Anas Jamil
 
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
NAVER D2
 

Similar to Solr Query Parsing (20)

PDF
Query Parsing - Tips and Tricks
Erik Hatcher
 
PDF
Solr Black Belt Pre-conference
Erik Hatcher
 
PDF
Solr5
Leonardo Souza
 
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
PDF
Apache Solr crash course
Tommaso Teofili
 
PDF
Get the most out of Solr search with PHP
Paul Borgermans
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Find it, possibly also near you!
Paul Borgermans
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PPT
Finite State Queries In Lucene
otisg
 
PDF
"Solr Update" at code4lib '13 - Chicago
Erik Hatcher
 
PPTX
Apache solr
Péter Király
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PPTX
Introduction to search engine-building with Lucene
Kai Chan
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PDF
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
PPTX
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
 
Query Parsing - Tips and Tricks
Erik Hatcher
 
Solr Black Belt Pre-conference
Erik Hatcher
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Apache Solr crash course
Tommaso Teofili
 
Get the most out of Solr search with PHP
Paul Borgermans
 
Lucene for Solr Developers
Erik Hatcher
 
Find it, possibly also near you!
Paul Borgermans
 
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Finite State Queries In Lucene
otisg
 
"Solr Update" at code4lib '13 - Chicago
Erik Hatcher
 
Apache solr
Péter Király
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Rapid Prototyping with Solr
Erik Hatcher
 
Introduction to search engine-building with Lucene
Kai Chan
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
 
Ad

More from Erik Hatcher (20)

PDF
Ted Talk
Erik Hatcher
 
PDF
it's just search
Erik Hatcher
 
PDF
Lucene's Latest (for Libraries)
Erik Hatcher
 
PDF
Solr Indexing and Analysis Tricks
Erik Hatcher
 
PDF
Solr Powered Libraries
Erik Hatcher
 
PDF
Solr 4
Erik Hatcher
 
PDF
Solr Recipes
Erik Hatcher
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Solr Flair
Erik Hatcher
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Introduction to Solr
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
What's New in Solr 3.x / 4.0
Erik Hatcher
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PDF
Solr Recipes Workshop
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr Powered Lucene
Erik Hatcher
 
Ted Talk
Erik Hatcher
 
it's just search
Erik Hatcher
 
Lucene's Latest (for Libraries)
Erik Hatcher
 
Solr Indexing and Analysis Tricks
Erik Hatcher
 
Solr Powered Libraries
Erik Hatcher
 
Solr 4
Erik Hatcher
 
Solr Recipes
Erik Hatcher
 
Introduction to Solr
Erik Hatcher
 
Solr Flair
Erik Hatcher
 
Introduction to Solr
Erik Hatcher
 
Introduction to Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
What's New in Solr 3.x / 4.0
Erik Hatcher
 
Solr Application Development Tutorial
Erik Hatcher
 
Solr Recipes Workshop
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr Powered Lucene
Erik Hatcher
 
Ad

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 

Solr Query Parsing

  • 1. Query Parsing Presented by Erik Hatcher 27 February 2013 1
  • 2. Description Interpreting what the user meant and what they ideally would like to find is tricky business. This talk will cover useful tips and tricks to better leverage and extend Solr's analysis and query parsing capabilities to more richly parse and interpret user queries. 2
  • 3. Abstract In this talk, Solr's built-in query parsers will be detailed included when and how to use them. Solr has nested query parsing capability, allowing for multiple query parsers to be used to generate a single query. The nested query parsing feature will be described and demonstrated. In many domains, e-commerce in particular, parsing queries often means interpreting which entities (e.g. products, categories, vehicles) the user likely means; this talk will conclude with techniques to achieve richer query interpretation. 3
  • 6. What’s new in 4.x? • Surround query parser • _query_:”{!...}” ugliness now unnecessary - just use nested {!...} expressions • Coming to 4.2: "switch" query parser 6
  • 7. "lucene"-syntax query parser • FieldType awareness - range queries, numerics - allows date math - reverses wildcard terms, if indexing used ReverseWildcardFilter • Magic fields - _val_: function query injection - _query_: nested query, to use a different query parser • Multi-term analysis (type="multiterm") - Analyzes prefix, wildcard, regex expressions to normalize diacritics, lowercase, etc - If not explicitly defined, all MultiTermAwareComponent's from query analyzer are used, or KeywordTokenizer for effectively no analysis 7
  • 8. dismax • Simple constrained syntax - "supports phrases" +requiredTerms -prohibitedTerms loose terms • Spreads terms across specified query fields (qf) and entire query string across phrase fields (pf) - with field-specific boosting - and explicit and implicit phrase slop - scores each document with the maximum score for that document as produced by any subquery; primary score associated with the highest boost, not the sum of the field scores (as BooleanQuery would give) • Minimum match (mm) allows query fields gradient between AND and OR - some number of terms must match, but not all necessarily, and can vary depending on number of actual query terms • Additive boost queries (bq) and boost functions (bf) • Debug output includes parsed boost and function queries 8
  • 9. Specifying the query parser • defType=parser_name - defines main query parser • {!parser_name local=param...}expression - Can specify parser per query expression • These are equivalent: - q=ApacheCon NA 2013&defType=dismax&mm=2&qf=name - q={!dismax qf=name mm=2}ApacheCon NA 2013 - q={!dismax qf=name mm=2 v='ApacheCon NA 2013'} 9
  • 10. Local Parameter Substitution /document?id=13 10
  • 11. Nested Query Parsing • Leverages the "lucene" query parser's _query_/{!...} trick • Example: - q={!dismax qf='title^2 body' v=$user_query} AND {!dismax qf='keywords^5 description^2' v=$topic} - &user_query=ApacheCon NA 2013 - &topic=events • Setting the complex nested q parameter in a request handler can make the client request lean and clean - And even qf and other parameters can be substituted: • {!dismax qf=$title_qf pf=$title_pf v=$title_query} • &title_qf=title^5 subtitle^2... • Real world example, Stanford University Libraries: - http://searchworks.stanford.edu/advanced - Insanely complex sets of nested dismax's and qf/pf settings 11
  • 12. edismax: extended dismax • "An advanced multi-field query parser based on the dismax parser" - Handles "lucene" syntax as well as dismax features • Fields available to user may be limited (uf) - including negations and dynamic fields, e.g. uf=* -cost -timestamp • Shingles query into 2 and 3 term phrases - Improves quality of results when query contains terms across multiple fields - pf2/pf3 and ps2/ps3 - removes stop words from shingled phrase queries • multiplicative "boost" functions • Additional features - Query comprised entirely of "stopwords" optionally allowed • if indexed, but query analyzer is set to remove them - Allow "lowercaseOperators" by default; or/OR, and/AND 12
  • 13. term query parser • FieldType aware, no analysis - converts to internal representation automatically • "raw" query parser is similar - though raw parser is not field type aware; no internal representation conversion • Best practice for filtering on single facet value - fq={!term f=facet_field}crazy:value :) • no query string escaping needed; but of course still need URL encoding when appropriate 13
  • 14. prefix query parser • No field type awareness • {!prefix f=field_name}prefixValue - Similar to Lucene query parser field_name:prefixValue* - Solr's "lucene" query parser has multiterm analysis capability, but the prefix query parser does not analyze 14
  • 15. boost query parser • Multiplicative to wrapped query score - Internally used by edismax "boost" • {!boost b=recip(ms(NOW,mydatefield), 3.16e-11,1,1)}foo 15
  • 16. field query parser • Same as handling of field:"Some Text" clause by Solr's "lucene" query parser • FieldType aware - TermQuery generated, unless field type has special handling • TextField - PhraseQuery: if multiple tokens in different positions - MultiPhraseQuery: if multiple tokens share some positions - BooleanQuery: if multiple terms all in same position - TermQuery: if only a single token • Other types that handle field queries specially: - currency, spatial types (point, latlon, etc) - {!field f=location}49.25,8.883333 16
  • 17. surround query parser • Creates Lucene SpanQuery's for fine-grained proximity matching, including use of wildcards • Uses infix and prefix notation - infix: AND/OR/NOT/nW/nN/() - prefix: AND/OR/nW/nN - Supports Lucene query parser basics • field:value, boost^5, wild?c*rd, prefix* - Proximity operators: • N: ordered • W: unordered • No analysis of clauses - requires user or search client to lowercase, normalize, etc • Example: - q={!surround}Apache* 4w Portland 17
  • 18. join query parser • Pseudo-join - Field values from inner result set used to map to another field to select final result set - No information from inner result set carries to final result set, such as scores or field values (it's not SQL!) • Can join from another local Solr core - Allows for different types of entities to be indexed in separate indexes altogether, modeled into clean schemas - Separate cores can scale independently, especially with commit and warming issues • Syntax: - {!join from=... to=... [fromIndex=core_name]}query • For more information: - Yonik's Lucene Revolution 2011 presentation: http://vimeo.com/25015101 - http://wiki.apache.org/solr/Join 18
  • 19. spatial query parsers • Operates on geohash, latlon, and point types • geofilt - Exact distance filtering - fq={!geofilt sfield=location pt=10.312,-20.556 d=3.5} • bbox - Alternatively use a range query: • fq=location:[45,-94 TO 46,-93] • Can use in conjunction with geodist() function - Sorting: • sort=geodist() asc - Returning distance: • fl=_dist_:geodist() 19
  • 20. frange: function range • Match a field term range, textual or numeric • Example: - fq={!frange l=0 u=2.2} sum(user_ranking,editor_ranking) 20
  • 21. switch query parser • acts like a "switch/case" statement • Example: - fq={!switch case.all='*:*' case.yes='inStock:true' case.no='inStock:false' v=$in_stock} - &in_stock=yes • Solr 4.2+ 21
  • 22. PostFilter • Query's implementing PostFilter interface consulted after query and all other filters have narrowed documents for consideration • Queries supporting PostFilter - frange, geofilt, bbox • Enabled by setting cache=false and cost >= 100 - Example: • fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist())) • More info: - Advanced filter caching • http://searchhub.org/2012/02/10/advanced-filter-caching-in-solr/ - Custom security filtering • http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ 22
  • 23. Phonetic, Stem, Synonym • Users tend to expect loose matching - but with "more exact" matches ranked higher • Various mechanisms for loosening matching: - Phonetic sounds-like: cat/kat, similar/similer - Stemming: search/searches/searched/searching - Synonyms: cat/feline, dog/canine • Distinguish ranking between exact and looser matching: - copyField original to a new (unstored, yet indexed) field with desired looser matching analysis - query across original field and looser field, with higher boosting for original field • /select?q=ApatchyCon&defType=dismax&qf=name^5 name_phonetic 23
  • 24. Suggest things, not strings • Model It As You Need It - Leverage Lucene's Document/Field/Query/score & sort & highlight • Example 1: Selling automobile parts - Exact year/make/model is needed to pick the right parts - Suggest a vehicle as user types • from the main parts index: tricky, requires lots of special fields and analysis tricks and even then you're suggesting fields from "parts" • Another (better?) approach: model vehicles as a separate core, "search" when suggesting, return documents, not field terms • Example 2: Technical Conferences - /select?q=Con&wt=csv&fl=name • Lucene EuroCon • ApacheCon 24
  • 25. Query parsing and relevancy • The query is the formula that determines each document's score • Tuning is about what your application needs - Build tests using your corpus and real-world queries and ranking expectations - Re-run tests frequently/continuously as query parameters are tweaked • Tooling, currently, is mostly in-house custom - but that's changing, stay tuned! 25
  • 26. Development/troubleshooting • Analysis - /analysis/field • ?analysis.fieldname=name • &analysis.fieldvalue=NA ApacheCon 2013 • &q=apachecon • &analysis.showmatch=true - Also /analysis/document - admin UI analysis tool • Query Parsing - &debug=query • Relevancy - &debug=results • shows scoring explanations 26
  • 27. Future of Solr query parsing • JSON query parser • XML query parser • PayloadTermQuery parser 27
  • 28. JSON query parser • https://issues.apache.org/jira/browse/ SOLR-4351 • Current patch enables these: - {'term':{'id':'13'}} - {'field':{'text':'ApacheCon'}} - {'frange':{'v':'mul(rating,2)', 'l':20,'u':24}}} - {'join':{'from':'book_id', 'to':'id', 'v':{'term': {'text':'search'}}}} 28
  • 29. XML query parser • Will allow a rich query "tree" • Parameters will fill in variables in a server- side XSLT query tree definition, or can provide full query tree • Useful for "advanced" query, multi-valued, input • https://issues.apache.org/jira/browse/ SOLR-839 29
  • 30. Payload term query parser • Solr supports indexing payload data on terms using DelimitedPayloadTokenFilter, but currently no support for querying with payloads • Requires custom Similarity implementation to provide score factor for payload data • Allows index-time weighting of terms - e.g. <b>bold words</b> weighted higher • https://issues.apache.org/jira/browse/ SOLR-1485 30
  • 31. BlockJoinQuery • https://issues.apache.org/jira/browse/ SOLR-3076 • Lucene provides a way to index a hierarchical "block" of documents and query it using ToParentBlockJoinQuery and ToChildBlockJoinQuery - Indexing a block is not yet supported by Solr • Example use case: What books greater than 100 pages have paragraphs containing "information retrieval"? 31
  • 32. 32