SlideShare a Scribd company logo
Boosting Documents in Solr by Recency, Popularity, and User Preferences Timothy Potter [email_address] , May 25, 2011
What I Will Cover Recency Boost Popularity Boost Filtering based on user preferences
My Background Timothy Potter Large scale distributed systems engineer specializing in Web and enterprise search, machine learning, and big data analytics. 5 years Lucene Search solution for learning management sys 2+ years Solr Mobile app for magazine content Solr + Mahout + Hadoop FAST to Solr Migration for a Real Estate Portal VinWiki: Wine search and recommendation engine
Boost documents by age Just do a descending sort by age = done? Boost more recent documents and penalize older documents just for being old Useful for news, business docs, and local search
Solr: Indexing In schema.xml: <fieldType name=&quot;tdate&quot;  class=&quot;solr.TrieDateField&quot;  omitNorms=&quot;true&quot;  precisionStep=&quot;6&quot;  positionIncrementGap=&quot;0&quot;/> <field name=&quot;pubdate&quot;  type=&quot;tdate&quot;  indexed=&quot;true&quot;  stored=&quot;true&quot;  required=&quot;true&quot; /> Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
FunctionQuery Basics FunctionQuery: Computes a value for each document Ranking Sorting constant literal fieldvalue ord rord sum sub product pow abs log sqrt map scale query linear recip max min ms sqedist - Squared Euclidean Dist hsin, ghhsin - Haversine Formula geohash - Convert to geohash strdist
Solr: Query Time Boost Use the recip function with the ms function: q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine Use edismax vs. dismax if possible : q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05) Recip is a highly tunable function recip(x,m,a,b) implementing a / (m*x + b) m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
Tune Solr recip function
Tips and Tricks Boost should be a multiplier on the relevancy score  {!boost b=} syntax confuses the spell checker so you need to use  spellcheck.q  to be explicit q={!boost b=$recency v=$qq}&spellcheck.q=wine  Bottom out the old age penalty using min: min(recip(…), 0.20) Not a one-size fits all solution – academic research focused on when to apply it
Score based on number of unique views Not known at indexing time View count should be broken into time slots Boost by Popularity
Popularity Illustrated
Solr: ExternalFileField In schema.xml: <fieldType name=&quot;externalPopularityScore&quot;  keyField=&quot;id&quot;  defVal=&quot;1&quot;  stored=&quot;false&quot; indexed=&quot;false&quot;  class=” solr.ExternalFileField &quot;  valType=&quot;pfloat&quot;/> <field name=&quot;popularity&quot;  type=&quot;externalPopularityScore&quot; />
Popularity Boost: Nuts & Bolts Logs Solr Server User activity logged View Counting Job solr-home/data/ external_popularity a=1.114 b=1.05 c=1.111 … commit
Popularity Tips & Tricks For big, high traffic sites, use log analysis Perfect problem for MapReduce Take a look at Hive for analyzing large volumes of log data Minimum popularity score is 1 (not zero) … up to 2 or more 1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …) Watch out for spell checker “buildOnCommit”
Filtering By User Preferences Easy approach is to build basic preference fields in to the index: Content types of interest – content_type High-level categories of interest - category Source of interest – source We had too many categories and sources that a user could enable / disable to use basic filtering Custom SearchComponent with a connection to a JDBC DataSource
Preferences Component Connects to a database Caches DocIdSet in a Solr FastLRUCache Cached values marked as dirty using a simple timestamp passed in the request Declared in solrconfig.xml: <searchComponent  class=“demo.solr.PreferencesComponent&quot;  name=”pref&quot;> <str name=&quot;jdbcJndi&quot;>jdbc/solr</str>  </searchComponent>
Preferences Filter Parameters passed in the query string: pref.id = primary key in db pref.mod = preferences modified on timestamp So the Solr side knows the database has been updated Use simple SQL queries to compute a list of disabled categories, feeds, and types Lucene FieldCaches for category, source, type Custom SearchComponent included in the list of components for edismax search handler <arr name=&quot;last-components&quot;> <str>pref</str>  </arr>
Preferences Filter in Action User Preferences Db Solr Server LRU Cache Preferences Component Update Preferences Query with pref.id=123 and pref.mod = TS pref.id & pref.mod If cached mod == pref.mod read from cache SQL to compute excluded categories sources and types
Wrap Up Use recip & ms functions to boost recent documents Use ExternalFileField to load popularity scores calculated outside the index Use a custom SearchComponent with a Solr FastLRUCache to filter documents using complex user preferences
Contact Timothy Potter [email_address] http://thelabdude.blogspot.com http://www.linkedin.com/in/thelabdude

More Related Content

Viewers also liked (19)

PPTX
Getting started with Elasticsearch and .NET
Tomas Jansson
 
PDF
Query Parsing - Tips and Tricks
Erik Hatcher
 
PDF
Twitter Search Architecture
Ramez Al-Fayez
 
PDF
Solr Query Parsing
Erik Hatcher
 
PDF
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
Yahoo!デベロッパーネットワーク
 
PDF
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
DOC
Black+listed+companies+list+in+hyd
kranrann
 
PDF
Language support and linguistics in lucene solr & its eco system
lucenerevolution
 
PDF
Learn How to Master Solr1 4
Lucidworks (Archived)
 
PPT
Lady gaga
tanica
 
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
PDF
Overview of Searching in Solr 1.4
Lucidworks (Archived)
 
PDF
What’s new in apache lucene 3.0
Lucidworks (Archived)
 
PPTX
Ashe
guest093609
 
PPT
Portades
guest6bfe1581
 
PPTX
Network Forensics Puzzle Contest に挑戦 #2
彰 村地
 
PPT
How To Get The Justin Bieber Smile
Dr. D. K. Simmons, DDS
 
PPTX
Ecma 262 5th Edition を読む #5 第9条
彰 村地
 
PPT
Spanish bombss
tanica
 
Getting started with Elasticsearch and .NET
Tomas Jansson
 
Query Parsing - Tips and Tricks
Erik Hatcher
 
Twitter Search Architecture
Ramez Al-Fayez
 
Solr Query Parsing
Erik Hatcher
 
第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP
Yahoo!デベロッパーネットワーク
 
Building a Real-time Solr-powered Recommendation Engine
lucenerevolution
 
Black+listed+companies+list+in+hyd
kranrann
 
Language support and linguistics in lucene solr & its eco system
lucenerevolution
 
Learn How to Master Solr1 4
Lucidworks (Archived)
 
Lady gaga
tanica
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Overview of Searching in Solr 1.4
Lucidworks (Archived)
 
What’s new in apache lucene 3.0
Lucidworks (Archived)
 
Portades
guest6bfe1581
 
Network Forensics Puzzle Contest に挑戦 #2
彰 村地
 
How To Get The Justin Bieber Smile
Dr. D. K. Simmons, DDS
 
Ecma 262 5th Edition を読む #5 第9条
彰 村地
 
Spanish bombss
tanica
 

Similar to Boosting Documents in Solr by Recency, Popularity, and User Preferences (20)

PPT
Boosting Documents in Solr (Lucene Revolution 2011)
thelabdude
 
PDF
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Ramzi Alqrainy
 
PDF
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
PDF
Retrieving Information From Solr
Ramzi Alqrainy
 
ODP
Mastering solr
jurcello
 
PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
PDF
Search@flipkart
Umesh Prasad
 
PDF
Apache Solr - An Experience Report
Netcetera
 
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
PDF
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
PDF
High Performance Solr
Shalin Shekhar Mangar
 
PDF
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
PDF
SOLR Power FTW: short version
Alex Pinkin
 
PPT
Solr Presentation
Gaurav Verma
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
PDF
Apace Solr Web Development.pdf
Abanti Aazmin
 
PDF
Faceted Search And Result Reordering
Varun Thacker
 
PDF
Apache solr liferay
Binesh Gummadi
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PPTX
Apache solr
Péter Király
 
Boosting Documents in Solr (Lucene Revolution 2011)
thelabdude
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Ramzi Alqrainy
 
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
Retrieving Information From Solr
Ramzi Alqrainy
 
Mastering solr
jurcello
 
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
Search@flipkart
Umesh Prasad
 
Apache Solr - An Experience Report
Netcetera
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
High Performance Solr
Shalin Shekhar Mangar
 
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
SOLR Power FTW: short version
Alex Pinkin
 
Solr Presentation
Gaurav Verma
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Apace Solr Web Development.pdf
Abanti Aazmin
 
Faceted Search And Result Reordering
Varun Thacker
 
Apache solr liferay
Binesh Gummadi
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Apache solr
Péter Király
 
Ad

More from Lucidworks (Archived) (20)

PDF
Integrating Hadoop & Solr
Lucidworks (Archived)
 
PDF
The Data-Driven Paradigm
Lucidworks (Archived)
 
PDF
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Lucidworks (Archived)
 
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Lucidworks (Archived)
 
PPTX
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
Lucidworks (Archived)
 
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
PPTX
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
PPTX
What's new in solr june 2014
Lucidworks (Archived)
 
PPTX
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Lucidworks (Archived)
 
PPTX
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
PDF
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
PDF
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
PPTX
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
PPTX
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
PPTX
Building a data driven search application with LucidWorks SiLK
Lucidworks (Archived)
 
PPTX
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 
Integrating Hadoop & Solr
Lucidworks (Archived)
 
The Data-Driven Paradigm
Lucidworks (Archived)
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Lucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
Lucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
What's new in solr june 2014
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 
Ad

Recently uploaded (20)

PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 

Boosting Documents in Solr by Recency, Popularity, and User Preferences

  • 1. Boosting Documents in Solr by Recency, Popularity, and User Preferences Timothy Potter [email_address] , May 25, 2011
  • 2. What I Will Cover Recency Boost Popularity Boost Filtering based on user preferences
  • 3. My Background Timothy Potter Large scale distributed systems engineer specializing in Web and enterprise search, machine learning, and big data analytics. 5 years Lucene Search solution for learning management sys 2+ years Solr Mobile app for magazine content Solr + Mahout + Hadoop FAST to Solr Migration for a Real Estate Portal VinWiki: Wine search and recommendation engine
  • 4. Boost documents by age Just do a descending sort by age = done? Boost more recent documents and penalize older documents just for being old Useful for news, business docs, and local search
  • 5. Solr: Indexing In schema.xml: <fieldType name=&quot;tdate&quot; class=&quot;solr.TrieDateField&quot; omitNorms=&quot;true&quot; precisionStep=&quot;6&quot; positionIncrementGap=&quot;0&quot;/> <field name=&quot;pubdate&quot; type=&quot;tdate&quot; indexed=&quot;true&quot; stored=&quot;true&quot; required=&quot;true&quot; /> Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
  • 6. FunctionQuery Basics FunctionQuery: Computes a value for each document Ranking Sorting constant literal fieldvalue ord rord sum sub product pow abs log sqrt map scale query linear recip max min ms sqedist - Squared Euclidean Dist hsin, ghhsin - Haversine Formula geohash - Convert to geohash strdist
  • 7. Solr: Query Time Boost Use the recip function with the ms function: q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine Use edismax vs. dismax if possible : q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05) Recip is a highly tunable function recip(x,m,a,b) implementing a / (m*x + b) m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
  • 8. Tune Solr recip function
  • 9. Tips and Tricks Boost should be a multiplier on the relevancy score {!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit q={!boost b=$recency v=$qq}&spellcheck.q=wine Bottom out the old age penalty using min: min(recip(…), 0.20) Not a one-size fits all solution – academic research focused on when to apply it
  • 10. Score based on number of unique views Not known at indexing time View count should be broken into time slots Boost by Popularity
  • 12. Solr: ExternalFileField In schema.xml: <fieldType name=&quot;externalPopularityScore&quot; keyField=&quot;id&quot; defVal=&quot;1&quot; stored=&quot;false&quot; indexed=&quot;false&quot; class=” solr.ExternalFileField &quot; valType=&quot;pfloat&quot;/> <field name=&quot;popularity&quot; type=&quot;externalPopularityScore&quot; />
  • 13. Popularity Boost: Nuts & Bolts Logs Solr Server User activity logged View Counting Job solr-home/data/ external_popularity a=1.114 b=1.05 c=1.111 … commit
  • 14. Popularity Tips & Tricks For big, high traffic sites, use log analysis Perfect problem for MapReduce Take a look at Hive for analyzing large volumes of log data Minimum popularity score is 1 (not zero) … up to 2 or more 1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …) Watch out for spell checker “buildOnCommit”
  • 15. Filtering By User Preferences Easy approach is to build basic preference fields in to the index: Content types of interest – content_type High-level categories of interest - category Source of interest – source We had too many categories and sources that a user could enable / disable to use basic filtering Custom SearchComponent with a connection to a JDBC DataSource
  • 16. Preferences Component Connects to a database Caches DocIdSet in a Solr FastLRUCache Cached values marked as dirty using a simple timestamp passed in the request Declared in solrconfig.xml: <searchComponent class=“demo.solr.PreferencesComponent&quot; name=”pref&quot;> <str name=&quot;jdbcJndi&quot;>jdbc/solr</str> </searchComponent>
  • 17. Preferences Filter Parameters passed in the query string: pref.id = primary key in db pref.mod = preferences modified on timestamp So the Solr side knows the database has been updated Use simple SQL queries to compute a list of disabled categories, feeds, and types Lucene FieldCaches for category, source, type Custom SearchComponent included in the list of components for edismax search handler <arr name=&quot;last-components&quot;> <str>pref</str> </arr>
  • 18. Preferences Filter in Action User Preferences Db Solr Server LRU Cache Preferences Component Update Preferences Query with pref.id=123 and pref.mod = TS pref.id & pref.mod If cached mod == pref.mod read from cache SQL to compute excluded categories sources and types
  • 19. Wrap Up Use recip & ms functions to boost recent documents Use ExternalFileField to load popularity scores calculated outside the index Use a custom SearchComponent with a Solr FastLRUCache to filter documents using complex user preferences
  • 20. Contact Timothy Potter [email_address] http://thelabdude.blogspot.com http://www.linkedin.com/in/thelabdude

Editor's Notes

  • #2: Attendees with come away from this presentation with a good understanding and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common &amp;quot;recip&amp;quot; based solution for boosting by document age. The framework also supports boosting documents by a popularity score, which is calculated and managed outside the index. I will present a few different ways to calculate popularity in a scalable manner. Lastly, my solution supports the concept of a personal document collection, where each user is only interested in a subset of the total number of documents in the index. My presentation will provide a good example of how to filter and/or boost results based on user preferences, which is a very common requirement of many Web applications.
  • #3: The one thing I’d like you to come away with today is confidence that Solr has powerful boosting capabilities built-in, but they require some fine-tuning and experimentation. Some simple recipes for complementing core Solr functionality to do: I. Boost documents by age (recency / freshness boost) II. Boost documents by popularity III. Filter results based on User Preferences (Personalized collection)
  • #4: Currently working at the National Renewable Energy Laboratory on building an infrastructure for storing and analyzing large volumes of smart grid related energy data using Hadoop technologies. Been doing search work for the past 5 years including a Lucene based search solution of eLearning content, Solr based solution for online magazine content and a FAST to Solr migration for a real estate portal. My other area of interest is in Mahout; I&apos;ve contributed a few bug fixes and several pages on the wiki including working with Grant Ingersoll on benchmarking Mahout&apos;s distributed clustering algorithms in the Amazon cloud. Technical Blog: http://thelabdude.blogspot.com/ Currently working on JSF2 components for Solr.
  • #5: All other things being equal, more recent documents are better What’s not covered is how to determine if you should apply the boost. That’s a more in-depth topic that is the focus of academic research, especially in relation to Web search. News and most magazine articles Business documents – perhaps a less aggressive boost function identification of recency sensitive queries before ranking. see: http://technicallypossible.wordpress.com/2011/03/13/identifying-queries-which-demand-recency-sensitive-results-in-web-search/
  • #6: Careful! TrieFields make it more efficient to do range searches on numeric fields indexed at full precision, but it doesn&apos;t actually do anything to round the fields for people who genuinely want their stored and index values to only have second/minute/hour/day precision regardless of what the initial raw data looks like. Currently, Solr doesn&apos;t have anything built-in to round a date down to a different precision, such as minute / hour. Thus, you may need to do this yourself prior to indexing a document. see SOLR-741 // from commons DateUtils Date published = DateUtils.round(item.getPublishedOnDate(), Calendar.HOUR);
  • #8: Solr 1.4+ the recommended approach is to use the recip function with the ms function: There are approximately 3.16e10 milliseconds in a year, so one can scale dates to fractions of a year with the inverse, or 3.16e-11 recip(ms(NOW/HOUR,pubdate),3.16e-11,1,1) For standard query parser, you could do: q={!boost b=recip(ms(NOW/HOUR,pubdate),3.16e-11,1,1)}wine This uses the built-in boost function query. This uses a Lucene FieldCache under the covers on the pubdate field (stored in the index as long). The ms(NOW/HOUR) uses less precise measure of document age (rounding clause), which helps reduce memory consumption. Lessons: 1 - {!boost b=} syntax breaks spell-checking so you need to use spellcheck.q to be explicit 2 - Use edismax because it multiplies the boost whereas dismax adds &amp;quot;bf&amp;quot; 3 - Use a tdate field when indexing 4 - Use ms(NOW/HOUR) and less precision when indexing 5 - Use max(boost,0.20) - to bottom out the age penalty
  • #9: A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b). m,a,b are constants, x is any numeric field or arbitrarily complex function. When a and b are equal, and x&gt;=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve. These properties can make this an ideal function for boosting more recent documents – see http://wiki.apache.org/solr/FunctionQuery
  • #10: identification of recency sensitive queries before ranking. see: http://technicallypossible.wordpress.com/2011/03/13/identifying-queries-which-demand-recency-sensitive-results-in-web-search/
  • #11: Score made of number of unique views in a time slot + avg rating / # of comments, etc. Must be computed outside of the index; refreshed periodically Probably don’t want to mix this with age boost as an older document might be really popular for some weird reason; think of old videos that become popular on YouTube Age – probably not as an old doc might get popular identification of recency sensitive queries before ranking. see: http://technicallypossible.wordpress.com/2011/03/13/identifying-queries-which-demand-recency-sensitive-results-in-web-search/
  • #12: Bar chart illustrates time slots Popularity score favors more recent content Document A is most popular; B was popular but is now on the decline and C has enjoyed consistent interest for a longer period but scores a little lower than A because of the recent interest in A
  • #15: Most likely use case would be to use log-file analysis &gt; Ideal problem for MapReduce Question the audience – who has heard of MapReduce?