Dynamic Sitemaps
Blacklight Virtual Summit
May 8, 2020
Charlie Morris
Lead Web Developer
Penn State University Libraries
Libraries Strategic Technologies
Discovery, Access and Web Services
Context on PSU Libraries
• Blacklight catalog (project name “BlackCat”) in Beta until Fall
• Vendor provided search interface remains the primary catalog
product for the Libraries
• 7.5+ million records
• Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL
• 100,000+ students across commonwealth and around the world
Letting the bots in
• Initially disallowed all bots in robots.txt
• As part of phased releasing closer to stable release we invited the
bots in
November 5, 2019
Prior to sitemap,
removed deny all for robots
How do people find you?
• Probably through a search engine.
• Probably Google.
• This is not a revelation.
• Search engines like sitemaps, especially critical for a site made up
entirely of dynamic links
A critical feature that is low hanging fruit
• Let users find content in channels they trust and use on a daily basis
(not defending these search engines, more that they are the critical
path for users)
• Why not compete with Amazon? Could save patrons some money
and increase use of library resources
• This isn’t a new revelation, of course, it’s more like ”low hanging
fruit”
• Note: no sitemap option in core Blacklight
The challenge of sitemaps on a large
repository
• < 50,000
• Solely dynamic links
Prior work
• Static sitemap generators
• https://github.com/jronallo/blacklight-sitemap
• https://github.com/kjvarga/sitemap_generator
• Operate by a scheduled task generating static files
A different approach: dynamic sitemaps
• Jack Reed of Stanford University Libraries and others create a POC
• Live query Solr for sitemap data
• Use a Rails’ controller to dictate what is displayed
• Use a Rails’ view to control the sitemap template
• Penn State University Libraries’ PR for the work:
• https://github.com/psu-libraries/psulib_blacklight/pull/511
The Query Recipe
• Necessary piece: a unique base 16 (hexadecimal) encoded hash for
each record indexed in Solr (call it the “signature”)
• lucene as the query parser
• Query parameter for “the signature starts with…” (q)
• Return the id and timestamp fields (fl)
• Make sure Solr isn’t attempting to calculate facets (facet)
• Specify a large number (rows) to prevent paging
More on query parameters from the Solr RefGuide
Making the signature with Solr
<updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
name="add_hash_id">
<bool name="enabled">true</bool>
<str name="signatureField">hashed_id_si</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">id</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</updateProcessor>
More on this from the Solr RefGuide
Add this to your UpdateProcessorChain
“Signature starts with” for “Dynamic leaves”
• Depending on size of the index, tell Solr to create links to queries that
start with every combination of hexadecimal values for X placeholders
• Example: 0 to F for one placeholder = 16 “leaves”
• GET /sitemap: shows a list of 16 links to leaves like /sitemap/0
• GET /sitemap/0: a sitemap with every document that has a signature that
starts with 0
PSU Libraries Example: 4096 leaves
Update robots.txt
Crawl-delay: 10
Sitemap: http://catalog.libraries.psu.edu/sitemap.xml
Early Returns
Slow growth…
But hey…
Dynamic sitemaps
More on slow growth
Google has known about 7+ million documents since November, but
growth is about 10,000 items per month, at this rate it will take 62 years
for Google to finish up
Light analysis
• About 20-50 visits a day
• 4,967 visits since launching it late November
• 4.4% of all traffic
• Screenshot below is daily visits over time from search engines via
Matomo Analytics (hey it used to be zero!)
Lessons learned
Google is mysterious
• Slow growth despite the fact that they know about all records
• Search of site:catalog.libraries.psu.edu still only shows a few
thousand records despite Google’s dashboard reporting over 30
thousand
Bing is problematic
• Bing needed to be throttled, it hit us very hard to the point of a DOS
like behavior (thankful to have Sematext Performance Monitoring to
tattle on Bing)
• Used Bing webmaster tools to gain finer control over when the bot is
allowed to visit and how often
• Also set crawl delay to 10 in robots.txt (Google ignores this because
it’s smart enough to not DOS you)
• Not sure which of the above two factors solved the issue
Dynamic sitemaps
Future
• Keep watching growth in Google Search Console
• Keep monitoring Matomo Analytics
• Discuss with others about their experiences in attempting to have
their repositories indexed by Google and others
Questions?
Incomplete gem: https://rubygems.org/gems/blacklight-sitemaps
Email cdm32@psu.edu
Twitter @cdmo
GitHub @cdmo

More Related Content

PDF
Building a relevance platform with Couchbase and Elasticsearch
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PDF
Superset druid realtime
PPTX
Session 03 acquiring data
PPTX
Presto@Netflix Presto Meetup 03-19-15
PDF
Presto@Uber
Building a relevance platform with Couchbase and Elasticsearch
Analysing GitHub commits with R
Analysing GitHub commits with R
Analysing GitHub commits with R
Superset druid realtime
Session 03 acquiring data
Presto@Netflix Presto Meetup 03-19-15
Presto@Uber

Similar to Dynamic sitemaps (20)

KEY
Online Collections Crawlability for Libraries, Archives, and Museums
PDF
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
PDF
PDF
Charting Searchland, ACM SIG Data Mining
PDF
Search Engine Google
PPTX
Smxeastbarbarastarr2012
PDF
PPT
Inside Google's Search Algorythm! (by Google Researchers)
PDF
Search Engines
PDF
Internet search engine
PDF
Lucene Case Studies ApacheCon EU 2009
PPT
Best practices in museum search
PDF
IRJET - Review on Search Engine Optimization
PDF
PARC Forum 2009: Adventures in SearchLand
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr
PPT
Search engines by ganesh kavhar
PPTX
Blacklight intro - LLI seminar
PDF
Searchland: Search quality for Beginners
Online Collections Crawlability for Libraries, Archives, and Museums
NCompass Live: Libraries in Search Engines: Search Engine Optimization (SEO) ...
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
Charting Searchland, ACM SIG Data Mining
Search Engine Google
Smxeastbarbarastarr2012
Inside Google's Search Algorythm! (by Google Researchers)
Search Engines
Internet search engine
Lucene Case Studies ApacheCon EU 2009
Best practices in museum search
IRJET - Review on Search Engine Optimization
PARC Forum 2009: Adventures in SearchLand
Solr Flair: Search User Interfaces Powered by Apache Solr
Search engines by ganesh kavhar
Blacklight intro - LLI seminar
Searchland: Search quality for Beginners
Ad

More from Charlie Morris (12)

PPTX
Axe-matchers gem for automated accessibility testing
PPTX
Content & Features Reno: Less Is More
PPTX
Less is more: Getting Real About Content and Features
PDF
Drupal, git and sanity
PDF
With Drupal Your Website is an API
PDF
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
PDF
Boiling a Frog: A Responsive Update
PDF
Viral Outreach: Blending Online and Offline Social Networks
PPTX
Creating the Hunt Partners App: Cross-Departmental Rapid Response
PDF
Google Analytics Basics for NCSU Libraries' Staff
PPT
Exposing Tech Lending Device Availability Data
PDF
5 Ways to Make Use of Your Google Analytics
Axe-matchers gem for automated accessibility testing
Content & Features Reno: Less Is More
Less is more: Getting Real About Content and Features
Drupal, git and sanity
With Drupal Your Website is an API
Responsive Approaches: Redesigning websites for Duke, NCSU & UNC Libraries
Boiling a Frog: A Responsive Update
Viral Outreach: Blending Online and Offline Social Networks
Creating the Hunt Partners App: Cross-Departmental Rapid Response
Google Analytics Basics for NCSU Libraries' Staff
Exposing Tech Lending Device Availability Data
5 Ways to Make Use of Your Google Analytics
Ad

Recently uploaded (20)

PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PPT
Geologic Time for studying geology for geologist
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
Configure Apache Mutual Authentication
PDF
Comparative analysis of machine learning models for fake news detection in so...
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
Internet of Everything -Basic concepts details
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Training Program for knowledge in solar cell and solar industry
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Five Habits of High-Impact Board Members
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PPTX
Microsoft Excel 365/2024 Beginner's training
giants, standing on the shoulders of - by Daniel Stenberg
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Geologic Time for studying geology for geologist
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
TEXTILE technology diploma scope and career opportunities
Configure Apache Mutual Authentication
Comparative analysis of machine learning models for fake news detection in so...
Build Your First AI Agent with UiPath.pptx
Internet of Everything -Basic concepts details
Early detection and classification of bone marrow changes in lumbar vertebrae...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Training Program for knowledge in solar cell and solar industry
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
CloudStack 4.21: First Look Webinar slides
Five Habits of High-Impact Board Members
sustainability-14-14877-v2.pddhzftheheeeee
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Microsoft Excel 365/2024 Beginner's training

Dynamic sitemaps

  • 1. Dynamic Sitemaps Blacklight Virtual Summit May 8, 2020 Charlie Morris Lead Web Developer Penn State University Libraries Libraries Strategic Technologies Discovery, Access and Web Services
  • 2. Context on PSU Libraries • Blacklight catalog (project name “BlackCat”) in Beta until Fall • Vendor provided search interface remains the primary catalog product for the Libraries • 7.5+ million records • Solr 7.4, running in cloud mode, Blacklight 7+, Traject for ETL • 100,000+ students across commonwealth and around the world
  • 3. Letting the bots in • Initially disallowed all bots in robots.txt • As part of phased releasing closer to stable release we invited the bots in
  • 4. November 5, 2019 Prior to sitemap, removed deny all for robots
  • 5. How do people find you? • Probably through a search engine. • Probably Google. • This is not a revelation. • Search engines like sitemaps, especially critical for a site made up entirely of dynamic links
  • 6. A critical feature that is low hanging fruit • Let users find content in channels they trust and use on a daily basis (not defending these search engines, more that they are the critical path for users) • Why not compete with Amazon? Could save patrons some money and increase use of library resources • This isn’t a new revelation, of course, it’s more like ”low hanging fruit” • Note: no sitemap option in core Blacklight
  • 7. The challenge of sitemaps on a large repository • < 50,000 • Solely dynamic links
  • 8. Prior work • Static sitemap generators • https://github.com/jronallo/blacklight-sitemap • https://github.com/kjvarga/sitemap_generator • Operate by a scheduled task generating static files
  • 9. A different approach: dynamic sitemaps • Jack Reed of Stanford University Libraries and others create a POC • Live query Solr for sitemap data • Use a Rails’ controller to dictate what is displayed • Use a Rails’ view to control the sitemap template • Penn State University Libraries’ PR for the work: • https://github.com/psu-libraries/psulib_blacklight/pull/511
  • 10. The Query Recipe • Necessary piece: a unique base 16 (hexadecimal) encoded hash for each record indexed in Solr (call it the “signature”) • lucene as the query parser • Query parameter for “the signature starts with…” (q) • Return the id and timestamp fields (fl) • Make sure Solr isn’t attempting to calculate facets (facet) • Specify a large number (rows) to prevent paging More on query parameters from the Solr RefGuide
  • 11. Making the signature with Solr <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory" name="add_hash_id"> <bool name="enabled">true</bool> <str name="signatureField">hashed_id_si</str> <bool name="overwriteDupes">false</bool> <str name="fields">id</str> <str name="signatureClass">solr.processor.Lookup3Signature</str> </updateProcessor> More on this from the Solr RefGuide Add this to your UpdateProcessorChain
  • 12. “Signature starts with” for “Dynamic leaves” • Depending on size of the index, tell Solr to create links to queries that start with every combination of hexadecimal values for X placeholders • Example: 0 to F for one placeholder = 16 “leaves” • GET /sitemap: shows a list of 16 links to leaves like /sitemap/0 • GET /sitemap/0: a sitemap with every document that has a signature that starts with 0
  • 13. PSU Libraries Example: 4096 leaves
  • 14. Update robots.txt Crawl-delay: 10 Sitemap: http://catalog.libraries.psu.edu/sitemap.xml
  • 18. More on slow growth Google has known about 7+ million documents since November, but growth is about 10,000 items per month, at this rate it will take 62 years for Google to finish up
  • 19. Light analysis • About 20-50 visits a day • 4,967 visits since launching it late November • 4.4% of all traffic • Screenshot below is daily visits over time from search engines via Matomo Analytics (hey it used to be zero!)
  • 21. Google is mysterious • Slow growth despite the fact that they know about all records • Search of site:catalog.libraries.psu.edu still only shows a few thousand records despite Google’s dashboard reporting over 30 thousand
  • 22. Bing is problematic • Bing needed to be throttled, it hit us very hard to the point of a DOS like behavior (thankful to have Sematext Performance Monitoring to tattle on Bing) • Used Bing webmaster tools to gain finer control over when the bot is allowed to visit and how often • Also set crawl delay to 10 in robots.txt (Google ignores this because it’s smart enough to not DOS you) • Not sure which of the above two factors solved the issue
  • 24. Future • Keep watching growth in Google Search Console • Keep monitoring Matomo Analytics • Discuss with others about their experiences in attempting to have their repositories indexed by Google and others