Introduction to Apache Solr
Andrew Jackson
UK Web Archive Technical Lead
www.bl.uk 2
Web Archive Overall Architecture
www.bl.uk 3
Understanding Your Use Case(s)
• Full text search, right?
– Yes, but there are many variations and choices to make.
• Work with users to understand their information needs:
– Are they looking for…
• Particular (archived) web resources?
• Resources on a particular issue or subject?
• Evidence of trends over time?
– What aspects of the content do they consider important?
– What kind of outputs do they want?
www.bl.uk 4
Working With Historians…
• JISC AADDA Project:
– Initial index and UI of the 1996-2010 data
– Great learning experience and feedback
– http://domaindarkarchive.blogspot.co.uk/
• AHRC ‘Big Data’ Project:
– Second iteration of index and UI
– Bursary holders reports coming soon
– http://buddah.projects.history.ac.uk/
• Interested in trends and reflections of society
– Who links to who/what, over time?
www.bl.uk 5
Apache Solr & Lucene
• Apache Lucene:
– A Java library for full text indexes
• Apache Solr:
– A web service and API that exposes Lucene functionality in a
as a document database
– Supports SolrCloud mode for distributed searches
• See also:
– Elasticsearch (also built around Lucene)
– We ‘chose’ Solr before Elasticsearch existed
– http://solr-vs-elasticsearch.com/
www.bl.uk 6
Example: Indexing Quotes
• Quotes to be indexed:
– “To do is to be.” - Jean-Paul Sartre
– “To be is to do.” - Socrates
– “Do be do be do.” - Frank Sinatra
• Goals:
– Index the quotation for full-text search.
• e.g. Show me all quotes that contain “to be”.
– Index the author for faceted search.
• e.g. Show me all quotes by “Frank Sinatra”.
www.bl.uk 7
Lucene’s Inverted Indexes
www.bl.uk 8
Solr as a Document Database
• Solr Indexes/Stores & Retrieves:
– Documents
composed of:
• Multiple Fields
each of which has a defined:
– Field Type
such as ‘text’, ‘string’, ‘int’, etc.
• The queries you can support depend on on many
parameters, but the fields and their types are the most
critical factors.
– See Overview of Documents, Fields, and Schema Design
www.bl.uk 9
The Quotes As Solr Documents
• Our Documents contain three fields:
– ‘id’ field of type ‘string’
– ‘text’ field of type ‘text_general’
– ‘author’ field, of type ‘string’
• Example Documents:
– id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre”
– id: “2”, text: “To be is to do.”, author: “Socrates”
– id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
www.bl.uk 10
Solr Update Flow
www.bl.uk 11
Analyzing The Text Field
• Analyzing the text on document 1:
– Input: “To do is to be.”, type = ‘text_general’
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Lower Case Filter:
• ‘to’ ‘be’ ‘is’ ‘to’ ‘do’
• Adding the tokens to the index:
– ‘be’ => id:1
– ‘do’ => id:1
– …
www.bl.uk 12
Analyzing The Author Field
• Analyzing the author on document 1:
– Input: “Jean-Paul Sartre”, type = ‘string’
– Strings are stored as is.
• Adding the tokens to the index:
– ‘Jean-Paul Sartre’ => id:1
www.bl.uk 13
Solr Query Flow
www.bl.uk 14
Query for text:“To be”
• Uses the same analyser
as the indexer:
– “To be?”
– ST: “To” “be”
– LCF: “to” “be”
• Returns
documents:
– 1
– 2
www.bl.uk 15
Solr’s Built-in UI
www.bl.uk 16
Solr Overall Flow
www.bl.uk 17
Choice: Ignore ‘stop words’?
• Removes common words, unrelated to subject/topic
– Input: “To do is to be”
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Stop Words Filter (stopwords_en.txt):
• ‘do’
– Lower Case Filter:
• ‘do’
• Cannot support phrase search
– e.g. searching for “to be”
www.bl.uk 18
Choice: Stemming?
• Attempts to group concepts together:
– "fishing", "fished”, "fisher" => "fish"
– "argue", "argued", "argues", "arguing”, "argus” => "argu"
• Sometimes confused:
– "axes” => "axe”, or ”axis”?
• Better at grouping related items together
• Makes precise phrase searching difficult
www.bl.uk 19
So Many Choices…
• Lots of text indexing options to tune:
– Punctuation and tokenization:
• is www.google.com one or three tokens?
– Stop word filter (“the” => “”)
– Lower case filter (“This” => “this”)
– Stemming (choice of algorithms too)
– Keywords (excepted from stemming)
– Synonyms (“TV” => “Television”)
– Possessive Filter (“Blair’s” => “Blair”)
– …and many more Tokenizers and Filters.
www.bl.uk 20
Even More Choices: Query Features
• As well as full-text search variations, we have
– Query parsers and features:
• Proximity, wildcards, term frequencies, relevance…
– Faceted search
– Numeric or Date values and range queries
– Geographic data and spatial search
– Snippets/fragments and highlighting
– Spell checking i.e. ‘Did you mean …?’
– MoreLikeThis
– Clustering
www.bl.uk 21
How to get started?
• Experimenting with the UKWA stack:
– Indexing:
• webarchive-discovery
– User Interfaces:
• Drupal Sarnia
• Shine (Play Framework, by UKWA)
• See https://github.com/ukwa/webarchive-
discovery/wiki/Front-ends
www.bl.uk 22
The webarchive-discovery system
• The webarchive-discovery codebase is an indexing stack
that reflects our (UKWA) use cases
– Contains our choices, reflects our progress so far
– Turns ARC or WARC records into Solr Documents
– Highly robust against (W)ARC data quality problems
• Adds custom fields for web archiving
– Text extracted using Apache Tika
– Various other analysis features
• Workshop sessions will use our setup
– but this is only a starting point…
www.bl.uk 23
Features: Basic Metadata Fields
• From the file system:
– The source (W)ARC filename and offset
• From the WARC record:
– URL, host, domain, public suffix
– Crawl date(s)
• From the HTTP headers:
– Content length
– Content type (as served)
– Server software IDs
www.bl.uk 24
Features: Payload Analysis
• Binary hash, embedded metadata
• Format and preservation risk analysis:
– Apache Tika & DROID format and encoding ID
– Notes parse errors to spot access problems
– Apache Preflight PDF risk analysis
– XML root namespace
– Format signature generation tricks
• HTML links, elements used, licence/rights URL
• Image properties, dominant colours, face detection
www.bl.uk 25
Features: Text Analysis
• Text extraction from binary formats
• ‘Fuzzy’ hash (ssdeep) of text
– for similarity analysis
• Natural language detection
• UK postcode extraction and geo-indexing
• Experimental language analysis:
– Simplistic sentiment analysis
– Stanford NLP named entity extraction
– Initial GATE NLP analyser
www.bl.uk 26
Command-line Indexing Architecture
www.bl.uk 27
Hadoop Indexing Architecture
www.bl.uk 28
Scaling Solr
• We are operating outside Solr’s sweet spot:
– General recommendation is RAM = Index Size
– We have a 15TB index. That’s a lot of RAM.
• e.g. from this email
– “100 million documents [and 16-32GB] per node”
– “it's quite the fool's errand for average developers to try to
replicate the "heroic efforts" of the few.”
• So how to scale up?
www.bl.uk 29
Basic Index Performance Scaling
• One Query:
– Single-threaded binary search
– Seek-and-read speed is critical, not CPU
• Add RAID/SAN?
– More IOPS can support more concurrent queries
– BUT each query is no faster
• Want faster queries?
– Use SSD, and/or
– More RAM to cache more disk, and/or
– Split the data into more shards (on independent media)
www.bl.uk 30
Sharding & SolrCloud
• For > ~100 million documents, use shards
– More, smaller independent shards == faster search
• Shard generation:
– SolrCloud ‘Live’ shards
• We use Solr’s standard sharding
• Randomly distributes records
• Supports updates to records
– Manual sharding
• e.g. ‘static’ shards generated from files
• As used by the Danish web archive (see later today)
www.bl.uk 31
Next Steps
• Prototype, Prototype, Prototype
– Expect to re-index
– Expect to iterate your front and back end systems
– Seek real user feedback
• Benchmark, Benchmark, Benchmark
– More on scaling issues and benchmarking this afternoon
• Work Together
– Share use cases, indexing tactics
– Share system specs, benchmarks
– Share code where appropriate

More Related Content

PPTX
Introduction to Apache Lucene/Solr
PDF
How Solr Search Works
PDF
Apache Solr crash course
PDF
Integrating the Solr search engine
PDF
Introduction to Apache Solr
PDF
Introduction to Solr
PPTX
Apache Solr
PDF
Solr: 4 big features
Introduction to Apache Lucene/Solr
How Solr Search Works
Apache Solr crash course
Integrating the Solr search engine
Introduction to Apache Solr
Introduction to Solr
Apache Solr
Solr: 4 big features

What's hot (20)

PPTX
20130310 solr tuorial
PDF
Beyond full-text searches with Lucene and Solr
PPTX
Introduction to Lucene & Solr and Usecases
PPTX
Intro to Apache Lucene and Solr
PPTX
Enterprise Search Using Apache Solr
PDF
Rapid Prototyping with Solr
PDF
Get the most out of Solr search with PHP
PDF
Retrieving Information From Solr
PDF
Solr Recipes
PDF
New-Age Search through Apache Solr
PDF
Solr Recipes Workshop
PPTX
Battle of the giants: Apache Solr vs ElasticSearch
PDF
Building your own search engine with Apache Solr
PDF
Apache Solr/Lucene Internals by Anatoliy Sokolenko
PDF
Solr Application Development Tutorial
PPTX
Apache Solr
KEY
State-of-the-Art Drupal Search with Apache Solr
PDF
Data Science with Solr and Spark
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
PDF
Solr Architecture
20130310 solr tuorial
Beyond full-text searches with Lucene and Solr
Introduction to Lucene & Solr and Usecases
Intro to Apache Lucene and Solr
Enterprise Search Using Apache Solr
Rapid Prototyping with Solr
Get the most out of Solr search with PHP
Retrieving Information From Solr
Solr Recipes
New-Age Search through Apache Solr
Solr Recipes Workshop
Battle of the giants: Apache Solr vs ElasticSearch
Building your own search engine with Apache Solr
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Solr Application Development Tutorial
Apache Solr
State-of-the-Art Drupal Search with Apache Solr
Data Science with Solr and Spark
Building Intelligent Search Applications with Apache Solr and PHP5
Solr Architecture
Ad

Viewers also liked (20)

PPT
Introduction to Apache Solr.
PDF
Introduction to Apache Solr
PPT
An Introduction to Solr
PPT
Solr Presentation
PDF
Introduction to SolrCloud
PPTX
Building a real time, solr-powered recommendation engine
PPTX
Solr formation Sparna
PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PPTX
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
PPT
Apprendre Solr en deux heures
ODP
Large Scale Crawling with Apache Nutch and Friends
KEY
Open source enterprise search and retrieval platform
PPTX
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
PPTX
Populate your Search index, NEST 2016-01
PPT
Apache Tika end-to-end
PPT
Content Analysis with Apache Tika
PDF
Large Scale Crawling with Apache Nutch and Friends
PPT
PPTX
Search Engine Capabilities - Apache Solr(Lucene)
PDF
Web Crawling with Apache Nutch
Introduction to Apache Solr.
Introduction to Apache Solr
An Introduction to Solr
Solr Presentation
Introduction to SolrCloud
Building a real time, solr-powered recommendation engine
Solr formation Sparna
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Apprendre Solr en deux heures
Large Scale Crawling with Apache Nutch and Friends
Open source enterprise search and retrieval platform
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Populate your Search index, NEST 2016-01
Apache Tika end-to-end
Content Analysis with Apache Tika
Large Scale Crawling with Apache Nutch and Friends
Search Engine Capabilities - Apache Solr(Lucene)
Web Crawling with Apache Nutch
Ad

Similar to Introduction to Apache Solr (20)

PDF
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
PPTX
IIPC GA 2014 Solr
PDF
Slides anu talkwebarchivingaug2012
PDF
Why do you consider to adopt Koha Open Source Integrated Library System for y...
PPTX
Elasticsearch - DevNexus 2015
PPTX
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
PPTX
Computer Science Library Training
PDF
Internet content as research data
KEY
Introduction to MongoDB
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
PDF
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
PDF
ELK stack introduction
PPTX
Google Dorks
PDF
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
PPT
SPARQL in the Semantic Web
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
IIPC GA 2014 Solr
Slides anu talkwebarchivingaug2012
Why do you consider to adopt Koha Open Source Integrated Library System for y...
Elasticsearch - DevNexus 2015
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Computer Science Library Training
Internet content as research data
Introduction to MongoDB
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
SQL Now! How Optiq brings the best of SQL to NoSQL data.
ELK stack introduction
Google Dorks
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Solr and Spark
SPARQL in the Semantic Web

More from Andy Jackson (6)

PPT
The 'Digital Object Types' Issue
PPTX
Ten years of the UK web archive: what have we saved?
PPTX
Seeing In The Dark: Discovery and data-mining of restricted web archives
PPTX
Digging into the Web Archive at the British Library 2014-11-27
PPTX
Unified characterisation, please
PDF
Formats Over Time: Exploring UK Web History
The 'Digital Object Types' Issue
Ten years of the UK web archive: what have we saved?
Seeing In The Dark: Discovery and data-mining of restricted web archives
Digging into the Web Archive at the British Library 2014-11-27
Unified characterisation, please
Formats Over Time: Exploring UK Web History

Recently uploaded (20)

PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
Microsoft User Copilot Training Slide Deck
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Comparative analysis of machine learning models for fake news detection in so...
DOCX
search engine optimization ppt fir known well about this
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
sbt 2.0: go big (Scala Days 2025 edition)
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Improvisation in detection of pomegranate leaf disease using transfer learni...
Statistics on Ai - sourced from AIPRM.pdf
Consumable AI The What, Why & How for Small Teams.pdf
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Data Virtualization in Action: Scaling APIs and Apps with FME
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Flame analysis and combustion estimation using large language and vision assi...
Microsoft User Copilot Training Slide Deck
Advancing precision in air quality forecasting through machine learning integ...
Enhancing plagiarism detection using data pre-processing and machine learning...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Comparative analysis of machine learning models for fake news detection in so...
search engine optimization ppt fir known well about this
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Lung cancer patients survival prediction using outlier detection and optimize...
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
sbt 2.0: go big (Scala Days 2025 edition)

Introduction to Apache Solr

  • 1. Introduction to Apache Solr Andrew Jackson UK Web Archive Technical Lead
  • 2. www.bl.uk 2 Web Archive Overall Architecture
  • 3. www.bl.uk 3 Understanding Your Use Case(s) • Full text search, right? – Yes, but there are many variations and choices to make. • Work with users to understand their information needs: – Are they looking for… • Particular (archived) web resources? • Resources on a particular issue or subject? • Evidence of trends over time? – What aspects of the content do they consider important? – What kind of outputs do they want?
  • 4. www.bl.uk 4 Working With Historians… • JISC AADDA Project: – Initial index and UI of the 1996-2010 data – Great learning experience and feedback – http://domaindarkarchive.blogspot.co.uk/ • AHRC ‘Big Data’ Project: – Second iteration of index and UI – Bursary holders reports coming soon – http://buddah.projects.history.ac.uk/ • Interested in trends and reflections of society – Who links to who/what, over time?
  • 5. www.bl.uk 5 Apache Solr & Lucene • Apache Lucene: – A Java library for full text indexes • Apache Solr: – A web service and API that exposes Lucene functionality in a as a document database – Supports SolrCloud mode for distributed searches • See also: – Elasticsearch (also built around Lucene) – We ‘chose’ Solr before Elasticsearch existed – http://solr-vs-elasticsearch.com/
  • 6. www.bl.uk 6 Example: Indexing Quotes • Quotes to be indexed: – “To do is to be.” - Jean-Paul Sartre – “To be is to do.” - Socrates – “Do be do be do.” - Frank Sinatra • Goals: – Index the quotation for full-text search. • e.g. Show me all quotes that contain “to be”. – Index the author for faceted search. • e.g. Show me all quotes by “Frank Sinatra”.
  • 8. www.bl.uk 8 Solr as a Document Database • Solr Indexes/Stores & Retrieves: – Documents composed of: • Multiple Fields each of which has a defined: – Field Type such as ‘text’, ‘string’, ‘int’, etc. • The queries you can support depend on on many parameters, but the fields and their types are the most critical factors. – See Overview of Documents, Fields, and Schema Design
  • 9. www.bl.uk 9 The Quotes As Solr Documents • Our Documents contain three fields: – ‘id’ field of type ‘string’ – ‘text’ field of type ‘text_general’ – ‘author’ field, of type ‘string’ • Example Documents: – id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre” – id: “2”, text: “To be is to do.”, author: “Socrates” – id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
  • 11. www.bl.uk 11 Analyzing The Text Field • Analyzing the text on document 1: – Input: “To do is to be.”, type = ‘text_general’ – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Lower Case Filter: • ‘to’ ‘be’ ‘is’ ‘to’ ‘do’ • Adding the tokens to the index: – ‘be’ => id:1 – ‘do’ => id:1 – …
  • 12. www.bl.uk 12 Analyzing The Author Field • Analyzing the author on document 1: – Input: “Jean-Paul Sartre”, type = ‘string’ – Strings are stored as is. • Adding the tokens to the index: – ‘Jean-Paul Sartre’ => id:1
  • 14. www.bl.uk 14 Query for text:“To be” • Uses the same analyser as the indexer: – “To be?” – ST: “To” “be” – LCF: “to” “be” • Returns documents: – 1 – 2
  • 17. www.bl.uk 17 Choice: Ignore ‘stop words’? • Removes common words, unrelated to subject/topic – Input: “To do is to be” – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Stop Words Filter (stopwords_en.txt): • ‘do’ – Lower Case Filter: • ‘do’ • Cannot support phrase search – e.g. searching for “to be”
  • 18. www.bl.uk 18 Choice: Stemming? • Attempts to group concepts together: – "fishing", "fished”, "fisher" => "fish" – "argue", "argued", "argues", "arguing”, "argus” => "argu" • Sometimes confused: – "axes” => "axe”, or ”axis”? • Better at grouping related items together • Makes precise phrase searching difficult
  • 19. www.bl.uk 19 So Many Choices… • Lots of text indexing options to tune: – Punctuation and tokenization: • is www.google.com one or three tokens? – Stop word filter (“the” => “”) – Lower case filter (“This” => “this”) – Stemming (choice of algorithms too) – Keywords (excepted from stemming) – Synonyms (“TV” => “Television”) – Possessive Filter (“Blair’s” => “Blair”) – …and many more Tokenizers and Filters.
  • 20. www.bl.uk 20 Even More Choices: Query Features • As well as full-text search variations, we have – Query parsers and features: • Proximity, wildcards, term frequencies, relevance… – Faceted search – Numeric or Date values and range queries – Geographic data and spatial search – Snippets/fragments and highlighting – Spell checking i.e. ‘Did you mean …?’ – MoreLikeThis – Clustering
  • 21. www.bl.uk 21 How to get started? • Experimenting with the UKWA stack: – Indexing: • webarchive-discovery – User Interfaces: • Drupal Sarnia • Shine (Play Framework, by UKWA) • See https://github.com/ukwa/webarchive- discovery/wiki/Front-ends
  • 22. www.bl.uk 22 The webarchive-discovery system • The webarchive-discovery codebase is an indexing stack that reflects our (UKWA) use cases – Contains our choices, reflects our progress so far – Turns ARC or WARC records into Solr Documents – Highly robust against (W)ARC data quality problems • Adds custom fields for web archiving – Text extracted using Apache Tika – Various other analysis features • Workshop sessions will use our setup – but this is only a starting point…
  • 23. www.bl.uk 23 Features: Basic Metadata Fields • From the file system: – The source (W)ARC filename and offset • From the WARC record: – URL, host, domain, public suffix – Crawl date(s) • From the HTTP headers: – Content length – Content type (as served) – Server software IDs
  • 24. www.bl.uk 24 Features: Payload Analysis • Binary hash, embedded metadata • Format and preservation risk analysis: – Apache Tika & DROID format and encoding ID – Notes parse errors to spot access problems – Apache Preflight PDF risk analysis – XML root namespace – Format signature generation tricks • HTML links, elements used, licence/rights URL • Image properties, dominant colours, face detection
  • 25. www.bl.uk 25 Features: Text Analysis • Text extraction from binary formats • ‘Fuzzy’ hash (ssdeep) of text – for similarity analysis • Natural language detection • UK postcode extraction and geo-indexing • Experimental language analysis: – Simplistic sentiment analysis – Stanford NLP named entity extraction – Initial GATE NLP analyser
  • 28. www.bl.uk 28 Scaling Solr • We are operating outside Solr’s sweet spot: – General recommendation is RAM = Index Size – We have a 15TB index. That’s a lot of RAM. • e.g. from this email – “100 million documents [and 16-32GB] per node” – “it's quite the fool's errand for average developers to try to replicate the "heroic efforts" of the few.” • So how to scale up?
  • 29. www.bl.uk 29 Basic Index Performance Scaling • One Query: – Single-threaded binary search – Seek-and-read speed is critical, not CPU • Add RAID/SAN? – More IOPS can support more concurrent queries – BUT each query is no faster • Want faster queries? – Use SSD, and/or – More RAM to cache more disk, and/or – Split the data into more shards (on independent media)
  • 30. www.bl.uk 30 Sharding & SolrCloud • For > ~100 million documents, use shards – More, smaller independent shards == faster search • Shard generation: – SolrCloud ‘Live’ shards • We use Solr’s standard sharding • Randomly distributes records • Supports updates to records – Manual sharding • e.g. ‘static’ shards generated from files • As used by the Danish web archive (see later today)
  • 31. www.bl.uk 31 Next Steps • Prototype, Prototype, Prototype – Expect to re-index – Expect to iterate your front and back end systems – Seek real user feedback • Benchmark, Benchmark, Benchmark – More on scaling issues and benchmarking this afternoon • Work Together – Share use cases, indexing tactics – Share system specs, benchmarks – Share code where appropriate