SlideShare a Scribd company logo
Recent Additions to Lucene Arsenal
Recent Additions to Lucene’s Arsenal
Shai Erera, Researcher, IBM
Adrien Grand, ElasticSearch
Who We Are
•

Shai Erera
–
–
–
–

•

Working at IBM – Information Retrieval Research
Lucene/Solr committer and PMC member
http://shaierera.blogspot.com
shaie@apache.org

Adrien Grand
–
–
–

@jpountz
Lucene/Solr committer and PMC member
Software engineer at Elasticsearch
The Replicator
Load Balancing

Load
Balancer
Failover
Index Backup
Replicator

Replication
Client

The Replicator

Backup

Replication
Client

Primary

Backup
http://shaierera.blogspot.com/2013/05/the-replicator.html
Replication Components
•

Replicator
–
–
–

•

Revision
–
–

•

Describes a list of files and metadata
Responsible to ensure the files are available as long as clients replicate it

ReplicationClient
–
–
–

•

Mediates between the client and server
Manages the published Revisions
Implementation for replication over HTTP

Performs the replication operation on the replica side
Copies delta files and invokes ReplicationHandler upon successful copy
Always replicates latest revision

ReplicationHandler
–

Acts on the copied files
Index Replication
•

IndexRevision
–
–

•

IndexReplicationHandler
–
–
–

•

Obtains a snapshot on the last commit through SnapshotDeletionPolicy
Released when revision is released by Replicator
Copies the files to the index directory and fsync them
Aborts (rollback) on any error
Upon successful completion, invokes a callback (e.g.
SearcherManager.maybeRefresh())

Similar extensions for faceted index replication
–
–

IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy
indexes
IndexAndTaxonomyReplicationHandler: copies the files to the respective
directories, keeping both in sync
Sample Code
// Server-side: publish a new Revision
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
// Client-side: replicate a Revision
Replicator replicator; // either LocalReplicator or HttpReplicator
// refresh SearcherManager after index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
public Boolean call() throws Exception {
// index was updated, refresh manager
searcherManager.maybeRefresh();
}
}
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);
ReplicationClient client = new ReplicationClient(replicator, handler, factory);
client.updateNow(); // invoke client manually
// -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
Future Work
•

Resume
–
–

•

Parallel Replication
–

•

Session level: don’t copy files that were already successfully copied
File level: don’t copy file parts that were already successfully copied
Copy revision files in parallel

Other replication strategies
–

Peer-to-peer
Index Sorting
How to trade index speed for search speed
Anatomy of a Lucene index
Index = collection of immutable segments
Segments store documents sequentially on disk
Add data = create a new segment
Segments get eventually merged together
Order of segments / documents in segments doesn’t matter
– the following segments are equivalent

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0
Anatomy of a Lucene index
ordinal of a doc in a segment = doc id
used in the inverted index to refer to docs

shoe

1, 3, 5, 8, 11, 13, 15

doc id

0

1

2

3

4

5

7

8

9

10 11 12 13 14 15 16

Id

1

3

10

4

7

20 42 11

9

8

15 18 30 31 99

5

12

Price

9

0

7

8

2

2

10

3

4

1

13

6
1

8

4

6

10

1
Top hits
Get top N=2 results:
– Create a priority queue of size N
– Accumulate matching docs

Id

1

3

10

4

7

20

42

11

9

8

15

18

30

31

99

5

12

Price

9

0

7

8

2

2

1

8

10

3

4

4

6

10

1

1

13

()

(3)

Create an empty
priority queue

(3,4)

(4,20)

(4,9)

Automatic overflow of the
priority queue to remove the
least one

(4,9)

(9,31) (9,31)

Top hits
Early termination
Let’s do the same on a sorted index

Id

12

9

31

1

4

11

10

30

15

18

8

7

20

42

99

5

3

Price

13

10

10

9

8

8

7

6

4

4

3

2

2

1

1

1

0

()

(9)

(9,31) (9,31)

Priority queue never
changes after this
document

(9,31)

(9,31)

(9,31) (9,31)
Early termination
Pros
– makes finding the top hits much faster
– file-system cache-friendly
Cons
– only works for static ranks
– not if the sort order depends on the query
– requires the index to be sorted
– doesn’t work for tasks that require visiting every doc:
– total number of matches
– faceting
Static ranks
Not uncommon!
Graph-based ranks
– Google’s PageRank
Facebook social search / Unicorn
– https://www.facebook.com/publications/219621248185635
Many more...

Doesn’t need to be the exact sort order
– heuristics when score is only a function of the static rank
Offline sorting
A live index can’t be kept sorted
– would require inserting docs between existing docs!
– segments are immutable
Offline sorting to the rescue:
– index as usual
– sort into a new index
– search!
Pros/cons
– super fast to search, the whole index is fully sorted
– but only works for static content
Offline Sorting
// open a reader on the unsorted index and create a sorted (but slow) view
DirectoryReader reader = DirectoryReader.open(in);
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
AtomicReader sortedReader = SortingAtomicReader.wrap(
SlowCompositeReaderWrapper.wrap(reader), sorter);
// copy the content of the sorted reader to the new dir
IndexWriter writer = new IndexWriter(out, iwConf);
writer.addIndexes(sortedReader);
writer.close();
reader.close();
Online sorting?
Sort segments independently
– wouldn’t require inserting data into existing segments
– collection could still be early-terminated on a per-segment basis
But segments are immutable
– must be sorted before starting writing them
Online sorting?
2 sources of segments
– flush
– merge
flushed segments can’t be sorted
– Lucene writes stored fields to disk on the fly
– could be buffered but this would require a lot of memory
merged segments can be sorted
– create a sorted view over the segments to merge
– pass this view to SegmentMerger instead of the original segments
not a bad trade-off
– flushed segments are usually small & fast to collect
Online sorting?

Merged segments can easily take 99+%
of the size of the index

Merged segments

Flushed segments
- NRT reopens
- RAM buffer size limit hit

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Online Sorting
IndexWriterConfig iwConf = new IndexWriterConfig(...);
// original MergePolicy finds the segments to merge
MergePolicy origMP = iwConf.getMergePolicy();
// SortingMergePolicy wraps the segments with a sorted view
boolean ascending = false;
Sorter sorter = new NumericDocValuesSorter("price", ascending);
MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);
// setup IndexWriter to use SortingMergePolicy
iwConf.setMergePolicy(sortingMP);
IndexWriter writer = new IndexWriter(dir, iwConf);
// index as usual
Early termination
Collect top N matches
Offline sorting
– index sorted globally
– early terminate after N matches have been collected
– no priority queue needed!
Online sorting
– no early termination on flushed segments
– early termination on merged segments
– if N matches have been collected
– or if current match is less than the top of the PQ
Early Termination
class MyCollector extends Collector {
@Override
public void setNextReader(AtomicReaderContext context) throws IOException {
readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter);
collected = 0;
}
@Override
public void collect(int doc) throws IOException {
if (readerIsSorted &&
(++collected >= maxDocsToCollect || curVal <= pq.top()) {
// Special exception that tells IndexSearcher to terminate
// collection of the current segment
throw new CollectionTerminatedException();
} else {
// collect hit
}
}
}
Questions?

More Related Content

PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
PDF
Faceted Search with Lucene
lucenerevolution
 
PDF
Elasticsearch speed is key
Enterprise Search Warsaw Meetup
 
PDF
Search at Twitter
lucenerevolution
 
PDF
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
PDF
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
PDF
Solr Black Belt Pre-conference
Erik Hatcher
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Faceted Search with Lucene
lucenerevolution
 
Elasticsearch speed is key
Enterprise Search Warsaw Meetup
 
Search at Twitter
lucenerevolution
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir
 
Administering and Monitoring SolrCloud Clusters
lucenerevolution
 
Solr Black Belt Pre-conference
Erik Hatcher
 

What's hot (20)

PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Lucene's Latest (for Libraries)
Erik Hatcher
 
PDF
Solr Recipes Workshop
Erik Hatcher
 
PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
PPT
Lucene Introduction
otisg
 
PPTX
Hacking Lucene for Custom Search Results
OpenSource Connections
 
PDF
Solr Application Development Tutorial
Erik Hatcher
 
PPT
Lucene and MySQL
farhan "Frank"​ mashraqi
 
PPTX
Tutorial on developing a Solr search component plugin
searchbox-com
 
PDF
Solr Query Parsing
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr 4
Erik Hatcher
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PDF
Solr Indexing and Analysis Tricks
Erik Hatcher
 
PDF
Lucene for Solr Developers
Erik Hatcher
 
PDF
Apache Solr Workshop
Saumitra Srivastav
 
PPT
Lucene basics
Nitin Pande
 
ODP
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
PDF
Building your own search engine with Apache Solr
Biogeeks
 
Rapid Prototyping with Solr
Erik Hatcher
 
Lucene's Latest (for Libraries)
Erik Hatcher
 
Solr Recipes Workshop
Erik Hatcher
 
Multi faceted responsive search, autocomplete, feeds engine & logging
lucenerevolution
 
Lucene Introduction
otisg
 
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Solr Application Development Tutorial
Erik Hatcher
 
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Tutorial on developing a Solr search component plugin
searchbox-com
 
Solr Query Parsing
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr 4
Erik Hatcher
 
Rapid Prototyping with Solr
Erik Hatcher
 
Solr Indexing and Analysis Tricks
Erik Hatcher
 
Lucene for Solr Developers
Erik Hatcher
 
Apache Solr Workshop
Saumitra Srivastav
 
Lucene basics
Nitin Pande
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
Building your own search engine with Apache Solr
Biogeeks
 
Ad

Similar to Recent Additions to Lucene Arsenal (20)

PDF
Consuming RealTime Signals in Solr
Umesh Prasad
 
PDF
Is Your Index Reader Really Atomic or Maybe Slow?
lucenerevolution
 
PDF
What is in a Lucene index?
lucenerevolution
 
PDF
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Lucidworks
 
PDF
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
PPTX
Lucene indexing
Lucky Sharma
 
PPT
Lucene Bootcamp - 2
GokulD
 
PDF
Building a near real time search engine & analytics for logs using solr
lucenerevolution
 
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Lucidworks
 
PDF
Introduction to elasticsearch
pmanvi
 
PDF
Webinar: Faster Log Indexing with Fusion
Lucidworks
 
PDF
Elasticsearch at EyeEm
Lars Fronius
 
PPTX
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
PPTX
Elasticsearch - under the hood
SmartCat
 
PPTX
Search enabled applications with lucene.net
Willem Meints
 
PDF
Batch Indexing & Near Real Time, keeping things fast
Marc Sturlese
 
PPTX
Search Me: Using Lucene.Net
gramana
 
Consuming RealTime Signals in Solr
Umesh Prasad
 
Is Your Index Reader Really Atomic or Maybe Slow?
lucenerevolution
 
What is in a Lucene index?
lucenerevolution
 
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
Lucidworks
 
KEYNOTE: Lucene / Solr road map
lucenerevolution
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
lucenerevolution
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
Lucene indexing
Lucky Sharma
 
Lucene Bootcamp - 2
GokulD
 
Building a near real time search engine & analytics for logs using solr
lucenerevolution
 
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Lucidworks
 
Introduction to elasticsearch
pmanvi
 
Webinar: Faster Log Indexing with Fusion
Lucidworks
 
Elasticsearch at EyeEm
Lars Fronius
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Elasticsearch - under the hood
SmartCat
 
Search enabled applications with lucene.net
Willem Meints
 
Batch Indexing & Near Real Time, keeping things fast
Marc Sturlese
 
Search Me: Using Lucene.Net
gramana
 
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
PDF
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
PDF
Building Client-side Search Applications with Solr
lucenerevolution
 
PDF
Integrate Solr with real-time stream processing applications
lucenerevolution
 
PDF
Scaling Solr with SolrCloud
lucenerevolution
 
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
PDF
Using Solr to Search and Analyze Logs
lucenerevolution
 
PDF
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
PDF
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
PDF
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
PDF
Turning search upside down
lucenerevolution
 
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
PDF
Shrinking the haystack wes caldwell - final
lucenerevolution
 
PDF
The First Class Integration of Solr with Hadoop
lucenerevolution
 
PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
PDF
Query Latency Optimization with Lucene
lucenerevolution
 
PDF
10 keys to Solr's Future
lucenerevolution
 
PDF
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution
 
Building Client-side Search Applications with Solr
lucenerevolution
 
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Scaling Solr with SolrCloud
lucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution
 
Using Solr to Search and Analyze Logs
lucenerevolution
 
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Solr's Admin UI - Where does the data come from?
lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
lucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution
 
Turning search upside down
lucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution
 
Shrinking the haystack wes caldwell - final
lucenerevolution
 
The First Class Integration of Solr with Hadoop
lucenerevolution
 
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
lucenerevolution
 
Query Latency Optimization with Lucene
lucenerevolution
 
10 keys to Solr's Future
lucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 

Recently uploaded (20)

PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Doc9.....................................
SofiaCollazos
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 

Recent Additions to Lucene Arsenal

  • 2. Recent Additions to Lucene’s Arsenal Shai Erera, Researcher, IBM Adrien Grand, ElasticSearch
  • 3. Who We Are • Shai Erera – – – – • Working at IBM – Information Retrieval Research Lucene/Solr committer and PMC member http://shaierera.blogspot.com [email protected] Adrien Grand – – – @jpountz Lucene/Solr committer and PMC member Software engineer at Elasticsearch
  • 9. Replication Components • Replicator – – – • Revision – – • Describes a list of files and metadata Responsible to ensure the files are available as long as clients replicate it ReplicationClient – – – • Mediates between the client and server Manages the published Revisions Implementation for replication over HTTP Performs the replication operation on the replica side Copies delta files and invokes ReplicationHandler upon successful copy Always replicates latest revision ReplicationHandler – Acts on the copied files
  • 10. Index Replication • IndexRevision – – • IndexReplicationHandler – – – • Obtains a snapshot on the last commit through SnapshotDeletionPolicy Released when revision is released by Replicator Copies the files to the index directory and fsync them Aborts (rollback) on any error Upon successful completion, invokes a callback (e.g. SearcherManager.maybeRefresh()) Similar extensions for faceted index replication – – IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy indexes IndexAndTaxonomyReplicationHandler: copies the files to the respective directories, keeping both in sync
  • 11. Sample Code // Server-side: publish a new Revision Replicator replicator = new LocalReplicator(); replicator.publish(new IndexRevision(indexWriter)); // Client-side: replicate a Revision Replicator replicator; // either LocalReplicator or HttpReplicator // refresh SearcherManager after index is updated Callable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); } } ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback); SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir); ReplicationClient client = new ReplicationClient(replicator, handler, factory); client.updateNow(); // invoke client manually // -- OR -client.startUpdateThread(30000); // check for updates every 30 seconds
  • 12. Future Work • Resume – – • Parallel Replication – • Session level: don’t copy files that were already successfully copied File level: don’t copy file parts that were already successfully copied Copy revision files in parallel Other replication strategies – Peer-to-peer
  • 13. Index Sorting How to trade index speed for search speed
  • 14. Anatomy of a Lucene index Index = collection of immutable segments Segments store documents sequentially on disk Add data = create a new segment Segments get eventually merged together Order of segments / documents in segments doesn’t matter – the following segments are equivalent Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0
  • 15. Anatomy of a Lucene index ordinal of a doc in a segment = doc id used in the inverted index to refer to docs shoe 1, 3, 5, 8, 11, 13, 15 doc id 0 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 10 3 4 1 13 6 1 8 4 6 10 1
  • 16. Top hits Get top N=2 results: – Create a priority queue of size N – Accumulate matching docs Id 1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12 Price 9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13 () (3) Create an empty priority queue (3,4) (4,20) (4,9) Automatic overflow of the priority queue to remove the least one (4,9) (9,31) (9,31) Top hits
  • 17. Early termination Let’s do the same on a sorted index Id 12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3 Price 13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0 () (9) (9,31) (9,31) Priority queue never changes after this document (9,31) (9,31) (9,31) (9,31)
  • 18. Early termination Pros – makes finding the top hits much faster – file-system cache-friendly Cons – only works for static ranks – not if the sort order depends on the query – requires the index to be sorted – doesn’t work for tasks that require visiting every doc: – total number of matches – faceting
  • 19. Static ranks Not uncommon! Graph-based ranks – Google’s PageRank Facebook social search / Unicorn – https://www.facebook.com/publications/219621248185635 Many more... Doesn’t need to be the exact sort order – heuristics when score is only a function of the static rank
  • 20. Offline sorting A live index can’t be kept sorted – would require inserting docs between existing docs! – segments are immutable Offline sorting to the rescue: – index as usual – sort into a new index – search! Pros/cons – super fast to search, the whole index is fully sorted – but only works for static content
  • 21. Offline Sorting // open a reader on the unsorted index and create a sorted (but slow) view DirectoryReader reader = DirectoryReader.open(in); boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter); // copy the content of the sorted reader to the new dir IndexWriter writer = new IndexWriter(out, iwConf); writer.addIndexes(sortedReader); writer.close(); reader.close();
  • 22. Online sorting? Sort segments independently – wouldn’t require inserting data into existing segments – collection could still be early-terminated on a per-segment basis But segments are immutable – must be sorted before starting writing them
  • 23. Online sorting? 2 sources of segments – flush – merge flushed segments can’t be sorted – Lucene writes stored fields to disk on the fly – could be buffered but this would require a lot of memory merged segments can be sorted – create a sorted view over the segments to merge – pass this view to SegmentMerger instead of the original segments not a bad trade-off – flushed segments are usually small & fast to collect
  • 24. Online sorting? Merged segments can easily take 99+% of the size of the index Merged segments Flushed segments - NRT reopens - RAM buffer size limit hit http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
  • 25. Online Sorting IndexWriterConfig iwConf = new IndexWriterConfig(...); // original MergePolicy finds the segments to merge MergePolicy origMP = iwConf.getMergePolicy(); // SortingMergePolicy wraps the segments with a sorted view boolean ascending = false; Sorter sorter = new NumericDocValuesSorter("price", ascending); MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter); // setup IndexWriter to use SortingMergePolicy iwConf.setMergePolicy(sortingMP); IndexWriter writer = new IndexWriter(dir, iwConf); // index as usual
  • 26. Early termination Collect top N matches Offline sorting – index sorted globally – early terminate after N matches have been collected – no priority queue needed! Online sorting – no early termination on flushed segments – early termination on merged segments – if N matches have been collected – or if current match is less than the top of the PQ
  • 27. Early Termination class MyCollector extends Collector { @Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; } @Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } } }