SlideShare a Scribd company logo
Scaling Lucene
The event of ElasticSearch
Stéphane Gamard
Scalability
• Index Size - The number of entries upon which we act
• QPS - Number of requests serviced per second
• Time to operation - Time taken to be operational
Scalability is defined in 3 main axis:
Lucene
• IR library - Purely focused on Tf-iDf
• Bounded by native resources - Vertical scaling
• NRT Inverse Lookup - Segments
In a nutshell, Lucene does not scale. why?
Lucene
Segments: the lucene storage
just a “bunch of files”
Lucene Indexing
In a “document” perspective
{#hello, #world}
{#there, #is, #a, #brown, #fox}
{#the, … , #kitchen}
…
T1 {#1, #33}
T2 {#2, … , #87}
…
T45 {#2, …}
…
#a T1
#is T2
…
#fox T45
…
Dictionary Inverse Lookup
Segment
Lucene Indexing
Factors of growth
T1 {#1, #33}
T2 {#2, … , #87}
…
T45 {#2, …}
…
#a T1
#is T2
…
#fox T45
…
Dictionary Inverse Lookup
• Dictionary Size - NLP*
• New Inverse Entries
Segment
Lucene Indexing
In a storage perspective
Segment
Lucene Indexing
In a storage perspective
Segment
Lucene Indexing
In a storage perspective
Segment
Lucene Indexing
In a storage perspective
Segment
IndexReader(s)
IndexWriter
Lucene Indexing
In a storage perspective
IndexReader(s)
IndexWriter
Lucene Index
Lucene
Segments: the lucene storage
just a “bunch of files”
Lucene Indexing
The wonderful world of merging segments
http://blog.mikemccandless.com/
2011/02/visualizing-lucenes-
segment-merges.html
Lucene Wrap-up
• A collection of segments
• One or multiple IndexReader
• A single IndexWriter
A Lucene Index is:
Lucene Wrap-up
A single Lucene Index scales to:
• Index- Available HDD/Ram for segments
• QPS - number of IndexReader threads
• T-to-Op - Speed at which indexWriter can ingest (IOPs)
It can only scale vertically!!!
Elasticsearch
Also known as the commodity scaling of Lucene ;)
There is no magic…
It’s about partitioning,
Using an index of indexes as its index.
Elasticsearch
A shard is the magic sauce of web scale
Lucene Lucene Lucene Lucene Lucene
Elasticsearch Index
Elasticsearch
Document Indexing
Lucene Lucene Lucene Lucene Lucene
• Distributed
• Routing
Elasticsearch
Request
Lucene Lucene Lucene Lucene Lucene
• Parallel
• Aggregated
{search: {…}}
Elasticsearch
In a nutshell
• Distributed - Distribute IndexWriter per shard
• Parallel - Parallelise request IndexReader per shard
Clustering
How to leverage ES to scale Lucene
Lucene
• 2 Threads - 1 searcher, 1 writer
• 2G ram - Lucene Cache
• 30G disk - Index size
Sample sizing for xM indexed documents
Elasticsearch Index
Clustering
Lucene
2T/2G/30G
Lucene
2T/2G/30G
Lucene
2T/2G/30G
Lucene
2T/2G/30G
Single Machine Scope: 8Core 16G ram 500G hdd
can sustain 4 times xM documents
Clustering
# Documents
QPS
1 machine -> 4 * xM documents
Clustering
2 machines -> 2 * 4 * xM documents
# Documents
QPS
• 4 Threads - 3 searcher, 1 writer
• 4G ram - Lucene Cache
• 60G disk - Index size
Clustering
# Documents
QPS
4 machines -> 2 * 4 * xM documents
twice more QPS
Clustering
# Documents
QPS
Is there a limit to this scalability?
Clustering
# Documents
QPS
• 8 Threads - 7 searcher, 1 writer
• 8G ram - Lucene Cache
• 120G disk - Index size
4 machines -> 4 * 4 * xM documents
Clustering
The rules of thumbs
• Threads - are the core of the scalability factors
• IOPs - is generally the limiting factor to horizontal scaling
• Ram - is generally the limiting factor of vertical scaling
ES is generally excellent with its parameters
Clustering
Health
• Redundancy - auto-balance shards for best possible HA
• Timing - Warmup and Commit points
• Latency - Result merging (especially on remote aggregations)

More Related Content

What's hot (20)

DOC
Distributed Mutual exclusion algorithms
MNM Jain Engineering College
 
PDF
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
knowdiff
 
PPT
Introduction to Compiler Construction
Sarmad Ali
 
PPTX
Process synchronization in Operating Systems
Ritu Ranjan Shrivastwa
 
PPTX
Operating system 22 threading issues
Vaibhav Khanna
 
PPTX
Networking in linux
Varnnit Jain
 
PPT
Introduction To Dotnet
SAMIR BHOGAYTA
 
PPT
Operating system services 9
myrajendra
 
PPTX
Theory of Computation
Shiraz316
 
PPT
Context Switching
franksvalli
 
PDF
Congestion control
arkaarka3
 
PDF
OS UNIT – 2 - Process Management
Gyanmanjari Institute Of Technology
 
PPTX
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 
PPT
Inter process communication
Mohd Tousif
 
PPTX
TCP timers in transport layer .pptx
anair23
 
PDF
gRPC with java
Knoldus Inc.
 
PPTX
Computer networks unit iii
JAIGANESH SEKAR
 
PPTX
Query processing in Distributed Database System
Meghaj Mallick
 
PPTX
cpu scheduling
hashim102
 
Distributed Mutual exclusion algorithms
MNM Jain Engineering College
 
Sara Afshar: Scheduling and Resource Sharing in Multiprocessor Real-Time Systems
knowdiff
 
Introduction to Compiler Construction
Sarmad Ali
 
Process synchronization in Operating Systems
Ritu Ranjan Shrivastwa
 
Operating system 22 threading issues
Vaibhav Khanna
 
Networking in linux
Varnnit Jain
 
Introduction To Dotnet
SAMIR BHOGAYTA
 
Operating system services 9
myrajendra
 
Theory of Computation
Shiraz316
 
Context Switching
franksvalli
 
Congestion control
arkaarka3
 
OS UNIT – 2 - Process Management
Gyanmanjari Institute Of Technology
 
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 
Inter process communication
Mohd Tousif
 
TCP timers in transport layer .pptx
anair23
 
gRPC with java
Knoldus Inc.
 
Computer networks unit iii
JAIGANESH SEKAR
 
Query processing in Distributed Database System
Meghaj Mallick
 
cpu scheduling
hashim102
 

Viewers also liked (20)

PDF
Scaling Elasticsearch at Synthesio
Fred de Villamil
 
PPTX
Elasticsearch Introduction
Roopendra Vishwakarma
 
ODP
Comparing open source search engines
Richard Boulton
 
PDF
elasticsearch - advanced features in practice
Jano Suchal
 
PPTX
Solr
sortivo
 
PDF
Introduction To Apache Lucene
Mindfire Solutions
 
ODP
Search Lucene
Jeremy Coates
 
PDF
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
PPT
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
PDF
Devinsampa nginx-scripting
Tony Fabeen
 
PDF
Munching & crunching - Lucene index post-processing
abial
 
PPTX
Index types
Volodymyr Zhabiuk
 
PDF
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
PDF
Lucene
Matt Wood
 
PPT
Lucene and MySQL
farhan "Frank"​ mashraqi
 
PPT
Lucandra
otisg
 
PDF
Intro to Elasticsearch
Clifford James
 
PPT
Inverted index
Krishna Gehlot
 
PPT
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
PPT
An introduction to inverted index
weedge
 
Scaling Elasticsearch at Synthesio
Fred de Villamil
 
Elasticsearch Introduction
Roopendra Vishwakarma
 
Comparing open source search engines
Richard Boulton
 
elasticsearch - advanced features in practice
Jano Suchal
 
Solr
sortivo
 
Introduction To Apache Lucene
Mindfire Solutions
 
Search Lucene
Jeremy Coates
 
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
Configuring elasticsearch for performance and scale
Bharvi Dixit
 
Devinsampa nginx-scripting
Tony Fabeen
 
Munching & crunching - Lucene index post-processing
abial
 
Index types
Volodymyr Zhabiuk
 
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Lucene
Matt Wood
 
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Lucandra
otisg
 
Intro to Elasticsearch
Clifford James
 
Inverted index
Krishna Gehlot
 
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
An introduction to inverted index
weedge
 
Ad

Similar to From Lucene to Elasticsearch, a short explanation of horizontal scalability (20)

PDF
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
cadejaumafiq
 
PPTX
The ELK Stack - Launch and Learn presentation
saivjadhav2003
 
PPTX
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
PPTX
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
ODP
Deep Dive Into Elasticsearch
Knoldus Inc.
 
PPTX
Devnexus 2018
Roy Russo
 
PDF
Introduction to SolrCloud
Varun Thacker
 
PPT
HPTS talk on micro sharding with Katta
MapR Technologies
 
PPT
Lucene BootCamp
GokulD
 
PPTX
ElasticSearch Basics
Amresh Singh
 
PDF
Roaring with elastic search sangam2018
Vinay Kumar
 
PDF
ELK stack introduction
abenyeung1
 
PDF
InfluxDB Internals
InfluxData
 
PPTX
Elasticsearch features presentation
Roopendra Vishwakarma
 
PDF
Hippo meetup: enterprise search with Solr and elasticsearch
Luca Cavanna
 
PPTX
The ultimate guide for Elasticsearch plugins
Itamar
 
PPTX
Elastic search
Binit Pathak
 
PDF
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
PDF
No sql & dq2 tracer service
Zang Donal
 
PPTX
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jie Li
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
cadejaumafiq
 
The ELK Stack - Launch and Learn presentation
saivjadhav2003
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Devnexus 2018
Roy Russo
 
Introduction to SolrCloud
Varun Thacker
 
HPTS talk on micro sharding with Katta
MapR Technologies
 
Lucene BootCamp
GokulD
 
ElasticSearch Basics
Amresh Singh
 
Roaring with elastic search sangam2018
Vinay Kumar
 
ELK stack introduction
abenyeung1
 
InfluxDB Internals
InfluxData
 
Elasticsearch features presentation
Roopendra Vishwakarma
 
Hippo meetup: enterprise search with Solr and elasticsearch
Luca Cavanna
 
The ultimate guide for Elasticsearch plugins
Itamar
 
Elastic search
Binit Pathak
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 
No sql & dq2 tracer service
Zang Donal
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Jie Li
 
Ad

Recently uploaded (20)

PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
July Patch Tuesday
Ivanti
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Biography of Daniel Podor.pdf
Daniel Podor
 
July Patch Tuesday
Ivanti
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 

From Lucene to Elasticsearch, a short explanation of horizontal scalability