SlideShare a Scribd company logo
5
Most read
7
Most read
14
Most read
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
How does Lucene
store your data?
Adrien Grand
@jpountz
Apache Lucene/Solr committer
Software engineer @ Elasticsearch
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Outline
●Segments
●What does a segment store?
●Improvements since Lucene 4.0
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
●Every segment is a fully
functional index
●High numbers of
segments trigger merges
●Merge: Copy all live data
from several segments
into a new one
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
●Immutable (up to deletes)
● SSD-friendly (no write amplification)
● great for caches (including the FS cache)
● easy incremental backups
●Merged together when they are too many of them
● Expunges deleted documents
●An IndexReader is a point-in-time view over a fixed
number of segments
● Need to reopen to see changes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What does a
segment store?
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What is in a segment?
Stores Useful for
Segment &
Field infos
Metadata
Getting doc count / index
options
Live docs Non-deleted docs
Excluding deleted docs
from results
Inverted index
The mapping from terms to
docs and positions Finding matching docs
Norms Index-time boosts Scoring
Doc values Any number or (small) bytes
Sorting, faceting, custom
scoring
Stored fields The original doc Result summaries
Term vectors Single doc inverted index Highlighting, MoreLikeThis
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What is in a segment?
API
Field infos AtomicReader.getFieldInfos()
Live docs AtomicReader.getLiveDocs()
Inverted index AtomicReader.fields()
Norms AtomicReader.getNormValues(String field)
Doc values AtomicReader.get*Values(String field)
Stored fields AtomicReader.document(int docID, FieldVisitor visitor)
Term vectors AtomicReader.getTermVectors()
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Doc IDs
●Lucene gives sequential doc IDs to all documents in a
segment, from 0 (inclusive) to AtomicReader.maxDoc()
(exclusive)
●Uniquely identifies documents inside a segment
● ie. if the inverted index API says that document 42
matches the term "bbuzz", I can query the stored
fields API with the same ID
●Allows for efficient storage
● doc IDs can be used as ordinals
● Small & dense ints are easy to compress
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: bit packing
●Efficient technique to store blocks of small ints
● Supports random access
● Special case: bits per value = 1 is a bit set
●Say you want to store
● 5 30 1 1 10 12
● Raw data: 6 * 32 = 192 bits
● Packed : 6 * 5 = 30 bits (84% size reduction!)
00000000000000000000000000000101 = 5
00000000000000000000000000011110 = 30
00000000000000000000000000000001 = 1
00000000000000000000000000000001 = 1
00000000000000000000000000001010 = 10
00000000000000000000000000001100 = 12
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Fixed-length data
●Dense doc IDs are great for single-valued fixed-length
data
● Store data sequentially
● Data for doc N is at offset N * dataLength
● Allows for fast and memory-efficient lookups
●Live docs (1 bit per value)
●Norms (1 byte per value)
●Numeric doc values
● Blocks with independent numbers of bits per value
4096 values 4096 values 4096 values ● Block idx
○ docID / 4096
● Idx in block
○ docID % 4096
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Variable-length data
end addresses
bytes
●Binary doc values
●Stored fields
●Term vectors
●Need one level of indirection: store end addresses
● Easy to compress since end addresses are
increasing
● Only store endAddress - (docID+1) * avgLength
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
String data
●Terms index
●Sorted (Set) doc values
●MemoryPostingsFormat
●Suggesters
s/1 t a c k
r/1o/2
p
t/4
●FST: automaton with weighted arcs
○ compact thanks to shared prefixes/suffixes
●Stack = 1
●Star = 2
●Stop = 3
●Top = 4
o
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Inverted index
●Terms index: map a term prefix to a block in the dict
○ FST
●Terms dictionary: statistics + pointer in postings lists
●Postings lists: encodes matching docs in sorted order
○ + positions + offsets
Original data 1 2 4 11 42 43 (6 * 4 = 32 bytes)
Split into blocks of 3
(128 in practice)
1 2 4 | 11 42 43
Delta-encode 1 1 2 | 11 31 1
Pack values 3 [1 1 2] | 5 [11 31 1] (1+1+1+2 = 5 bytes)
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Improvements since
Lucene 4.0
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Improvements since Lucene 4.0
●LUCENE-4399 (4.1): no seek on write
●LUCENE-4498 (4.1): terms "pulsed" when freq=1
●Compression:
● LUCENE-3892 (4.1): postings encoding moved from
vInt to packed ints: smaller & faster!
● LUCENE-4226 (4.1): compressed stored fields
● LUCENE-4599 (4.2): compressed term vectors
● LUCENE-4547 (4.2): better doc values:
● blocks of packed ints for numbers
● compression of addresses for binary
● FST for Sorted (Set)
● LUCENE-4936 (4.4): compression for date DV
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Performance
●http://people.apache.org/~mikemccand/lucenebench/Term.html
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: LZ4
●Super simple, blazing fast compression codec
●http://code.google.com/p/lz4/
●https://github.com/jpountz/lz4-java
●Example
● L: literals
● R: reference = (offset decrement, length)
● 1 2 3 6 7 6 7 6 7 6 7 8 9 1 2 3 6 7 10
● L 1 2 3 6 7 R(2,6) L 8 9 R(13,5) L 10
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: LZ4
●https://github.com/ning/jvm-compressor-benchmark
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Twitter benchmark
●Quick benchmark on a Twitter corpus
● 160908 tweets
● WhitespaceAnalyzer
Type Indexed Stored Doc values
Term
vectors
id long yes yes - -
created_at long - yes numeric -
user.name string yes yes sorted -
text text yes yes - yes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Twitter benchmark
Lucene 4.0
Lucene 4.4
(not released yet)
Difference
Inverted index 23.3M 20.5M -12%
Norms 157K 157K +0%
Doc values 3.4M 3.1M -9%
Stored fields 21.2M 15.7M -26%
Term vectors 23.5M 15.5M -34%
Overall ~71.5M ~55.0M -23%
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Questions?

More Related Content

What's hot (20)

PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Curso completo de Elasticsearch
Federico Andrés Ocampo
 
PPTX
Service Discovery using etcd, Consul and Kubernetes
Sreenivas Makam
 
PPTX
RubiX
Shubham Tagra
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PPTX
CockroachDB
andrei moga
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
PDF
Vectorized Query Execution in Apache Spark at Facebook
Databricks
 
PDF
Inside HDFS Append
Yue Chen
 
PPT
Learn REST API with Python
Larry Cai
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
PPTX
Apache Spark
SugumarSarDurai
 
PDF
Spark shuffle introduction
colorant
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PDF
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
MongoDB
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Curso completo de Elasticsearch
Federico Andrés Ocampo
 
Service Discovery using etcd, Consul and Kubernetes
Sreenivas Makam
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
CockroachDB
andrei moga
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
How to use Impala query plan and profile to fix performance issues
Cloudera, Inc.
 
Vectorized Query Execution in Apache Spark at Facebook
Databricks
 
Inside HDFS Append
Yue Chen
 
Learn REST API with Python
Larry Cai
 
Spark architecture
GauravBiswas9
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Apache Spark
SugumarSarDurai
 
Spark shuffle introduction
colorant
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
MongoDB
 

Similar to Berlin Buzzwords 2013 - How does lucene store your data? (20)

PDF
Flexible Indexing in Lucene 4.0
Lucidworks (Archived)
 
PPTX
Elasticsearch - under the hood
SmartCat
 
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Oleksiy Panchenko
 
PDF
Fun with flexible indexing
Lucidworks (Archived)
 
PDF
Elasticsearch speed is key
Enterprise Search Warsaw Meetup
 
PDF
Introduction to elasticsearch
pmanvi
 
PPTX
Elastic pivorak
Pivorak MeetUp
 
PPTX
Real time analytics using Hadoop and Elasticsearch
Abhishek Andhavarapu
 
PDF
Elasticsearch Architechture
Anurag Sharma
 
PDF
Elasticsearch and Spark
Audible, Inc.
 
PPTX
Elastic search
Binit Pathak
 
PDF
Elasticsearch at EyeEm
Lars Fronius
 
PDF
Scaling real-time search and analytics with Elasticsearch
clintongormley
 
PPSX
Elasticsearch - basics and beyond
Ernesto Reig
 
PDF
Recent Additions to Lucene Arsenal
lucenerevolution
 
PPTX
ElasticSearch - DevNexus Atlanta - 2014
Roy Russo
 
PDF
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
PDF
Consuming RealTime Signals in Solr
Umesh Prasad
 
PPTX
Elasticsearch as a search alternative to a relational database
Kristijan Duvnjak
 
Flexible Indexing in Lucene 4.0
Lucidworks (Archived)
 
Elasticsearch - under the hood
SmartCat
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Oleksiy Panchenko
 
Fun with flexible indexing
Lucidworks (Archived)
 
Elasticsearch speed is key
Enterprise Search Warsaw Meetup
 
Introduction to elasticsearch
pmanvi
 
Elastic pivorak
Pivorak MeetUp
 
Real time analytics using Hadoop and Elasticsearch
Abhishek Andhavarapu
 
Elasticsearch Architechture
Anurag Sharma
 
Elasticsearch and Spark
Audible, Inc.
 
Elastic search
Binit Pathak
 
Elasticsearch at EyeEm
Lars Fronius
 
Scaling real-time search and analytics with Elasticsearch
clintongormley
 
Elasticsearch - basics and beyond
Ernesto Reig
 
Recent Additions to Lucene Arsenal
lucenerevolution
 
ElasticSearch - DevNexus Atlanta - 2014
Roy Russo
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
Consuming RealTime Signals in Solr
Umesh Prasad
 
Elasticsearch as a search alternative to a relational database
Kristijan Duvnjak
 
Ad

Recently uploaded (20)

PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
The Future of Artificial Intelligence (AI)
Mukul
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Ad

Berlin Buzzwords 2013 - How does lucene store your data?

  • 1. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited How does Lucene store your data? Adrien Grand @jpountz Apache Lucene/Solr committer Software engineer @ Elasticsearch
  • 2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Outline ●Segments ●What does a segment store? ●Improvements since Lucene 4.0
  • 3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments
  • 4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments ●Every segment is a fully functional index ●High numbers of segments trigger merges ●Merge: Copy all live data from several segments into a new one
  • 5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments ●Immutable (up to deletes) ● SSD-friendly (no write amplification) ● great for caches (including the FS cache) ● easy incremental backups ●Merged together when they are too many of them ● Expunges deleted documents ●An IndexReader is a point-in-time view over a fixed number of segments ● Need to reopen to see changes
  • 6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What does a segment store?
  • 7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What is in a segment? Stores Useful for Segment & Field infos Metadata Getting doc count / index options Live docs Non-deleted docs Excluding deleted docs from results Inverted index The mapping from terms to docs and positions Finding matching docs Norms Index-time boosts Scoring Doc values Any number or (small) bytes Sorting, faceting, custom scoring Stored fields The original doc Result summaries Term vectors Single doc inverted index Highlighting, MoreLikeThis
  • 8. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What is in a segment? API Field infos AtomicReader.getFieldInfos() Live docs AtomicReader.getLiveDocs() Inverted index AtomicReader.fields() Norms AtomicReader.getNormValues(String field) Doc values AtomicReader.get*Values(String field) Stored fields AtomicReader.document(int docID, FieldVisitor visitor) Term vectors AtomicReader.getTermVectors()
  • 9. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Doc IDs ●Lucene gives sequential doc IDs to all documents in a segment, from 0 (inclusive) to AtomicReader.maxDoc() (exclusive) ●Uniquely identifies documents inside a segment ● ie. if the inverted index API says that document 42 matches the term "bbuzz", I can query the stored fields API with the same ID ●Allows for efficient storage ● doc IDs can be used as ordinals ● Small & dense ints are easy to compress
  • 10. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: bit packing ●Efficient technique to store blocks of small ints ● Supports random access ● Special case: bits per value = 1 is a bit set ●Say you want to store ● 5 30 1 1 10 12 ● Raw data: 6 * 32 = 192 bits ● Packed : 6 * 5 = 30 bits (84% size reduction!) 00000000000000000000000000000101 = 5 00000000000000000000000000011110 = 30 00000000000000000000000000000001 = 1 00000000000000000000000000000001 = 1 00000000000000000000000000001010 = 10 00000000000000000000000000001100 = 12
  • 11. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Fixed-length data ●Dense doc IDs are great for single-valued fixed-length data ● Store data sequentially ● Data for doc N is at offset N * dataLength ● Allows for fast and memory-efficient lookups ●Live docs (1 bit per value) ●Norms (1 byte per value) ●Numeric doc values ● Blocks with independent numbers of bits per value 4096 values 4096 values 4096 values ● Block idx ○ docID / 4096 ● Idx in block ○ docID % 4096
  • 12. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Variable-length data end addresses bytes ●Binary doc values ●Stored fields ●Term vectors ●Need one level of indirection: store end addresses ● Easy to compress since end addresses are increasing ● Only store endAddress - (docID+1) * avgLength
  • 13. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited String data ●Terms index ●Sorted (Set) doc values ●MemoryPostingsFormat ●Suggesters s/1 t a c k r/1o/2 p t/4 ●FST: automaton with weighted arcs ○ compact thanks to shared prefixes/suffixes ●Stack = 1 ●Star = 2 ●Stop = 3 ●Top = 4 o
  • 14. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Inverted index ●Terms index: map a term prefix to a block in the dict ○ FST ●Terms dictionary: statistics + pointer in postings lists ●Postings lists: encodes matching docs in sorted order ○ + positions + offsets Original data 1 2 4 11 42 43 (6 * 4 = 32 bytes) Split into blocks of 3 (128 in practice) 1 2 4 | 11 42 43 Delta-encode 1 1 2 | 11 31 1 Pack values 3 [1 1 2] | 5 [11 31 1] (1+1+1+2 = 5 bytes)
  • 15. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Improvements since Lucene 4.0
  • 16. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Improvements since Lucene 4.0 ●LUCENE-4399 (4.1): no seek on write ●LUCENE-4498 (4.1): terms "pulsed" when freq=1 ●Compression: ● LUCENE-3892 (4.1): postings encoding moved from vInt to packed ints: smaller & faster! ● LUCENE-4226 (4.1): compressed stored fields ● LUCENE-4599 (4.2): compressed term vectors ● LUCENE-4547 (4.2): better doc values: ● blocks of packed ints for numbers ● compression of addresses for binary ● FST for Sorted (Set) ● LUCENE-4936 (4.4): compression for date DV
  • 17. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Performance ●http://people.apache.org/~mikemccand/lucenebench/Term.html
  • 18. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: LZ4 ●Super simple, blazing fast compression codec ●http://code.google.com/p/lz4/ ●https://github.com/jpountz/lz4-java ●Example ● L: literals ● R: reference = (offset decrement, length) ● 1 2 3 6 7 6 7 6 7 6 7 8 9 1 2 3 6 7 10 ● L 1 2 3 6 7 R(2,6) L 8 9 R(13,5) L 10
  • 19. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: LZ4 ●https://github.com/ning/jvm-compressor-benchmark
  • 20. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Twitter benchmark ●Quick benchmark on a Twitter corpus ● 160908 tweets ● WhitespaceAnalyzer Type Indexed Stored Doc values Term vectors id long yes yes - - created_at long - yes numeric - user.name string yes yes sorted - text text yes yes - yes
  • 21. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Twitter benchmark Lucene 4.0 Lucene 4.4 (not released yet) Difference Inverted index 23.3M 20.5M -12% Norms 157K 157K +0% Doc values 3.4M 3.1M -9% Stored fields 21.2M 15.7M -26% Term vectors 23.5M 15.5M -34% Overall ~71.5M ~55.0M -23%
  • 22. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Questions?