SlideShare a Scribd company logo
Search @twitter 
Michael Busch 
@michibusch 
michael@twitter.com 
buschmi@apache.org
Search @twitter 
Agenda 
‣ Introduction 
- Search Architecture 
- Lucene Extensions 
- Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Introduction
Introduction 
Twitter has more than 284 million 
monthly active users.
Introduction 
500 million tweets are sent per day.
Introduction 
More than 300 billion tweets have been 
sent since company founding in 2006.
Introduction 
Tweets-per-second record: 
one-second peak of 143,199 TPS.
Introduction 
More than 2 billion search queries per 
day.
Search @twitter 
Agenda 
- Introduction 
‣ Search Architecture 
- Lucene Extensions 
- Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Search Architecture
RT index 
Search Architecture 
RT stream 
Analyzer/ 
Partitioner 
RT index 
(Earlybird) 
Blender 
Archive 
index 
RT index 
Mapreduce 
Analyzer 
raw 
tweets 
Tweet archive 
HDFS 
Search 
requests 
writes 
searches 
analyzed 
tweets 
analyzed 
tweets 
raw 
tweets
RT index 
Search Architecture 
Tweets 
Analyzer/ 
Partitioner 
RT index 
(Earlybird) 
Blender 
Archive 
index 
RT index 
queue 
HDFS 
Search 
requests 
Updates Deletes/ 
Engagement (e.g. retweets/favs) 
writes 
searches 
Mapreduce 
Analyzer
RT index 
Search Architecture 
RT index 
(Earlybird) 
Social 
graph Social 
Blender 
Archive 
index 
RT index 
User 
search 
Search 
requests 
writes 
searches 
• Blender is our Thrift 
service aggregator 
• Queries multiple 
Earlybirds, merges results 
Social 
graph 
graph
Search Architecture 
RT index 
(Earlybird) 
Archive 
index 
User 
search
Search Architecture 
RT index 
(Earlybird) 
Archive 
index 
• For historic reasons, these used 
to be entirely different codebases, 
but had similar features/ 
technologies 
• Over time cross-dependencies 
were introduced to share code 
User 
search 
Lucene
Search Architecture 
RT index 
(Earlybird) 
Archive 
index 
User 
search 
Lucene 
Extensions 
Lucene 
• New Lucene extension package 
• This package is truly generic and 
has no dependency on an actual 
product/index 
• It contains Twitter’s extensions for 
real-time search, a thin segment 
management layer and other 
features
Search @twitter 
Agenda 
- Introduction 
- Search Architecture 
‣ Lucene Extensions 
- Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Lucene Extensions
Lucene Extension Library 
• Abstraction layer for Lucene index segments 
• Real-time writer for in-memory index segments 
• Schema-based Lucene document factory 
• Real-time faceting
Lucene Extension Library 
• API layer for Lucene segments 
• *IndexSegmentWriter 
• *IndexSegmentAtomicReader 
• Two implementations 
• In-memory: RealtimeIndexSegmentWriter (and reader) 
• On-disk: LuceneIndexSegmentWriter (and reader)
Lucene Extension Library 
• IndexSegments can be built ... 
• in realtime 
• on Mesos or Hadoop (Mapreduce) 
• locally on serving machines 
• Cluster-management code that deals with IndexSegments 
• Share segments across serving machines using HDFS 
• Can rebuild segments (e.g. to upgrade Lucene version, change data 
schema, etc.)
Lucene Extension Library 
HDFS EEEaararlyrlylbybbirirdirdd 
Mesos 
Hadoop (MR) 
RT pipeline
RealtimeIndexSegmentWriter 
• Modified Lucene index implementation optimized for realtime search 
• IndexWriter buffer is searchable (no need to flush to allow searching) 
• In-memory 
• Lock-free concurrency model for best performance
Concurrency - Definitions 
• Pessimistic locking 
• A thread holds an exclusive lock on a resource, while an action is 
performed [mutual exclusion] 
• Usually used when conflicts are expected to be likely 
• Optimistic locking 
• Operations are tried to be performed atomically without holding a lock; 
conflicts can be detected; retry logic is often used in case of conflicts 
• Usually used when conflicts are expected to be the exception
Concurrency - Definitions 
• Non-blocking algorithm 
Ensures, that threads competing for shared resources do not have their 
execution indefinitely postponed by mutual exclusion. 
• Lock-free algorithm 
A non-blocking algorithm is lock-free if there is guaranteed system-wide 
progress. 
• Wait-free algorithm 
A non-blocking algorithm is wait-free, if there is guaranteed per-thread 
progress. 
* Source: Wikipedia
Concurrency 
• Having a single writer thread simplifies our problem: no locks have to be used 
to protect data structures from corruption (only one thread modifies data) 
• But: we have to make sure that all readers always see a consistent state of 
all data structures -> this is much harder than it sounds! 
• In Java, it is not guaranteed that one thread will see changes that another 
thread makes in program execution order, unless the same memory barrier is 
crossed by both threads -> safe publication 
• Safe publication can be achieved in different, subtle ways. Read the great 
book “Java concurrency in practice” by Brian Goetz for more information!
Java Memory Model 
• Program order rule 
Each action in a thread happens-before every action in that thread that comes 
later in the program order. 
• Volatile variable rule 
A write to a volatile field happens-before every subsequent read of that same 
field. 
• Transitivity 
If A happens-before B, and B happens-before C, then A happens-before C. 
* Source: Brian Goetz: Java Concurrency in Practice
Concurrency 
RAM 0 
int x; 
Cache 
Thread 1 Thread 2 
time
Concurrency 
Cache 5 
RAM 0 
int x; 
Thread 1 Thread 2 
x = 5; 
Thread A writes x=5 to cache 
time
Concurrency 
Cache 5 
RAM 0 
int x; 
Thread 1 Thread 2 
x = 5; 
time while(x != 5); 
This condition will likely 
never become false!
Concurrency 
RAM 0 
int x; 
Cache 
Thread 1 Thread 2 
time
Concurrency 
RAM 0 
int x; 
Thread A writes b=1 to RAM, 
because b is volatile 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1;
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
Read volatile b 
int dummy = b; 
while(x != 5);
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
happens-before 
• Program order rule: Each action in a thread happens-before every action in 
that thread that comes later in the program order.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
happens-before 
• Volatile variable rule: A write to a volatile field happens-before every 
subsequent read of that same field.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
happens-before 
• Transitivity: If A happens-before B, and B happens-before C, then A 
happens-before C.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
This condition will be 
false, i.e. x==5 
• Note: x itself doesn’t have to be volatile. There can be many variables like x, 
but we need only a single volatile field.
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
Memory barrier 
• Note: x itself doesn’t have to be volatile. There can be many variables like x, 
but we need only a single volatile field.
Search at Twitter: Presented by Michael Busch, Twitter
Demo
Concurrency 
RAM 0 
int x; 
5 x = 5; 
1 
Cache 
Thread 1 Thread 2 
time 
volatile int b; 
b = 1; 
int dummy = b; 
while(x != 5); 
Memory barrier 
• Note: x itself doesn’t have to be volatile. There can be many variables like x, 
but we need only a single volatile field.
Concurrency 
IndexWriter IndexReader 
time 
write 100 docs 
maxDoc = 100 
in IR.open(): read maxDoc 
search upto maxDoc 
write more docs 
maxDoc is volatile
Concurrency 
IndexWriter IndexReader 
time 
write 100 docs 
maxDoc = 100 
in IR.open(): read maxDoc 
search upto maxDoc 
write more docs 
maxDoc is volatile 
happens-before 
• Only maxDoc is volatile. All other fields that IW writes to and IR reads from 
don’t need to be!
Wait-free 
• Not a single exclusive lock 
• Writer thread can always make progress 
• Optimistic locking (retry-logic) in a few places for searcher thread 
• Retry logic very simple and guaranteed to always make progress
In-memory Real-time Index 
• Highly optimized for GC - all data is stored in blocked native arrays 
• v1: Optimized for tweets with a term position limit of 255 
• v2: Support for 32 bit positions without performance degradation 
• v2: Basic support for out-of-order posting list inserts
In-memory Real-time Index 
• Highly optimized for GC - all data is stored in blocked native arrays 
• v1: Optimized for tweets with a term position limit of 255 
• v2: Support for 32 bit positions without performance degradation 
• v2: Basic support for out-of-order posting list inserts
In-memory Real-time Index 
• RT term dictionary 
• Term lookups using a lock-free hashtable in O(1) 
• v2: Additional probabilistic, lock-free skip list maintains ordering on terms 
• Perfect skip list not an option: out-of-order inserts would require 
rebalancing, which is impractical with our lock-free index 
• In a probabilistic skip list the tower height of a new (out-of-order) item can 
be determined without knowing its insert position by simply rolling a dice
In-memory Real-time Index 
• Perfect skip list
In-memory Real-time Index 
• Perfect skip list 
Inserting a new element in the middle of this 
skip list requires re-balancing the towers.
In-memory Real-time Index 
• Probabilistic skip list
In-memory Real-time Index 
• Probabilistic skip list Tower height determined by rolling a dice 
BEFORE knowing the insert location; tower height 
never has to change for an element, simplifying 
memory allocation and concurrency.
Schema-based Document factory 
• Apps provide one ThriftSchema per index and create a ThriftDocument for 
each document 
• SchemaDocumentFactory translates ThriftDocument -> Lucene Document 
using the Schema 
• Default field values 
• Extended field settings 
• Type-system on top of DocValues 
• Validation
Schema-based Document factory 
Schema 
Lucene 
Document 
SchemaDocument 
Factory 
Thrift 
Document 
• Validation 
• Fill in default values 
• Apply correct Lucene 
field settings
Schema-based Document factory 
Schema 
Lucene 
Document 
SchemaDocument 
Factory 
Thrift 
Document 
• Validation 
• Fill in default values 
• Apply correct Lucene 
field settings 
Decouples core package from 
specific product/index. Similar 
to Solr/ElasticSearch.
Search @twitter 
Agenda 
- Introduction 
- Search Architecture 
- Lucene Extensions 
‣ Outlook
Search at Twitter: Presented by Michael Busch, Twitter
Outlook
Outlook 
• Support for parallel (sliced) segments to support partial segment rebuilds 
and other cool posting list update patterns 
• Add remaining missing Lucene features to RT index 
• Index term statistics for ranking 
• Term vectors 
• Stored fields
Questions? 
Michael Busch 
@michibusch 
michael@twitter.com 
buschmi@apache.org
Search at Twitter: Presented by Michael Busch, Twitter
Backup Slides
Searching for top entities within Tweets 
• Task: Find the best photos in a subset of tweets 
• We could use a Lucene index, where each photo is a document 
• Problem: How to update existing documents when the same photos are 
tweeted again? 
• In-place posting list updates are hard 
• Lucene’s updateDocument() is a delete/add operation - expensive and not 
order-preserving
Searching for top entities within Tweets 
• Task: Find the best photos in a subset of tweets 
• Could we use our existing time-ordered tweet index? 
• Facets!
Searching for top entities within Tweets 
Query Doc ids 
Inverted 
index 
Term id Term label 
Forward 
Doc id index Document 
Metadata 
Facet 
index 
Doc id Term ids
Storing tweet metadata 
Facet 
Doc id index Term ids
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 8 
31241 2 
Query 
Searching for top entities within Tweets
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 15 
31241 12 
85932 8 
6748 3 
Query 
Searching for top entities within Tweets
Searching for top entities within Tweets 
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 15 
31241 12 
85932 8 
6748 3 
Query 
Weighted counts (from 
engagement features) used 
for relevance scoring
Searching for top entities within Tweets 
5 15 9000 9002 100000 100090 
Matching 
doc id 
Facet 
index 
Term ids 
Top-k heap 
Id Count 
48239 15 
31241 12 
85932 8 
6748 3 
Query 
All query operators can be 
used. E.g. find best photos in 
San Francisco tweeted by 
people I follow
Searching for top entities within Tweets 
Inverted 
Term id index Term label
Searching for top entities within Tweets 
Id Count Label Count 
pic.twitter.com/jknui4w 45 
pic.twitter.com/dslkfj83 23 
pic.twitter.com/acm3ps 15 
pic.twitter.com/948jdsd 11 
pic.twitter.com/dsjkf15h 8 
pic.twitter.com/irnsoa32 5 
48239 45 
31241 23 
85932 15 
6748 11 
74294 8 
3728 5 
Inverted 
index
Summary 
• Indexing tweet entities (e.g. photos) as facets allows to search and rank top-entities 
using a tweets index 
• All query operators supported 
• Documents don’t need to be reindexed 
• Approach reusable for different use cases, e.g.: best vines, hashtags, 
@mentions, etc.

More Related Content

PDF
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
PDF
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
HostedbyConfluent
 
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
PDF
Giới thiệu git
Long Ta
 
PPTX
Técnicas avanzadas de control de versiones
Angel Armenta
 
PDF
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
 
PPTX
Basic Git Intro
Yoad Snapir
 
PDF
CNIT 129S: 8: Attacking Access Controls
Sam Bowne
 
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
HostedbyConfluent
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
Giới thiệu git
Long Ta
 
Técnicas avanzadas de control de versiones
Angel Armenta
 
Catalogs - Turning a Set of Parquet Files into a Data Set
InfluxData
 
Basic Git Intro
Yoad Snapir
 
CNIT 129S: 8: Attacking Access Controls
Sam Bowne
 

What's hot (15)

PPTX
Inside Zalo: Developing a mobile messenger for the audience of millions - VN ...
Zalo_app
 
PDF
Facebook Messenger/Whatsapp System Design
Elia Ahadi
 
PPTX
The Universal Recommender
Pat Ferrel
 
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
PPTX
ORC File - Optimizing Your Big Data
DataWorks Summit
 
PDF
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
PDF
Building Software Systems at Google and Lessons Learned
parallellabs
 
PDF
Introduction to apache kafka
Dimitris Kontokostas
 
PDF
Từ Gà Đến Pro Git và GitHub trong 60 phút
Huy Hoàng Phạm
 
PDF
IPFS: A Whole New World
ArcBlock
 
PPTX
Or2019 DSpace 7 Enhanced submission & workflow
4Science
 
PDF
DeathNote of Microsoft Windows Kernel
Peter Hlavaty
 
PPTX
Spring Boot & WebSocket
Ming-Ying Wu
 
PDF
Gitflow with FME and Autobuilding a Project with the Gitlab Build Pipeline
Safe Software
 
PDF
[FFE19] Build a Flink AI Ecosystem
Jiangjie Qin
 
Inside Zalo: Developing a mobile messenger for the audience of millions - VN ...
Zalo_app
 
Facebook Messenger/Whatsapp System Design
Elia Ahadi
 
The Universal Recommender
Pat Ferrel
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
DataWorks Summit
 
ORC File - Optimizing Your Big Data
DataWorks Summit
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
Building Software Systems at Google and Lessons Learned
parallellabs
 
Introduction to apache kafka
Dimitris Kontokostas
 
Từ Gà Đến Pro Git và GitHub trong 60 phút
Huy Hoàng Phạm
 
IPFS: A Whole New World
ArcBlock
 
Or2019 DSpace 7 Enhanced submission & workflow
4Science
 
DeathNote of Microsoft Windows Kernel
Peter Hlavaty
 
Spring Boot & WebSocket
Ming-Ying Wu
 
Gitflow with FME and Autobuilding a Project with the Gitlab Build Pipeline
Safe Software
 
[FFE19] Build a Flink AI Ecosystem
Jiangjie Qin
 
Ad

Viewers also liked (20)

PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PDF
Realtime Search at Twitter - Michael Busch
lucenerevolution
 
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
PDF
What's New in Solr 3.x / 4.0
Erik Hatcher
 
PPT
Type-Safe MongoDB query (Lift Rogue query)
Knoldus Inc.
 
PDF
11 lucene
Vasya Petrov
 
PPTX
Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
Grid Dynamics
 
PDF
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
PDF
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Lucidworks
 
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
PDF
Lucene/Solr Spatial in 2015: Presented by David Smiley
Lucidworks
 
PDF
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
PDF
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
PDF
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucidworks
 
PDF
A Survey of Elasticsearch Usage
Greg Brown
 
PDF
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Lucidworks
 
PDF
MongoDB: Queries and Aggregation Framework with NBA Game Data
Valeri Karpov
 
PDF
The Many Facets of Apache Solr - Yonik Seeley
lucenerevolution
 
PDF
Webinar: Ecommerce, Rules, and Relevance
Lucidworks
 
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Realtime Search at Twitter - Michael Busch
lucenerevolution
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Lucidworks
 
What's New in Solr 3.x / 4.0
Erik Hatcher
 
Type-Safe MongoDB query (Lift Rogue query)
Knoldus Inc.
 
11 lucene
Vasya Petrov
 
Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
Grid Dynamics
 
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Lucidworks
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
Lucene/Solr Spatial in 2015: Presented by David Smiley
Lucidworks
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
Lucene/Solr Revolution 2015 Opening Keynote with Lucidworks CEO Will Hayes
Lucidworks
 
A Survey of Elasticsearch Usage
Greg Brown
 
Evolving Search Relevancy: Presented by James Strassburg, Direct Supply
Lucidworks
 
MongoDB: Queries and Aggregation Framework with NBA Game Data
Valeri Karpov
 
The Many Facets of Apache Solr - Yonik Seeley
lucenerevolution
 
Webinar: Ecommerce, Rules, and Relevance
Lucidworks
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Ad

Similar to Search at Twitter: Presented by Michael Busch, Twitter (20)

PPTX
Pune-Cocoa: Blocks and GCD
Prashant Rane
 
PDF
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
PDF
Swift 2 Under the Hood - Gotober 2015
Alex Blewitt
 
PPS
Storm presentation
Shyam Raj
 
PPTX
Jvm memory model
Yoav Avrahami
 
PPTX
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
KesavanT10
 
PDF
Groovy concurrency
Alex Miller
 
PPTX
Игорь Фесенко "Direction of C# as a High-Performance Language"
Fwdays
 
PPTX
L6.sp17.pptx
SudheerKumar499932
 
PDF
Ehcache 3 @ BruJUG
Louis Jacomet
 
KEY
London devops logging
Tomas Doran
 
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
PPTX
Freckle
Sasha Goldshtein
 
KEY
Verification with LoLA: 4 Using LoLA
Universität Rostock
 
KEY
Everything I Ever Learned About JVM Performance Tuning @Twitter
Attila Szegedi
 
PDF
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
PDF
Bringing Concurrency to Ruby - RubyConf India 2014
Charles Nutter
 
PPTX
.NET UY Meetup 7 - CLR Memory by Fabian Alves
.NET UY Meetup
 
PDF
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
PDF
I see deadlocks : Matt Ellis - Techorama NL 2024
citizenmatt
 
Pune-Cocoa: Blocks and GCD
Prashant Rane
 
STORMPresentation and all about storm_FINAL.pdf
ajajkhan16
 
Swift 2 Under the Hood - Gotober 2015
Alex Blewitt
 
Storm presentation
Shyam Raj
 
Jvm memory model
Yoav Avrahami
 
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
KesavanT10
 
Groovy concurrency
Alex Miller
 
Игорь Фесенко "Direction of C# as a High-Performance Language"
Fwdays
 
L6.sp17.pptx
SudheerKumar499932
 
Ehcache 3 @ BruJUG
Louis Jacomet
 
London devops logging
Tomas Doran
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Verification with LoLA: 4 Using LoLA
Universität Rostock
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Attila Szegedi
 
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Bringing Concurrency to Ruby - RubyConf India 2014
Charles Nutter
 
.NET UY Meetup 7 - CLR Memory by Fabian Alves
.NET UY Meetup
 
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
I see deadlocks : Matt Ellis - Techorama NL 2024
citizenmatt
 

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
PDF
Drive Agent Effectiveness in Salesforce
Lucidworks
 
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
PPTX
Connected Experiences Are Personalized Experiences
Lucidworks
 
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
PDF
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
PPTX
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
PPTX
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
PPTX
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Lucidworks
 
Drive Agent Effectiveness in Salesforce
Lucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks
 
Connected Experiences Are Personalized Experiences
Lucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
Lucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Lucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Lucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Lucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Lucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Lucidworks
 
Webinar: Building a Business Case for Enterprise Search
Lucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Lucidworks
 

Recently uploaded (20)

PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 

Search at Twitter: Presented by Michael Busch, Twitter

  • 2. Search @twitter Agenda ‣ Introduction - Search Architecture - Lucene Extensions - Outlook
  • 5. Introduction Twitter has more than 284 million monthly active users.
  • 6. Introduction 500 million tweets are sent per day.
  • 7. Introduction More than 300 billion tweets have been sent since company founding in 2006.
  • 8. Introduction Tweets-per-second record: one-second peak of 143,199 TPS.
  • 9. Introduction More than 2 billion search queries per day.
  • 10. Search @twitter Agenda - Introduction ‣ Search Architecture - Lucene Extensions - Outlook
  • 13. RT index Search Architecture RT stream Analyzer/ Partitioner RT index (Earlybird) Blender Archive index RT index Mapreduce Analyzer raw tweets Tweet archive HDFS Search requests writes searches analyzed tweets analyzed tweets raw tweets
  • 14. RT index Search Architecture Tweets Analyzer/ Partitioner RT index (Earlybird) Blender Archive index RT index queue HDFS Search requests Updates Deletes/ Engagement (e.g. retweets/favs) writes searches Mapreduce Analyzer
  • 15. RT index Search Architecture RT index (Earlybird) Social graph Social Blender Archive index RT index User search Search requests writes searches • Blender is our Thrift service aggregator • Queries multiple Earlybirds, merges results Social graph graph
  • 16. Search Architecture RT index (Earlybird) Archive index User search
  • 17. Search Architecture RT index (Earlybird) Archive index • For historic reasons, these used to be entirely different codebases, but had similar features/ technologies • Over time cross-dependencies were introduced to share code User search Lucene
  • 18. Search Architecture RT index (Earlybird) Archive index User search Lucene Extensions Lucene • New Lucene extension package • This package is truly generic and has no dependency on an actual product/index • It contains Twitter’s extensions for real-time search, a thin segment management layer and other features
  • 19. Search @twitter Agenda - Introduction - Search Architecture ‣ Lucene Extensions - Outlook
  • 22. Lucene Extension Library • Abstraction layer for Lucene index segments • Real-time writer for in-memory index segments • Schema-based Lucene document factory • Real-time faceting
  • 23. Lucene Extension Library • API layer for Lucene segments • *IndexSegmentWriter • *IndexSegmentAtomicReader • Two implementations • In-memory: RealtimeIndexSegmentWriter (and reader) • On-disk: LuceneIndexSegmentWriter (and reader)
  • 24. Lucene Extension Library • IndexSegments can be built ... • in realtime • on Mesos or Hadoop (Mapreduce) • locally on serving machines • Cluster-management code that deals with IndexSegments • Share segments across serving machines using HDFS • Can rebuild segments (e.g. to upgrade Lucene version, change data schema, etc.)
  • 25. Lucene Extension Library HDFS EEEaararlyrlylbybbirirdirdd Mesos Hadoop (MR) RT pipeline
  • 26. RealtimeIndexSegmentWriter • Modified Lucene index implementation optimized for realtime search • IndexWriter buffer is searchable (no need to flush to allow searching) • In-memory • Lock-free concurrency model for best performance
  • 27. Concurrency - Definitions • Pessimistic locking • A thread holds an exclusive lock on a resource, while an action is performed [mutual exclusion] • Usually used when conflicts are expected to be likely • Optimistic locking • Operations are tried to be performed atomically without holding a lock; conflicts can be detected; retry logic is often used in case of conflicts • Usually used when conflicts are expected to be the exception
  • 28. Concurrency - Definitions • Non-blocking algorithm Ensures, that threads competing for shared resources do not have their execution indefinitely postponed by mutual exclusion. • Lock-free algorithm A non-blocking algorithm is lock-free if there is guaranteed system-wide progress. • Wait-free algorithm A non-blocking algorithm is wait-free, if there is guaranteed per-thread progress. * Source: Wikipedia
  • 29. Concurrency • Having a single writer thread simplifies our problem: no locks have to be used to protect data structures from corruption (only one thread modifies data) • But: we have to make sure that all readers always see a consistent state of all data structures -> this is much harder than it sounds! • In Java, it is not guaranteed that one thread will see changes that another thread makes in program execution order, unless the same memory barrier is crossed by both threads -> safe publication • Safe publication can be achieved in different, subtle ways. Read the great book “Java concurrency in practice” by Brian Goetz for more information!
  • 30. Java Memory Model • Program order rule Each action in a thread happens-before every action in that thread that comes later in the program order. • Volatile variable rule A write to a volatile field happens-before every subsequent read of that same field. • Transitivity If A happens-before B, and B happens-before C, then A happens-before C. * Source: Brian Goetz: Java Concurrency in Practice
  • 31. Concurrency RAM 0 int x; Cache Thread 1 Thread 2 time
  • 32. Concurrency Cache 5 RAM 0 int x; Thread 1 Thread 2 x = 5; Thread A writes x=5 to cache time
  • 33. Concurrency Cache 5 RAM 0 int x; Thread 1 Thread 2 x = 5; time while(x != 5); This condition will likely never become false!
  • 34. Concurrency RAM 0 int x; Cache Thread 1 Thread 2 time
  • 35. Concurrency RAM 0 int x; Thread A writes b=1 to RAM, because b is volatile 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1;
  • 36. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; Read volatile b int dummy = b; while(x != 5);
  • 37. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); happens-before • Program order rule: Each action in a thread happens-before every action in that thread that comes later in the program order.
  • 38. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); happens-before • Volatile variable rule: A write to a volatile field happens-before every subsequent read of that same field.
  • 39. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); happens-before • Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.
  • 40. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); This condition will be false, i.e. x==5 • Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  • 41. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); Memory barrier • Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  • 43. Demo
  • 44. Concurrency RAM 0 int x; 5 x = 5; 1 Cache Thread 1 Thread 2 time volatile int b; b = 1; int dummy = b; while(x != 5); Memory barrier • Note: x itself doesn’t have to be volatile. There can be many variables like x, but we need only a single volatile field.
  • 45. Concurrency IndexWriter IndexReader time write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc write more docs maxDoc is volatile
  • 46. Concurrency IndexWriter IndexReader time write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc write more docs maxDoc is volatile happens-before • Only maxDoc is volatile. All other fields that IW writes to and IR reads from don’t need to be!
  • 47. Wait-free • Not a single exclusive lock • Writer thread can always make progress • Optimistic locking (retry-logic) in a few places for searcher thread • Retry logic very simple and guaranteed to always make progress
  • 48. In-memory Real-time Index • Highly optimized for GC - all data is stored in blocked native arrays • v1: Optimized for tweets with a term position limit of 255 • v2: Support for 32 bit positions without performance degradation • v2: Basic support for out-of-order posting list inserts
  • 49. In-memory Real-time Index • Highly optimized for GC - all data is stored in blocked native arrays • v1: Optimized for tweets with a term position limit of 255 • v2: Support for 32 bit positions without performance degradation • v2: Basic support for out-of-order posting list inserts
  • 50. In-memory Real-time Index • RT term dictionary • Term lookups using a lock-free hashtable in O(1) • v2: Additional probabilistic, lock-free skip list maintains ordering on terms • Perfect skip list not an option: out-of-order inserts would require rebalancing, which is impractical with our lock-free index • In a probabilistic skip list the tower height of a new (out-of-order) item can be determined without knowing its insert position by simply rolling a dice
  • 51. In-memory Real-time Index • Perfect skip list
  • 52. In-memory Real-time Index • Perfect skip list Inserting a new element in the middle of this skip list requires re-balancing the towers.
  • 53. In-memory Real-time Index • Probabilistic skip list
  • 54. In-memory Real-time Index • Probabilistic skip list Tower height determined by rolling a dice BEFORE knowing the insert location; tower height never has to change for an element, simplifying memory allocation and concurrency.
  • 55. Schema-based Document factory • Apps provide one ThriftSchema per index and create a ThriftDocument for each document • SchemaDocumentFactory translates ThriftDocument -> Lucene Document using the Schema • Default field values • Extended field settings • Type-system on top of DocValues • Validation
  • 56. Schema-based Document factory Schema Lucene Document SchemaDocument Factory Thrift Document • Validation • Fill in default values • Apply correct Lucene field settings
  • 57. Schema-based Document factory Schema Lucene Document SchemaDocument Factory Thrift Document • Validation • Fill in default values • Apply correct Lucene field settings Decouples core package from specific product/index. Similar to Solr/ElasticSearch.
  • 58. Search @twitter Agenda - Introduction - Search Architecture - Lucene Extensions ‣ Outlook
  • 61. Outlook • Support for parallel (sliced) segments to support partial segment rebuilds and other cool posting list update patterns • Add remaining missing Lucene features to RT index • Index term statistics for ranking • Term vectors • Stored fields
  • 65. Searching for top entities within Tweets • Task: Find the best photos in a subset of tweets • We could use a Lucene index, where each photo is a document • Problem: How to update existing documents when the same photos are tweeted again? • In-place posting list updates are hard • Lucene’s updateDocument() is a delete/add operation - expensive and not order-preserving
  • 66. Searching for top entities within Tweets • Task: Find the best photos in a subset of tweets • Could we use our existing time-ordered tweet index? • Facets!
  • 67. Searching for top entities within Tweets Query Doc ids Inverted index Term id Term label Forward Doc id index Document Metadata Facet index Doc id Term ids
  • 68. Storing tweet metadata Facet Doc id index Term ids
  • 69. 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 8 31241 2 Query Searching for top entities within Tweets
  • 70. 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 15 31241 12 85932 8 6748 3 Query Searching for top entities within Tweets
  • 71. Searching for top entities within Tweets 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 15 31241 12 85932 8 6748 3 Query Weighted counts (from engagement features) used for relevance scoring
  • 72. Searching for top entities within Tweets 5 15 9000 9002 100000 100090 Matching doc id Facet index Term ids Top-k heap Id Count 48239 15 31241 12 85932 8 6748 3 Query All query operators can be used. E.g. find best photos in San Francisco tweeted by people I follow
  • 73. Searching for top entities within Tweets Inverted Term id index Term label
  • 74. Searching for top entities within Tweets Id Count Label Count pic.twitter.com/jknui4w 45 pic.twitter.com/dslkfj83 23 pic.twitter.com/acm3ps 15 pic.twitter.com/948jdsd 11 pic.twitter.com/dsjkf15h 8 pic.twitter.com/irnsoa32 5 48239 45 31241 23 85932 15 6748 11 74294 8 3728 5 Inverted index
  • 75. Summary • Indexing tweet entities (e.g. photos) as facets allows to search and rank top-entities using a tweets index • All query operators supported • Documents don’t need to be reindexed • Approach reusable for different use cases, e.g.: best vines, hashtags, @mentions, etc.