SlideShare a Scribd company logo
Search Engines
How They Work and
Why You Need Them
Search Engines: How They Work and Why You Need Them
Search Engines: How They Work and Why You Need Them
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
What do you
even do all day?
We have Google.
@scarletdrive
Not all search engines are
web search engines.
@scarletdrive
google.com potatoparcel.com
Large scope
(entire internet)
Small scope
(just a few potatoes)
No control
over content
Total control over content
Many use cases
Optimize for selling
potatoes
Search Engines: How They Work and Why You Need Them
Search Engines: How They Work and Why You Need Them
Most websites have a
custom search engine.
@scarletdrive
Why build search engines?
● Keep it local and customize it
Search Engines: How They Work and Why You Need Them
Let’s try to
search my store.
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
n = items in database
m = max length of title strings
n·m
n = items in database
m = max length of title strings = 250
O(n)
n n · m (m=250)
10 2 500
100 25 000
1 000 250 000
10 000 2 500 000
100 000 25 000 000
1 000 000 250 000 000
Why build search engines?
● Keep it local and customize it
● Improve performance
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cats%’
SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
Why build search engines?
● Keep it local and customize it
● Improve performance
● Improve quality of results
But how?
@scarletdrive
Agenda
1. Why build search engines? ✓
2. Search indexes
3. Open source tools
4. Interesting challenges
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Inverted
Index
Terminology
● A document is a single searchable unit
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
● An index is a collection of documents
(including many inverted indexes)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
... ...
5.00 [5]
8.00 [3]
0-10.00 [3, 5]
11.99 [7, 8]
... ...
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
... ... ...
items indexTerminology
● A search index can have
many inverted indexes
● A search engine can have
many search indexes
title inverted index
price inverted index
blog-posts index
title inverted index
post inverted index
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance
● Improve quality of results
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
O(1)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
r = number of results found
O(1+r)
...but we usually only ask for a fixed
number of results at a time
O(25) → O(1)
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
● Improve quality of results
But at
what cost?
@scarletdrive
Trade-offs
● Space
● System complexity
● Pre-processing time
O(1)
Query
time
O(n·m·p)
Index
time
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results
Let’s talk about
how we build it.
@scarletdrive
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
How did we do this??
Step 1:
Tokenization
string: “cat hat”
array: [“cat”, “hat”]
Image from aliexpress.com
Image from aliexpress.com
Step 2:
Normalization
● Stemming
○ “cats” → “cat”
○ “walking” → “walk”
● Stop words
○ Remove “the”, “and”, “to”, etc...
Image from aliexpress.com
Step 3: Filters
● Lowercase
○ “Dog” → “dog”
● Synonyms
○ “colour” → “color”
○ “t-shirt” → “tshirt”
○ “canadian” → “canada”
○ “kitten” → “cat”
Quality Problems
1. “cat” search returned “vacation hat for dog”
Quality Problems
1. “cat” search returned “vacation hat for dog”
id title price
4 vacation hat for dog 12.99
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
Quality Problems
1. “cat” search returned “vacation hat for dog”
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
cat
id title price
4 vacation hat for dog 12.99
Quality Problems
1. “cat” search returned “vacation hat for dog”
2. “cats” search does not return “red cat mittens”
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
→
All transformations performed on
the input data for the index
are also performed on the query
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
cats cat
Quality Problems
1. “cat” search returned “vacation hat for dogs”
2. “cats” search does not return “red cat mittens”
3. “cat” search does not return “kitten mittens”
Quality Problems
3. “cat” search does not return “kitten mittens”
id title price
7 kitten mittens 11.99
cat [7]
mitten [7]
Quality Problems
3. “cat” search does not return “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
cat
Quality Problems
3 ½ search for “kitten” still returns “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
kitten cat
Did we solve it?
● Keep it local ✓ and customize it ✓
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results ✓
○ By performing special pre-processing steps
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools
4. Interesting challenges
I want a search engine...
do I have to build it myself?
@scarletdrive
Search Engines: How They Work and Why You Need Them
● Inverted index
● Basic tokenization,
normalization, and filters
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization,
normalization, and filters
● Plugins
● ...and more!
Which one should I pick?
It doesn’t matter
Which one should I pick?
● Most projects work well with either
● Getting configuration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/
Which one should I pick?
Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges
Interesting Challenge:
Scalability
Too much traffic?
Replication
Too much traffic?
Replication
update
Too much data?
Sharding
Distribution
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Interesting Challenge:
Relevance
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
tf-idf
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
(assume 100 records which all contain
“cat” in them)
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
Query: “orange cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Result order = [2, 1]Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
3/7 = 0.43
2/5 = 0.40
1/7 = 0.14
1/5 = 0.20
tf-idf
bm25
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
Relevance Challenges
● Prevent keyword stuffing or other “gaming the system”
● Phrase matching
● Fuzzy matching
● User factors: language, location
● Other factors: quality, recency, randomness, diversity
Interesting Challenges
● Scalability
● Relevance
● Query understanding
● Numeric range search
● Faceted search
● Autocomplete
We covered: We did not cover:
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges ✓
Thanks!

More Related Content

PDF
Introduction to Search Systems - ScaleConf Colombia 2017
Toria Gibbs
 
PDF
A Search Index is Not a Database Index - Full Stack Toronto 2017
Toria Gibbs
 
PPTX
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Kory Becker
 
PDF
Python WATs: Uncovering Odd Behavior
Amy Hanlon
 
PPTX
Avoid Query Pitfalls
Norberto Leite
 
PDF
Postgres the best tool you're already using
LiquidPlanner
 
KEY
Grouping (MOTM 2010.02)
Kevin Munc
 
PDF
How Search Engines Work (A Thing I Didn't Learn in University)
Toria Gibbs
 
Introduction to Search Systems - ScaleConf Colombia 2017
Toria Gibbs
 
A Search Index is Not a Database Index - Full Stack Toronto 2017
Toria Gibbs
 
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Kory Becker
 
Python WATs: Uncovering Odd Behavior
Amy Hanlon
 
Avoid Query Pitfalls
Norberto Leite
 
Postgres the best tool you're already using
LiquidPlanner
 
Grouping (MOTM 2010.02)
Kevin Munc
 
How Search Engines Work (A Thing I Didn't Learn in University)
Toria Gibbs
 

Similar to Search Engines: How They Work and Why You Need Them (16)

PPTX
Most common mistakes of workshops applicants
Dominik Wojciechowski
 
PDF
TRECVID 2016 : Instance Search
George Awad
 
PDF
Elastic Relevance Presentation feb4 2020
Brian Nauheimer
 
PDF
Assumptions: Check yo'self before you wreck yourself
Erin Shellman
 
PPTX
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
POTX
Microsoft_brand_template_blue.potx
PhanTien25
 
PPT
Agile Estimating
Robert Dempsey
 
PPT
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Rob Snell
 
PDF
Agile experiments in Machine Learning with F#
J On The Beach
 
PDF
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Johannes Radig
 
PDF
Crush Competitors with Deep On-Page SEO Tactics
PJ Howland
 
PPTX
Agile Experiments in Machine Learning
mathias-brandewinder
 
PDF
PowerPoint StressSaver - Executive Ready Slideware by CorporateDonkey
DanielReilly41
 
PPT
Adp scrum multiple product logs
Akkiraju Bhattiprolu
 
PDF
Python for High School Programmers
Siva Arunachalam
 
PPT
SEO: Create Compelling Content
Rob Snell
 
Most common mistakes of workshops applicants
Dominik Wojciechowski
 
TRECVID 2016 : Instance Search
George Awad
 
Elastic Relevance Presentation feb4 2020
Brian Nauheimer
 
Assumptions: Check yo'self before you wreck yourself
Erin Shellman
 
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
Microsoft_brand_template_blue.potx
PhanTien25
 
Agile Estimating
Robert Dempsey
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Rob Snell
 
Agile experiments in Machine Learning with F#
J On The Beach
 
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Johannes Radig
 
Crush Competitors with Deep On-Page SEO Tactics
PJ Howland
 
Agile Experiments in Machine Learning
mathias-brandewinder
 
PowerPoint StressSaver - Executive Ready Slideware by CorporateDonkey
DanielReilly41
 
Adp scrum multiple product logs
Akkiraju Bhattiprolu
 
Python for High School Programmers
Siva Arunachalam
 
SEO: Create Compelling Content
Rob Snell
 
Ad

Recently uploaded (20)

PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The Future of Artificial Intelligence (AI)
Mukul
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Ad

Search Engines: How They Work and Why You Need Them

  • 1. Search Engines How They Work and Why You Need Them
  • 4. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 5. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 6. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 7. What do you even do all day? We have Google. @scarletdrive
  • 8. Not all search engines are web search engines. @scarletdrive
  • 9. google.com potatoparcel.com Large scope (entire internet) Small scope (just a few potatoes) No control over content Total control over content Many use cases Optimize for selling potatoes
  • 12. Most websites have a custom search engine. @scarletdrive
  • 13. Why build search engines? ● Keep it local and customize it
  • 15. Let’s try to search my store. @scarletdrive
  • 16. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 17. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 18. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 19. n = items in database m = max length of title strings n·m
  • 20. n = items in database m = max length of title strings = 250 O(n)
  • 21. n n · m (m=250) 10 2 500 100 25 000 1 000 250 000 10 000 2 500 000 100 000 25 000 000 1 000 000 250 000 000
  • 22. Why build search engines? ● Keep it local and customize it ● Improve performance
  • 23. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 24. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 25. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 26. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” SELECT * FROM items WHERE title LIKE ‘%cats%’
  • 27. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ...
  • 28. Why build search engines? ● Keep it local and customize it ● Improve performance ● Improve quality of results
  • 30. Agenda 1. Why build search engines? ✓ 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 31. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 32. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Inverted Index
  • 33. Terminology ● A document is a single searchable unit red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] 7 kitten mittens 11.99
  • 34. Terminology ● A document is a single searchable unit ● A field is a defined value in a document red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 35. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 36. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 37. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs ● An index is a collection of documents (including many inverted indexes) red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] ... ... 5.00 [5] 8.00 [3] 0-10.00 [3, 5] 11.99 [7, 8] ... ... id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 ... ... ...
  • 38. items indexTerminology ● A search index can have many inverted indexes ● A search engine can have many search indexes title inverted index price inverted index blog-posts index title inverted index post inverted index
  • 39. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ● Improve quality of results
  • 40. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat
  • 41. O(1)
  • 42. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  • 43. r = number of results found O(1+r)
  • 44. ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  • 45. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ● Improve quality of results
  • 47. Trade-offs ● Space ● System complexity ● Pre-processing time
  • 49. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results
  • 50. Let’s talk about how we build it. @scarletdrive
  • 51. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 How did we do this??
  • 52. Step 1: Tokenization string: “cat hat” array: [“cat”, “hat”] Image from aliexpress.com
  • 53. Image from aliexpress.com Step 2: Normalization ● Stemming ○ “cats” → “cat” ○ “walking” → “walk” ● Stop words ○ Remove “the”, “and”, “to”, etc...
  • 54. Image from aliexpress.com Step 3: Filters ● Lowercase ○ “Dog” → “dog” ● Synonyms ○ “colour” → “color” ○ “t-shirt” → “tshirt” ○ “canadian” → “canada” ○ “kitten” → “cat”
  • 55. Quality Problems 1. “cat” search returned “vacation hat for dog”
  • 56. Quality Problems 1. “cat” search returned “vacation hat for dog” id title price 4 vacation hat for dog 12.99 cat [1, 3, 5] hat [4] dog [4] vacation [4]
  • 57. Quality Problems 1. “cat” search returned “vacation hat for dog” cat [1, 3, 5] hat [4] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99
  • 58. Quality Problems 1. “cat” search returned “vacation hat for dog” 2. “cats” search does not return “red cat mittens”
  • 59. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] →
  • 60. All transformations performed on the input data for the index are also performed on the query
  • 61. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] cats cat
  • 62. Quality Problems 1. “cat” search returned “vacation hat for dogs” 2. “cats” search does not return “red cat mittens” 3. “cat” search does not return “kitten mittens”
  • 63. Quality Problems 3. “cat” search does not return “kitten mittens” id title price 7 kitten mittens 11.99 cat [7] mitten [7]
  • 64. Quality Problems 3. “cat” search does not return “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 cat
  • 65. Quality Problems 3 ½ search for “kitten” still returns “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 kitten cat
  • 66. Did we solve it? ● Keep it local ✓ and customize it ✓ ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results ✓ ○ By performing special pre-processing steps
  • 67. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools 4. Interesting challenges
  • 68. I want a search engine... do I have to build it myself? @scarletdrive
  • 70. ● Inverted index ● Basic tokenization, normalization, and filters ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization, normalization, and filters ● Plugins ● ...and more!
  • 71. Which one should I pick? It doesn’t matter
  • 72. Which one should I pick? ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  • 73. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~
  • 74. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges
  • 79. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 80. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 82. id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in?
  • 84. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: “cat”
  • 85. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: “cat”
  • 86. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 Query: “orange cat”
  • 87. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
  • 88. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 3/7 = 0.43 2/5 = 0.40 1/7 = 0.14 1/5 = 0.20
  • 90. Relevance Challenges ● Prevent keyword stuffing or other “gaming the system” ● Phrase matching ● Fuzzy matching ● User factors: language, location ● Other factors: quality, recency, randomness, diversity
  • 91. Interesting Challenges ● Scalability ● Relevance ● Query understanding ● Numeric range search ● Faceted search ● Autocomplete We covered: We did not cover:
  • 92. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges ✓