SlideShare a Scribd company logo
Approaching Join Index
yet another one join algorithm
Mikhail Khludnev
principal engineer
PRIVILEGED AND CONFIDENTIAL
• Grid Dynamics is a Silicon Valley-based leading provider of scalable, next-generation
commerce technology solutions
• Record of outperformance with Tier 1 retail clients
• Fortune 1000 client relationships
About Me
● principal engineer at Grid Dynamics
● spoke at few last LuceneRevolutions
● contributed BlockJoin query parser for Solr - {!parent}
● blogged about it at http://blog.griddynamics.com/
● tried to fix threads at DataImportHandler
http://google.com/+MikhailKhludnev
You are expected to know
● how Lucene searches/filters
● how it counts facets
● that there are segments
● what is DocValues
● why to join
● RDBMS joins: nested loop join, sort-merge join and hash join.
I’m expected to know
● query-time join
● index-time join
● yet another one join
Lucene/Solr/Elastic Is Strong
● searching
○ filtering
● analytics
○ facets
○ pivots
○ stats
SKU_ID: 13
PROD_ID: 1
TYPE: SKU
SIZE: 8
COLOR:Blue
SKU_ID: 12
PROD_ID: 1
TYPE: SKU
SIZE: 8
COLOR:Green
There is a weakness
● robust joins
○ multiple entities
○ relations
PROD_ID: 1
TYPE: PROD
BRAND: Nike
NAME: Shoes
PRICE: $50
SKU_ID: 11
PROD_ID: 1
TYPE: SKU
SIZE: 7
COLOR: Black
Joins in General
Joins in General
Joins in General
PK=FK
Joins in General
PK=FK
Joins in General
PK=FK
children
Joins in General
1:M
parents
Executing Join Query
q
Executing Join Query
q
Executing Join Query
q
Executing Join Query
q
fq
Executing Join Query
q
fq
Join in General
parents ∩ join-relation ∩ children
JoinUtil
q
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
FK[doc#]
JoinUtil
q
FK[doc#]
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
“25”
“25”
“17”
...
JoinUtil
q
PK
“1” →△
…
“17”→△
“25”→△
...
“25”
“17”
...
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
FK[doc#]
JoinUtil
q
“1” →△
…
“17”→△
“25”→△
...
“25”
“17”
...
fq
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
FK[doc#]PK
Block Join
doc#
Block Join
doc#
4
3
2
Block Join
doc#1
0
0
1
0
0
1
0
0
1
0
Block Join
doc#1
0
0
1
0
0
1
0
0
1
0
q
Block Join
doc#1
0
0
1
0
0
1
0
0
1
0
q
fq
Block Join
doc#1
0
0
1
0
0
1
0
0
1
0
q
fq
Comparison
JoinUtil BlockJoin
searching slow fast
reindexing
Comparison
JoinUtil BlockJoin
searching slow fast
reindexing fast slow
Comparison
doc#
JoinUtil BlockJoin
searching slow fast
reindexing fast slow
Comparison
doc#
JoinUtil BlockJoin
searching slow fast
reindexing fast slow
Comparison
doc#
JoinUtil BlockJoin
searching slow fast
reindexing fast slow
Comparison
JoinUtil BlockJoin
searching slow < ? < fast
reindexing fast > ? > slow
Join Index
q
doc#[doc#]
3
6
0
3
10
0
6
3
Join Index
q
doc#[doc#]
3
6
0
3
10
0
6
3
Join Index
q
doc#[doc#]
fq 3
6
0
3
10
0
6
3
Join Index
q
fq
doc#[doc#]
2
4
1
6
5 10
9
8
Join Index
q
doc#[doc#]
fq
2
4
1
6
5 10
9
8
Join Index
q
doc#[doc#]
fq
2
4
1
6
5 10
9
8
meanwhile… in LUCENE-6352
GlobalOrdinalsQuery
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
SortedDocValues
GlobalOrdinalsQuery
“25”
“17”
“17”
“25”
“25”
“56”
“56”
“56”
“25”
“4”
“61”
SortedDocValues
0 “17”
1 “25”
2 “4”
3 “56”
4 “61”
Ordinals
0
0
1
1
3
3
3
1
2
4
1
GlobalOrdinalsQuery
q
0
0
1
1
3
3
3
1
2
4
1
1
1
0
0
0
fq
GlobalOrdinalsQuery
q
0
0
1
1
3
3
3
1
2
4
1
1
1
0
0
0
Benchmarking 2.9 M docs
https://github.com/m-khl/lucene-solr/tree/dvjoin-benchmark-5-1
Latency, ms
the bigger the worse
7
28
14
BlockJoin
(i-time)
Join Index
JoinUtil
(q-time)
GloblOrdinals
●
JoinUtil JoinIndex
Global
Ordinals
Block
Join
searching slow fast fast
faster
anyway
reindexing fast uber slow fast slow
<
doc#[doc#]
Indexing is still a problem
2
4
3
6
1
0
3
10
0
6
3
5 10
9
8
Further Plans 2.0
● incremental join-index update
● perhaps just calculate and cache it
● or put to dedicated index
● join in both directions
● calculate optimal execution plan of segments enumeration
● edge case for benchmark
Summary
● Joins in General
● JoinUtil vs Block-join vs GlobalOrdinals
● updatable DocValues
● opportunities for improving query-time joins:
○ eliminate term enum
○ choose lower cardinality side for enumeration
○ GlobalOrdinalsJoin
References
● Searching relational content with Lucene's BlockJoinQuery
http://blog.mikemccandless.com/
● Solr Experience: search parent-child relations. Part I
Solr block-join support
http://blog.griddynamics.com/
● https://wiki.apache.org/solr/Join
● http://www.slideshare.net/martijnvg/document-relations
● SOLR-6234 - {!scorejoin }
● LUCENE-6352
● Updatable DocValues Under the Hood
http://shaierera.blogspot.com/
● Subject: How to openIfChanged the most recent merge?
at: java-dev@lucene.apache.org
Thanks for Joining us!
https://goo.gl/hjsYZW
Off scope
Joins’ Zoo in Lucene
True Joins
● query-time join
○ JoinUtil
○ {!join }
○ {!scorejoin } - SOLR-6234
● index-time join aka block-join {!
parent}
Joins’ Zoo in Lucene
Workarounds
● term positions/SpanQueries
● FieldCollapsing/Grouping
● term decoration
○ spatial
● multivalue fields
True Joins
● query-time join
○ JoinUtil
○ {!join },
○ {!scorejoin } - SOLR-6234
● index-time join aka block-join {!
parent}
Joins’ Zoo in Lucene
Workarounds
● term positions/SpanQueries
● FieldCollapsing/Grouping
● term decoration
○ spatial
● multivalue fields
True Joins
● query-time join
○ JoinUtil
○ {!join },
○ {!scorejoin } - SOLR-6234
● index-time join aka block-join {!
parent}
Two phase update problem
Subject: How to openIfChanged the most recent merge?
at: java-dev@lucene.apache.org
JoinUtil
● query-time
● indexing is fast
● searching is slow, why?
○ expensive term enum
○ single enumeration order
BlockJoin
● index-time
● reindexing whole block is as expensive as mandatory
● searching is darn fast, however
○ can’t reorder child docs
store ref = segment#, doc#
put ref to previous and current segment in DV
when add new segment, join IDs with previous segments
- for parent, just ref to all children docnums
- for children, add plain field refsToSeg:seg#
-
when score parents on some segment
- buffer them with the link refs, then
- intersect buffered link refs with children query on previous segments
- search all segments for refsToSeg:seg#, intersect with children query, obtain
perent ref from DV intersect with buffered

More Related Content

PDF
Grouping and Joining in Lucene/Solr
lucenerevolution
 
PPTX
Approaching Join Index - Lucene/Solr Revolution 2014
Grid Dynamics
 
PDF
Solr search engine with multiple table relation
Jay Bharat
 
PDF
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Lucidworks
 
PDF
Building Applications with a Graph Database
Tobias Lindaaker
 
PDF
Choosing the right NOSQL database
Tobias Lindaaker
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
PPTX
Linked Open Data - Masaryk University in Brno 8.11.2016
Martin Necasky
 
Grouping and Joining in Lucene/Solr
lucenerevolution
 
Approaching Join Index - Lucene/Solr Revolution 2014
Grid Dynamics
 
Solr search engine with multiple table relation
Jay Bharat
 
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Lucidworks
 
Building Applications with a Graph Database
Tobias Lindaaker
 
Choosing the right NOSQL database
Tobias Lindaaker
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
Linked Open Data - Masaryk University in Brno 8.11.2016
Martin Necasky
 

What's hot (20)

PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Olaf Hartig
 
PDF
Linked Data, Ontologies and Inference
Barry Norton
 
PPT
Semantic Technology In Oracle Database 12c
Martin Toshev
 
PPT
Jdk 10 sneak peek
Martin Toshev
 
PPT
Java 9 Security Enhancements in Practice
Martin Toshev
 
PDF
Sem tech 2010_integrity_constraints
Clark & Parsia LLC
 
PPTX
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Muhammad Saleem
 
PDF
OrientDB: Unlock the Value of Document Data Relationships
Fabrizio Fortino
 
PDF
Producing, publishing and consuming linked data - CSHALS 2013
François Belleau
 
PPTX
2013 CrossRef Annual Meeting System Update Chuck Koscher
Crossref
 
PDF
Stardog Linked Data Catalog
kendallclark
 
PDF
Grails And The Semantic Web
william_greenly
 
PDF
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Karel Minarik
 
PDF
JEEConf 2019 | Let’s build a Java backend designed for a high load
Alex Moskvin
 
PPTX
Tagging search solution design Advanced edition
Alexander Tokarev
 
PDF
Querying federations 
of Triple Pattern Fragments
Ruben Verborgh
 
PDF
Harnessing The Semantic Web
william_greenly
 
KEY
State-of-the-Art Drupal Search with Apache Solr
guest432cd6
 
PPT
Tna Discovery Portal
Jeremie Charlet
 
PDF
PharoDAYS 2015: Pharo Status - by Markus Denker
Pharo
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Olaf Hartig
 
Linked Data, Ontologies and Inference
Barry Norton
 
Semantic Technology In Oracle Database 12c
Martin Toshev
 
Jdk 10 sneak peek
Martin Toshev
 
Java 9 Security Enhancements in Practice
Martin Toshev
 
Sem tech 2010_integrity_constraints
Clark & Parsia LLC
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Muhammad Saleem
 
OrientDB: Unlock the Value of Document Data Relationships
Fabrizio Fortino
 
Producing, publishing and consuming linked data - CSHALS 2013
François Belleau
 
2013 CrossRef Annual Meeting System Update Chuck Koscher
Crossref
 
Stardog Linked Data Catalog
kendallclark
 
Grails And The Semantic Web
william_greenly
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Karel Minarik
 
JEEConf 2019 | Let’s build a Java backend designed for a high load
Alex Moskvin
 
Tagging search solution design Advanced edition
Alexander Tokarev
 
Querying federations 
of Triple Pattern Fragments
Ruben Verborgh
 
Harnessing The Semantic Web
william_greenly
 
State-of-the-Art Drupal Search with Apache Solr
guest432cd6
 
Tna Discovery Portal
Jeremie Charlet
 
PharoDAYS 2015: Pharo Status - by Markus Denker
Pharo
 
Ad

Similar to Mikhail khludnev: approaching-join index for lucene (20)

PDF
Joins in a distributed world - Lucian Precup
distributed matters
 
PPTX
Joins in a distributed world Distributed Matters Barcelona 2015
Lucian Precup
 
PDF
Apache Solr lessons learned
Jeroen Rosenberg
 
PDF
Recent Additions to Lucene Arsenal
lucenerevolution
 
PDF
What is in a Lucene index?
lucenerevolution
 
PDF
Tech Talk - JPA and Query Optimization - publish
Gleydson Lima
 
PDF
Consuming RealTime Signals in Solr
Umesh Prasad
 
PPTX
Join operation
Jeeva Nanthini
 
PDF
Solr5
Leonardo Souza
 
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
ZIP
CouchDB-Lucene
Martin Rehfeld
 
PDF
Lucene 101
Varun Thacker
 
PPTX
Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
Grid Dynamics
 
PDF
Migration from mysql to elasticsearch
Ryosuke Nakamura
 
PDF
Solr Black Belt Pre-conference
Erik Hatcher
 
PPT
Lucene basics
Nitin Pande
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PPTX
Index Structures.pptx
MBablu1
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PDF
Use of-solr-at-trovit-classified-ads marc-sturlese
Marc Sturlese
 
Joins in a distributed world - Lucian Precup
distributed matters
 
Joins in a distributed world Distributed Matters Barcelona 2015
Lucian Precup
 
Apache Solr lessons learned
Jeroen Rosenberg
 
Recent Additions to Lucene Arsenal
lucenerevolution
 
What is in a Lucene index?
lucenerevolution
 
Tech Talk - JPA and Query Optimization - publish
Gleydson Lima
 
Consuming RealTime Signals in Solr
Umesh Prasad
 
Join operation
Jeeva Nanthini
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
CouchDB-Lucene
Martin Rehfeld
 
Lucene 101
Varun Thacker
 
Faceting with Lucene Block Join Query - Lucene/Solr Revolution 2014
Grid Dynamics
 
Migration from mysql to elasticsearch
Ryosuke Nakamura
 
Solr Black Belt Pre-conference
Erik Hatcher
 
Lucene basics
Nitin Pande
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Index Structures.pptx
MBablu1
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Use of-solr-at-trovit-classified-ads marc-sturlese
Marc Sturlese
 
Ad

More from Grid Dynamics (20)

PPTX
Are you keeping up with your customer
Grid Dynamics
 
PPTX
"Implementing data quality automation with open source stack" - Max Martynov,...
Grid Dynamics
 
PDF
"How to build cool & useful voice commerce applications (such as devices like...
Grid Dynamics
 
PPTX
"Challenges for AI in Healthcare" - Peter Graven Ph.D
Grid Dynamics
 
PPTX
Dynamic Talks: "Applications of Big Data, Machine Learning and Artificial Int...
Grid Dynamics
 
PPTX
Dynamic Talks: "Digital Transformation in Banking & Financial Services… a per...
Grid Dynamics
 
PDF
Dynamic Talks: "Data Strategy as a Conduit for Data Maturity and Monetization...
Grid Dynamics
 
PDF
Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor P...
Grid Dynamics
 
PPTX
"Trends in Building Advanced Analytics Platform for Large Enterprises" - Atul...
Grid Dynamics
 
PPTX
The New Era of Public Safety Records Management: Dynamic talks Chicago 9/24/2019
Grid Dynamics
 
PPTX
Dynamic Talks: "Implementing data quality automation with open source stack" ...
Grid Dynamics
 
PDF
"Implementing AI for New Business Models and Efficiencies" - Parag Shrivastav...
Grid Dynamics
 
PPTX
Reducing No-shows and Late Cancelations in Healthcare Enterprise" - Shervin M...
Grid Dynamics
 
PDF
Customer intelligence: a Machine Learning Approach: Dynamic talks Atlanta 8/2...
Grid Dynamics
 
PPTX
"ML Services - How do you begin and when do you start scaling?" - Madhura Dud...
Grid Dynamics
 
PPTX
Realtime Contextual Product Recommendations…that scale and generate revenue -...
Grid Dynamics
 
PDF
Decision Automation in Marketing Systems using Reinforcement Learning: Dynami...
Grid Dynamics
 
PPTX
Best practices for enterprise-grade microservices implementations with Google...
Grid Dynamics
 
PPTX
Attribution Modelling 101: Credit Where Credit is Due!: Dynamic talks Seattle...
Grid Dynamics
 
PDF
Building an algorithmic price management system using ML: Dynamic talks Seatt...
Grid Dynamics
 
Are you keeping up with your customer
Grid Dynamics
 
"Implementing data quality automation with open source stack" - Max Martynov,...
Grid Dynamics
 
"How to build cool & useful voice commerce applications (such as devices like...
Grid Dynamics
 
"Challenges for AI in Healthcare" - Peter Graven Ph.D
Grid Dynamics
 
Dynamic Talks: "Applications of Big Data, Machine Learning and Artificial Int...
Grid Dynamics
 
Dynamic Talks: "Digital Transformation in Banking & Financial Services… a per...
Grid Dynamics
 
Dynamic Talks: "Data Strategy as a Conduit for Data Maturity and Monetization...
Grid Dynamics
 
Dynamics Talks: "Writing Spark Pipelines with Less Boilerplate Code" - Egor P...
Grid Dynamics
 
"Trends in Building Advanced Analytics Platform for Large Enterprises" - Atul...
Grid Dynamics
 
The New Era of Public Safety Records Management: Dynamic talks Chicago 9/24/2019
Grid Dynamics
 
Dynamic Talks: "Implementing data quality automation with open source stack" ...
Grid Dynamics
 
"Implementing AI for New Business Models and Efficiencies" - Parag Shrivastav...
Grid Dynamics
 
Reducing No-shows and Late Cancelations in Healthcare Enterprise" - Shervin M...
Grid Dynamics
 
Customer intelligence: a Machine Learning Approach: Dynamic talks Atlanta 8/2...
Grid Dynamics
 
"ML Services - How do you begin and when do you start scaling?" - Madhura Dud...
Grid Dynamics
 
Realtime Contextual Product Recommendations…that scale and generate revenue -...
Grid Dynamics
 
Decision Automation in Marketing Systems using Reinforcement Learning: Dynami...
Grid Dynamics
 
Best practices for enterprise-grade microservices implementations with Google...
Grid Dynamics
 
Attribution Modelling 101: Credit Where Credit is Due!: Dynamic talks Seattle...
Grid Dynamics
 
Building an algorithmic price management system using ML: Dynamic talks Seatt...
Grid Dynamics
 

Recently uploaded (20)

PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Software Development Methodologies in 2025
KodekX
 
Doc9.....................................
SofiaCollazos
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 

Mikhail khludnev: approaching-join index for lucene