SlideShare a Scribd company logo
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
Director of Engineering, Search & Recommendations
2015.10.15
Trey Grainger
Director of Engineering, Search & Recommendations
• Joined CareerBuilder in 2007 as a Software Engineer
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Mining Massive Datasets (in progress) - Stanford University
Fun outside of CB:
• Co-author of Solr in Action, plus a handful of research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
About Me
Agenda
• Introduction
• Defining the problem – the need for Semantic Search
• Building an Intent Engine
- Type-ahead prediction
- Spelling Correction
- Entity / Entity-type Resolution
- Semantic Query Parsing
- Query Augmentation
- The Knowledge Graph
• Conclusion
Knowledge
Graph
At CareerBuilder, Solr Powers...At CareerBuilder, Solr Powers...
Search by the Numbers
5
Powering 50+ Search Experiences Including:
100million +
Searches per day
30+
Software Developers, Data
Scientists + Analysts
500+
Search Servers
1,5billion +
Documents indexed and
searchable
1
Global Search
Technology platform
...and many more
What’s the problem we’re trying to solve today?
User’s Query:
machine learning research and development Portland, OR software
engineer AND hadoop, java
Traditional Query Parsing:
(machine AND learning AND research AND development AND portland)
OR (software AND engineer AND hadoop AND java)
Semantic Query Parsing:
"machine learning" AND "research and development" AND "Portland, OR"
AND "software engineer" AND hadoop AND java
Semantically Expanded Query:
("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")
AND ("research and development"^10 OR "r&d") AND
AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})
AND ("software engineer"^10 OR "software developer")
AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
But we also really want “things”, not “strings”…
Job Level Job title Company
Job Title Company School + Degree
Type-ahead
Prediction
Knowledge Graph and Intent Engine
Search Box
Semantic Query
Parsing
Intent Engine
Spelling Correction
Entity / Entity
Type Resolution
Machine-learned
Ranking
Relevancy Engine (“re-expressing intent”)
User Feedback
(Clarifying Intent)
Query Re-writing Search Results
Query
Augmentation
Knowledge
Graph
Type-ahead Predictions
Semantic Autocomplete
• Shows top terms for any search
• Breaks out job titles, skills, companies,
related keywords, and other
categories
• Understands abbreviations, alternate
forms, misspellings
• Supports full Boolean syntax and
multi-term autocomplete
• Enables fielded search on entities, not
just keywords
Spelling Correction*
*Google “Solr Spell Check Component”
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Entity / Entity-type
Resolution
Differentiating related terms
Synonyms: cpa => certified public accountant
rn => registered nurse
r.n. => registered nurse
Ambiguous Terms*: driver => driver (trucking) ~80% likelihood
driver => driver (software) ~20% likelihood
Related Terms: r.n. => nursing, bsn
hadoop => mapreduce, hive, pig
*differentiated based upon user and query context
Building a Taxonomy of Entities
Many ways to generate this:
• Topic Modelling
• Clustering of documents
• Statistical Analysis of interesting phrases
• Buy a dictionary (often doesn’t work for
domain-specific search problems)
• …
Our strategy:
Generate a model of domain-specific phrases by
mining query logs for commonly searched phrases within the domain [1]
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
Entity-type Recognition
Build classifiers trained on
External data sources
(Wikipedia, DBPedia,
WordNet, etc.), as well as
from our own domain.
The subject for a future
talk / research paper…
java developer
registered nurse
emergency room
director
job title
skill
job level
location
work type
Portland, OR
part-time
Semantic Query Parsing
Query Parsing: The whole is greater than the sum of the parts
project manager vs. "project" AND "manager"
building architect vs. "building" AND "architect"
software architect vs. "software" AND "architect"
Consider: a "software architect" designs and builds software
a "building architect" uses software to design architecture
User’s Query:
machine learning research and
development Portland, OR software
engineer AND hadoop java
Traditional Query Parsing:
(machine AND learning AND research
AND development AND portland)
OR (software AND engineer AND
hadoop AND java)
≠
Identifying the correct phrase (not just the parts) is crucial here!
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Probabilistic Query Parser
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:
senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop"
Input: senior hadoop developer java ruby on rails perl
Semantic Search Architecture – Query Parsing
1) Generate the previously discussed taxonomy of
Domain-specific phrases
• You can mine query logs or actual text of documents for
significant phrases within your domain [1]
2) Feed these phrases to SolrTextTagger (uses Lucene FST
for high-throughput term lookups)
3) Use SolrTextTagger to perform entity extraction
on incoming queries (tagging documents is also possible)
4) Also invoke probabilistic parser to dynamically identify
unknown phrases from a corpus of data (language model)
5) Shown on next slides:
Pass extracted entities to a Query Augmentation phase to
rewrite the query with enhanced semantic understanding
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of
Domain-specific Jargon," in IEEE Big Data 2014.
[2] https://github.com/OpenSextant/SolrTextTagger
Query Augmentation
machine learning
Keywords:
Search Behavior,
Application Behavior, etc.
Job Title Classifier, Skills Extractor, Job Level Classifier, etc.
Semantic Query
Augmentation
keywords:((machine learning)^10 OR
{ AT_LEAST_2: ("data mining"^0.9, matlab^0.8,
"data scientist"^0.75, "artificial intelligence"^0.7,
"neural networks"^0.55)) }
{ BOOST_TO_TOP: ( job_title:(
"software engineer" OR "data manager" OR
"data scientist" OR "hadoop engineer")) }
Modified Query:
Related Occupations
machine learning:
{15-1031.00 .58
Computer Software Engineers, Applications
15-1011.00 .55
Computer and Information Scientists, Research
15-1032.00 .52
Computer Software Engineers, Systems Software }
machine learning:
{ software engineer .65,
data manager .3,
data scientist .25,
hadoop engineer .2, }
Common Job Titles
Semantic Search Architecture – Query Augmentation
Related Phrases
machine learning:
{ data mining .9,
matlab .8,
data scientist .75,
artificial intelligence .7,
neural networks .55 }
Known keyword
phrases
java developer
machine learning
registered nurse
FST
Knowledge
Graph in
+
Query Enrichment
Document Enrichment
Document Enrichment
Knowledge Graph
Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through
multiple levels of relationships between items in our domain. Compare the relationships of skills to
keywords, job titles to skills to keywords, skills to government occupation codes, skills to experience
level, etc.
Knowledge Graph API
Core similarity engine, exposed via API
Any product can leverage our core relationship scoring
engine to score any list of entities against any other list
Full domain support
Keywords, job titles, skills, companies, job levels,
locations, and all other taxonomies.
Intersections, overlaps, & relationship
scoring, many levels deep
Users can either provide a list of items to score, or else have the
system dynamically discover the most related items (or both).
Knowledge
Graph
So how does it work?
Foreground vs. Background Analysis
Every term scored against it’s context. The more
commonly the term appears within it’s foreground
context versus its background context, the more
relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness":0.9773, "popularity":369 },
{ "value":"java", "relatedness":0.9236, "popularity":15653 },
{ "value":".net", "relatedness":0.5294, "popularity":17683 },
{ "value":"bee", "relatedness":0.0, "popularity":0 },
{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },
{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }
We are essentially boosting terms which are more related to some known feature
(and ignoring terms which are equally likely to appear in the background corpus)
+
-
Foreground Query:
"Hadoop"
Knowledge
Graph
Knowledge Graph – Potential Use Cases
Cross-walk between Types
• Have an ID field, but want to enable free text search
on the most associated entity with that ID?
• Have a “state” (geo) search box, but want to accept
any free-text location and map it to the right state?
• Have an old classification taxonomy and want to
know how the values from the old system now map
into the new values?
Build User Profiles from Search Logs
• If someone searches for “Java”, and then “JQuery”,
and then “CSS”, and then “JSP”, what do those have
in common?
• What if they search for “Java”, and then “C++”, and
then “Assembly”?
Discover Relationships Between Anything
• If I want to become a data scientist and know
Python, what libraries should I learn?
• If my last job was mid-level software engineer and
my current job is Engineering Lead, what are my
most likely next roles?
Traverse arbitrarily deep, Sort on anything
• Build an instant co-occurrence matrix, sort the top
values by their relatedness, and then add in any
number of additional dimensions (RAM permitting).
Data Cleansing
• Have dirty taxonomies and need to figure out which
items don’t belong?
• Need to understand the conceptual cohesion of a
document (vs spammy or off-topic content)?
Knowledge
Graph
2014-2015 Publications & Presentations
Books:
Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr
Research papers:
● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon - 2014
● Towards a Job title Classification System - 2014
● Augmenting Recommendation Systems Using a Model of Semantically-related Terms
Extracted from User Behavior - 2014
● sCooL: A system for academic institution name normalization - 2014
● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems - 2014
● SKILL: A System for Skill Identification and Normalization – 2015
● Carotene: A Job Title Classification System for the Online Recruitment Domain - 2015
● WebScalding: A Framework for Big Data Web Services - 2015
● A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases - 2015
● Macau: Large-Scale Skill Sense Disambiguation in the Online Recruitment Domain - 2015
● Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data – 2015
● Query Sense Disambiguation Leveraging Large Scale User Behavioral Data - 2015
Speaking Engagements:
● Over a dozen in the last year: Lucene/Solr Revolution 2014, WSDM 2014, Atlanta Solr Meetup, Atlanta Big Data Meetup, Second
International Syposium on Big Data and Data Analytics, RecSys 2014, IEEE Big Data Conference 2014 (x2), AAAI/IAAI 2015, IEEE Big Data
2015 (x6) Lucene/Solr Revolution 2015
So What’s Next?
machine learning
Keywords:
Search Behavior,
Application Behavior, etc.
Job Title Classifier, Skills Extractor, Job Level Classifier, etc.
Semantic Query
Augmentation
keywords:((machine learning)^10 OR
{ AT_LEAST_2: ("data mining"^0.9, matlab^0.8,
"data scientist"^0.75, "artificial intelligence"^0.7,
"neural networks"^0.55)) }
{ BOOST_TO_TOP: ( job_title:(
"software engineer" OR "data manager" OR
"data scientist" OR "hadoop engineer")) }
Modified Query:
Related Occupations
machine learning:
{15-1031.00 .58
Computer Software Engineers, Applications
15-1011.00 .55
Computer and Information Scientists, Research
15-1032.00 .52
Computer Software Engineers, Systems Software }
machine learning:
{ software engineer .65,
data manager .3,
data scientist .25,
hadoop engineer .2, }
Common Job Titles
Semantic Search Architecture – Query Augmentation
Related Phrases
machine learning:
{ data mining .9,
matlab .8,
data scientist .75,
artificial intelligence .7,
neural networks .55 }
Known keyword
phrases
java developer
machine learning
registered nurse
FST
Knowledge
Graph in
+
This Piece:
How do you construct the
best possible queries?
The answer… Learning to Rank
(Machine-learned Ranking)
That can be a topic for next time…
Type-ahead
Prediction
Knowledge Graph and Intent Engine
Search Box
Semantic Query
Parsing
Intent Engine
Spelling Correction
Entity / Entity
Type Resolution
Machine-learned
Ranking
Relevancy Engine (“re-expressing intent”)
User Feedback
(Clarifying Intent)
Query Re-writing Search Results
Query
Augmentation
Knowledge
Graph
Additional References:
Contact Info
Yes, WE ARE HIRING @ . Come talk with me if you are interested…
Trey Grainger
trey.grainger@careerbuilder.com
@treygrainger
http://solrinaction.com
Conference discount (43% off): lusorevcftw
Other presentations:
http://www.treygrainger.com

More Related Content

What's hot (20)

PDF
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
PPTX
William slawski-google-patents- how-do-they-influence-search
Bill Slawski
 
PDF
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Koray Tugberk GUBUR
 
PDF
Influxdb and time series data
Marcin Szepczyński
 
PDF
[2018] 구조화된 검색 모델
NHN FORWARD
 
PDF
Azure Cognitive Search: AI로 비정형데이터 바로 활용하기
Minnie Seungmin Cho
 
PPTX
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
PPTX
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Koray Tugberk GUBUR
 
PPTX
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PPTX
Slawski New Approaches for Structured Data:Evolution of Question Answering
Bill Slawski
 
PPT
Big Data & Text Mining
Michel Bruley
 
PDF
Integrating Clickstream Data into Solr for Ranking and Dynamic Facet Optimiza...
Lucidworks
 
PDF
Introduction to MongoDB
Mike Dirolf
 
PDF
파이썬을 활용한 웹 크롤링
HWANGTAEYONG
 
PPTX
Keyword Research and Topic Modeling in a Semantic Web
Bill Slawski
 
PDF
서버학개론(백엔드 서버 개발자를 위한)
SU BO KIM
 
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
PPTX
Machine learning
Saurabh Agrawal
 
PDF
[215]네이버콘텐츠통계서비스소개 김기영
NAVER D2
 
PPTX
Semantic seo and the evolution of queries
Bill Slawski
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
William slawski-google-patents- how-do-they-influence-search
Bill Slawski
 
Semantic Search Engine: Semantic Search and Query Parsing with Phrases and En...
Koray Tugberk GUBUR
 
Influxdb and time series data
Marcin Szepczyński
 
[2018] 구조화된 검색 모델
NHN FORWARD
 
Azure Cognitive Search: AI로 비정형데이터 바로 활용하기
Minnie Seungmin Cho
 
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Koray Tugberk GUBUR
 
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Bill Slawski
 
Big Data & Text Mining
Michel Bruley
 
Integrating Clickstream Data into Solr for Ranking and Dynamic Facet Optimiza...
Lucidworks
 
Introduction to MongoDB
Mike Dirolf
 
파이썬을 활용한 웹 크롤링
HWANGTAEYONG
 
Keyword Research and Topic Modeling in a Semantic Web
Bill Slawski
 
서버학개론(백엔드 서버 개발자를 위한)
SU BO KIM
 
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Machine learning
Saurabh Agrawal
 
[215]네이버콘텐츠통계서비스소개 김기영
NAVER D2
 
Semantic seo and the evolution of queries
Bill Slawski
 

Viewers also liked (20)

PPTX
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
PPTX
The Semantic Knowledge Graph
Trey Grainger
 
PDF
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
PPTX
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
PPTX
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
PDF
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Lucidworks
 
PPTX
Solr 6.0 Graph Query Overview
Kevin Watters
 
PDF
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Lucidworks
 
PDF
Semantic & Multilingual Strategies in Lucene/Solr
Trey Grainger
 
PPTX
Building a real time, solr-powered recommendation engine
Trey Grainger
 
PDF
Graphs, Graphs everywhere - Lucene powered relation exploration
Zbyszko Papierski
 
PPT
Google knowledge graph 0
STIinnsbruck
 
PDF
Simple Knowledge Organisation System (SKOS) as the core of Enterprise Knowled...
Andreas Blumauer
 
PPTX
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
PDF
Distributed processing of large graphs in python
Jose Quesada (hiring)
 
PDF
Distributed Graph Analytics with Gradoop
Martin Junghanns
 
PPT
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Trey Grainger
 
PDF
Implementing search with solr at 7digital
lucenerevolution
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
The Semantic Knowledge Graph
Trey Grainger
 
Reflected intelligence evolving self-learning data systems
Trey Grainger
 
South Big Data Hub: Text Data Analysis Panel
Trey Grainger
 
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Lucidworks
 
Solr 6.0 Graph Query Overview
Kevin Watters
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Lucidworks
 
Semantic & Multilingual Strategies in Lucene/Solr
Trey Grainger
 
Building a real time, solr-powered recommendation engine
Trey Grainger
 
Graphs, Graphs everywhere - Lucene powered relation exploration
Zbyszko Papierski
 
Google knowledge graph 0
STIinnsbruck
 
Simple Knowledge Organisation System (SKOS) as the core of Enterprise Knowled...
Andreas Blumauer
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Trey Grainger
 
Distributed processing of large graphs in python
Jose Quesada (hiring)
 
Distributed Graph Analytics with Gradoop
Martin Junghanns
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Trey Grainger
 
Implementing search with solr at 7digital
lucenerevolution
 
Ad

Similar to Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine (20)

PPTX
From keyword-based search to language-agnostic semantic search
CareerBuilder.com
 
PDF
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
PDF
Mark Tortoricci - Talent42 2015
Talent42
 
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
PPTX
Machine Learning for Recommender Systems in the Job Market
Fabian Abel
 
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Lucidworks
 
PDF
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis
 
PPT
Advanced full text searching techniques using Lucene
Asad Abbas
 
PDF
Building a real time, big data analytics platform with solr
lucenerevolution
 
PDF
Building a real time big data analytics platform with solr
Trey Grainger
 
PPTX
Elasticsearch
Ricardo Peres
 
PDF
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
PPTX
How to Be a 10x Data Scientist
Stephanie Kim
 
PDF
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
jaxLondonConference
 
PDF
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
PDF
Relevance trilogy may dream be with you! (dec17)
Woonsan Ko
 
PDF
Measuring Your Code
Nate Abele
 
PDF
Measuring Your Code 2.0
Nate Abele
 
PDF
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
From keyword-based search to language-agnostic semantic search
CareerBuilder.com
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Lucidworks
 
The Relevance of the Apache Solr Semantic Knowledge Graph
Trey Grainger
 
Mark Tortoricci - Talent42 2015
Talent42
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
Machine Learning for Recommender Systems in the Job Market
Fabian Abel
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Lucidworks
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis
 
Advanced full text searching techniques using Lucene
Asad Abbas
 
Building a real time, big data analytics platform with solr
lucenerevolution
 
Building a real time big data analytics platform with solr
Trey Grainger
 
Elasticsearch
Ricardo Peres
 
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
How to Be a 10x Data Scientist
Stephanie Kim
 
Designing and Building a Graph Database Application - Ian Robinson (Neo Techn...
jaxLondonConference
 
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Relevance trilogy may dream be with you! (dec17)
Woonsan Ko
 
Measuring Your Code
Nate Abele
 
Measuring Your Code 2.0
Nate Abele
 
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
Ad

More from Trey Grainger (15)

PDF
Balancing the Dimensions of User Intent
Trey Grainger
 
PDF
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Trey Grainger
 
PDF
The Next Generation of AI-powered Search
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
PDF
Measuring Relevance in the Negative Space
Trey Grainger
 
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
PDF
The Future of Search and AI
Trey Grainger
 
PPTX
Searching for Meaning
Trey Grainger
 
PPTX
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
PPTX
Building Search & Recommendation Engines
Trey Grainger
 
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
PPTX
Self-learned Relevancy with Apache Solr
Trey Grainger
 
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
PDF
Enhancing relevancy through personalization & semantic search
Trey Grainger
 
Balancing the Dimensions of User Intent
Trey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Trey Grainger
 
The Next Generation of AI-powered Search
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Trey Grainger
 
Measuring Relevance in the Negative Space
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
The Future of Search and AI
Trey Grainger
 
Searching for Meaning
Trey Grainger
 
The Intent Algorithms of Search & Recommendation Engines
Trey Grainger
 
Building Search & Recommendation Engines
Trey Grainger
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Trey Grainger
 
Self-learned Relevancy with Apache Solr
Trey Grainger
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Enhancing relevancy through personalization & semantic search
Trey Grainger
 

Recently uploaded (20)

PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
The Future of Artificial Intelligence (AI)
Mukul
 

Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine

  • 1. Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine Trey Grainger Director of Engineering, Search & Recommendations 2015.10.15
  • 2. Trey Grainger Director of Engineering, Search & Recommendations • Joined CareerBuilder in 2007 as a Software Engineer • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Mining Massive Datasets (in progress) - Stanford University Fun outside of CB: • Co-author of Solr in Action, plus a handful of research papers • Frequent conference speaker • Founder of Celiaccess.com, the gluten-free search engine • Lucene/Solr contributor About Me
  • 3. Agenda • Introduction • Defining the problem – the need for Semantic Search • Building an Intent Engine - Type-ahead prediction - Spelling Correction - Entity / Entity-type Resolution - Semantic Query Parsing - Query Augmentation - The Knowledge Graph • Conclusion Knowledge Graph
  • 4. At CareerBuilder, Solr Powers...At CareerBuilder, Solr Powers...
  • 5. Search by the Numbers 5 Powering 50+ Search Experiences Including: 100million + Searches per day 30+ Software Developers, Data Scientists + Analysts 500+ Search Servers 1,5billion + Documents indexed and searchable 1 Global Search Technology platform ...and many more
  • 6. What’s the problem we’re trying to solve today? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: ("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  • 7. But we also really want “things”, not “strings”… Job Level Job title Company Job Title Company School + Degree
  • 8. Type-ahead Prediction Knowledge Graph and Intent Engine Search Box Semantic Query Parsing Intent Engine Spelling Correction Entity / Entity Type Resolution Machine-learned Ranking Relevancy Engine (“re-expressing intent”) User Feedback (Clarifying Intent) Query Re-writing Search Results Query Augmentation Knowledge Graph
  • 10. Semantic Autocomplete • Shows top terms for any search • Breaks out job titles, skills, companies, related keywords, and other categories • Understands abbreviations, alternate forms, misspellings • Supports full Boolean syntax and multi-term autocomplete • Enables fielded search on entities, not just keywords
  • 11. Spelling Correction* *Google “Solr Spell Check Component”
  • 14. Differentiating related terms Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse Ambiguous Terms*: driver => driver (trucking) ~80% likelihood driver => driver (software) ~20% likelihood Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig *differentiated based upon user and query context
  • 15. Building a Taxonomy of Entities Many ways to generate this: • Topic Modelling • Clustering of documents • Statistical Analysis of interesting phrases • Buy a dictionary (often doesn’t work for domain-specific search problems) • … Our strategy: Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain [1] [1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
  • 16. Entity-type Recognition Build classifiers trained on External data sources (Wikipedia, DBPedia, WordNet, etc.), as well as from our own domain. The subject for a future talk / research paper… java developer registered nurse emergency room director job title skill job level location work type Portland, OR part-time
  • 18. Query Parsing: The whole is greater than the sum of the parts project manager vs. "project" AND "manager" building architect vs. "building" AND "architect" software architect vs. "software" AND "architect" Consider: a "software architect" designs and builds software a "building architect" uses software to design architecture User’s Query: machine learning research and development Portland, OR software engineer AND hadoop java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) ≠ Identifying the correct phrase (not just the parts) is crucial here!
  • 20. Probabilistic Query Parser Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop"
  • 21. Input: senior hadoop developer java ruby on rails perl
  • 22. Semantic Search Architecture – Query Parsing 1) Generate the previously discussed taxonomy of Domain-specific phrases • You can mine query logs or actual text of documents for significant phrases within your domain [1] 2) Feed these phrases to SolrTextTagger (uses Lucene FST for high-throughput term lookups) 3) Use SolrTextTagger to perform entity extraction on incoming queries (tagging documents is also possible) 4) Also invoke probabilistic parser to dynamically identify unknown phrases from a corpus of data (language model) 5) Shown on next slides: Pass extracted entities to a Query Augmentation phase to rewrite the query with enhanced semantic understanding [1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. [2] https://github.com/OpenSextant/SolrTextTagger
  • 24. machine learning Keywords: Search Behavior, Application Behavior, etc. Job Title Classifier, Skills Extractor, Job Level Classifier, etc. Semantic Query Augmentation keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) } { BOOST_TO_TOP: ( job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) } Modified Query: Related Occupations machine learning: {15-1031.00 .58 Computer Software Engineers, Applications 15-1011.00 .55 Computer and Information Scientists, Research 15-1032.00 .52 Computer Software Engineers, Systems Software } machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, } Common Job Titles Semantic Search Architecture – Query Augmentation Related Phrases machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 } Known keyword phrases java developer machine learning registered nurse FST Knowledge Graph in +
  • 29. Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple levels of relationships between items in our domain. Compare the relationships of skills to keywords, job titles to skills to keywords, skills to government occupation codes, skills to experience level, etc. Knowledge Graph API Core similarity engine, exposed via API Any product can leverage our core relationship scoring engine to score any list of entities against any other list Full domain support Keywords, job titles, skills, companies, job levels, locations, and all other taxonomies. Intersections, overlaps, & relationship scoring, many levels deep Users can either provide a list of items to score, or else have the system dynamically discover the most related items (or both). Knowledge Graph
  • 30. So how does it work? Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  • 31. Knowledge Graph – Potential Use Cases Cross-walk between Types • Have an ID field, but want to enable free text search on the most associated entity with that ID? • Have a “state” (geo) search box, but want to accept any free-text location and map it to the right state? • Have an old classification taxonomy and want to know how the values from the old system now map into the new values? Build User Profiles from Search Logs • If someone searches for “Java”, and then “JQuery”, and then “CSS”, and then “JSP”, what do those have in common? • What if they search for “Java”, and then “C++”, and then “Assembly”? Discover Relationships Between Anything • If I want to become a data scientist and know Python, what libraries should I learn? • If my last job was mid-level software engineer and my current job is Engineering Lead, what are my most likely next roles? Traverse arbitrarily deep, Sort on anything • Build an instant co-occurrence matrix, sort the top values by their relatedness, and then add in any number of additional dimensions (RAM permitting). Data Cleansing • Have dirty taxonomies and need to figure out which items don’t belong? • Need to understand the conceptual cohesion of a document (vs spammy or off-topic content)? Knowledge Graph
  • 32. 2014-2015 Publications & Presentations Books: Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr Research papers: ● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon - 2014 ● Towards a Job title Classification System - 2014 ● Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from User Behavior - 2014 ● sCooL: A system for academic institution name normalization - 2014 ● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems - 2014 ● SKILL: A System for Skill Identification and Normalization – 2015 ● Carotene: A Job Title Classification System for the Online Recruitment Domain - 2015 ● WebScalding: A Framework for Big Data Web Services - 2015 ● A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases - 2015 ● Macau: Large-Scale Skill Sense Disambiguation in the Online Recruitment Domain - 2015 ● Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data – 2015 ● Query Sense Disambiguation Leveraging Large Scale User Behavioral Data - 2015 Speaking Engagements: ● Over a dozen in the last year: Lucene/Solr Revolution 2014, WSDM 2014, Atlanta Solr Meetup, Atlanta Big Data Meetup, Second International Syposium on Big Data and Data Analytics, RecSys 2014, IEEE Big Data Conference 2014 (x2), AAAI/IAAI 2015, IEEE Big Data 2015 (x6) Lucene/Solr Revolution 2015
  • 34. machine learning Keywords: Search Behavior, Application Behavior, etc. Job Title Classifier, Skills Extractor, Job Level Classifier, etc. Semantic Query Augmentation keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) } { BOOST_TO_TOP: ( job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) } Modified Query: Related Occupations machine learning: {15-1031.00 .58 Computer Software Engineers, Applications 15-1011.00 .55 Computer and Information Scientists, Research 15-1032.00 .52 Computer Software Engineers, Systems Software } machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, } Common Job Titles Semantic Search Architecture – Query Augmentation Related Phrases machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 } Known keyword phrases java developer machine learning registered nurse FST Knowledge Graph in + This Piece: How do you construct the best possible queries? The answer… Learning to Rank (Machine-learned Ranking) That can be a topic for next time…
  • 35. Type-ahead Prediction Knowledge Graph and Intent Engine Search Box Semantic Query Parsing Intent Engine Spelling Correction Entity / Entity Type Resolution Machine-learned Ranking Relevancy Engine (“re-expressing intent”) User Feedback (Clarifying Intent) Query Re-writing Search Results Query Augmentation Knowledge Graph
  • 37. Contact Info Yes, WE ARE HIRING @ . Come talk with me if you are interested… Trey Grainger [email protected] @treygrainger http://solrinaction.com Conference discount (43% off): lusorevcftw Other presentations: http://www.treygrainger.com